Python

python实现简单爬虫功能

2017年7月3日2015年7月10日 2 Comments by Bigway

最近流行用Python写SEO工具，现在互联网上也有很多Python的培训、相当地火爆，今天给大家粗略写下一些模块如urllib和pycurl以及requests、tornado下编写Spider的代码，pycurl、requests、tornado是是第三方库，需安装Python的自身环境后再去百度下个模块安装即可！
学习Python编程一开建议从互联网上下载一些视频观看，看书并不是对于每个新人都能看懂，做SEO的话可以重点偏向网络应用编程这一块，学会之后思路就会上个台阶了！老板也开始担心你要跳槽了，哈哈！请尝试下面的代码，你动手就赢了不少人了！
urllib库爬虫代码

#-*-coding:utf-8-*-
import urllib
url='https://www.baidu.com/'
page=urllib.urlopen(url).read()
print page #输出网页源代码

pycurl库爬虫代码

#-*-coding:utf-8-*-
import pycurl
import StringIO
url='https://www.baidu.com'
c=pycurl.Curl()
c.setopt(c.URL, url)
b = StringIO.StringIO()
c.setopt(c.WRITEFUNCTION, b.write)
c.setopt(c.FOLLOWLOCATION, 1)
c.setopt(c.HEADER, True)
c.perform()
html=b.getvalue()
print html #输出网页源代码
b.close()
c.close()

requests模块爬虫代码

#encoding=utf-8
import requests
print requests.get('https://bigwayseo.com').content #输出网页源代码

tornado模块爬虫代码

#encoding=utf-8
import tornado.httpclient
def getpage(url):
    http_header={'User-Agent':'Chrome'}
    http_request=tornado.httpclient.HTTPRequest(url=url,method='GET',headers=http_header,connect_timeout=20,request_timeout=600)
    http_client=tornado.httpclient.HTTPClient()
    print 'Start downloading data...'
    http_response=http_client.fetch(http_request)
    print 'Finish downloading data'
 
    # 打印状态码
    print http_response.code
 
    #获取header所有信息
    all_fields=http_response.headers.get_all()
    for x in all_fields:
        print x
 
    #打印网页源码
    print http_response.body
if __name__ == '__main__':
    getpage('https://bigwayseo.com')

直接复制代码就可以用，python实现简单爬虫功能so easy，do it~
欢迎转载，请保留出处！

2 thoughts on “python实现简单爬虫功能”

xu

2019年5月18日 at 上午11:53

最后一段代码出错：RuntimeError: Cannot run the event loop while another loop is running

环境是anaconda python 3.7
回复
yameimei

2015年7月12日 at 上午7:28

拜读一下，哈哈
回复

Leave a Comment 取消回复