Python爬虫tornado web server

Tornado
介绍一个市面上少提到的实用的爬虫–tornado
Ubantu环境下安装方式pip install tornado
下面是爬取百度源码

#encoding=utf-8
import tornado.httpclient
def getpage(url):
    http_header={'User-Agent':'Chrome'}
    http_request=tornado.httpclient.HTTPRequest(url=url,method='GET',headers=http_header,connect_timeout=20,request_timeout=600)
    http_client=tornado.httpclient.HTTPClient()
    print 'Start downloading data...'
    http_response=http_client.fetch(http_request)
    print 'Finish downloading data'

    # 打印状态码
    print http_response.code

    #获取header所有信息
    all_fields=http_response.headers.get_all()
    for x in all_fields:
        print x

    #打印网页源码
    print http_response.body
if __name__ == '__main__':
    getpage('https://bigwayseo.com')

下面是打印出来的结果:

Start downloading data...
Finish downloading data
200
('X-Consumed-Content-Encoding', 'gzip')
('Bduserid', '0')
('Bdqid', '0x8619bea400001fa6')
('X-Powered-By', 'HPHP')
('Transfer-Encoding', 'chunked')
('Set-Cookie', 'BAIDUID=6074595641285942B3B28F52889C30CC:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com')
('Set-Cookie', 'BIDUPSID=6074595641285942B3B28F52889C30CC; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com')
('Set-Cookie', 'PSTM=1453696013; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com')
('Set-Cookie', 'BDSVRTM=0; path=/')
('Set-Cookie', 'BD_HOME=0; path=/')
('Set-Cookie', 'H_PS_PSSID=18286_1434_13549_18879_17949_18205_18964_18778_17000_18782_17072_15444_12239; path=/; domain=.baidu.com')
('Set-Cookie', '__bsi=12997306107329212412_00_11_N_N_2_0303_C02F_N_N_N_0; expires=Mon, 25-Jan-16 04:26:58 GMT; domain=www.baidu.com; path=/')
('Expires', 'Mon, 25 Jan 2016 04:26:30 GMT')
('Vary', 'Accept-Encoding')
('Server', 'bfe/1.0.8.13')
('Connection', 'close')
('Cxy_all', 'baidu+1d7ede856c7c96380845666fcd8157ce')
('Cache-Control', 'private')
('Date', 'Mon, 25 Jan 2016 04:26:53 GMT')
('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "')
('Content-Type', 'text/html; charset=utf-8')
('Bdpagetype', '1')
('X-Ua-Compatible', 'IE=Edge,chrome=1')
<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索" /><link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/><title>百度一下,你就知道</title><--------源码只截取了一部分------>

这个tornado明显是比urllib快,具体大家可以自行测试!
官方https://www.tornadoweb.org/en/stable/

Leave a Comment