scrapy这个框架使用起来有点累,语法相对其它的一些爬虫模块来说要复杂一点,但scrapy自身相当牛逼,上一篇https://bigwayseo.com/412就说到,不再累赘。这次的脚本呢是批量获取指定域名爱站的外链数据,使用到有scapy框架(最主要),另外正则表达式,xpath也都用上,主要是都用一下,不然脑袋不好使,容易忘,此代码仅供交流学习之用,同时也只是记录一下自己学习scrapy的笔记,学习不易,尤其这方面代码少,看官方,那英文,苦逼青年啊,下面直接上代码:
我用的windows,运行的时候在你scrapy startproject ‘project name’的这个project目录下按着shift键右击选择在此处打开命令窗口来输入scrapy crawl ‘name’程序里面的变量name如果是我的源代码的话就是scrapy crawl domainlink运行!
#encoding=utf-8 # from scrapy.http import Request # from scrapy.selector import Selector # from CSDNBlog.items import CsdnblogItem from scrapy.spiders import Spider from scrapy.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.linkextractors import LinkExtractor from scrapy.http import Request import re class aizhan(Spider): name='domainlink' download_delay=2 allowed_domains=['link.aizhan.com'] start_urls=['https://link.aizhan.com/?url=www.tuniu.com&p=1'] # rules=( # Rule(LinkExtractor(allow=('?url=www.tuniu.com&vt=c&ob=br&linktext=&linktexttype=&p=.*&aizhan=0')), # callback='parse', # follow=True) # ) def parse(self,response): html=response.body c=re.compile(r'href="https://baidurank\.aizhan\.com/baidu/(.*?)"><img border="0" align="absmiddle" src="https://static\.aizhan\.com/images/brs/\d+\.gif" alt="该站权重值为(\d+)".*?</a></dd>.*?<img align="absmiddle" src="https://static\.aizhan\.com/images/pr/pr\d+\.gif"></dd>.*?链接名称.*?target="_blank" rel="nofollow">(.*?) </a></strong></dd></dl>') b=re.findall(c,html) # print b for i in b: f=",".join(i) n=f.decode('utf-8').encode('gbk') print n op_write_csv=open('aizhan_link.csv','a') op_write_csv.write(n+'\n') sel=Selector(response) sites=sel.xpath('//*[@id="page_bottom"]/a/@href').extract() for site in sites: a='https://link.aizhan.com/' f=a+site # print f yield Request(f,callback=self.parse)