全球速卖通是阿里巴巴集团下跨境B2C平台,主要以服务中小型为主,流量上应该跟阿里国际站差不多,国内很多中小企业都在做,其次主要是关联阿里国际站的,一开始是免费,泥萌懂了吧、、、速卖通和国际站联合起来,应对亚马逊、ebay这些国外数一数二的电商平台,还是、、、没想好词形容,反正阿里巴巴为中国的出口业做出了贡献,就酱了!
采集的数据字段有:排名,订单数,评论,反馈评分,正面反馈,星级,价格,原价格,供应商,嗯,这9个字段!
补充:因为不换ip就可以采集不少数据了,so没有加入轮换ip,ua,cookie之类,也可以理解为有点懒。
#encoding=utf-8 import re,requests,time,threading,urllib open_file=open('aliexpress_data.csv','a') open_file.write('排名,订单数,评论,反馈评分,正面反馈,星级,价格,原价格,供应商\n') c=re.compile(r'<span class="value" itemprop="price">(.*?)</span>[\s|\S]*?<del class="original-price">(.*?)</del>[\s|\S]*?<span class="star star-s" title="Star Rating:(.*?)" >[\s|\S]*?<a class="rate-num " title="Feedback\((.*?)\)"[\s|\S]*?rel="nofollow" ><em title="Total Orders"> Orders \((.*?)\)</em></a>[\s|\S]*?<a href=".*?>(.*?)</a>[\s|\S]*?<a id="talkId(.*?)" class="atm16 atm-link"[\s|\S]*?<a class="score-dot".*?feedBackScore="(.*?)" sellerPositiveFeedbackPercentage="(.*?)"></span></a>') target=[] with open('hotword.txt') as f: for word in f.readlines(): target.append('https://www.aliexpress.com/af/%s.html'%urllib.quote_plus(word)) def search(data): try: getdata=data.replace(',','') return getdata except: pass def finddata(html): data=re.findall(c,html) if data: return data else: return data class aliexpress(threading.Thread): def __init__(self, target): threading.Thread.__init__(self) self.target=target def run(self): self.get_data() def get_data(self): headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'} for url in self.target: html=requests.get(url,headers=headers,timeout=20).content # print html data=finddata(html) for price,original_price,Star_Rating,Feedback,Orders,store,rank,feedBackScore,Positive_Feedback in data: Orders=search(Orders) Feedback=search(Feedback) feedBackScore=search(feedBackScore) print rank,Orders,Feedback,feedBackScore,Positive_Feedback,Star_Rating,price,original_price,store open_file.writelines('%s,%s,%s,%s,%s,%s,%s,%s,%s\n'%(rank,Orders,Feedback,feedBackScore,Positive_Feedback,Star_Rating,price,original_price,store)) if __name__ == '__main__': start_working=[] threads=10 for i in range(threads): get_target=aliexpress(target[((len(target)+threads-1)/threads) * i:((len(target)+threads-1)/threads) * (i+1)]) start_working.append(get_target) for i in range(len(start_working)): start_working[i].start() for i in range(len(start_working)): start_working[i].join()
没有了吗,没错你看完了,依旧没彩蛋吗?这次可以有,呵呵呵!
彩蛋来啦啦啦啦