Python采集速卖通serp页面数据

全球速卖通是阿里巴巴集团下跨境B2C平台,主要以服务中小型为主,流量上应该跟阿里国际站差不多,国内很多中小企业都在做,其次主要是关联阿里国际站的,一开始是免费,泥萌懂了吧、、、速卖通和国际站联合起来,应对亚马逊、ebay这些国外数一数二的电商平台,还是、、、没想好词形容,反正阿里巴巴为中国的出口业做出了贡献,就酱了!
采集的数据字段有:排名,订单数,评论,反馈评分,正面反馈,星级,价格,原价格,供应商,嗯,这9个字段,你们看着办!
补充:因为不换ip就可以采集不少数据了,so没有加入轮换ip,ua,cookie之类,也可以理解为有点懒。

#encoding=utf-8
import re,requests,time,threading,urllib


open_file=open('aliexpress_data.csv','a')
open_file.write('排名,订单数,评论,反馈评分,正面反馈,星级,价格,原价格,供应商\n')

c=re.compile(r'<span class="value" itemprop="price">(.*?)</span>[\s|\S]*?<del class="original-price">(.*?)</del>[\s|\S]*?<span class="star star-s" title="Star Rating:(.*?)" >[\s|\S]*?<a class="rate-num " title="Feedback\((.*?)\)"[\s|\S]*?rel="nofollow" ><em title="Total Orders"> Orders \((.*?)\)</em></a>[\s|\S]*?<a href=".*?>(.*?)</a>[\s|\S]*?<a id="talkId(.*?)" class="atm16 atm-link"[\s|\S]*?<a class="score-dot".*?feedBackScore="(.*?)" sellerPositiveFeedbackPercentage="(.*?)"></span></a>')

target=[]
with open('hotword.txt') as f:
    for word in f.readlines():
        target.append('http://www.aliexpress.com/af/%s.html'%urllib.quote_plus(word))



def search(data):
    try:
        getdata=data.replace(',','')
        return getdata
    except:
        pass


def finddata(html):
    data=re.findall(c,html)
    if data:
        return data
    else:
        return data


class aliexpress(threading.Thread):

    def __init__(self, target):
        threading.Thread.__init__(self)
        self.target=target

    def run(self):
         self.get_data() 

    def get_data(self):
        headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}
        for url in self.target:
            html=requests.get(url,headers=headers,timeout=20).content
            # print html
            data=finddata(html)

            for price,original_price,Star_Rating,Feedback,Orders,store,rank,feedBackScore,Positive_Feedback in data:
                
                Orders=search(Orders)
                Feedback=search(Feedback)
                feedBackScore=search(feedBackScore)

                print rank,Orders,Feedback,feedBackScore,Positive_Feedback,Star_Rating,price,original_price,store
                open_file.writelines('%s,%s,%s,%s,%s,%s,%s,%s,%s\n'%(rank,Orders,Feedback,feedBackScore,Positive_Feedback,Star_Rating,price,original_price,store))

if __name__ == '__main__':
    start_working=[]
    threads=10
    for i in range(threads):
        get_target=aliexpress(target[((len(target)+threads-1)/threads) * i:((len(target)+threads-1)/threads) * (i+1)])
        start_working.append(get_target)
    for i in range(len(start_working)):
        start_working[i].start()
    for i in range(len(start_working)):
        start_working[i].join()

没有了吗,没错你看完了,依旧没彩蛋吗?这次可以有,呵呵呵!
彩蛋来啦啦啦啦

彩蛋

发表评论

电子邮件地址不会被公开。