100字范文 > python 爬虫保存豆瓣TOP250电影海报及修改名称

python 爬虫保存豆瓣TOP250电影海报及修改名称

时间：2019-02-28 07:15:20

1. spider代码：这里注意找title和star，以及pic时xpath不同。前两者是在info下，后者是在pic下。for循环中按item寻找，每次找到一个item（电影）的title、star和图片信息，每次调用一次yield生成器，在pipeline里面进行处理。在item找完后，找下一个page的链接，再调用parse进行解析

# -*- coding: utf-8 -*-importscrapyfromdouban.itemsimportDoubanItemclassDouban250Spider(scrapy.Spider):name =douban250# allowed_domains = [/] start_urls = [/top250]defparse(self, response):forselinresponse.xpath(//div[@class="item"]):item = DoubanItem()item[ itle] = sel.xpath(div[@class="info"]/div[@class="hd"]/a/span/text()).extract()[0]item[star] = sel.xpath(div[@class="info"]/div[@class="bd"]/div[@class="star"]\/span[@class="rating_num"]/text()).extract()[0]item[image_urls] = sel.xpath(div[@class="pic"]/a/img/@src).extract()yielditemnextPage = sel.xpath(//div[@class="paginator"]/\span[@class="next"]/a/@href).extract()[0].strip()ifnextPage:next_url =/top250+nextPageyieldscrapy.http.Request(next_url,callback=self.parse,dont_filter=True)

2. settings文件：指定pipeline。这里有处理文字和图片两个pipeline，设置随机代理：

# -*- coding: utf-8 -*-# Scrapy settings for douban project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##/en/latest/topics/settings.html#/en/la

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。

python 爬虫 保存豆瓣TOP250电影海报及修改名称

python 爬虫保存豆瓣TOP250电影海报及修改名称