100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > 爬虫学习 尝试爬取小说网站

爬虫学习 尝试爬取小说网站

时间:2022-04-28 02:17:07

相关推荐

爬虫学习 尝试爬取小说网站

初步学习了scrapy 尝试着爬取小说网站全部小说(发现没问题就停了)

以下为scrapy中基础爬虫的写法

# -*- coding: utf-8 -*-import scrapyfrom scrapy.shell import inspect_response #测试import refrom xiaoshuo.items import XiaoshuoItemclass Ba9dxSpider(scrapy.Spider):name = 'ba9dx'allowed_domains = ['']start_urls = ['/xuanhuan_1.html']url = '/'def parse(self, response):txtlist = response.xpath('//p[@class="articlename"]/a/@href').extract()for each in txtlist:nextpage = each.split('/',maxsplit=2)[-1]nextpage = self.url + nextpageyield scrapy.Request(nextpage,callback=self.parse_1)next_page = response.xpath('//div[@class="pagelink"]/a[@class="next"]/@href').extract_first()if next_page is not None:nextpage = self.url + next_page.split('/')[-1]yield scrapy.Request(next_page,callback=self.parse)def parse_1(self,response):#inspect_response(response,self)pageurl = response.xpath('//div[@class="top"]/a/@href').extract_first()# print(pageurl)pageurl = '' + pageurlyield scrapy.Request(pageurl,callback=self.parse_2)#/2/2773/index.htmldef parse_2(self,response):# inspect_response(response,self)pagelist = response.xpath('//div[@class="chapterlist"]/ul/li/a/@href').extract()for each in pagelist:niubi = re.findall(r'https://www\.9dxs\.com/\d+?/\d+?/',response.url)[0] + eachyield scrapy.Request(niubi,callback=self.parse_3)def parse_3(self,response):#inspect_response(response,self)item = XiaoshuoItem()item['title'] = response.xpath('//h1/text()').extract_first().split(' ',maxsplit=1)[-1]item['text'] = response.xpath('string(//div[@id="content"])').extract_first()yield item

之前也尝试着用crawlspider写这段代码,毕竟其遍历的能力会使代码简洁很多.但是实际操作中发现同样的链接,下载下来后用正则可以匹配到,但是在爬取过程中同样的正则却搜不到相应的链接,但还是贴上自己的代码,望大佬能够指正.

# -*- coding: utf-8 -*-import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom xiaoshuo.items import XiaoshuoItemclass A9dxsSpider(CrawlSpider):name = 'a9dxs'allowed_domains = ['']start_urls = ['/xuanhuan_1.html']#主界面小说地址pagelist1 = LinkExtractor(allow=(r'/\./\d+/\d+/'))#主界面页码地址pagelist2 = LinkExtractor(allow=(r'/\./xuanhuan_\d+\.html'))rules = [Rule(pagelist1, callback='parse_item', follow=True),Rule(pagelist2,process_links='process_link1',follow=True)]def process_link1(self,links):url = '/' + links.split('/')[-1] print(url)return urldef parse_item(self, response):url = response.xpath('//div[@class="top"]/a/@href').extract_first()url = response.urljoin(url)print(response.body)yield scrapy.Request(url,callback='parse_next')def parse_next(self,response):chapterlist = response.xpath('//div[@class="chapterlist"]/ul/li/a/@href').extract()for each in chapterlist:chapter = response.urljoin(each)yield scrapy.Request(chapter,callback='parse_text')def parse_text(self,response):item = XiaoshuoItem()item['title'] = response.xpath('//h1/text()').extract_first().split(' ',maxsplit=1)[-1]item['text'] = response.xpath('string(//p)').extract_first()yield item

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。