100字范文 > 网络爬虫(六)之爬虫框架【Scrapy】

网络爬虫(六)之爬虫框架【Scrapy】

时间：2021-05-15 13:01:28

项目目录介绍

新创建一个目录，按住shift-右键-在此处打开命令窗口

输入：scrapy startproject 项目名

文件夹目录如下：

文件的功能：

scrapy.cfg：配置文件

spiders：存放你Spider文件，也就是你爬取的py文件

items.py：相当于一个容器，和字典较像

middlewares.py：定义Downloader Middlewares(下载器中间件)和Spider Middlewares(蜘蛛中间件)的实现

pipelines.py:定义Item Pipeline的实现，实现数据的清洗，储存，验证。

settings.py：全局配置

使用 Scrapy 爬去起点中文网数据

首先我们要在项目目录的 spiders 目录下，运行scrapy genspider xs ""来创建我一个spider（自己定义的爬虫文件）。下面是我写好的内容：

# -*- coding: utf-8 -*- 这是到正目录下运行 scrapy genspider xs "" 命令创建的import scrapyfrom python16.items import Python16Itemclass XsSpider(scrapy.Spider):# 爬虫的名字name = 'xs'# 是允许爬取的域名，比如一些网站有相关链接，域名就和本网站不同，这些就会忽略。allowed_domains = ['']#是Spider爬取的网站，定义初始的请求url，可以多个。start_urls = ['/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1']# 是Spider的一个方法，在请求start_url后，之后的方法，这个方法是对网页的解析，与提取自己想要的东西。# response参数：是请求网页后返回的内容，也就是你需要解析的网页。def parse(self, response):# 获取网页返回的数据# print(response.body.decode("UTF-8"))# 解析网页数据 xpath (response 已经提供了这样的方法)lis = response.xpath("//ul[contains(@class,'all-img-list')]/li")for li in lis:item = Python16Item()# 说明 scrapy 中拿到的 text() 数据是长这样的：[<Selector xpath=".//div[@class='book-mid-info']/h4/a/text()" data='圣墟'>]# 我们还要通过这样的方式去提取 .extract_first() 拿到第一条数据name = li.xpath(".//div[@class='book-mid-info']/h4/a/text()").extract_first()author = li.xpath(".//div[@class='book-mid-info']/p[@class='author']/a/text()").extract_first()# 去空格content = str(li.xpath("./div[@class='book-mid-info']/p[@class='intro']/text()").extract_first()).strip()item['name'] = nameitem['author'] = authoritem['content'] = content# 通过这种方式将数据返回给管道（千万不要用 yield 集合因为性能不好，毕竟在内存里面）yield item# 获取下一页超链接nextUrl = response.xpath("//a[contains(@class, 'lbf-pagination-next')]/@href").extract_first()if nextUrl != "javascript:;":yield scrapy.Request(url="http:" + nextUrl, callback=self.parse)

定义Item

item是保存爬取数据的容器，使用的方法和字典差不多。

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# /en/latest/topics/items.htmlimport scrapy# 相当于实体类class Python16Item(scrapy.Item):# define the fields for your item here like:# name = scrapy .Field()name = scrapy.Field()author = scrapy.Field()content = scrapy.Field()

pipelines.py 这是我们处理数据的地方

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: /en/latest/topics/item-pipeline.htmlimport csv# 存数据的类似于实体类class Python16Pipeline(object):def __init__(self):# newline 时防止写的时候发生空一行的问题self.f = open("起点中文网.csv", "w", newline="")self.writer = csv.writer(self.f)self.writer.writerow(["书名", "作者", "简介"])# 处理 itemdef process_item(self, item, spider):# 保存成 csv 文件name = item["name"]author = item["author"]content = item["content"]self.writer.writerow([name, author, content])#如果还有下一个处理的管道就是交给下一个处理管道的值，当然页可以不返回，那么那边 process_item item 值就是itemNonereturn item;

下面贴扫 settings.py 这个文件中的代码：

# -*- coding: utf-8 -*-# Scrapy settings for python16 project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##/en/latest/topics/settings.html#/en/latest/topics/downloader-middleware.html#/en/latest/topics/spider-middleware.htmlBOT_NAME = 'python16'SPIDER_MODULES = ['python16.spiders']NEWSPIDER_MODULE = 'python16.spiders'# 设置日志输出级别LOG_LEVEL = "WARNING"# Crawl responsibly by identifying yourself (and your website) on the user-agent 这就是配置我们自己的 ua值了#USER_AGENT = 'python16 (+)'# Obey robots.txt rules 觉得要爬去的东西是否符合 rules 协议# ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers: 这是默认的请求头我们可以覆盖掉#DEFAULT_REQUEST_HEADERS = {# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#}# Enable or disable spider middlewares# See /en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'python16.middlewares.Python16SpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See /en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'python16.middlewares.Python16DownloaderMiddleware': 543,#}# Enable or disable extensions# See /en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See /en/latest/topics/item-pipeline.html# 管道优先级（例如下面的300）值越小就优先级别越高（取值范围值 0-1000）# 默认是没有用管道的这个配置所以我们要打开# 这里可以配置多个管道来，通过优先级别来决定管道的执行顺序# 每个管道最后 return 的值作为下一个管道 process_item 函数 item 参函接收的值 ITEM_PIPELINES = {'python16.pipelines.Python16Pipeline': 300,}# Enable and configure the AutoThrottle extension (disabled by default)# See /en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See /en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

测试：

在命令行通过scrapy crawl xs就能运行了

当然上面这种方式不是很方便我们可以在项目中编写一个 start.py 文件通过运行这个文件来启动爬虫程序

from scrapy import cmdline# 这里必须分割通过 splitcmdline.execute("spiders>scrapy crawl xs".split())

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。