100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > Python爬虫:scrapy框架请求参数meta headers cookies一探究竟

Python爬虫:scrapy框架请求参数meta headers cookies一探究竟

时间:2022-09-14 13:13:24

相关推荐

Python爬虫:scrapy框架请求参数meta headers cookies一探究竟

对于scrapy请参数,会经常用到,不过没有深究

今天我就来探索下scrapy请求时所携带的3个重要参数headers,cookies,meta

原生参数

首先新建myscrapy项目,新建my_spider爬虫

通过访问:/get 来测试请求参数

将爬虫运行起来

# -*- coding: utf-8 -*-from scrapy import Spider, Requestimport loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['']start_urls = ['/get']def parse(self, response):self.write_to_file("*" * 40)self.write_to_file("response text: %s" % response.text)self.write_to_file("response headers: %s" % response.headers)self.write_to_file("response meta: %s" % response.meta)self.write_to_file("request headers: %s" % response.request.headers)self.write_to_file("request cookies: %s" % response.request.cookies)self.write_to_file("request meta: %s" % response.request.meta)def write_to_file(self, words):with open("logging.log", "a") as f:f.write(words)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute("scrapy crawl my_spider".split())

保存到文件中的信息如下:

response text: {"args":{},"headers":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Encoding":"gzip,deflate","Accept-Language":"en","Connection":"close","Host":"","User-Agent":"Scrapy/1.5.1 (+)"},"origin":"223.72.90.254","url":"/get"}response headers: {b'Server': [b'gunicorn/19.8.1'], b'Date': [b'Sun, 22 Jul 10:03:15 GMT'], b'Content-Type': [b'application/json'], b'Access-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true'], b'Via': [b'1.1 vegur']}response meta: {'download_timeout': 180.0, 'download_slot': '', 'download_latency': 0.5500118732452393}request headers: {b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+)'], b'Accept-Encoding': [b'gzip,deflate']}request cookies: {}request meta: {'download_timeout': 180.0, 'download_slot': '', 'download_latency': 0.5500118732452393}

meta

通过上面的输出比较,发现 response 和 request 的meta参数是一样的,meta的功能就是从request携带信息,将其传递给response的

修改下代码,测试下传递效果

# -*- coding: utf-8 -*-from scrapy import Spider, Requestimport loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['']start_urls = ['/get']def start_requests(self):for url in self.start_urls:yield Request(url, meta={"uid": "this is uid of meta"})def parse(self, response):print("request meta: %s" % response.request.meta.get("uid"))print("response meta: %s" % response.meta.get("uid"))

输出如下

request meta: this is uid of metaresponse meta: this is uid of meta

看来获取request中meta这两种方式都可行,这里的meta类似字典,可以按照字典获取key-value的形式获取对应的值

当然代理设置也是通过meta的

以下是一个代理中间件的示例

import randomclass ProxyMiddleware(object): def process_request(self, request, spider):proxy=random.choice(proxies)request.meta["proxy"] = proxy

headers

按照如下路径,打开scrapy的default_settings文件

from scrapy.settings import default_settings

发现是这么写的

USER_AGENT = 'Scrapy/%s (+)' % import_module('scrapy').__version__DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',}

修改下请求头,看服务器返回的信息

# -*- coding: utf-8 -*-from scrapy import Spider, Requestimport loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['']start_urls = ['/get',]def start_requests(self):for url in self.start_urls:yield Request(url, headers={"User-Agent": "Chrome"})def parse(self, response):logging.debug("*" * 40)logging.debug("response text: %s" % response.text)logging.debug("response headers: %s" % response.headers)logging.debug("request headers: %s" % response.request.headers)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute("scrapy crawl my_spider".split())

输出如下

response text: {"args":{},"headers":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Encoding":"gzip,deflate","Accept-Language":"en","Connection":"close","Host":"","User-Agent":"Chrome"},"origin":"122.71.64.121","url":"/get"}response headers: {b'Server': [b'gunicorn/19.8.1'], b'Date': [b'Sun, 22 Jul 10:29:26 GMT'], b'Content-Type': [b'application/json'], b'Access-Control-Allow-Origin': [b'*'], b'Access-Control-Allow-Credentials': [b'true'], b'Via': [b'1.1 vegur']}request headers: {b'User-Agent': [b'Chrome'], b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'Accept-Encoding': [b'gzip,deflate']}

看到 request 和 服务器接收到并返回的的 headers(User-Agent)变化了,说明已经把默认的User-Agent修改了

看到default_settings中默认使用了中间件UserAgentMiddleware

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,

源码如下

class UserAgentMiddleware(object):"""This middleware allows spiders to override the user_agent"""def __init__(self, user_agent='Scrapy'):self.user_agent = user_agent@classmethoddef from_crawler(cls, crawler):o = cls(crawler.settings['USER_AGENT'])crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)return odef spider_opened(self, spider):self.user_agent = getattr(spider, 'user_agent', self.user_agent)def process_request(self, request, spider):if self.user_agent:request.headers.setdefault(b'User-Agent', self.user_agent)

仔细阅读源码,发现无非就是对User-Agent读取和设置操作,仿照源码写自己的中间件

这里使用fake_useragent库来随机获取请求头,详情可参看:

/mouday/article/details/80476409

middlewares.py 编写自己的中间件

from fake_useragent import UserAgentclass UserAgentMiddleware(object):def process_request(self, request, spider):ua = UserAgent()user_agent = ua.chromerequest.headers.setdefault(b'User-Agent', user_agent)

settings.py 用自己的中间件替换默认中间件

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,'myscrapy.middlewares.UserAgentMiddleware': 500}

输出如下:

request headers: {b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'], b'Accept-Encoding': [b'gzip,deflate']}

关于scrapy请求头设置,可以参考我之前的文章:

/mouday/article/details/80776030

cookies

上面的信息中少了个response.cookies,如果添加上回报错:

AttributeError: 'TextResponse' object has no attribute 'cookies'

说明响应是不带cookies参数的

通过 /cookies 测试cookies

# -*- coding: utf-8 -*-from scrapy import Spider, Requestimport loggingclass MySpider(Spider):name = 'my_spider'allowed_domains = ['']start_urls = ['/cookies']def start_requests(self):for url in self.start_urls:yield Request(url, cookies={"username": "pengshiyu"})def parse(self, response):logging.debug("*" * 40)logging.debug("response text: %s" % response.text)logging.debug("request headers: %s" % response.request.headers)logging.debug("request cookies: %s" % response.request.cookies)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute("scrapy crawl my_spider".split())

返回值如下:

response text: {"cookies":{"username":"pengshiyu"}}request headers: {b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+)'], b'Accept-Encoding': [b'gzip,deflate'], b'Cookie': [b'username=pengshiyu']}request cookies: {'username': 'pengshiyu'}

服务器端已经接收到我的cookie值了,不过request的headers也包含了相同的cookie,保存到了键为Cookie下面

其实并没有什么cookie,浏览器请求的·cookies·被包装到了·headers·中发送给服务器端

既然这样,在headers中包含Cookie试试

def start_requests(self):for url in self.start_urls:yield Request(url, headers={"Cookie": {"username": "pengshiyu"}})

返回结果

response text: {"cookies":{}}request headers: {b'Accept': [b'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/1.5.1 (+)'], b'Accept-Encoding': [b'gzip,deflate']}request cookies: {}

cookies 是空的,设置失败

我们找到default_settings中的cookie中间件

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700

class CookiesMiddleware(object):"""This middleware enables working with sites that need cookies"""def __init__(self, debug=False):self.jars = defaultdict(CookieJar)self.debug = debug@classmethoddef from_crawler(cls, crawler):if not crawler.settings.getbool('COOKIES_ENABLED'):raise NotConfiguredreturn cls(crawler.settings.getbool('COOKIES_DEBUG'))def process_request(self, request, spider):if request.meta.get('dont_merge_cookies', False):returncookiejarkey = request.meta.get("cookiejar")jar = self.jars[cookiejarkey]cookies = self._get_request_cookies(jar, request)for cookie in cookies:jar.set_cookie_if_ok(cookie, request)# set Cookie headerrequest.headers.pop('Cookie', None)jar.add_cookie_header(request)self._debug_cookie(request, spider)def process_response(self, request, response, spider):if request.meta.get('dont_merge_cookies', False):return response# extract cookies from Set-Cookie and drop invalid/expired cookiescookiejarkey = request.meta.get("cookiejar")jar = self.jars[cookiejarkey]jar.extract_cookies(response, request)self._debug_set_cookie(response, spider)return responsedef _debug_cookie(self, request, spider):if self.debug:cl = [to_native_str(c, errors='replace')for c in request.headers.getlist('Cookie')]if cl:cookies = "\n".join("Cookie: {}\n".format(c) for c in cl)msg = "Sending cookies to: {}\n{}".format(request, cookies)logger.debug(msg, extra={'spider': spider})def _debug_set_cookie(self, response, spider):if self.debug:cl = [to_native_str(c, errors='replace')for c in response.headers.getlist('Set-Cookie')]if cl:cookies = "\n".join("Set-Cookie: {}\n".format(c) for c in cl)msg = "Received cookies from: {}\n{}".format(response, cookies)logger.debug(msg, extra={'spider': spider})def _format_cookie(self, cookie):# build cookie stringcookie_str = '%s=%s' % (cookie['name'], cookie['value'])if cookie.get('path', None):cookie_str += '; Path=%s' % cookie['path']if cookie.get('domain', None):cookie_str += '; Domain=%s' % cookie['domain']return cookie_strdef _get_request_cookies(self, jar, request):if isinstance(request.cookies, dict):cookie_list = [{'name': k, 'value': v} for k, v in \six.iteritems(request.cookies)]else:cookie_list = request.cookiescookies = [self._format_cookie(x) for x in cookie_list]headers = {'Set-Cookie': cookies}response = Response(request.url, headers=headers)return jar.make_cookies(response, request)

观察源码,发现以下几个方法

# process_requestjar.add_cookie_header(request) # 添加cookie到headers# process_responsejar.extract_cookies(response, request) # 提取出cookie# _debug_cookie request.headers.getlist('Cookie') # 从headers获取cookie# _debug_set_cookieresponse.headers.getlist('Set-Cookie') # 从headers获取Set-Cookie

几个参数:

# settingsCOOKIES_ENABLEDCOOKIES_DEBUG# metadont_merge_cookiescookiejar# headersCookieSet-Cookie

使用最开始cookie部分的代码,为了看的清晰,我删除了headers中其他参数,下面逐个做测试

1、COOKIES_ENABLED

COOKIES_ENABLED = True (默认)

response text: {"cookies":{"username":"pengshiyu"}}request headers: {b'Cookie': [b'username=pengshiyu']}request cookies: {'username': 'pengshiyu'}

一切ok

COOKIES_ENABLED = False

response text: {"cookies":{}}request headers: {}request cookies: {'username': 'pengshiyu'}

虽然request的cookies有内容,不过headers没有加进去,所以服务器端没有获取到cookie

注意:查看请求的真正cookie,应该在request的header中查看

2、COOKIES_DEBUG

COOKIES_DEBUG = False (默认)

DEBUG: Crawled (200) <GET /cookies> (referer: None)

COOKIES_DEBUG = True

多输出了下面一句,可以看到我设置的cookie

[scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET /cookies>Cookie: username=pengshiyu

当然,debug模式下服务器肯定能正常接收我的cookie

3、dont_merge_cookies

设置meta={"dont_merge_cookies": True}默认为 False

response text: {"cookies":{}}request headers: {}request cookies: {'username': 'pengshiyu'}

服务器并没有接收到我的cookie

4、cookiejar

直接通过response.request.meta.get("cookiejar")获取

response text: {"cookies":{"username":"pengshiyu"}}request headers: {b'Cookie': [b'username=pengshiyu']}request cookies: {'username': 'pengshiyu'}request cookiejar: None

啥也没有

5、Cookie

直接获取:response.request.headers.get("Cookie"))

headers Cookie: b'username=pengshiyu'

看来这里已经被处理成字节串了

修改Request请求参数

cookies={"username": "pengshiyu", "password": "123456"}

# response.request.headers.get("Cookie"))headers Cookie: b'username=pengshiyu; password=123456'# request.headers.getlist('Cookie')headers Cookies: [b'username=pengshiyu; password=123456']

很明显,两个获取方式,一个获取的是字符串,一个获取的是列表

6、Set-Cookie

同样,我通过以下

response.headers.get("Set-Cookie")response.headers.getlist("Set-Cookie")

还是啥都没有

headers Set-Cookie: Noneheaders Set-Cookies: []

不过,到目前为止,cookie设置的大概流程应该如下:

request cookies: {'username': 'pengshiyu', 'password': '123456'}request cookiejar: Nonerequest Cookie: b'username=pengshiyu; password=123456'response text: {"cookies":{"password":"123456","username":"pengshiyu"}}response Set-Cookie: Noneresponse Set-Cookies: []

7、接收服务器传递过来的cookie

将请求链接改为 :’/cookies/set/key/value’

开启 COOKIES_DEBUG

在debug中看到如下变化

Sending cookies to: <GET /cookies/set/key/value>Cookie: username=pengshiyu; password=123456Received cookies from: <302 /cookies/set/key/value>Set-Cookie: key=value; Path=/Redirecting (302) to <GET /cookies> from <GET /cookies/set/key/value>Sending cookies to: <GET /cookies>Cookie: key=value; username=pengshiyu; password=123456

日志看出他进行了两次请求,看到中间的cookie变化:

发送 -> 接收 -> 发送

第二次发送的cookie包含了第一次请求时服务器端传递过来的cookie,说明scrapy对服务器端和客户端的cookie进行了管理

最后的cookie输出

request cookies: {'username': 'pengshiyu', 'password': '123456'}request cookiejar: Nonerequest Cookie: b'key=value; username=pengshiyu; password=123456'response text: {"cookies":{"key":"value","password":"123456","username":"pengshiyu"}}response Set-Cookie: None

request的cookies并没有变化,而request.headers.get(“Cookie”)已经发生了变化

8、接收服务器传递过来的 同key键cookie

将请求链接换为:/cookies/set/username/pengpeng

Sending cookies to: <GET /cookies/set/username/pengpeng>Cookie: username=pengshiyuReceived cookies from: <302 /cookies/set/username/pengpeng>Set-Cookie: username=pengpeng; Path=/Redirecting (302) to <GET /cookies> from <GET /cookies/set/username/pengpeng>Sending cookies to: <GET /cookies>Cookie: username=pengshiyu

发现虽然收到了username=pengpeng但是,第二次发请求的时候,又发送了原来的的cookieusername=pengshiyu

这说明客户端设置的cookie优先级高于服务器端传递过来的cookie

9、取消使用中间件CookiesMiddleware

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None}

请求链接:/cookies

request cookies: {'username': 'pengshiyu'}request cookiejar: Nonerequest Cookie: Noneresponse text: {"cookies":{}}response Set-Cookie: Noneresponse Set-Cookies: []

这个效果类似COOKIES_ENABLED = False

10、自定义cookie池

class RandomCookiesMiddleware(object):def process_request(self, request, spider):cookies = []cookie = random.choice(cookies)request.cookies = cookie

同样需要设置

DOWNLOADER_MIDDLEWARES = {'myscrapy.middlewares.RandomCookiesMiddleware': 600}

注意到scrapy的中间件CookiesMiddleware值是700,为了cookie设置生效,需要在这个中间件启用之前就设置好自定义的cookie,优先级按照从小到大的顺序执行,所以我们自己自定义的cookie中间件需要小于< 700

'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,

总结

常用的中间件如下

import randomfrom fake_useragent import UserAgentclass RandomUserAgentMiddleware(object):def process_request(self, request, spider):ua = UserAgent()user_agent = ua.chromerequest.headers.setdefault(b'User-Agent', user_agent)class RandomProxyMiddleware(object):def process_request(self, request, spider):proxies = []proxy = random.choice(proxies)request.meta["proxy"] = proxyclass RandomCookiesMiddleware(object):def process_request(self, request, spider):cookies = []cookie = random.choice(cookies)request.cookies = cookie

当然,cookies 和 proxies 需要结合自己的情况补全

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。