崔庆才老师爬虫的学习笔记。

一、基本使用

1、Downloader Middleware中文文档0.25.0

http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/downloader-middleware.html

2、Downloader Middleware下载器中间件

下载器中间件是介于Scrapy的request/response处理的钩子框架。 是用于全局修改Scrapy request和response的一个轻量、底层的系统。

3、激活下载器中间件

  • settings中设置DOWNLOADER_MIDDLEWARES选项。 设置一个字典(dict),键为中间件类的路径,值为其中间件的顺序(order)。例如:
1
2
3
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
  • DOWNLOADER_MIDDLEWARES设置会与Scrapy定义的DOWNLOADER_MIDDLEWARES_BASE 设置合并(但不是覆盖), 而后根据顺序(order)进行排序,最后得到启用中间件的有序列表: 第一个中间件是最靠近引擎的,最后一个中间件是最靠近下载器的。

  • 如果想禁止内置的(在 DOWNLOADER_MIDDLEWARES_BASE 中设置并默认启用的)中间件, 必须在项目的 DOWNLOADER_MIDDLEWARES 设置中定义该中间件,并将其值赋为 None 。 例如,关闭user-agent中间件:

1
2
3
4
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
  • 获得内置的DOWNLOADER_MIDDLEWARES_BASE中间件,命令行输入命令scrapy settings --get=DOWNLOADER_MIDDLEWARES_BASE,回显例如:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
{"scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware": 100, 
"scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware": 300,
"scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware": 350,
"scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware": 400,
"scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": 500,
"scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
"scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware": 560,
"scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware": 580,
"scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware": 590,
"scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 600,
"scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 700,
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 750,
"scrapy.downloadermiddlewares.stats.DownloaderStats": 850,
"scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware": 900}

4、编写下载器中间件

  • process_request(request, spider)

  • process_response(request, response, spider)

  • process_exception(request, exception, spider)

详见官方文档

5、其他内置的MIDDLEWARE

详见官方文档

持续更新…

× 多少都行~
打赏二维码