Scrapy 爬虫使用指南完全教程

237次阅读

没有评论

共计 5391 个字符，预计需要花费 14 分钟才能阅读完成。

scrapy note

command

全局命令:

startproject：在 project_name 文件夹下创建一个名为 project_name 的 Scrapy 项目。
```
scrapy startproject myproject
```
settings：在项目中运行时，该命令将会输出项目的设定值，否则输出 Scrapy 默认设定。
runspider：在未创建项目的情况下，运行一个编写在 Python 文件中的 spider。
shell：以给定的 URL(如果给出)或者空 (没有给出 URL) 启动 Scrapy shell。
fetch：使用 Scrapy 下载器 (downloader) 下载给定的 URL，并将获取到的内容送到标准输出。
```
scrapy fetch --nolog --headers http://www.example.com/
```
view：在浏览器中打开给定的 URL，并以 Scrapy spider 获取到的形式展现。
```
scrapy view http://www.example.com/some/page.html
```
version：输出 Scrapy 版本。

项目 (Project-only) 命令:

crawl：使用 spider 进行爬取。
scrapy crawl myspider
check：运行 contract 检查。
scrapy check -l
list：列出当前项目中所有可用的 spider。每行输出一个 spider。
edit

parse：获取给定的 URL 并使用相应的 spider 分析处理。如果您提供 –callback 选项，则使用 spider 的该方法处理，否则使用 parse。

 --spider=SPIDER: 跳过自动检测 spider 并强制使用特定的 spider
--a NAME=VALUE: 设置 spider 的参数(可能被重复)
--callback or -c: spider 中用于解析返回 (response) 的回调函数
--pipelines: 在 pipeline 中处理 item
--rules or -r: 使用 CrawlSpider 规则来发现用来解析返回 (response) 的回调函数
--noitems: 不显示爬取到的 item
--nolinks: 不显示提取到的链接
--nocolour: 避免使用 pygments 对输出着色
--depth or -d: 指定跟进链接请求的层次数(默认: 1)
--verbose or -v: 显示每个请求的详细信息
scrapy parse http://www.example.com/ -c parse_item

genspider：在当前项目中创建 spider。

 scrapy genspider [-t template] <name> <domain>
scrapy genspider -t basic example example.com

deploy：将项目部署到 Scrapyd 服务。
bench：运行 benchmark 测试。

使用选择器(selectors)

 body = '<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').extract()
 
response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()

Scrapy 提供了两个实用的快捷方式: response.xpath() 及 response.css()

 >>>response.xpath('//base/@href').extract()
>>>response.css('base::attr(href)').extract()
>>>response.xpath('//a[contains(@href,"image")]/@href').extract()
>>>response.css('a[href*=image]::attr(href)').extract()
>>>response.xpath('//a[contains(@href,"image")]/img/@src').extract()
>>>response.css('a[href*=image] img::attr(src)').extract()

嵌套选择器(selectors)

选择器方法 (.xpath() or .css()) 返回相同类型的选择器列表，因此你也可以对这些选择器调用选择器方法。下面是一个例子:

 links = response.xpath('//a[contains(@href,"image")]')
for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

结合正则表达式使用选择器(selectors)

Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回 unicode 字符串的列表。所以你无法构造嵌套式的 .re() 调用。

>>> response.xpath('//a[contains(@href,"image")]/text()').re(r'Name:\s*(.*)')

使用相对 XPaths

 >>>for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()
>>>for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()
>>>for p in divs.xpath('p'): #gets all <p> from the whole document
...     print p.extract()

例如在 XPath 的 starts-with() 或 contains() 无法满足需求时，test() 函数可以非常有用。

 >>>sel.xpath('//li//@href').extract()
>>>sel.xpath('//li[re:test(@class,"item-\d$")]//@href').extract()

XPATH TIPS

Avoid using contains(.//text(),‘search text’) in your XPath conditions. Use contains(.,‘search text’) instead.
Beware of the difference between //node[1] and (//node)[1]
When selecting by class, be as specific as necessary，When querying by class, consider using CSS
Learn to use all the different axes
Useful trick to get text content

Item Loaders

populate items

 def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Item Pipeline

清理 HTML 数据
验证爬取的数据(检查 item 包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

编写你自己的 item pipeline

每个 item pipeline 组件都需要调用该方法，这个方法必须返回一个 Item (或任何继承类)对象，或是抛出 DropItem 异常，被丢弃的 item 将不会被之后的 pipeline 组件所处理。
参数:

item (Item 对象) – 被爬取的 item
spider (Spider 对象) – 爬取该 item 的 spider

Write items to MongoDB

 import pymongo
 
class MongoPipeline(object):
 
    def__init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
 
@classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )
 
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
 
    def close_spider(self, spider):
        self.client.close()
 
    def process_item(self, item, spider):
        collection_name = item.__class__.__name__
        self.db[collection_name].insert(dict(item))
        return item

为了启用一个 Item Pipeline 组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:

 ITEM_PIPELINES = {'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配给每个类的整型值，确定了他们运行的顺序，item 按数字从低到高的顺序，通过 pipeline，通常将这些数字定义在 0 -1000 范围内。

实践经验

同一进程运行多个 spider

 from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
 
runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)
 
defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

避免被禁止(ban)

使用 user agent 池，轮流选择之一来作为 user agent。池中包含常见的浏览器的 user agent(google 一下一大堆)
禁止 cookies(参考 COOKIES_ENABLED)，有些站点会使用 cookies 来发现爬虫的轨迹。
设置下载延迟(2 或更高)。参考 DOWNLOAD_DELAY 设置。
如果可行，使用 Google cache 来爬取数据，而不是直接访问站点。
使用 IP 池。例如免费的 Tor 项目或付费服务(ProxyMesh)。
使用高度分布式的下载器 (downloader) 来绕过禁止(ban)，您就只需要专注分析处理页面。这样的例子有: Crawlera
增加并发 CONCURRENT_REQUESTS = 100
禁止 cookies:COOKIES_ENABLED = False
禁止重试:RETRY_ENABLED = False
减小下载超时:DOWNLOAD_TIMEOUT = 15
禁止重定向:REDIRECT_ENABLED = False
启用“Ajax Crawlable Pages”爬取:AJAXCRAWL_ENABLED = True

对爬取有帮助的实用 Firefox 插件

Firebug
XPather
XPath Checker
Tamper Data
Firecookie
自动限速：AUTOTHROTTLE_ENABLED=True

本文永久更新链接地址：http://www.linuxidc.com/Linux/2017-01/139945.htm

正文完

星哥玩云-微信公众号

发表至：服务器应用

2022-01-21

0

转载说明：除特殊说明外本站文章皆由CC-4.0协议发布，转载请注明出处。

Nginx高级应用–负载均衡与rewrite规则

PKI/CA 技术的介绍

Cacti 安装spine以实现每分钟获取一次监控数据

Apache Kylin 安装部署之不完全指南

CentOS 7搭建ELK开源实时日志分析系统

Kubernetes部署Nginx/Tomcat

详解LAMP源码编译安装

使用RMAN增量备份来更新传输表空间

ElasticSearch性能优化官方建议

Scrapy 爬虫使用指南完全教程

scrapy note

command

全局命令:

项目 (Project-only) 命令:

使用选择器(selectors)

嵌套选择器(selectors)

结合正则表达式使用选择器(selectors)

使用相对 XPaths

XPATH TIPS

Item Loaders

populate items

Item Pipeline

编写你自己的 item pipeline

Write items to MongoDB

实践经验

同一进程运行多个 spider

避免被禁止(ban)

对爬取有帮助的实用 Firefox 插件

开源堡垒机JumpServer配置教程：使用步骤与配置

申请腾讯混元的API Key并且使用LobeChat调用混元AI

系统加固-Linux不允许用户使用密码登录，只能使用密钥登录

【开源安全保护】如何安装JumpServer堡垒机

基于Docker快速搭建一个开源的IT人员在线工具箱-it-tools

UNIX 下奇怪的事情

Fedora:Gnome创建桌面图标，以Eclipse和IDEA为例

Ubuntu16.04自带防火墙ufw配置和用法

网页绘图API——WebGL

Mariadb学习总结（八）：聚合函数及分组查询

	--spider=SPIDER: 跳过自动检测 spider 并强制使用特定的 spider
	--a NAME=VALUE: 设置 spider 的参数(可能被重复)
	--callback or -c: spider 中用于解析返回 (response) 的回调函数
	--pipelines: 在 pipeline 中处理 item
	--rules or -r: 使用 CrawlSpider 规则来发现用来解析返回 (response) 的回调函数
	--noitems: 不显示爬取到的 item
	--nolinks: 不显示提取到的链接
	--nocolour: 避免使用 pygments 对输出着色
	--depth or -d: 指定跟进链接请求的层次数(默认: 1)
	--verbose or -v: 显示每个请求的详细信息
	scrapy parse http://www.example.com/ -c parse_item

	scrapy genspider [-t template] <name> <domain>
	scrapy genspider -t basic example example.com

	body = '<html><body><span>good</span></body></html>'
	Selector(text=body).xpath('//span/text()').extract()

	response = HtmlResponse(url='http://example.com', body=body)
	Selector(response=response).xpath('//span/text()').extract()

	>>>response.xpath('//base/@href').extract()
	>>>response.css('base::attr(href)').extract()
	>>>response.xpath('//a[contains(@href,"image")]/@href').extract()
	>>>response.css('a[href*=image]::attr(href)').extract()
	>>>response.xpath('//a[contains(@href,"image")]/img/@src').extract()
	>>>response.css('a[href*=image] img::attr(src)').extract()

	links = response.xpath('//a[contains(@href,"image")]')
	for index, link in enumerate(links):
	args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
	print 'Link number %d points to url %s and image %s' % args

	>>>for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
	... print p.extract()
	>>>for p in divs.xpath('.//p'): # extracts all <p> inside
	... print p.extract()
	>>>for p in divs.xpath('p'): #gets all <p> from the whole document
	... print p.extract()

	>>>sel.xpath('//li//@href').extract()
	>>>sel.xpath('//li[re:test(@class,"item-\d$")]//@href').extract()

	def parse(self, response):
	l = ItemLoader(item=Product(), response=response)
	l.add_xpath('name', '//div[@class="product_name"]')
	l.add_xpath('name', '//div[@class="product_title"]')
	l.add_xpath('price', '//p[@id="price"]')
	l.add_css('stock', 'p#stock]')
	l.add_value('last_updated', 'today') # you can also use literal values
	return l.load_item()

	import pymongo

	class MongoPipeline(object):

	def__init__(self, mongo_uri, mongo_db):
	self.mongo_uri = mongo_uri
	self.mongo_db = mongo_db

	@classmethod
	def from_crawler(cls, crawler):
	return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
	mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
	)

	def open_spider(self, spider):
	self.client = pymongo.MongoClient(self.mongo_uri)
	self.db = self.client[self.mongo_db]

	def close_spider(self, spider):
	self.client.close()

	def process_item(self, item, spider):
	collection_name = item.__class__.__name__
	self.db[collection_name].insert(dict(item))
	return item

	ITEM_PIPELINES = {'myproject.pipelines.PricePipeline': 300,
	'myproject.pipelines.JsonWriterPipeline': 800,
	}

	from twisted.internet import reactor, defer
	from scrapy.crawler import CrawlerRunner
	from scrapy.utils.project import get_project_settings

	runner = CrawlerRunner(get_project_settings())
	dfs = set()
	for domain in ['scrapinghub.com', 'insophia.com']:
	d = runner.crawl('followall', domain=domain)
	dfs.add(d)

	defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
	reactor.run() # the script will block here until all crawling jobs are finished

Scrapy 爬虫 使用指南 完全教程

scrapy note

command

全局命令:

项目 (Project-only) 命令:

使用选择器(selectors)

嵌套选择器(selectors)

结合正则表达式使用选择器(selectors)

使用相对 XPaths

XPATH TIPS

Item Loaders

populate items

Item Pipeline

编写你自己的 item pipeline

Write items to MongoDB

实践经验

同一进程运行多个 spider

避免被禁止(ban)

对爬取有帮助的实用 Firefox 插件

Scrapy 爬虫使用指南完全教程