scrapy 2.3 可用工具命令

可用工具命令

本节包含可用的内置命令列表，其中包含说明和一些用法示例。记住，您可以通过运行以下命令获取有关每个命令的更多信息：

scrapy <command> -h

您可以使用以下命令查看所有可用命令：

scrapy -h

有两种命令，一种是只从零碎项目（特定于项目的命令）内部工作的命令，另一种是不使用活动零碎项目（全局命令）的命令，尽管从项目内部运行时它们的行为可能略有不同（因为它们将使用项目覆盖设置）。

全局命令：

startproject
genspider
settings
runspider
shell
fetch
view
version

仅Project命令：

crawl
check
list
edit
parse
bench

启动项目

Syntax： scrapy startproject <project_name> [project_dir]
需要项目： no

创建一个名为 project_name 下 project_dir 目录。如果 project_dir 没有指定， project_dir 将与 project_name .

使用实例：

$ scrapy startproject myproject

基因蜘蛛

Syntax： scrapy genspider [-t template] <name> <domain>
需要项目： no

在当前文件夹或当前项目的 spiders 文件夹（如果从项目内部调用）。这个 <name> 参数设置为spider的 name ，同时 <domain> 用于生成 allowed_domains 和 start_urls 蜘蛛的属性。

使用实例：

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

这只是一个基于预先定义的模板创建spider的快捷命令，但肯定不是创建spider的唯一方法。您可以自己创建蜘蛛源代码文件，而不是使用这个命令。

爬行

Syntax： scrapy crawl <spider>
需要项目： yes

开始用蜘蛛爬行。

用法示例：

$ scrapy crawl myspider
[ ... myspider starts crawling ... ]

检查

Syntax： scrapy check [-l] <spider>
需要项目： yes

运行合同检查。

用法示例：

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

列表

Syntax： scrapy list
需要项目： yes

列出当前项目中所有可用的spider。每行输出一个蜘蛛。

使用实例：

$ scrapy list
spider1
spider2

编辑

Syntax： scrapy edit <spider>
需要项目： yes

使用中定义的编辑器编辑给定的蜘蛛 EDITOR 环境变量或（如果未设置） EDITOR 设置。

这个命令仅作为最常见情况下的快捷方式提供，开发人员当然可以自由选择任何工具或IDE来编写和调试spider。

使用实例：

$ scrapy edit spider1

取来

Syntax： scrapy fetch <url>
需要项目： no

使用ScrapyDownloader下载给定的URL，并将内容写入标准输出。

这个命令的有趣之处在于它获取了蜘蛛如何下载它的页面。例如，如果蜘蛛 USER_AGENT 覆盖用户代理的属性，它将使用该属性。

所以这个命令可以用来“查看”蜘蛛如何获取特定的页面。

如果在项目之外使用，则不会应用特定的每蜘蛛行为，它只会使用默认的scrapy下载器设置。

支持的选项：

--spider=SPIDER ：绕过Spider自动检测并强制使用特定Spider
--headers ：打印响应的HTTP头而不是响应的正文
--no-redirect ：不遵循HTTP 3xx重定向（默认为遵循它们）

用法示例：

$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

看法

Syntax： scrapy view <url>
需要项目： no

在浏览器中打开给定的URL，因为您的废蜘蛛会“看到”它。有时候蜘蛛看到的页面与普通用户不同，所以这可以用来检查蜘蛛“看到”什么，并确认它是你所期望的。

支持的选项：

--spider=SPIDER ：绕过Spider自动检测并强制使用特定Spider
--no-redirect：不遵循HTTP 3xx重定向（默认为遵循它们）

使用实例：

$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

壳

Syntax： scrapy shell [url]
需要项目： no

为给定的URL（如果给定）启动scrapy shell；如果没有给定URL，则为空。还支持Unix风格的本地文件路径，无论是相对于 ./ 或 ../ 前缀或绝对文件路径。见 Scrapy shell 更多信息。

支持的选项：

--spider=SPIDER ：绕过Spider自动检测并强制使用特定Spider
-c code ：评估shell中的代码，打印结果并退出
--no-redirect ：不遵循HTTP 3xx重定向（默认为遵循它们）；这只影响在命令行上作为参数传递的URL；一旦进入shell， fetch(url) 默认情况下仍将遵循HTTP重定向。

使用实例：

$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]

$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')

# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')

# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')

解析

Syntax： scrapy parse <url> [options]
需要项目： yes

获取给定的URL，并使用处理它的spider，使用 --callback 选项，或 parse 如果没有给出。

支持的选项：

--spider=SPIDER ：绕过Spider自动检测并强制使用特定Spider
--a NAME=VALUE ：set spider参数（可以重复）
--callback 或 -c ：用作分析响应的回调的spider方法
--meta 或 -m ：将传递给回调请求的附加请求元。这必须是有效的JSON字符串。示例：--meta='“foo”：“bar”'
--cbkwargs ：将传递给回调的其他关键字参数。这必须是有效的JSON字符串。示例：--cbkwargs='“foo”：“bar”'
--pipelines ：通过管道处理项目
--rules 或 -r 使用 CrawlSpider 发现用于解析响应的回调（即spider方法）的规则
--noitems ：不显示爬取的项目
--nolinks ：不显示提取的链接
--nocolour ：避免使用Pygments对输出着色
--depth 或 -d ：应递归执行请求的深度级别（默认值：1）
--verbose 或 -v ：显示每个深度级别的信息
--output 或 -o ：将刮取的项目转储到文件2.3 新版功能.

使用实例：

$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': 'Example item',
 'category': 'Furniture',
 'length': '12 cm'}]

# Requests  -----------------------------------------------------------------
[]

设置

Syntax： scrapy settings [options]
需要项目： no

获取 Scrapy 设置的值。

如果在项目中使用，它将显示项目设置值，否则它将显示该设置的默认 Scrapy 值。

示例用法：

$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0

运行蜘蛛

Syntax： scrapy runspider <spider_file.py>
需要项目： no

运行一个包含在python文件中的spider，而不必创建一个项目。

示例用法：

$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

版本

Syntax： scrapy version [-v]
需要项目： no

打印残缺版本。如果使用 -v 它还打印python、twisted和platform信息，这对bug报告很有用。

长凳

Syntax： scrapy bench
需要项目： no

运行一个快速基准测试。标杆管理 .

w3cschool 编程狮，随时随地学编程

scrapy 2.3 可用工具命令

可用工具命令

启动项目

基因蜘蛛

爬行

检查

列表

编辑

取来

看法

壳

解析

设置

运行蜘蛛

版本

长凳

scrapy 2.3 安装指南

scrapy 2.3 教程

scrapy 2.3 命令行工具

scrapy 2.3 蜘蛛

scrapy 2.3 选择器

scrapy 2.3 使用选择器

scrapy 2.3 使用xpaths

scrapy 2.3 使用exslt扩展

scrapy 2.3 内置选择器引

scrapy 2.3 选择器实例

scrapy 2.3 项目

scrapy 2.3 项目类型

scrapy 2.3 使用项目对象

scrapy 2.3 使用项目对象

scrapy 2.3 项目加载器

scrapy 2.3 shell

scrapy 2.3 shell使用外壳

scrapy 2.3 项目管道

scrapy 2.3 项目管道示例

scrapy 2.3 Feed导出

scrapy 2.3 请求和响应

无标题文章

scrapy 2.3 请求子类

scrapy 2.3 链接提取器

scrapy 2.3 设置

scrapy 2.3 登录

scrapy 2.3 日志记录配置

scrapy 2.3 统计数据集合

scrapy 2.3 发送电子邮件

scrapy 2.3 远程登录控制台

scrapy 2.3 常见问题

scrapy 2.3 调试spiders

scrapy 2.3 蜘蛛合约

scrapy 2.3 常用做法

scrapy 2.3 宽爬行

scrapy 2.3 使用浏览器的开发人员工具进行抓取

scrapy 2.3 选择动态加载的内容

scrapy 2.3 调试内存泄漏

scrapy 2.3 下载和处理文件和图像

scrapy 2.3 如何部署蜘蛛

scrapy 2.3 AutoThrottle扩展