scrapy 2.3 提取数据

学习如何使用scrappy提取数据的最佳方法是使用 Scrapy shell . 运行：

scrapy shell 'http://quotes.toscrape.com/page/1/'

注解

否则，在运行Scrapy命令时，请记住要在命令行中包含url。 & 字符）不起作用。

在Windows上，使用双引号：

scrapy shell "http://quotes.toscrape.com/page/1/"

您将看到类似的内容：

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

使用shell，可以尝试使用 CSS 对于响应对象：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

运行``response.css（'title'）``的结果是一个类似于列表的对象：class：~scrapy.selector.SelectorList，它表示一个列表：class：`~scrapy.selector.Selector，这些对象环绕XML/HTML元素，并允许您运行进一步的查询，以细化所选内容或提取数据。

要从上述标题中提取文本，可以执行以下操作：

>>> response.css('title::text').getall()
['Quotes to Scrape']

这里有两件事需要注意：一是我们已经添加了 ::text 对于CSS查询，意味着我们只想直接选择内部的文本元素 <title> 元素。如果我们不指定 ::text ，我们将获得完整的title元素，包括其标记：

>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']

另一件事是呼叫的结果 .getall() 是一个列表：选择器可能返回多个结果，因此我们提取所有结果。当您知道您只想要第一个结果时，如本例所示，您可以：

>>> response.css('title::text').get()
'Quotes to Scrape'

作为替代，你可以写下：

>>> response.css('title::text')[0].get()
'Quotes to Scrape'

然而，使用 .get() 直接在A上 SelectorList 实例避免了 IndexError 回报 None 当它找不到任何与所选内容匹配的元素时。

这里有一个教训：对于大多数抓取代码，您希望它能够对由于在页面上找不到的东西而导致的错误具有弹性，这样即使某些部分无法抓取，您至少可以 some 数据。

除此之外 getall() 和 get() 方法，也可以使用 re() 提取方法 regular expressions ：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

为了找到合适的CSS选择器，您可能会发现在Web浏览器的shell中使用 view(response) . 您可以使用浏览器的开发人员工具检查HTML并找到一个选择器（请参见使用浏览器的开发人员工具进行抓取）

Selector Gadget 也是一个很好的工具，可以快速找到视觉上选中的元素的CSS选择器，它可以在许多浏览器中使用。

w3cschool 编程狮，随时随地学编程

scrapy 2.3 提取数据

scrapy 2.3 安装指南

scrapy 2.3 教程

scrapy 2.3 命令行工具

scrapy 2.3 蜘蛛

scrapy 2.3 选择器

scrapy 2.3 使用选择器

scrapy 2.3 使用xpaths

scrapy 2.3 使用exslt扩展

scrapy 2.3 内置选择器引

scrapy 2.3 选择器实例

scrapy 2.3 项目

scrapy 2.3 项目类型

scrapy 2.3 使用项目对象

scrapy 2.3 使用项目对象

scrapy 2.3 项目加载器

scrapy 2.3 shell

scrapy 2.3 shell使用外壳

scrapy 2.3 项目管道

scrapy 2.3 项目管道示例

scrapy 2.3 Feed导出

scrapy 2.3 请求和响应

无标题文章

scrapy 2.3 请求子类

scrapy 2.3 链接提取器

scrapy 2.3 设置

scrapy 2.3 登录

scrapy 2.3 日志记录配置

scrapy 2.3 统计数据集合

scrapy 2.3 发送电子邮件

scrapy 2.3 远程登录控制台

scrapy 2.3 常见问题

scrapy 2.3 调试spiders

scrapy 2.3 蜘蛛合约

scrapy 2.3 常用做法

scrapy 2.3 宽爬行

scrapy 2.3 使用浏览器的开发人员工具进行抓取

scrapy 2.3 选择动态加载的内容

scrapy 2.3 调试内存泄漏

scrapy 2.3 下载和处理文件和图像

scrapy 2.3 如何部署蜘蛛

scrapy 2.3 AutoThrottle扩展