【爬虫】4.3 Scrapy 爬取与存储数据-Toy模板网

这篇具有很好参考价值的文章主要介绍了【爬虫】4.3 Scrapy 爬取与存储数据。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

1. 建立 Web 网站

2. 编写数据项目类

3. 编写爬虫程序 MySpider

4. 编写数据管道处理类

5. 设置 Scrapy 的配置文件

从一个网站爬取到数据后，往往要存储数据到数据库中，scrapy 框架有十分方便的存储方法，为了说明这个存储过程，首先建立一个简单的网站，然后写一个 scrapy 爬虫程序爬取数据，最后存储数据。

1. 建立 Web 网站

这个网站有一个网页，返回基本计算机教材数据，Flask程序

服务器 server.py 如下：

import flask

app = flask.Flask(__name__)


@app.route("/")
def index():
    html = """
    <books>
    <book>
        <title>Python程序设计</title>
        <author>James</author>
        <publisher>清华大学出版社</publisher>
    </book>
    <book>
        <title>Java程序设计</title>
        <author>Robert</author>
        <publisher>人民邮电出版社</publisher>
    </book>
    <book>
        <title>MySQL数据库</title>
        <author>Steven</author>
        <publisher>高等教育出版社</publisher>
    </book>
    </books> 
    """
    return html


if __name__ == "__main__":
    app.run()

访问这个网站时返回 xml 的数据，包含教材的名称、作者、与出版社

2. 编写数据项目类

程序要爬取的数据是多本教材，每本教材有名称与作者，因此要建立一个教材的类，类中包含教材名称title、作者author与出版社 publisher。在 scrapy 框架中有的.\example\Test\Test 目录下有一个文件 items.py 就是用来设计数据项目类的，打开这个文件，改造文件成如下形式：

改造前：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TestItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

改造后：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BookItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    publish = scrapy.Field()

其中 BookItme 是我们设计的教材类，这个类必须从 scrapy.Item 类继承，在类中定义教材的字段项目，每个字段项目都是一个 scrapy.Field 对象，这里定义了3个字段项目，用来存储教材名称 title、作者 author、出版社 publisher。

如果item是一个BooItem的对象，那么可以通过item["title"]、 item["author"]、item["publisher"]来获取与设置各个字段的值，例如：

        item=BookItem()

        item["title"]="Python程序设计"

        item["author"]="James"

        item["publisher"]="清华大学出版社"

        print(item["title"])

        print(item["author"])

        print(item["publisher"])

3. 编写爬虫程序 MySpider

数据的项目设计好后就可以编写爬虫程序（.\example\Test\Test\spiders\MySpider.py）

爬虫程序 MySpider.py 如下：

import scrapy
from ..items import BookItem


class MySpider(scrapy.Spider):
    name = "mySpider"
    start_urls = ['http://127.0.0.1:5000']

    # 回调函数
    def parse(self, response, **kwargs):
        try:
            data = response.body.decode()
            # 爬取数据
            selector = scrapy.Selector(text=data)
            books = selector.xpath("//book")
            for book in books:
                item = BookItem()
                item["title"] = book.xpath("./title/text()").extract_first()
                item["author"] = book.xpath("./author/text()").extract_first()
                item["publisher"] = book.xpath("./publisher/text()").extract_first()
                yield item
        except Exception as err:
            print(err)

        这个程序访问 http://127.0.0.1:5000 的网站，得到的网页包含教材信息，程序过程如下：

(1)

from ..items import BookItem

从Test文件夹的items.py文件中引入BookItem类的定义。

(2)

data=response.body.decode()

selector=scrapy.Selector(text=data)

books=selector.xpath("//book")

得到网站数据并建立Selector对象，搜索到所有的<book>节点的元素。

(3)

for book in books:

        item=BookItem()

        item["title"]=book.xpath("./title/text()").extract_first()

        item["author"] = book.xpath("./author/text()").extract_first()

        item["publisher"] = book.xpath("./publisher/text()").extract_first()

        yield item

        对于每个<book>节点，在它下面搜索到<title>节点，取出它的文本即教材名称，其中注意使用book.xpath("./title/text()")搜索到<book>下面的<title>节点的文本，一定不能少"./"的部分，它表示从当前节点<book>往下搜索。同样道理搜索<author>、<publisher>节点的文本，它们组成一个BookItem对象，这个对象通过语句： yield item 向上一级调用函数返回，接下来scrapy会把这个对象推送给与items.py同目录下的 pipelines.py文件中的数据管道，执行类取处理数据。

4. 编写数据管道处理类

在我 scrapy框架中有的 .\example\Test\Test 目录下有一个文件 pipelines.py 就是用来数据管道处理类文件，打开这个文件可以看到一个 默认的管道类，

默认数据管道类 pipelines.py如下：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class TestPipeline:
    def process_item(self, item, spider):
        return item

修改并设计数据管道类 pipelines.py如下：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BookPipeline(object):
    count = 0

    def process_item(self, item, spider):
        BookPipeline.count += 1
        try:
            if BookPipeline.count == 1:
                fobj = open("books.txt", "wt")
            else:
                fobj = open("books.txt", "at")
            print(item["title"], item["author"], item["publisher"])
            fobj.write(item["title"] + "," + item["author"] + "," + item["publisher"] + "\n")
            fobj.close()
        except Exception as err:
            print(err)
        return item

这个类我们命名为 BookPipeline，它继承自object类，类中最重要的函数是process_item函数，scrapy 爬取数据开始时会建立一个 BookPipeline 类对象，然后每爬取一个数据类BookItem项目item，MySpider程序会把这个对象推送给BookPipeline对象，同时调用process_item函数一次。 process_item 函数的参数中的item就是推送来的数据，于是,便可以在这个函数中保存爬取的数据了。注意scrapy要求process_item函数最后返回这个item对象。

在这个程序中采用文件存储爬取的数据，BookPipeline 类中先定义一个类成员count=0，用它来记录process_item调用的次数。如果是第一次调用(count=1)那么就使用语句fobj=open("books.txt","wt") 新建立一个books.txt的文件，然后把item的数据写到文件中。如果不是第一次调用(count>1)，就使用语句fobj=open("books.txt","at")打开已经存在的文件books.txt，把item的数据追加到文件中。这样我们反复执行爬虫程序的过程,保证每次清除掉上次的数据，记录本次爬取的数据。

5. 设置 Scrapy 的配置文件

MySpider爬虫程序执行后每爬取一个 item 项目都会推送到 BookPipelines类并调用的process_item 函数，那么 scrapy 怎么样知道要这样做呢？前提是我们必须设置这样一个通道。在Test 文件夹中有一个 settings.py 的设置文件，打开这个文件可以看到很多设置项目，大部分是用#注释的语句，找到语句ITEM_PIPLINES的项目，把它设置成如下形式：

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "Test.pipelines.TestPipeline": 300,
}

其中 ITEM_PIPLINES 是一个字典，把关键字改成 Test.pipelines.BookPipeline'，而BookPipelines 就是在 pipelines.py 文件中设计的数据管道类的名称，后面的300是一个默认的整数，实际上它可以不是300，它可以是任何整数。

设置完成后就连通了爬虫程序 MySpider 数据管道处理程序 pipelines.py 的通道，scrapy工作时会把 MySpider 爬虫程序通过yield返回的每项数据推送给 pipelines.py 程序的 BookPipeline 类，并执行 process_item 函数，这样就可以保存数据了。

总结：

scrapy把数据爬取与数据存储分开处理，它们都是异步执行的， MySpider.py 每爬取到一个数据项目 item，就 yield 推送给 pipelines.py 程序存储；等待存储完毕后，又再次爬取另外一个数据项目 item，再次 yield 推送到 pipelines.py 程序，然后再次存储， ......，这个过程一直进行下去，直到爬取过程结束，文件 books.txt 中就存储了所有的爬取数据了。

下一篇文章：4.4 Scrapy 爬取网站数据文章来源地址https://www.toymoban.com/news/detail-487570.html

到了这里，关于【爬虫】4.3 Scrapy 爬取与存储数据的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！