Skip to content

Scrapy框架应用

使用scrapy框架爬取猫眼电影案例

1. 创建项目

shell
scrapy startproject Maoyan100
#进入项目目录
cd Maoyan100
# 创建爬虫文件,注意url 一定要是网站域名
scrapy genspider maoyan www.maoyan.com

2. 定义数据结构

在items.py中定义数据结构

python
import scrapy


class Maoyan100Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    star = scrapy.Field()
    time = scrapy.Field()

3. 编写爬虫文件

编写爬虫文件 maoyan.py 代码如下所示:

python
import scrapy

from Maoyan100.items import Maoyan100Item


class MaoyanSpider(scrapy.Spider):
    name = "maoyan"
    allowed_domains = ["www.maoyan.com"]
    start_urls = ["https://www.maoyan.com/board/4?offset=0"]  # 第一个要抓取的url

    offset = 0  # 偏移量

    def parse(self, response, **kwargs):
        # 基于xpath解析数据, 匹配电影新街的dd节点对象列表
        dd_list = response.xpath('//dd')
        # 给items.py中的Maoyan100Item类实例化
        for dd in dd_list:
            item = Maoyan100Item()
            item['name'] = dd.xpath('.//div/div/div[1]/p[1]/a/text()').extract_first()
            item['star'] = dd.xpath('.//div/div/div[1]/p[2]/text()').extract_first().replace('\n', '').strip()
            item['time'] = dd.xpath('.//div/div/div[1]/p[3]/text()').extract_first()
            print(item)
            yield item
        if self.offset < 100:
            self.offset += 10
            url = 'https://www.maoyan.com/board/4?offset=' + str(self.offset)
            # 把url交给secheduer入队列
            # response会自动传给 callback 回调的 parse()函数
            # Scrapy.request()向url发起请求,并将响应结果交给回调的解析函数
            yield scrapy.Request(url=url, callback=self.parse)

实现数据存储

通过编写管道文件 pipelinse.py 文件实现数据的存储,将抓取的数据存放在 MySQL 数据库,这里需要提前建库、建表,因为前面章节已经创建过,此处不再赘述。代码编写如下:

python
import pymysql
from itemadapter import ItemAdapter
from Maoyan100.settings import *


class Maoyan100Pipeline:
    def process_item(self, item, spider):
        print(item['name'], item['star'], item['time'])
        return item


class Maoyan100MysqlPipeline:

    def open_spider(self, spider):
        self.conn = pymysql.connect(
            host=MYSQL_HOST,
            port=MYSQL_PORT,
            user=MYSQL_USER,
            password=MYSQL_PASSWORD,
            database=MYSQL_DB,
            charset=MYSQL_CHARSET
        )
        self.cursor = self.conn.cursor()
        print('数据库连接成功')

    def process_item(self, item, spider):
        print('123', item['name'], item['star'], item['time'])
        sql = 'insert into filmtab values(%s,%s,%s)'
        self.cursor.execute(sql, (item['name'], item['star'], item['time']))
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

定义启动文件

在项目根目录下创建一个start.py文件(名字自定),代码如下:

python
from scrapy import cmdline

cmdline.execute('scrapy crawl maoyan -o maoyan.csv'.split())

注意:指定 -o 参数,可以将数据以特定的文件格式保存,比如 csv、txt、josn 等。

修改配置文件

最后修改配置文件,主要有修改以下内容:添加日志输出、激活管道 pipelines、定义数据库常量,以及其他一些常用选项