切换主题
Scrapy框架应用
使用scrapy框架爬取猫眼电影案例
1. 创建项目
shell
scrapy startproject Maoyan100
#进入项目目录
cd Maoyan100
# 创建爬虫文件,注意url 一定要是网站域名
scrapy genspider maoyan www.maoyan.com
2. 定义数据结构
在items.py中定义数据结构
python
import scrapy
class Maoyan100Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
star = scrapy.Field()
time = scrapy.Field()
3. 编写爬虫文件
编写爬虫文件 maoyan.py 代码如下所示:
python
import scrapy
from Maoyan100.items import Maoyan100Item
class MaoyanSpider(scrapy.Spider):
name = "maoyan"
allowed_domains = ["www.maoyan.com"]
start_urls = ["https://www.maoyan.com/board/4?offset=0"] # 第一个要抓取的url
offset = 0 # 偏移量
def parse(self, response, **kwargs):
# 基于xpath解析数据, 匹配电影新街的dd节点对象列表
dd_list = response.xpath('//dd')
# 给items.py中的Maoyan100Item类实例化
for dd in dd_list:
item = Maoyan100Item()
item['name'] = dd.xpath('.//div/div/div[1]/p[1]/a/text()').extract_first()
item['star'] = dd.xpath('.//div/div/div[1]/p[2]/text()').extract_first().replace('\n', '').strip()
item['time'] = dd.xpath('.//div/div/div[1]/p[3]/text()').extract_first()
print(item)
yield item
if self.offset < 100:
self.offset += 10
url = 'https://www.maoyan.com/board/4?offset=' + str(self.offset)
# 把url交给secheduer入队列
# response会自动传给 callback 回调的 parse()函数
# Scrapy.request()向url发起请求,并将响应结果交给回调的解析函数
yield scrapy.Request(url=url, callback=self.parse)
实现数据存储
通过编写管道文件 pipelinse.py 文件实现数据的存储,将抓取的数据存放在 MySQL 数据库,这里需要提前建库、建表,因为前面章节已经创建过,此处不再赘述。代码编写如下:
python
import pymysql
from itemadapter import ItemAdapter
from Maoyan100.settings import *
class Maoyan100Pipeline:
def process_item(self, item, spider):
print(item['name'], item['star'], item['time'])
return item
class Maoyan100MysqlPipeline:
def open_spider(self, spider):
self.conn = pymysql.connect(
host=MYSQL_HOST,
port=MYSQL_PORT,
user=MYSQL_USER,
password=MYSQL_PASSWORD,
database=MYSQL_DB,
charset=MYSQL_CHARSET
)
self.cursor = self.conn.cursor()
print('数据库连接成功')
def process_item(self, item, spider):
print('123', item['name'], item['star'], item['time'])
sql = 'insert into filmtab values(%s,%s,%s)'
self.cursor.execute(sql, (item['name'], item['star'], item['time']))
self.conn.commit()
return item
def close_spider(self, spider):
self.cursor.close()
self.conn.close()
定义启动文件
在项目根目录下创建一个start.py文件(名字自定),代码如下:
python
from scrapy import cmdline
cmdline.execute('scrapy crawl maoyan -o maoyan.csv'.split())
注意:指定 -o 参数,可以将数据以特定的文件格式保存,比如 csv、txt、josn 等。
修改配置文件
最后修改配置文件,主要有修改以下内容:添加日志输出、激活管道 pipelines、定义数据库常量,以及其他一些常用选项