前言

现在很多网站都是在浏览器中使用js动态渲染页面,直接意味着无法直接从原始页面中爬取到数据。
所以这里就使用可以提供js渲染解析功能的Scrapy-Splash

安装Scrapy-Splash

首先,要明白一点,Scrapy-Splash是需要在Docker中使用的,所以前期工作得做好

安装Docker

Docker在Linux下载安装及部署

安装scrapy-splash

Docker安装成功后,在Docker中安装scrapy-splash,执行该命令:

1
docker run -d -p 8050:8050 scrapinghub/splash

测试

安装成功,再从浏览器上测试一下,ip为安装服务器的路径。如果能打开如图所示的页面,就意味着安装成功了

1
http://192.168.1.104:8050

具体代码

1
pip install scrapy-splash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from scrapy_splash import SplashRequest
import time

class mainSpider(scrapy.Spider):
name = "test"
start_urls = ['https://chp.shadiao.app/']

def __init__(self):
self.script = """
function main(splash)
splash:set_viewport_size(1028, 10000)
splash:go(splash.args.url)
local scroll_to = splash:jsfunc("window.scrollTo")
scroll_to(0, 5000)
splash:wait(15)
return {
html = splash:html()
}
end
"""
self.splash_args = {"lua_source": """
--splash.response_body_enabled = true
splash.private_mode_enabled = false
splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")
splash:wait(3)
return {html = splash:html()}
"""}

def start_requests(self):
time.sleep(5)
try:
url="https://chp.shadiao.app/"
yield SplashRequest(url=url, callback=self.parse, meta={'dont_redirect': True, 'splash': {
'args': {'lua_source': self.script, 'images': 0},
'endpoint': 'execute',
}}, args=self.splash_args, endpoint='render.html')
except Exception as e:
pass

def parse(self,response):
pass
time.sleep(7)
url = "https://chp.shadiao.app/"
content = ''.join(response.xpath('//*[@id="txt_chp"]//text()').extract())
print(content)

yield SplashRequest(url, callback=self.parse, args=self.splash_args, endpoint='render.html', dont_filter=True)

配置文件

1
2
3
4
5
6
7
8
9
10
11
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'crawlerDemo.middlewares.CrawlerdemoDownloaderMiddleware': 1,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
#splash
SPLASH_URL = 'http://xxx.xxx.xxx.xxx:8050'

执行Scrapy项目

这样执行Scrapy项目时,就使用Splash进行渲染了