前言
现在很多网站都是在浏览器中使用js动态渲染页面,直接意味着无法直接从原始页面中爬取到数据。
所以这里就使用可以提供js渲染解析功能的Scrapy-Splash
安装Scrapy-Splash
首先,要明白一点,Scrapy-Splash是需要在Docker中使用的,所以前期工作得做好
安装Docker
【Docker在Linux下载安装及部署】
安装scrapy-splash
Docker安装成功后,在Docker
中安装scrapy-splash
,执行该命令:
1
| docker run -d -p 8050:8050 scrapinghub/splash
|
测试
安装成功,再从浏览器上测试一下,ip
为安装服务器的路径。如果能打开如图所示的页面,就意味着安装成功了
1
| http://192.168.1.104:8050
|
具体代码
1
| pip install scrapy-splash
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
| from scrapy_splash import SplashRequest import time class mainSpider(scrapy.Spider): name = "test" start_urls = ['https://chp.shadiao.app/'] def __init__(self): self.script = """ function main(splash) splash:set_viewport_size(1028, 10000) splash:go(splash.args.url) local scroll_to = splash:jsfunc("window.scrollTo") scroll_to(0, 5000) splash:wait(15) return { html = splash:html() } end """ self.splash_args = {"lua_source": """ --splash.response_body_enabled = true splash.private_mode_enabled = false splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36") splash:wait(3) return {html = splash:html()} """} def start_requests(self): time.sleep(5) try: url="https://chp.shadiao.app/" yield SplashRequest(url=url, callback=self.parse, meta={'dont_redirect': True, 'splash': { 'args': {'lua_source': self.script, 'images': 0}, 'endpoint': 'execute', }}, args=self.splash_args, endpoint='render.html') except Exception as e: pass def parse(self,response): pass time.sleep(7) url = "https://chp.shadiao.app/" content = ''.join(response.xpath('//*[@id="txt_chp"]//text()').extract()) print(content) yield SplashRequest(url, callback=self.parse, args=self.splash_args, endpoint='render.html', dont_filter=True)
|
配置文件
1 2 3 4 5 6 7 8 9 10 11
| SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, } DOWNLOADER_MIDDLEWARES = { 'crawlerDemo.middlewares.CrawlerdemoDownloaderMiddleware': 1, 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
SPLASH_URL = 'http://xxx.xxx.xxx.xxx:8050'
|
执行Scrapy项目
这样执行Scrapy
项目时,就使用Splash
进行渲染了