是否有一种方法来收集数据/解析页面在Beautifulsoup从动态编译网页
Is there a way to collect data/Parse pages in Beautifulsoup from dynamically compiles webpages?
我曾经使用Beautifulsoup解析来自网页的数据。然而,我不确定如何从由脚本(JS和JSON)填充的网页收集数据,当我看源代码。是否有任何工具收集或呈现页面,以便我可以或链接从这些页面收集数据。
我在下面举了一个JSON/JS源页面的例子。
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" class="__meteor-css__" href="/3688b5ba42be128b061150ae66a2c2f245507d7e.css?meteor_css_resource=true"> <link rel="stylesheet" type="text/css" class="__meteor-css__" href="/4281a8e71152d94a7380f89ab8dd32d9542c9b5c.css?meteor_css_resource=true">
<meta name="fragment" content="!">
<script type="text/inject-data">%7B%22fast-render-data%22%3A%7B%22collectionData%22%3A%7B%22users%22%3A%5B%5B%7B%22emails%22%3A%5B%7B%22address%22%3A%22suhas.servesh%40gmail.com%22%2C%22verified%22%3Afalse%7D%5D%2C%22profile%22%3A%7B%22defaultSiteName%22%3A%22draftkings%22%2C%22defaultSportName%22%3A%22mlb%22%7D%2C%22username%22%3A%22kloudklown%22%2C%22_id%22%3A%22YnZKGMPLrwHCzHRh5%22%7D%5D%5D%2C%22kadira_settings%22%3A%5B%5B%7B%22appId%22%3A%22SiGbMwMEWLf7WK3KB%22%2C%22endpoint%22%3A%22https%3A%2F%2Fenginex.kadira.io%22%2C%22clientEngineSyncDelay%22%3A10000%2C%22enableErrorTracking%22%3Atrue%2C%22_id%22%3A%22SgS4nrWA5a6nDdzaY%22%7D%5D%5D%7D%2C%22subscriptions%22%3A%7B%7D%2C%22loginToken%22%3A%22-cCvsClRaCVlHa24nJLdIjfDp0EOC_flNuR7IR6Qxqj%22%7D%7D</script>
<script type="text/javascript" src="https://js.stripe.com/v2/"></script>
<script type="text/javascript" src="https://checkout.stripe.com/checkout.js"></script>
<link href="https://d1mua5vq38hnzr.cloudfront.net/favicon.ico" rel="icon" type="image/x-icon" />
<script type="text/javascript" src="https://static.leaddyno.com/js"></script>
<!-- Facebook Pixel Code -->
<script>
!function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
document,'script','https://connect.facebook.net/en_US/fbevents.js');
fbq('init', '156814968048022');
fbq('track', "PageView");</script>
<noscript><img height="1" width="1" style="display:none"
src="https://www.facebook.com/tr?id=156814968048022&ev=PageView&noscript=1"
/></noscript>
<!-- End Facebook Pixel Code -->
</head>
<body>
<script type="text/javascript">__meteor_runtime_config__ = JSON.parse(decodeURIComponent("%7B%22meteorRelease%22%3A%22METEOR%401.3.4.1%22%2C%22meteorEnv%22%3A%7B%22NODE_ENV%22%3A%22production%22%2C%22TEST_METADATA%22%3A%22%7B%7D%22%7D%2C%22PUBLIC_SETTINGS%22%3A%7B%22ga%22%3A%7B%22account%22%3A%22UA-58886344-1%22%7D%7D%2C%22ROOT_URL%22%3A%22https%3A%2F%2Fdailyfantasynerd.com%22%2C%22ROOT_URL_PATH_PREFIX%22%3A%22%22%2C%22appId%22%3A%228u0umeqb2znyyvsybl%22%2C%22kadira%22%3A%7B%22appId%22%3A%22SiGbMwMEWLf7WK3KB%22%2C%22endpoint%22%3A%22https%3A%2F%2Fenginex.kadira.io%22%2C%22clientEngineSyncDelay%22%3A10000%2C%22enableErrorTracking%22%3Atrue%7D%2C%22autoupdateVersion%22%3A%22cd1f15509aed34ad130a1b1cc1c46cb282abe1dd%22%2C%22autoupdateVersionRefreshable%22%3A%227a8125062727989a665ebc42d995410c7cc05ab7%22%2C%22autoupdateVersionCordova%22%3A%22none%22%7D"));</script>
<script type="text/javascript" src="/e517e573069a465b017732a35a886ff1c36e2550.js?meteor_js_resource=true"></script>
</body>
</html>
可以使用PyQt和它的webkit绑定。下面是一个示例脚本,摘自这篇博客文章:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://webscraping.com'
r = Render(url)
html = r.frame.toHtml()
soup = BeautifulSoup(html, 'html.parser')
相关文章:
- 如何防止网页加载后自动启动功能
- 如何使用Node.js最有效地解析网页
- 刷新后保留对网页的更改
- AJAX不会在文件上传后重定向到网页-POST方法
- 使用谷歌网站翻译器自动翻译网页
- 如何在内联依赖项并将图像转换为dataURI的情况下完全提取网页
- 仅重新加载网页的一部分
- 每次提交表单时都会重新加载网页
- 打开网页后立即获取网页的活动javascript函数
- 链接两个网页或网络应用程序的最佳方式
- Android键盘不适用于包含Javascript的网页
- 网页上失败的javascript会导致所有其他脚本失败
- 在网页上显示当前股票报价
- HTML 5 和 3.js 代码不会在网页上显示任何内容
- 使用javascript替换网页上的文本
- 如何建立一个网页,检查我的路由器网络接口是否可以访问
- 我的点击登录在网页上不起作用
- 是否有一种方法来收集数据/解析页面在Beautifulsoup从动态编译网页
- 如何在使用BeautifulSoup抓取网页时提取javascript中的内容
- 使用BeautifulSoup从网页上抓取javascript / json对象