是否有一种方法来收集数据/解析页面在Beautifulsoup从动态编译网页

Is there a way to collect data/Parse pages in Beautifulsoup from dynamically compiles webpages?

本文关键字:Beautifulsoup 网页 编译 动态 数据 一种 方法 是否      更新时间:2023-09-26

我曾经使用Beautifulsoup解析来自网页的数据。然而,我不确定如何从由脚本(JS和JSON)填充的网页收集数据,当我看源代码。是否有任何工具收集或呈现页面,以便我可以或链接从这些页面收集数据。

我在下面举了一个JSON/JS源页面的例子。

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" type="text/css" class="__meteor-css__" href="/3688b5ba42be128b061150ae66a2c2f245507d7e.css?meteor_css_resource=true">  <link rel="stylesheet" type="text/css" class="__meteor-css__" href="/4281a8e71152d94a7380f89ab8dd32d9542c9b5c.css?meteor_css_resource=true">
<meta name="fragment" content="!">
<script type="text/inject-data">%7B%22fast-render-data%22%3A%7B%22collectionData%22%3A%7B%22users%22%3A%5B%5B%7B%22emails%22%3A%5B%7B%22address%22%3A%22suhas.servesh%40gmail.com%22%2C%22verified%22%3Afalse%7D%5D%2C%22profile%22%3A%7B%22defaultSiteName%22%3A%22draftkings%22%2C%22defaultSportName%22%3A%22mlb%22%7D%2C%22username%22%3A%22kloudklown%22%2C%22_id%22%3A%22YnZKGMPLrwHCzHRh5%22%7D%5D%5D%2C%22kadira_settings%22%3A%5B%5B%7B%22appId%22%3A%22SiGbMwMEWLf7WK3KB%22%2C%22endpoint%22%3A%22https%3A%2F%2Fenginex.kadira.io%22%2C%22clientEngineSyncDelay%22%3A10000%2C%22enableErrorTracking%22%3Atrue%2C%22_id%22%3A%22SgS4nrWA5a6nDdzaY%22%7D%5D%5D%7D%2C%22subscriptions%22%3A%7B%7D%2C%22loginToken%22%3A%22-cCvsClRaCVlHa24nJLdIjfDp0EOC_flNuR7IR6Qxqj%22%7D%7D</script>
<script type="text/javascript" src="https://js.stripe.com/v2/"></script>
    <script type="text/javascript" src="https://checkout.stripe.com/checkout.js"></script>
<link href="https://d1mua5vq38hnzr.cloudfront.net/favicon.ico" rel="icon" type="image/x-icon" />
    <script type="text/javascript" src="https://static.leaddyno.com/js"></script>
    <!-- Facebook Pixel Code -->
    <script>
    !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
    n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
    n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
    t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
    document,'script','https://connect.facebook.net/en_US/fbevents.js');
    fbq('init', '156814968048022');
    fbq('track', "PageView");</script>
    <noscript><img height="1" width="1" style="display:none"
    src="https://www.facebook.com/tr?id=156814968048022&ev=PageView&noscript=1"
    /></noscript>
    <!-- End Facebook Pixel Code -->
</head>
<body>

<script type="text/javascript">__meteor_runtime_config__ = JSON.parse(decodeURIComponent("%7B%22meteorRelease%22%3A%22METEOR%401.3.4.1%22%2C%22meteorEnv%22%3A%7B%22NODE_ENV%22%3A%22production%22%2C%22TEST_METADATA%22%3A%22%7B%7D%22%7D%2C%22PUBLIC_SETTINGS%22%3A%7B%22ga%22%3A%7B%22account%22%3A%22UA-58886344-1%22%7D%7D%2C%22ROOT_URL%22%3A%22https%3A%2F%2Fdailyfantasynerd.com%22%2C%22ROOT_URL_PATH_PREFIX%22%3A%22%22%2C%22appId%22%3A%228u0umeqb2znyyvsybl%22%2C%22kadira%22%3A%7B%22appId%22%3A%22SiGbMwMEWLf7WK3KB%22%2C%22endpoint%22%3A%22https%3A%2F%2Fenginex.kadira.io%22%2C%22clientEngineSyncDelay%22%3A10000%2C%22enableErrorTracking%22%3Atrue%7D%2C%22autoupdateVersion%22%3A%22cd1f15509aed34ad130a1b1cc1c46cb282abe1dd%22%2C%22autoupdateVersionRefreshable%22%3A%227a8125062727989a665ebc42d995410c7cc05ab7%22%2C%22autoupdateVersionCordova%22%3A%22none%22%7D"));</script>
  <script type="text/javascript" src="/e517e573069a465b017732a35a886ff1c36e2550.js?meteor_js_resource=true"></script>

</body>
</html>

可以使用PyQt和它的webkit绑定。下面是一个示例脚本,摘自这篇博客文章:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from bs4 import BeautifulSoup

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  
  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  
url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml()
soup = BeautifulSoup(html, 'html.parser')