使用美丽汤获取“视图元素”代码，而不是“查看源代码”代码

Use BeautifulSoup to obtain "View Element" code instead of "View Source" code

本文关键字：代码查看源代码视图元素元素视图汤获取美丽获取更新时间：2023-09-26

我使用以下代码从网页获取所有<script>...</script>内容（请参阅代码中的 url）：

import urllib2
from bs4 import BeautifulSoup
import re
import imp
url = "http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
script = soup.find_all("script")
print script #just to check the output of script

但是，BeautifulSoup在网页的源代码（chrome中的Ctrl + U）中进行搜索。但是，我想在网页的元素代码（chrome中的Ctrl + Shift + I）中进行BeautifulSoup搜索。

我希望它这样做，因为我真正感兴趣的代码段是元素代码而不是源代码。

首先要了解的是，BeautifulSoup 和 urllib2 都不是浏览器。 urllib2只会获取/下载您最初的"静态"页面 - 它不能像真正的浏览器那样执行JavaScript。因此，您将始终获得"查看页面源代码"内容。

要解决您的问题 - 通过 selenium 启动真正的浏览器，等待页面加载，获取.page_source并将其传递给BeautifulSoup进行解析：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".fluid-width-video-wrapper")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)

这是一般方法，但您的情况略有不同 - 有一个包含视频播放器的iframe元素。如果要访问iframe内的script元素，则需要切换到它，然后获取.page_source：

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://racing4everyone.eu/2015/10/25/formula-e-201516formula-e-201516-round01-china-race/")
# wait for the page to load, switch to iframe
wait = WebDriverWait(driver, 10)
frame = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe[src*=video]")))
driver.switch_to.frame(frame)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".controls")))
# get the page source
page_source = driver.page_source
driver.close()
# parse the HTML
soup = BeautifulSoup(page_source, "html.parser")
script = soup.find_all("script")
print(script)