Selenium WebDriver可以快速分析大量链接

Selenium WebDriver analyze large collection of links quickly

本文关键字：链接 WebDriver Selenium 更新时间：2023-09-26

我有一个网页，其中包含大量链接（大约300个），我想收集有关这些链接的信息。

这是我的代码：

beginning_time = Time.now
#This gets a collection of links from the webpage
tmp = driver.find_elements(:xpath,"//a[string()]")
end_time = Time.now
puts "Execute links:#{(end_time - beginning_time)*1000} milliseconds for #{tmp.length} links"

before_loop = Time.now
#Here I iterate through the links
tmp.each do |link|
    #I am not interested in the links I can't see
    if(link.location.x < windowX and link.location.y < windowY)
        #I then insert the links into a NoSQL database, 
        #but for all purposes you could imagine this as just saving the data in a hash table.
        $elements.insert({
            "text" => link.text,
            "href" => link.attribute("href"),
            "type" => "text",
            "x" => link.location.x,
            "y" => link.location.y,
            "url" => url,
            "accessTime" => accessTime,
            "browserId" => browserId
        })
    end
end
after_loop = Time.now
puts "The loop took #{(after_loop - before_loop)*1000} milliseconds"

目前获取链接集合需要 20 毫秒，检索链接信息大约需要 4000 毫秒（或 4 秒）。当我将访问器与NoSQL插入分开时，我发现NoSQL插入只需要20ms，并且大部分时间都花在了访问者身上（由于我不明白的原因，从NoSQL插入中分离后，访问器变得慢得多），这使我得出结论，访问器必须执行JavaScript。

我的问题是：如何更快地收集这些链接及其信息？

想到的第一个解决方案是尝试并行运行两个驱动程序，但 WebDriver 不是线程安全的，这意味着我必须创建 WebDriver 的新实例并导航到该页面。这就提出了一个问题，如何下载页面的源代码，以便可以将其加载到另一个驱动程序中，这在Selenium中无法完成，因此必须在Chrome本身上使用桌面自动化工具执行，从而增加了相当大的开销。

我听说的另一种选择是停止使用ChromeDriver，只使用PhantomJS，但我需要在可视浏览器中显示页面。

还有其他我还没有考虑过的替代方案吗？

您似乎纯粹使用 Webdriver 来执行 Javascript 而不是访问对象。

如果你放弃使用javascript，有几个想法可以尝试（请原谅java，但你明白了）;

 //We have restricted via xpath so will get less links back AND will not haveto check the text within loop
        List<WebElement> linksWithText = driver.findElements(By.xpath("//a[text() and not(text()='')]"));
        for (WebElement link : linksWithText) {
            //Store the location details rather than re-get each time
            Point location = link.getLocation();
            Integer x = location.getX();
            Integer y = location.getY();
            if (x < windowX && y < windowY) {
                ///Insert all info using webdriver commands;
            }
        }

我

通常使用远程网格，因此性能是我的测试中的一个关键问题，因此为什么我总是尝试通过 CSS 选择器或 XPath 进行限制，而不是获取所有内容并循环