使用casperjs从页面获取所有链接

Use casperjs to get all links from a page

本文关键字：链接获取 casperjs 使用更新时间：2023-09-26

我需要使用casperJS和phantomJS只从a href属性和img src以http,https,ftp或ftps开始的链接(我想这是应该使用的正则表达式:((http|https|ftp|ftps):'/'/[^"]+)"/g) .

我实现了仅从a标签获取链接的代码，但我需要改进它，以便从与正则表达式匹配的img标签获得链接…

var casper = require('casper').create();
var links;
function getLinks() {
    var links = document.querySelectorAll('img');
    return Array.prototype.map.call(links, function (e) {
        return e.getAttribute('src')
    });
}
casper.start('https://marvel.com');
casper.then(function () {
    links = this.evaluate(getLinks);
});
casper.run(function () {
    for(var i in links) {
        console.log(links[i]);
    }
    casper.done();
});

字符串提供match函数，您可以根据该函数对正则表达式求值。它以数组形式返回匹配项，如果没有匹配项则返回null。

casper.then(function(){
    var regex = /((http|https|ftp|ftps):'/'/[^"]+)"/g)/;
    var srcs = this.getElementsAttribute("img", "src").filter(function(src){
        return !!src.match(regex);
    });
    srcs.forEach(function(src){
        console.log(src);
    });
});

这个正则表达式看起来不太合适，所以我使用var regex = /^(ht|f)tps?:'/'//;