如何抓取运行带有cookie的Javascript的网页在显示主要内容之前进行检查

How to crawl web page which run Javascript with cookies check before showing main content

本文关键字:显示 进行检查 网页 Javascript 抓取 运行 行带 cookie 何抓取      更新时间:2023-09-26

我正在尝试抓取和解析以下RSS提要:http://english.alarabiya.net/.mrss/en/sports.xml

当我在浏览器中打开它时,它会给我想要解析的普通RSS提要。但当我在Java中下载它时,它会显示以下内容:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<script type="text/javascript">
function getCookie(c_name) { // Local function for getting a cookie value
    if (document.cookie.length > 0) {
        c_start = document.cookie.indexOf(c_name + "=");
        if (c_start!=-1) {
        c_start=c_start + c_name.length + 1;
        c_end=document.cookie.indexOf(";", c_start);
        if (c_end==-1) 
            c_end = document.cookie.length;
        return unescape(document.cookie.substring(c_start,c_end));
        }
    }
    return "";
}
function setCookie(c_name, value, expiredays) { // Local function for setting a value of a cookie
    var exdate = new Date();
    exdate.setDate(exdate.getDate()+expiredays);
    document.cookie = c_name + "=" + escape(value) + ((expiredays==null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";
}
function getHostUri() {
    var loc = document.location;
    return loc.toString();
}
setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '105.183.123.12', 10);
try {  
    location.reload(true);  
} catch (err1) {  
    try {  
        location.reload();  
    } catch (err2) {  
        location.href = getHostUri();  
    }  
}
</script>
</head>
<body>
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.</noscript>
</body>
</html>

我使用的是简单的流读取,这是我的代码:

     try {
            URL url = new URL("http://english.alarabiya.net/.mrss/en/sports.xml");
            BufferedReader in = new BufferedReader(
                    new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null)
                System.out.println(inputLine);
            in.close();
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

有人知道如何解析RSS的主要内容并用cookie绕过Javascript部分吗?或者有什么想法吗?

附言:我正在使用罗马图书馆抓取RSS源,但我认为问题超出了它的范围。

尝试HtmlUnit库并在中使用setJavascriptEnabled(true)

你的问题与这个类似