获取两个标签之间的HTML

Get HTML between two tags

本文关键字:标签 之间 HTML 两个 获取      更新时间:2023-09-26

试图从内部论坛获取一些HTML资源。为了独立,我们使用nodejs, express和类似的东西。

当我直接打开页面时,我得到以下html返回:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta http-equiv="content-type" content="text/html; charset=us-ascii" />
    <meta name="description" content="myForum" />
    <meta name="viewport" content="width=320; user-scalable=no" />
    <title>myForum</title>
</head>
<body>
        <table>
            <tr>
                <td align="left" valign="top" width="100%">
                    <center>
                        <h1><img class="banner" src=
                        "./img/myForum.jpg" width="730"
                        height="117" border="0" alt="myForum" /></h1>
                    </center>
                    <hr />
                    <center>
                        [ <a href="answerswer.php?id=975710">Antworten</a> ]&nbsp;&nbsp;[
                        <a href="index.php">Forum</a> ]&nbsp;&nbsp;[ <a href=
                        "newEntries.php">Neue Beitr&auml;ge</a> ]
                    </center>
                    <hr />
                    <h1>sCHween</h1>geschrieben von&nbsp;<font color=
                    "#FFFFFF">User1</font>&nbsp;&nbsp;am&nbsp;18.06.2014&nbsp;um&nbsp;21:26:15
                    <hr />
                    This is my text! It could contain images and links!
                    <img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
                    <a href="http://www.google.com/">Google</a>
                    <br />
                    <hr />
                    <b>Antworten:</b><br />
                    <a href="thread.php?id=9752">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User2</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;22:56:27<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9756">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User2</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;23:14:44<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9753">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User1</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;23:02:21<br />
                    <a href="showentry.php?id=975713">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User1</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;21:46:13<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9720">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User3</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;22:22:25<br />
                    &nbsp;&nbsp;&nbsp;&nbsp;<a href="showentry.php?id=9755">Re:
                    sCHween</a>&nbsp;-&nbsp;<b><font color=
                    "#FFFFFF">User4</font></b>&nbsp;-&nbsp;18.06.2014&nbsp;21:52:51<br />
                    <hr />
                    <span>
                        <a href="answerswer.php?id=975">Antworten</a><br />
                        <a href="recent.php">Neue Beitr&auml;ge</a><br />
                    </span>
                    <hr />
                </td>
            </tr>
        </table>
</body>
</html>

我们想要得到的是两个hr标签之间内容的html源:

This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png" /><br />
<a href="http://www.google.com/">Google</a>

是否有一种简单的方法来获取两个hr标签之间的源,或者提取此内容的最干净和最简单的方法是什么?

不确定这是否是你想要的:

Jquery:

var AllContent = $("td").contents();
var hrCount = 0;
var addContent = false;
var result="";
AllContent.each(function(){
    if ($(this).prop('tagName') == "HR"){
        hrCount++;
        if (hrCount ==3){
            addContent = true;
        }
        if (hrCount ==4){
            addContent = false;
        }
    }else{
        if(addContent){
            if (typeof $(this).html() != "undefined"){
                result+=$(this)[0].outerHTML;
            }else{
                result+=$(this).text();
           }
       }
    }   
});
alert(result);

jsdom是在node中进行DOM解析的一个很好的工具。由于您希望将文本节点和常规元素都转换为字符串,因此我们必须对两者进行区分:

var jsdom = require("jsdom");
jsdom.env(
  'http://example.com',
  ['http://code.jquery.com/jquery.js'],
  function (errors, window) {
    var $hr = window.$('hr'),
        node = $hr.get(2).nextSibling,
        endNode = $hr.get(3),
        html = '';
    while (node && node !== endNode) {
        if (node.nodeType === 3) {
            html += node.textContent;
        } else {
            html += node.outerHTML;
        }
        node = node.nextSibling;
    }
  }
);

现在html的值如下:

This is my text! It could contain images and links!
<img src="http://images.google.ch/intl/en_ALL/images/srpr/logo11w.png"><br>
<a href="http://www.google.com/">Google</a>
<br>