如何解析结构不良的 html 代码

How do I parse ill-structued html code?

本文关键字:html 代码 不良 何解析 结构      更新时间:2023-09-26

我有以下 html 代码想要解析(一些元素被剥离以增强可读性(:

</div>
            <article class="article-detail-description">
                <h1 class="page-heading">
                    Postulat operacyjności definicji w naukach społecznych
                    <br /><small>Definition’s Operativeness Postulate in Social Sciences</small>
                </h1>
                <div>
                    <strong>Author(s): </strong>Jakub Karpiński<br /><strong>Subject(s): </strong>Social Sciences<br /><strong>Published by: </strong>Instytut Filozofii i Socjologii Polskiej Akademii Nauk<br/><strong>Keywords: </strong>operationism; definition of property; definition of indicator;  concepts selection
<br/>
                </div>
                  <p class="summary"><strong>Summary/Abstract: </strong> 
The article’s primary goal is to demonstrate the problems inherited in “operationism – antioperationism” polemics.
</p>
                <ul class="nav nav-tabs">
                    <li class="active" ><a href="#details" data-toggle="tab">Details</a></li>
                    <li><a href="#tableOfContents" data-toggle="tab">Contents</a></li>
                </ul>
                <div class="tab-content">
                    <div class="tab-pane fade active in" id="details">
                        <p class="journal-link"><strong>Journal: </strong><a href="/search/journal-detail?id=10">Studia Socjologiczne</a></p>   
                        <ul class="article-additional-info">
                            <li><strong>Issue Year:</strong> 2011</li><li><strong>Issue No:</strong> 1 (200)</li><li><strong>Page Range:</strong> 65-80</li><li><strong>Page Count:</strong> 15</li><li><strong>Language:</strong> Polish</li>
                        </ul>
                    </div>

我可以使用以下命令阅读所有内容

document.getElementsByClassName("article-detail-description")[0].textContent .

要只阅读我使用<p class="summary"

getElementsByClassName("summary")[0].textContent

但是,后者并不完美,因为它也显示了Summary/Abstract:

我对很多元素感兴趣,让我们以以下内容为例:

1. Postulat operacyjności definicji w naukach społecznych

我可以得到:

Postulat operacyjności definicji w naukach społecznych
Definition’s Operativeness Postulate in Social Sciences

为了得到它,我使用:document.getElementsByClassName("page-heading")[0].innerText

我如何分别获得Postulat operacyjności definicji w naukach społecznychDefinition’s Operativeness Postulate in Social Sciences

2. 我想得到例如 2011来自:

`<li><strong>Issue Year:</strong> 2011</li><li>`

这次我对获得这些信息一无所知。Issue No:和其他人也是如此。

这取决于结构是否稳定; 但你可以去访问文本节点:

var heading = document.getElementsByClassName('page-heading')[0];
var polish = heading.childNodes[0].textContent.trim();
var english = heading.childNodes[2].textContent.trim();
console.log("Polish:", polish);
console.log("English:", english);
var li = document.querySelector('.article-additional-info li');
var issueYear = li.childNodes[1].textContent.trim();
console.log("Issue Year:", issueYear);
</div>
            <article class="article-detail-description">
                <h1 class="page-heading">
                    Postulat operacyjności definicji w naukach społecznych
                    <br /><small>Definition’s Operativeness Postulate in Social Sciences</small>
                </h1>
                <div>
                    <strong>Author(s): </strong>Jakub Karpiński<br /><strong>Subject(s): </strong>Social Sciences<br /><strong>Published by: </strong>Instytut Filozofii i Socjologii Polskiej Akademii Nauk<br/><strong>Keywords: </strong>operationism; definition of property; definition of indicator;  concepts selection
<br/>
                </div>
                  <p class="summary"><strong>Summary/Abstract: </strong> 
The article’s primary goal is to demonstrate the problems inherited in “operationism – antioperationism” polemics.
</p>
                <ul class="nav nav-tabs">
                    <li class="active" ><a href="#details" data-toggle="tab">Details</a></li>
                    <li><a href="#tableOfContents" data-toggle="tab">Contents</a></li>
                </ul>
                <div class="tab-content">
                    <div class="tab-pane fade active in" id="details">
                        <p class="journal-link"><strong>Journal: </strong><a href="/search/journal-detail?id=10">Studia Socjologiczne</a></p>   
                        <ul class="article-additional-info">
                            <li><strong>Issue Year:</strong> 2011</li><li><strong>Issue No:</strong> 1 (200)</li><li><strong>Page Range:</strong> 65-80</li><li><strong>Page Count:</strong> 15</li><li><strong>Language:</strong> Polish</li>
                        </ul>
                    </div>