通过getContext - Google Apps Script -电子表格从HTML标签中获取文本

Get The Text Out of HTML Tag via getContext - Google Apps Script - Spreadsheets

本文关键字:标签 HTML 获取 取文本 电子表格 getContext Google Apps Script 通过      更新时间:2023-09-26

所以,我在这个谷歌应用程序脚本相当多的困境。对于习惯了传统Javascript的人来说,这是一个相当大的挑战。我目前正试图从Zillow拉值,我已经成功的前几个项目(租金价值,Zestimate,学校评级),但现在我需要得到学校的名称。这变得非常麻烦,我真的卡住了,我似乎不能做一个我需要得到的.match()。我将发布一些代码,看看是否有人能理解这个。

我正在解析的Zillow代码:

<ul class="nearby-schools-list">
<li class="nearby-schools-header">
    <h4 class="nearby-schools-rating">&nbsp;</h4>
    <h4 class="nearby-schools-name">&nbsp;</h4>
    <h4 class="nearby-schools-grades">Grades</h4>
    <h4 class="nearby-schools-distance">Distance</h4>
</li>
<li class="nearby-school assigned-school">
    <span class="gs-rating-badge">
        <div class="gs-rating gs-rating-8">
            <span class="gs-rating-number">8</span>
            <span class="gs-rating-subtext">out of 10</span>
        </div>
    </span>
    <span class="nearby-schools-name"> <a href="/seattle-wa/schools/salmon-bay-school-93956/" class="ga-tracked-link track-ga-event school-name notranslate" data-ga-action="School details click" data-ga-label="HDP AB Module" data-ga-category="Homes" data-ga-standard-href="true">Salmon Bay School</a> 
        <span class="assigned-label de-emph">(assigned)</span>
    </span>
    <span class="nearby-schools-grades">K-8</span>
    <span class="nearby-schools-distance">0.3 mi</span>
</li>
<li class="nearby-school assigned-school">
    <span class="gs-rating-badge">
        <div class="gs-rating gs-rating-8">
            <span class="gs-rating-number">8</span>
            <span class="gs-rating-subtext">out of 10</span>
        </div>
    </span>
    <span class="nearby-schools-name"> <a href="/seattle-wa/schools/whitman-middle-school-93939/" class="ga-tracked-link track-ga-event school-name notranslate" data-ga-action="School details click" data-ga-label="HDP AB Module" data-ga-category="Homes" data-ga-standard-href="true">Whitman Middle</a> 
        <span class="assigned-label de-emph">(assigned)</span>
    </span>
    <span class="nearby-schools-grades">6-8</span>
    <span class="nearby-schools-distance">1.4 mi</span>
</li>
<li class="nearby-school assigned-school">
    <span class="gs-rating-badge">
        <div class="gs-rating gs-rating-9">
            <span class="gs-rating-number">9</span>
            <span class="gs-rating-subtext">out of 10</span>
        </div>
    </span>
    <span class="nearby-schools-name"> <a href="/seattle-wa/schools/ballard-high-school-92363/" class="ga-tracked-link track-ga-event school-name notranslate" data-ga-action="School details click" data-ga-label="HDP AB Module" data-ga-category="Homes" data-ga-standard-href="true">Ballard High</a> 
        <span class="assigned-label de-emph">(assigned)</span>
    </span>
    <span class="nearby-schools-grades">9-12</span>
    <span class="nearby-schools-distance">0.2 mi</span>
</li>

这是一个很大的块,但实际上我试图从school-name中抓取文本它是ul > li > span.nearby-schools-name > a.school-name下列出的类

这是我的尝试,我做的任何事情都是空白的。

// get School Names
var match = contentText.match(/<a href="([^<]*)" class="ga-tracked-link track-ga-event school-name notranslate" /g);
Browser.msgBox(match);
var schoolNameArray = new Array();
while (match.length > 0) {
    var thisSchoolName = new String(schoolName.pop());
    Browser.msgBox(thisSchoolName);
    //schoolNameArray.push(thisSchoolName);
}
var schoolNames = schoolNameArray.toString().replace(/,/g, " _ ");

一个快速的常见问题解答,我已经尝试了在网络上复制getElementsByClassName的功能,我没有运气。我还试着抓住href

这是一种方法。首先通过类名获取所有元素:

var elSchoolNames = document.getElementsByClassName("nearby-schools-name");

返回的是对象。如果您将变量elSchoolNames显示到控制台,console.log('elSchoolNames: ' + elSchoolNames );将看起来像这样:

[object HTMLCollection]

对象[object HTMLCollection]内部是一堆更多的对象;对象数组

[object HTMLHeadingElement]
[object HTMLSpanElement]
[object HTMLSpanElement]
[object HTMLSpanElement] 

重要的是要理解对象有key:value对,但也有一个对象数组,没有键(属性)。要从主对象中获取子对象,请通过编号来引用它们,因为它们没有属性名,因为在那个级别上它是一个数组。

你需要所有的Span元素。

var theSpanEl = elSchoolNames[1];
var theSpanE2 = elSchoolNames[2];
var theSpanE3 = elSchoolNames[3];
console.log('textContent: ' + theSpanEl.textContent);

学校名称在对象的textContent属性中。

我如何知道第一个对象中的所有对象,以及第一个Span元素的内容是什么?我循环遍历对象的所有属性。

var elSchoolNames = document.getElementsByClassName("nearby-schools-name");
console.log('namesOfSchools: ' + elSchoolNames);
for (theProperty in elSchoolNames) {
    console.log('theProperties: ' + theProperty);
    console.log('each value: ' + elSchoolNames[theProperty]);
};
var theSpanEl = elSchoolNames[1];
for (spanProperty in theSpanEl) {
    console.log('theProperties: ' + spanProperty);
    console.log('each value: ' + theSpanEl[spanProperty]);
};
console.log('textContent: ' + theSpanEl.textContent);

要获得子元素,需要去掉第一个元素之后的所有元素。因为它的索引为0,所以第二个元素的编号为1。

var theSpanEl = elSchoolNames[1];

现在,看看你有什么,把它打印到控制台:

console.log('textContent: ' + theSpanEl.textContent);

等于:

textContent:  Salmon Bay School 
    (assigned)

当然,您将希望使用string方法去掉末尾的(assigned)。您不需要使用.match()或regEx。

我刚刚意识到,如果你得到的HTML内容的网站,不是你的,和HTML内容是一个字符串,那么这一切都不会工作。除非您使用innerHTML将HTML注入站点,否则请使用上述代码