使用Regex在不在锚点中的页面上查找电话号码

Use Regex to find a phone number on a page not in an anchor

本文关键字：查找电话号码 Regex 使用更新时间：2023-09-26

我有一个正则表达式，用于搜索电话号码模式：

[(]?'d{3}[)]?[('s)?.-]'d{3}['s.-]'d{4}

这与以下格式的电话号码相匹配：

123 456 7890
(123)456 7890
(123) 456 7890
(123)456-7890
(123) 456-7890
123.456.7890
123-456-7890

我想扫描整个页面（使用JavaScript）来寻找这个匹配，但不包括已经存在于锚点中的这个匹配。找到匹配后，我想将电话号码转换为移动设备的点击呼叫链接：

(123) 456-7890 --> <a href="tel:1234567890">(123) 456-7890</a>

我很确定我需要做一个负查找。我试过了，但这似乎不是正确的想法：

(?!.*('<a href.*?'>))[(]?'d{3}[)]?[('s)?.-]'d{3}['s.-]'d{4}

不要使用正则表达式来解析HTML。使用HTML/DOM解析器来获取文本节点（例如，浏览器可以为您过滤，删除锚标记和所有太短而无法包含电话号码的文本），您可以直接检查文本。

例如，使用XPath（它有点难看，但支持以大多数其他DOM方法所不支持的方式直接处理文本节点）：

// This query finds all text nodes with at least 12 non-whitespace characters
// who are not direct children of an anchor tag
// Letting XPath apply basic filters dramatically reduces the number of elements
// you need to process (there are tons of short and/or pure whitespace text nodes
// in most DOMs)
var xpr = document.evaluate('descendant-or-self::text()[not(parent::A) and string-length(normalize-space(self::text())) >= 12]',
                            document.body, null, XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE, null);
for (var i=0, len=xpr.snapshotLength; i < len; ++i) {
    var txt = xpr.snapshotItem(i);
    // Splits with grouping to preserve the text split on
    var numbers = txt.data.split(/([(]?'d{3}[)]?[('s)?.-]'d{3}['s.-]'d{4})/);
    // split will return at least three items on a hit, prefix, split match, and suffix
    if (numbers.length >= 3) {
        var parent = txt.parentNode; // Save parent before replacing child
        // Insert new elements before existing element; first element is just
        // text before first phone number
        parent.insertBefore(document.createTextNode(numbers[0]), txt);
        // Now explicitly create pairs of anchors and following text nodes
        for (var j = 1; j < numbers.length; j += 2) {
            // Operate in pairs; odd index is phone number, even is
            // text following that phone number
            var anc = document.createElement('a');
            anc.href = 'tel:' + numbers[j].replace(/'D+/g, '');
            anc.textContent = numbers[j];
            parent.insertBefore(anc, txt);
            parent.insertBefore(document.createTextNode(numbers[j+1]), txt);
        }
        // Remove original text node now that we've inserted all the
        // replacement elements and don't need it for positioning anymore
        parent.removeChild(txt);
        parent.normalize(); // Normalize whitespace after rebuilding
    }
}

对于记录，基本过滤器在大多数页面上都有助于批量。例如，在这个页面上，现在，正如我所看到的（会因用户、浏览器、浏览器扩展和脚本等而异），如果没有过滤器，查询'descendant-or-self::text()'的快照将有1794个项目。'descendant-or-self::text()[not(parent::A)]'省略了锚标记的父文本，将其减少到1538，而完整的查询，验证非空白内容至少有12个字符长，将其减到87个项目。将regex应用于87个项目是一个巨大的性能变化，而且您已经消除了使用不合适的工具解析HTML的需要。

将其用作正则表达式：

(<a href.*?>.*?([(]?('d{3})[)]?[('s)?.-]('d{3})['s.-]('d{4})).*?<'/a>)|([(]?('d{3})[)]?[('s)?.-]('d{3})['s.-]('d{4}))

将其用作替换字符串：

<a href="tel:$3$7$4$8$5$9">($3$7) $4$8-$5$9</a>

这会查找href标记内外的所有电话号码，但是，在所有情况下，它都会将电话号码本身作为特定的regex组返回。因此，您可以将找到的每个电话号码都包含在新的href标记中，因为在它们存在的地方，您将替换原始的href标记。

正则表达式组或"捕获组"捕获与整个正则表达式匹配的特定部分。它们是通过将正则表达式的一部分括在括号中来创建的。这些组按左括号的顺序从左到右进行编号，并且可以通过在Javascript中的数字前面放置$来引用它们匹配的输入部分。其他实现为此目的使用'。这被称为反向引用。反向引用可以稍后出现在regex表达式中，也可以出现在替换字符串中（如本答案前面所述）。更多信息：http://www.regular-expressions.info/backref.html

举一个更简单的例子，假设您有一个包含帐号和其他信息的文档。每个帐号后面都有一个单词"account"，您想将其更改为"acct"，但"account"出现在文档的其他位置，因此您不能简单地单独查找和替换它。您可以使用account ([0-9]+)的正则表达式。在这个正则表达式中，([0-9]+)形成了一个与实际帐号匹配的组，我们可以在替换字符串中将其作为$1反向引用，该字符串变为acct $1。

你可以在这里测试：http://regexr.com/