查找具有正则表达式的链接文本

Finding Link Text with Regular Expressions

本文关键字：链接文本正则表达式查找更新时间：2023-09-26

团队：

我需要一些正则表达式方面的帮助。目标是能够识别用户在注释中表达链接的三种不同方式，如下所示。

<a href="http://www.msn.com">MSN</a>

可能性

http://www.msn.com或https://www.msn.com或www.msn.com

然后，通过能够找到它们，我可以根据需要将它们中的每一个都更改为真正的A标签。我意识到第一个例子已经是一个A标记，但我需要为它添加一些特定于我们的应用程序的属性，例如TARGET和ONCLICK。

现在，我有了正则表达式，可以分别找到其中的每一个，它们如下所示，分别对应于上面的示例。

<a?'w+(('s+'w+('s*='s*(?:".*?"|'.*?'|[^'">'s]+))?)+'s*)/?>
(http|https):'/'/['w'-_]+('.['w'-_]+)+(['w'-'.,@?^=%&amp;:/~'+#]*['w'-'@?^=%&amp;/~'+#])?
['w'-_]+('.['w'-_]+)+(['w'-'.,@?^=%&amp;:/~'+#]*['w'-'@?^=%&amp;/~'+#])?

但问题是，我不能在字符串上运行所有这些，因为第二个字符串将与第一个字符串的一部分匹配，第三个字符串将同时匹配第一个和第二个的一部分。无论如何——我需要能够清楚地找到这三个排列，这样我就可以单独替换它们中的每一个——因为例如第三个表达式需要添加http://

我期待着每个人的帮助！

假设链接以空格或行首/行尾（或在现有的A标记内）开始或结束，我得到了以下代码，其中还包括一些示例文本：

string regexPattern = "((?:<a (?:.*?)href='")|^|''s)((?:http[s]?://)?(?:''S+)(?:''.(?:''S+?))+?)((?:'"(?:.*?)>(.*?)</a>)|''s|$)";
string[] examples = new string[] {
    "some text <a href='"http://www.msn.com/path/file?page=some.page&subpage=9#jump'">MSN</a>  more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text https://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
    "some text www.msn.com/path/file?page=some.page&subpage=9#jump",
    "www.msn.com/path/file?page=some.page&subpage=9#jump more text"
};
Regex re = new Regex(regexPattern);
foreach (string s in examples) {
    MatchCollection mc = re.Matches(s);
    foreach (Match m in mc) {
        string prePart = m.Groups[1].Value;
        string actualLink = m.Groups[2].Value;
        string postPart = m.Groups[3].Value;
        string linkText = m.Groups[4].Value;
        MessageBox.Show(" prePart: '" + prePart + "''n actualLink: '" + actualLink + "''n postPart: '" + postPart + "''n linkText: '" + linkText + "'");
    }
}

由于此代码使用带数字的组，因此也可以在JavaScript中使用正则表达式。

根据您需要对现有的A标记做什么，您还需要解析特定的第一个组。

更新：根据请求修改正则表达式，使链接文本成为第4组

更新2:为了更好地捕捉格式错误的链接，你可以尝试这个修改后的版本：

pattern = "((?:<a (?:.*?)href='"?)|^|''s)((?:http[s]?://)?(?:''S+)(?:'.(?:[^>'"''s]+))+)((?:'"?(?:.*?)>(.*?)</a>)|''s|$)";

好吧，如果我们想一次性完成，可以为每个场景创建名称组：

(?<full><a?'w+(('s+'w+('s*='s*(?:".*?"|'.*?'|[^'">'s]+))?)+'s*)/?>.*</a>)|
(?<url>(http|https)://['w'-_]+('.['w'-_]+)+(['w'-'.,@?^=%&amp;:/~'+#]*['w'-'@?^=%&amp;/~'+#])?)|
(<?www>['w'-_]+('.['w'-_]+)+(['w'-'.,@?^=%&amp;:/~'+#]*['w'-'@?^=%&amp;/~'+#])?)

然后你必须检查哪个是匹配的组：

Match match = regex.Match(pattern);
if (match.Success)
{
    if (match.Groups["full"].Success) 
       Console.WriteLine(match.Groups["full"].Value);
    else if (match.Groups["url"].Success)
    ....
}