搜索Thorough链接并用Regex、PHP或Javascript识别RSS源

Search thorugh links and identify RSS source with Regex, PHP or Javascript

本文关键字：Javascript 识别 RSS PHP 链接 Thorough Regex 搜索更新时间：2023-09-26

我正在构建一个关注叙利亚冲突的新闻/博客聚合器，我希望能够确定来源。这是一个简单的网站，聚合器是一个从我的雅虎管道中提取RSS的外部javascript。我的问题是，我找不到一种方法来识别来源（即CNN、BBC等）

所以我想，如果我扫描文档并识别href源，我就能做点什么。

假设我们有<a href="http://foxnews.com/blahblahblah.php">，我想做一个IF href == http://foxnews.com { logo(fox); }——或者类似的事情。

我不确定我是否"思考正确"，但我真的很想解决这个问题。有什么建议吗？或者我的RSS管道中有没有遗漏作者信息？

http://pipes.yahoo.com/pipes/pipe.run?_id=e9fdf79f13be013e7c3a2e4a7d0f2900&amp_render=rss

RSS提要只是XML，所以您要做的第一件事就是为您想要使用的语言找到一个XML解析器。

PHP内置了SimpleXML，使用起来既快捷又方便。

你会用它来拉出所有这样的链接。

foreach ($xml->channel->item as $key => $item) {
    $link = $item->link
}

这很容易理解，我们的根XML元素是<channel>，然后在里面我们有所有的新闻<item>标签。因此，我们循环遍历这些元素，并取出每个子元素<link>。

当我走到这一步时，我意识到我不需要花太多时间就可以为你完成整件事了。我通过用空字符串替换http://，将链接剥离到仅域。然后使用/作为分隔符分解字符串。这样做会将字符串拆分为从斜杠之间提取的块。因此，第一块是我们的领域。

<?php
$url = 'http://pipes.yahoo.com/pipes/pipe.run?_id=e9fdf79f13be013e7c3a2e4a7d0f2900&_render=rss';
$xml = simplexml_load_file($url);
foreach ($xml->channel->item as $key => $item) {
    $link = $item->link;
    $link = str_replace("http://", "", $link);
    $parts = explode('/', $link);
    $domain = $parts[0];
    print($domain . "<br/>");
}
?>

这个代码给我一个输出：

www.ft.com
www.dailystar.com.lb
www.ft.com
www.ft.com
www.ft.com
www.ft.com
www.dailystar.com.lb
www.bbc.co.uk
....

然后是PHP切换语句的例子，以获得每个链接所需的结果。像这样：

switch($domain) {
  case "www.bbc.co.uk":
    // Do BBC stuff
    break;
  case "www.dailystar.com.lb":
    // Do daily star stuff
    break;
  default:
    // Do something for domains that aren't covered above
    break;
}

祝你好运！