如何从一个网页's源抓取字符串

How to Grab a String from This to That from a Webpage's Source?

本文关键字：字符串抓取网页一个更新时间：2023-09-26

如何从一个网页的来源抓取字符串?我已经看遍了PHP.net，我无法弄清楚PHP是否有一个函数或一组函数可以从这个到那个抓取字符串。

例如，这是我目前拥有的(我想从$html中存储的网页中抓取从"wgCategories"到"wgMonthNamesShort"的所有内容):

<?php
error_reporting(E_ALL);
$html = file_get_contents('http://en.wikipedia.org/wiki/Los_Angeles');
$string = <>;
?>

首先，我将网页的源代码放到$html变量中。现在我需要一个函数或一组函数，可以抓取从"wgCategories"到"wgMonthNamesShort"的所有内容并将其存储到$string中。

预期的结果:

$string = "wgCategories":["All articles with dead external links","Articles with dead external links from March 2013","Articles with dead external links from March 2014","Pages with broken reference names","Articles with dead external links from January 2014","Articles with dead external links from September 2011","Articles with dead external links from October 2011","CS1 errors: dates","Use mdy dates from May 2014","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Coordinates on Wikidata","Articles including recorded pronunciations","Articles containing Spanish-language text","All articles with unsourced statements","Articles with unsourced statements from December 2013","Spoken articles","Articles with hAudio microformats","Los Angeles, California","Cities in Los Angeles County, California","Communities on U.S. Route 66","County seats in California","Incorporated cities and towns in California","Populated coastal places in California","Populated places established in 1781","Port cities and towns of the United States Pacific coast","Butterfield Overland Mail in California","Stockton - Los Angeles Road"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort";

最后，请注意，从"wgCategories"到"wgMonthNamesShort"的所有内容都存储在<script>标记之间(不确定这是否重要，但有人告诉我这是值得一提的)。

如果需要澄清，请告诉我。

您可以使用preg_match和s标志(DOTALL)来抓取两个关键字之间的字符串:

error_reporting(E_ALL);
$html = file_get_contents('http://en.wikipedia.org/wiki/Los_Angeles');
if (preg_match('/wgCategories.*?wgMonthNamesShort/is', $html, $matches))
   echo $matches[0];

你可以避免使用正则表达式，并使用PHP字符串函数，如stristr。

以上代码打印:

wgCategories":["All articles with dead external links","Articles with dead external links from March 2013","Articles with dead external links from March 2014","Pages with broken reference names","Articles with dead external links from January 2014","Articles with dead external links from September 2011","Articles with dead external links from October 2011","CS1 errors: dates","Use mdy dates from May 2014","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Coordinates on Wikidata","Articles including recorded pronunciations","Articles containing Spanish-language text","All articles with unsourced statements","Articles with unsourced statements from December 2013","Spoken articles","Articles with hAudio microformats","Los Angeles, California","Cities in Los Angeles County, California","Communities on U.S. Route 66","County seats in California","Incorporated cities and towns in California","Populated coastal places in California","Populated places established in 1781","Port cities and towns of the United States Pacific coast","Butterfield Overland Mail in California","Stockton - Los Angeles Road"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort