如何从一个网页's源抓取字符串
How to Grab a String from This to That from a Webpage's Source?
如何从一个网页的来源抓取字符串?我已经看遍了PHP.net,我无法弄清楚PHP是否有一个函数或一组函数可以从这个到那个抓取字符串。
例如,这是我目前拥有的(我想从$html
中存储的网页中抓取从"wgCategories"
到"wgMonthNamesShort"
的所有内容):
<?php
error_reporting(E_ALL);
$html = file_get_contents('http://en.wikipedia.org/wiki/Los_Angeles');
$string = <>;
?>
首先,我将网页的源代码放到$html变量中。现在我需要一个函数或一组函数,可以抓取从"wgCategories"
到"wgMonthNamesShort"
的所有内容并将其存储到$string中。
预期的结果:
$string = "wgCategories":["All articles with dead external links","Articles with dead external links from March 2013","Articles with dead external links from March 2014","Pages with broken reference names","Articles with dead external links from January 2014","Articles with dead external links from September 2011","Articles with dead external links from October 2011","CS1 errors: dates","Use mdy dates from May 2014","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Coordinates on Wikidata","Articles including recorded pronunciations","Articles containing Spanish-language text","All articles with unsourced statements","Articles with unsourced statements from December 2013","Spoken articles","Articles with hAudio microformats","Los Angeles, California","Cities in Los Angeles County, California","Communities on U.S. Route 66","County seats in California","Incorporated cities and towns in California","Populated coastal places in California","Populated places established in 1781","Port cities and towns of the United States Pacific coast","Butterfield Overland Mail in California","Stockton - Los Angeles Road"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort";
最后,请注意,从"wgCategories"
到"wgMonthNamesShort"
的所有内容都存储在<script>
标记之间(不确定这是否重要,但有人告诉我这是值得一提的)。
如果需要澄清,请告诉我。
您可以使用preg_match
和s
标志(DOTALL)来抓取两个关键字之间的字符串:
error_reporting(E_ALL);
$html = file_get_contents('http://en.wikipedia.org/wiki/Los_Angeles');
if (preg_match('/wgCategories.*?wgMonthNamesShort/is', $html, $matches))
echo $matches[0];
你可以避免使用正则表达式,并使用PHP字符串函数,如stristr
。
以上代码打印:
wgCategories":["All articles with dead external links","Articles with dead external links from March 2013","Articles with dead external links from March 2014","Pages with broken reference names","Articles with dead external links from January 2014","Articles with dead external links from September 2011","Articles with dead external links from October 2011","CS1 errors: dates","Use mdy dates from May 2014","Wikipedia indefinitely semi-protected pages","Wikipedia indefinitely move-protected pages","Coordinates on Wikidata","Articles including recorded pronunciations","Articles containing Spanish-language text","All articles with unsourced statements","Articles with unsourced statements from December 2013","Spoken articles","Articles with hAudio microformats","Los Angeles, California","Cities in Los Angeles County, California","Communities on U.S. Route 66","County seats in California","Incorporated cities and towns in California","Populated coastal places in California","Populated places established in 1781","Port cities and towns of the United States Pacific coast","Butterfield Overland Mail in California","Stockton - Los Angeles Road"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort
- 同源策略目的|用户数据与基本页面数据|客户端页面抓取
- VBA正在抓取不在HTML源文件中的生成内容
- 尝试使用Node.js动态路由从IMDB中抓取电影内容.但是在我的output.json文件中没有定义
- 如何从网站上抓取链接和图片
- 用jquery抓取图像SRC-attr
- Javascript Regex-从价格中抓取分隔符
- 使用 jQuery 将 HTML 文本抓取到 JSON 中,但由于循环引用而无法字符串化
- 是否可以抓取连接到字符串的 NUMBER?(Javascript)
- 如何在第二个“-”之前抓取字符串&”;
- 可以't在JavaScript中抓取查询字符串
- javascript:抓取空白后的最后一个字符串
- 如何从一个网页's源抓取字符串
- 试图建立查询字符串和抓取谷歌结果
- Javascript /从字符串中抓取第一个和最后一个单词,然后用class包装
- 使用jQuery从标记字符串中抓取一些文本
- 在抓取图像src上获取base64字符串
- 在javascript中抓取和显示xml格式的字符串
- 在浏览器中,使用CSS选择器从包含HTML标记的字符串中抓取数据,而不创建DOM元素
- 从字符串中抓取多个数字1-10
- 从两个字符串抓取编辑