如何使用无限滚动抓取网站
How to scrape website with infinte scrolling?
我想抓取这个网站。我写了一只蜘蛛,但它只是爬行首页,即前 52 个项目。
我试过这段代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
a=[]
from aqaq.items import aqaqItem
import os
import urlparse
import ast
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/womens-tops/",
]
def parse(self, response):
# ... Extract items in the page using extractors
n=3
ct=1
hxs = HtmlXPathSelector(response)
sites=hxs.select('//div[@id="page"]')
for site in sites:
name=site.select('//div[@id="content"]/div[@class="l-pageWrapper"]/div[@class="l-main"]/div[@class="box box-bgcolor"]/section[@class="box-bd pan mtm"]/ul[@id="productsCatalog"]/li/a/@href').extract()
print name
print ct
ct=ct+1
a.append(name)
req= Request (url="http://www.jabong.com/women/clothing/womens-tops/?page=" + str(n) ,
headers = {"Referer": "http://www.jabong.com/women/clothing/womens-tops/",
"X-Requested-With": "XMLHttpRequest"},callback=self.parse,dont_filter=True)
return req # and your items
它显示以下输出:
2013-10-31 09:22:42-0500 [jabong] DEBUG: Crawled (200) <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> (referer: http://www.jabong.com/women/clothing/womens-tops/)
2013-10-31 09:22:42-0500 [jabong] DEBUG: Filtered duplicate request: <GET http://www.jabong.com/women/clothing/womens-tops/?page=3> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2013-10-31 09:22:42-0500 [jabong] INFO: Closing spider (finished)
2013-10-31 09:22:42-0500 [jabong] INFO: Dumping Scrapy stats:
当我放dont_filter=True
它永远不会停止。
是的,dont_filter
必须在这里使用,因为每次将页面向下滚动到底部时,XHR 请求中只有page
GET 参数更改http://www.jabong.com/women/clothing/womens-tops/?page=X
。
现在您需要弄清楚如何停止爬行。这实际上很简单 - 只需检查队列中下一页上没有产品时CloseSpider
并引发异常。
这是一个对我有用的完整代码示例(停在第 234 页):
import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import BaseSpider
from scrapy.http import Request
class Product(scrapy.Item):
brand = scrapy.Field()
title = scrapy.Field()
class aqaqspider(BaseSpider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = [
"http://www.jabong.com/women/clothing/womens-tops/?page=1",
]
page = 1
def parse(self, response):
products = response.xpath("//li[@data-url]")
if not products:
raise CloseSpider("No more products!")
for product in products:
item = Product()
item['brand'] = product.xpath(".//span[contains(@class, 'qa-brandName')]/text()").extract()[0].strip()
item['title'] = product.xpath(".//span[contains(@class, 'qa-brandTitle')]/text()").extract()[0].strip()
yield item
self.page += 1
yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%d" % self.page,
headers={"Referer": "http://www.jabong.com/women/clothing/womens-tops/", "X-Requested-With": "XMLHttpRequest"},
callback=self.parse,
dont_filter=True)
你可以试试这段代码,与alecxe
的代码略有不同,
如果没有产品,则只需从功能return
,最终导致关闭蜘蛛。简单的解决方案。
import scrapy
from scrapy.exceptions import CloseSpider
from scrapy.spider import Spider
from scrapy.http import Request
class aqaqItem(scrapy.Item):
brand = scrapy.Field()
title = scrapy.Field()
class aqaqspider(Spider):
name = "jabong"
allowed_domains = ["jabong.com"]
start_urls = ["http://www.jabong.com/women/clothing/womens-tops/?page=1"]
page_index = 1
def parse(self, response):
products = response.xpath("//li[@data-url]")
if products:
for product in products:
brand = product.xpath(
".//span[contains(@class, 'qa-brandName')]/text()").extract()
brand = brand[0].strip() if brand else 'N/A'
title = product.xpath(
".//span[contains(@class, 'qa-brandTitle')]/text()").extract()
title = title[0].strip() if title else 'N/A'
item = aqaqItem()
item['brand']=brand,
item['title']=title
yield item
# here if no products are available , simply return, means exiting from
# parse and ultimately stops the spider
else:
return
self.page_index += 1
if page_index:
yield Request(url="http://www.jabong.com/women/clothing/womens-tops/?page=%s" % (self.page_index + 1),
callback=self.parse)
尽管蜘蛛产生超过 12.5K 的产品,但它包含大量重复的条目,我已经做了一个ITEM_PIPELINE
,它将删除重复的条目并插入到 mongodb 中。
下面的管道代码,
from pymongo import MongoClient
class JabongPipeline(object):
def __init__(self):
self.db = MongoClient().jabong.product
def isunique(self, data):
return self.db.find(data).count() == 0
def process_item(self, item, spider):
if self.isunique(dict(item)):
self.db.insert(dict(item))
return item
并在此处附加抓取日志状态
2015-04-19 10:00:58+0530 [jabong] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 426231,
'downloader/request_count': 474,
'downloader/request_method_count/GET': 474,
'downloader/response_bytes': 3954822,
'downloader/response_count': 474,
'downloader/response_status_count/200': 235,
'downloader/response_status_count/301': 237,
'downloader/response_status_count/302': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 4, 19, 4, 30, 58, 710487),
'item_scraped_count': 12100,
'log_count/DEBUG': 12576,
'log_count/INFO': 11,
'request_depth_max': 234,
'response_received_count': 235,
'scheduler/dequeued': 474,
'scheduler/dequeued/memory': 474,
'scheduler/enqueued': 474,
'scheduler/enqueued/memory': 474,
'start_time': datetime.datetime(2015, 4, 19, 4, 26, 17, 867079)}
2015-04-19 10:00:58+0530 [jabong] INFO: Spider closed (finished)
如果您在该页面上打开开发人员控制台,您将看到页面内容在 webrequest 中返回:
http://www.jabong.com/home-living/furniture/new-products/?page=1
这将返回一个包含所有项目的 HTML 文档。因此,我只会增加页面的值并解析它,直到返回的 HTML 等于之前返回的 HTML。
使用 dont_filter
并每次都发出新请求确实会永远运行,除非有一些错误响应。
在浏览器中进行无限滚动,看看当它没有更多页面时会有什么反应。然后,在蜘蛛中,通过不发出新请求来处理这种情况。
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,'http://www.jabong.com/women/clothing/womens-tops/?page=3');
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0');
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('X-Requested-With: XMLHttpRequest'));
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$htmldata = curl_exec($curl_handle);
curl_close($curl_handle);
它为我工作。请通过 PHP Curl 致电
- 如何从网站上抓取链接和图片
- Html抓取网站加载错误的J汤Java
- 如何获取网站所有页面的链接以进行数据抓取
- 我如何从一个无限滚动的网站抓取图像,其中api是隐藏的,我无法通过使用Inspect Element获得它->网络
- PHP:如何基于Javascript抓取网站内容
- 以类似于谷歌机器人的方式抓取网站html和javascript
- 如何抓取使用直接Web远程处理(DWR)返回操纵页面的Javascript的网站's的HTML
- 用Ruby抓取一个Javascript很重的网站
- 试图在网站上抓取谷歌地图api生成的动态数据,但正常抓取返回空白
- JS滑块网站 - 谷歌抓取
- 如何使用无限滚动抓取网站
- 抓取网站.无法在抓取期间自动执行用户单击
- 抓取网站失败是因为javascript没有启用
- 如何抓取网站内容(*COMPLEX* iframe, javascript提交)
- 使用zombie.js抓取网站的问题
- Python抓取网站得到Apache Tomcat/6.0.36 -错误报告
- 抓取网站's的每一个页面与谷歌应用程序脚本
- 如何使用phantomjs抓取网站
- 抓取网站并将表格插入到我自己的 HTML 文档中
- 如何使用node.js与ASP和AJAX抓取网站