跟随页面的每个链接并抓取内容,Scrapy + Selenium

Follow each link of a page and scrape content, Scrapy + Selenium

本文关键字:Scrapy Selenium 抓取 链接 跟随      更新时间:2023-09-26

这是我正在研究的网站。在每个页面上,表格中有 18 个帖子。我想访问每个帖子并抓取其内容,并对前 5 页重复此操作。

我的方法是让我的蜘蛛抓取 5 页中的所有链接并迭代它们以获取内容。因为"下一页"按钮和每个帖子中的某些文本都是由JavaScript编写的,所以我使用Selenium和Scrapy。我运行我的蜘蛛,可以看到 Firefox 网络驱动程序显示前 5 页,但随后蜘蛛停止了,没有抓取任何内容。Scrapy也没有返回任何错误消息。

现在我怀疑失败可能是由于:

1) 没有链接存储在all_links中。

2)不知何故parse_content没有跑。

我的诊断可能是错误的,我需要帮助来找到问题。谢谢!

这是我的蜘蛛:

import scrapy
from bjdaxing.items_bjdaxing import BjdaxingItem
from selenium import webdriver
from scrapy.http import TextResponse 
import time
all_links = [] # a global variable to store post links

class Bjdaxing(scrapy.Spider):
    name = "daxing"
    allowed_domains = ["bjdx.gov.cn"] # DO NOT use www in allowed domains
    start_urls = ["http://app.bjdx.gov.cn/cms/daxing/lookliuyan_bjdx.jsp"] # This has to start with http
    def __init__(self):
        self.driver = webdriver.Firefox()
    def parse(self, response):
        self.driver.get(response.url) # request the start url in the browser         
        i = 1
        while i <= 5: # The number of pages to be scraped in this session
            response = TextResponse(url = response.url, body = self.driver.page_source, encoding='utf-8') # Assign page source to response. I can treat response as if it's a normal scrapy project.           
            global all_links
            all_links.extend(response.xpath("//a/@href").extract()[0:18])
            next = self.driver.find_element_by_xpath(u'//a[text()="'u4e0b'u9875'xa0"]') # locate "next" button
            next.click() # Click next page            
            time.sleep(2) # Wait a few seconds for next page to load. 
            i += 1

    def parse_content(self, response):
        item = BjdaxingItem()
        global all_links
        for link in all_links: 
            self.driver.get("http://app.bjdx.gov.cn/cms/daxing/") + link
            response = TextResponse(url = response.url, body = self.driver.page_source, encoding = 'utf-8')
            if len(response.xpath("//table/tbody/tr[1]/td[2]/text()").extract() > 0):
                item['title'] =     response.xpath("//table/tbody/tr[1]/td[2]/text()").extract()
            else: 
                item['title'] = ""    
            if len(response.xpath("//table/tbody/tr[3]/td[2]/text()").extract() > 0):
                item['netizen'] =    response.xpath("//table/tbody/tr[3]/td[2]/text()").extract()
            else: 
                item['netizen'] = ""    
            if len(response.xpath("//table/tbody/tr[3]/td[4]/text()").extract() > 0):
                item['sex'] = response.xpath("//table/tbody/tr[3]/td[4]/text()").extract()
            else: 
                item['sex'] = ""   
            if len(response.xpath("//table/tbody/tr[5]/td[2]/text()").extract() > 0):
                item['time1'] = response.xpath("//table/tbody/tr[5]/td[2]/text()").extract()
            else: 
                item['time1'] = ""
            if len(response.xpath("//table/tbody/tr[11]/td[2]/text()").extract() > 0):
                item['time2'] =   response.xpath("//table/tbody/tr[11]/td[2]/text()").extract()
            else: 
                item['time2'] = "" 
            if len(response.xpath("//table/tbody/tr[7]/td[2]/text()").extract()) > 0:
                question = "".join(response.xpath("//table/tbody/tr[7]/td[2]/text()").extract())
                item['question'] = "".join(map(unicode.strip, question))
            else: item['question'] = ""  
            if len(response.xpath("//table/tbody/tr[9]/td[2]/text()").extract()) > 0:
                reply = "".join(response.xpath("//table/tbody/tr[9]/td[2]/text()").extract()) 
                item['reply'] = "".join(map(unicode.strip, reply))
            else: item['reply'] = ""    
            if len(response.xpath("//table/tbody/tr[13]/td[2]/text()").extract()) > 0:
                agency = "".join(response.xpath("//table/tbody/tr[13]/td[2]/text()").extract())
                item['agency'] = "".join(map(unicode.strip, agency))
            else: item['agency'] = ""    
            yield item 

这里有多个问题和可能的改进:

  • parse()parse_content()方法之间没有任何"链接"
  • 使用global变量通常是一种不好的做法
  • 你在这里根本不需要selenium。要跟踪分页,您只需向提供currPage参数的同一 url 发出 POST 请求

这个想法是使用.start_requests()并创建一个请求列表/队列来处理分页。按照分页并从表中收集链接。请求队列为空后,切换到遵循之前收集的链接。实现:

import json
from urlparse import urljoin
import scrapy

NUM_PAGES = 5
class Bjdaxing(scrapy.Spider):
    name = "daxing"
    allowed_domains = ["bjdx.gov.cn"] # DO NOT use www in allowed domains
    def __init__(self):
        self.pages = []
        self.links = []
    def start_requests(self):
        self.pages = [scrapy.Request("http://app.bjdx.gov.cn/cms/daxing/lookliuyan_bjdx.jsp",
                                     body=json.dumps({"currPage": str(page)}),
                                     method="POST",
                                     callback=self.parse_page,
                                     dont_filter=True)
                      for page in range(1, NUM_PAGES + 1)]
        yield self.pages.pop()
    def parse_page(self, response):
        base_url = response.url
        self.links += [urljoin(base_url, link) for link in response.css("table tr td a::attr(href)").extract()]
        try:
            yield self.pages.pop()
        except IndexError:  # no more pages to follow, going over the gathered links
            for link in self.links:
                yield scrapy.Request(link, callback=self.parse_content)
    def parse_content(self, response):
        # your parse_content method here