我试图从这个网站获得标题(http://www.itslaw.com),它是通过JavaScript加载的

I tried to get title from this site (http://www.itslaw.com), it was loading by JavaScript

本文关键字:com www itslaw 加载 JavaScript 网站 http 标题      更新时间:2023-09-26

这是我的代码,我使用Python来获取信息,我使用代理、头、会话来模拟,但我一直得到501。

# -*- coding: utf-8 -*-
import requests
from pyquery import PyQuery as pq
from goose import Goose
from goose.text import StopWordsChinese
import json
import time

class ItSlaw(object):
    def __init__(self):
        self.url = 'XXXX'                
        self.headers = {'XXXX'}
        self.result = None
        self.keyword = None
        self.session = requests.Session()
    def reset(self, keyword):
        self.keyword = keyword
        self.result = None
    def fetch(self):
        url = self.url.format(keyword='self.keyword',keywordcopy='self.keyword') 
        res = []
        time.sleep(3)
        proxies = {"http": "14.111.148.1"}
        r = self.session.get(url, proxies=proxies)
        print r.status_code
        completed_url = 'http://www.itslaw.com/' + 'url'
        g = Goose({'stopwords_class': StopWordsChinese})
        article = g.extract(url=completed_url)
        content = article.cleaned_text
        res.append()
        self.result = res
        return self.result
    def get_result(self):
        return self.result

您可以使用硒:

  1. 使用pip为Python安装selenium。对于Linux(Ubuntu/Debian),它看起来是:

    sudo apt-get install python-pip

    sudo pip install selenium

(!)你必须谷歌如何为你的操作系统做这件事

  1. 然后运行这个代码
import unittest
from selenium import webdriver
class GetTitle(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.Firefox()
    def test_get_title(self):
        driver = self.driver
        driver.get("http://www.itslaw.com/")
        print "Title is: ", driver.title
    def tearDown(self):
        self.driver.close()
if __name__ == "__main__":
    unittest.main()
>>> Title is: 无讼案例|无讼名片-打造中国最大的互联网律师名片、案例检索服务平台
相关文章: