使用Selinium,Scrapy,Python检索用户个人资料的公共Facebook墙帖子
Using Selinium, Scrapy, Python to retrieve public facebook wall post of user profile
我正在尝试检索我的公开个人资料的墙贴。我需要检查消息是否到达我的墙并在给定的时间戳内传递。我本质上是在编写一个监视检查来验证消息传递系统的消息传递。我得到一个无法建立连接,因为目标计算机主动拒绝了它。不太清楚为什么?
#!/usr/bin/env python
# Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness). However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
#
# Some things to note:
# You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
#
# This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
import time
from selenium import selenium
class SeleniumSpider(CrawlSpider):
name = "SeleniumSpider"
start_urls = ["https://www.facebook.com/chronotrackmsgcheck"]
rules = (
Rule(SgmlLinkExtractor(allow=(''.html', )), callback='parse_page',follow=True),
)
def __init__(self):
CrawlSpider.__init__(self)
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*chrome", "https://www.facebook.com/chronotrackmsgcheck")
self.selenium.start()
def __del__(self):
self.selenium.stop()
print self.verificationErrors
CrawlSpider.__del__(self)
def parse_page(self, response):
item = Item()
hxs = HtmlXPathSelector(response)
#Do some XPath selection with Scrapy
hxs.select('//div').extract()
sel = self.selenium
sel.open(response.url)
#Wait for javscript to load in Selenium
time.sleep(2.5)
#Do some crawling of javascript created content with Selenium
sel.get_text("//div")
yield item
SeleniumSpider()
这是答案。这将使用 selenium 解析用户配置文件,然后仅解析页面上被视为文本的内容。如果你想使用它,你将不得不做你自己的算法进行数据挖掘,但它适用于我的目的。
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get("https://www.facebook.com/profileusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("facebookemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("facebookpassword")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)
parse_data = soup.get_text().encode('utf-8').split('Grant Zukel') #if you use your name extactly how it is displayed on facebook it will parse all post it sees, because your name is always in every post.
latest_message = parse_data[3]
driver.close()
print latest_message
这就是我获得用户最新帖子的方式:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.get("https://www.facebook.com/fbusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("fbemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("fbpass")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)
parse_data = soup.get_text().encode('utf-8').split('Grant Zukel')
latest_message = parse_data[4]
latest_message = parse_data[4].split('·')
driver.close()
time = latest_message[0]
message = latest_message[1]
print time,message
相关文章:
- 使用Facebook登录时未填写个人资料
- 如何查找任何给定Facebook用户的LinkedIn个人资料
- Javascript-使用Facebook登录,显示用户's成功登录后的个人资料图片
- 如何使用JavaScript SDK更改Facebook个人资料图片并按时更改其背面
- Facebook共享仅对我的个人资料JavaScript API可见
- 钛加载Facebook个人资料图片集成不起作用
- 使用Selinium,Scrapy,Python检索用户个人资料的公共Facebook墙帖子
- 使用 Facebook API 获取和比较两个个人资料之间的常见喜欢
- Node.js:Facebook返回的是未定义而不是个人资料
- 在Facebook和Twitter上分享个人资料链接
- 如何在Meteor应用程序中显示facebook好友数量+个人资料图片
- Javascript获取Facebook个人资料图像像素数据
- 访问Facebook个人资料图像时出现问题
- 使用facebook用户名获取facebook个人资料图片
- 如何在facebook个人资料中分享和播放音频MP3文件?
- 检索Facebook个人资料图片的URL
- 如何使用JS和API在用户的facebook个人资料上标记和上传图像
- 如何获取facebook个人资料图片缩略图
- 如何临时存储facebook个人资料图片
- 无法在HTML FB.api('/me')中使用Javascript获取Facebook个人资料图片