Crawler - 1 (selenium实现登录网站)

前情提要

A scientific research program I’ve been undertaking with my fellows in our Department of Psychology and Cognitive Science, Peking Uniersity, requires loads of facial pictures on some famous social media. The whole programs starts with picture gathering, so that we could conducts behavioral experiments. This article is here as a recording in learning crawling the pictures with Python.

Phase 1

最初我试用以下代码来爬取网站内容

import requests as rq
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/mynetwork/"
web_head = rq.get(url)
print(web_head.text)
soup = BeautifulSoup(web_head.text, 'lxml')

for person in soup.find_all('ul', id_ = "artdeco-toasts__wormhole"):
    print(person)

然后试图在代码中搜索ul对应的id，但是运行后没有任何输出。于是我检查print(web_head.text)的内容，发现和我在网站上看到的html内容不一样。我换成B站主页，发现这段代码很稳，于是考虑是网址的问题。

你也应该发现了，网址用的是/mynetwork/，很离谱，我也知道这个不work，哪里来的这种网站名可以一直登录。

Phase 2

那么现在考虑一下如何实现登录网站，通过查资料发现，requests中的Session()类好像也许能够实现（吧？）

不管怎么样，写出来（抄过来）试试看。

好的，从开始就失败了。

看到别人教程说，需要自己模拟浏览器访问网站，并且要在网站里f12的内容的Network地方找到自己需要的cookie并记下来，我看不懂，但我大受震撼。

Phase 3

没事，上网又查到了别的实现方式，就是python的selenium库。

相关链接在这里Selenium公众号教程。

selenium库很重要的一个模块是selenium.webdriver模块，这个模块可以创建针对不同浏览器的驱动实例（eg. Chrome, Firefox, Safari etc.）

为此，我们首先要安装Chromedriver驱动

针对不同OS的Chrome驱动下载地址如下https://sites.google.com/chromium.org/driver/home。注意要和自己Chrome当前版本一致，如果更新了Chrome版本，则需要更新驱动器的版本。

想要查看Chrome的版本，点击Settings–About Chrome进行查看即可。

下载好之后解压，得到一个名为Chromedriver.exe的可执行目标文件，现在我们要将其加入系统shell的环境变量中。

先将该文件放置在/usr/local/bin内，然后在命令行输入（针对zsh shell）

1 2	$ export PATH=$PATH:/usr/local/bin/chromedriver $ source ~/.zshrc

这样我们就安装好了Chromedriver

现在我们就可以在python中编写如下代码（注释在代码内）

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support import wait
from selenium.webdriver.support.wait import WebDriverWait as Wait
from selenium.webdriver.common.by import By

url = "https://www.linkedin.com/checkpoint/rm/sign-in-another-account"

# Create a webdriver object with Chromedriver
browser = webdriver.Chrome()
# Login specific url
browser.get(url)

# Here we create WAIT from module Wait, two arguments required
WAIT = Wait(browser, 7)

# Find specific "input" from the website with NAME / ID etc.
username = WAIT.until(method = EC.presence_of_element_located((By.NAME, "session_key")))
password = WAIT.until(EC.presence_of_element_located((By.NAME, "session_password")))

# Send infos
username.send_keys('Your Username')
password.send_keys('Your Password')

# Locate the login "button" of the webpage
# Remember using css_selector / xpath to locate the login button
submit = WAIT.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[class='The class name of the button']")))

# Here you perform clicking the botton you choose
submit.click()

# Here we successfully login the website and we can get the infos

通过上述方案，我们实现了python利用用户名登陆网页的操作。

Hexo

Crawler-1 (selenium实现登录网站)

Crawler - 1 (selenium实现登录网站)

前情提要

Phase 1

Phase 2

Phase 3

未完待续