文章目录
  1. 1. 分析
  2. 2. 实现
    1. 2.1. 添加依赖
    2. 2.2. 下载 driver
    3. 2.3. 安装 Chrome 浏览器
    4. 2.4. 测试
    5. 2.5. 实现爬虫

前些天写了《初识网络爬虫》,一般的网络爬虫是通过获取源码,并从源码中获取数据。问题来了,现在 ajax 技术应用也很普遍,针对如此情况,我们一般的爬虫无法获取到需要的数据。本文主要实现抓取动态页面的数据。

动态加载页面数据有两种方法可以选择:

  1. 模拟页面中的请求,直接获取接口返回的数据
  2. 内建浏览器渲染页面,然后获取渲染后的数据

分析

在页面中通过拼凑参数等方法来模拟网络请求,最终获取接口数据,这种方法是可以行的通的,问题是比较麻烦。本文主要通过内建浏览器渲染这种简单粗暴的方法来实现数据的抓取。

问题来了,如何内建浏览器呢?

熟悉自动化测试同学应该都知道 Selenium,这个模拟浏览器进行自动化测试的工具。Selenium 提供一组 API 可以与真实的浏览器内核交互。Selenium 是跨语言的,有 Java、C#、python 等版本,并且支持多种浏览器,chrome、firefox 以及 IE 都支持。

实现

我们用 Java 来写 Demo。

添加依赖

添加 Selenium 依赖,以 Maven 为例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.0.1</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-server</artifactId>
<version>2.18.0</version>
</dependency>

下载 driver

以 chrome 为例:https://sites.google.com/a/chromium.org/chromedriver/

下载后,最好添加环境变量。当然,也可以在调用前设置环境:

1
2
System.getProperties().setProperty("webdriver.chrome.driver",
"/Users/zhenguo/Documents/chrome/chromedriver");

注意:MAC 环境下需要确认 chromedriver 是可运行的。

安装 Chrome 浏览器

这个,没什么好说的。

测试

测试 selenium ,代码如下:

1
2
3
4
5
6
7
8
9
10
11
@Ignore("need chrome driver")
@Test
public void testSelenium() {
System.getProperties().setProperty("webdriver.chrome.driver",
"/Users/zhenguo/Documents/chrome/chromedriver");
WebDriver webDriver = new ChromeDriver();
webDriver.get("http://huaban.com/");
WebElement webElement = webDriver.findElement(By.xpath("/html"));
System.out.println(webElement.getAttribute("outerHTML"));
webDriver.close();
}

如果出现类似以下结果,就说明 webdriver 配置好了:

1
2
3
4
5
6
Starting ChromeDriver 2.25.426935 (820a95b0b81d33e42712f9198c215f703412e1a1) on port 2052
Only local connections are allowed.
Nov 07, 2016 12:35:11 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Attempting bi-dialect session, assuming Postel's Law holds true on the remote end
Nov 07, 2016 12:35:13 AM org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: OSS

PS:每次new ChromeDriver(),Selenium都会建立一个Chrome进程,并使用一个随机端口在Java中与chrome进程进行通信来交互。我们需要调用 webDriver.close() 关闭进程。如果是网络爬虫抓取数据的话,最好用线程池来处理。

实现爬虫

上面步骤都设置好了,基于 webmagic 的爬虫实现就比较简单了,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
public class HuabanProcessor implements PageProcessor {

private Site site;

@Override
public void process(Page page) {
page.addTargetRequests(
page.getHtml().links().regex("http://huaban\\.com/.*").all());
if (page.getUrl().toString().contains("pins")) {
page.putField("img", page.getHtml().
xpath("//div[@id='baidu_image_holder']/img/@src").toString());
} else {
page.getResultItems().setSkip(true);
}
}

@Override
public Site getSite() {
if (null == site) {
site = Site.me().setDomain("huaban.com").setSleepTime(0);
}
return site;
}

public static void main(String[] args) {
Spider.create(new HuabanProcessor()).thread(5)
.addPipeline(new FilePipeline("/Users/zhenguo/Documents/chrome/webmagic/test/"))
.setDownloader(new SeleniumDownloader("/Users/zhenguo/Documents/chrome/chromedriver"))
.addUrl("http://huaban.com/explore/gufenghaibao/")
.runAsync();
}
}

上面 HuabanProcessor 使用到 SeleniumDownloader,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
   package us.codecraft.webmagic.downloader.selenium;

import org.apache.log4j.Logger;
import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.Downloader;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.UrlUtils;

import java.io.Closeable;
import java.io.IOException;
import java.util.Map;

/**
* 使用Selenium调用浏览器进行渲染。目前仅支持chrome。<br>
* 需要下载Selenium driver支持。<br>
*
* @author code4crafter@gmail.com <br>
* Date: 13-7-26 <br>
* Time: 下午1:37 <br>
*/
public class SeleniumDownloader implements Downloader, Closeable {

private volatile WebDriverPool webDriverPool;

private Logger logger = Logger.getLogger(getClass());

private int sleepTime = 0;

private int poolSize = 1;

private static final String DRIVER_PHANTOMJS = "phantomjs";

/**
* 新建
*
* @param chromeDriverPath chromeDriverPath
*/
public SeleniumDownloader(String chromeDriverPath) {
System.getProperties().setProperty("webdriver.chrome.driver",
chromeDriverPath);
}

/**
* Constructor without any filed. Construct PhantomJS browser
*
* @author bob.li.0718@gmail.com
*/
public SeleniumDownloader() {
// System.setProperty("phantomjs.binary.path",
// "/Users/Bingo/Downloads/phantomjs-1.9.7-macosx/bin/phantomjs");
}

/**
* set sleep time to wait until load success
*
* @param sleepTime sleepTime
* @return this
*/
public SeleniumDownloader setSleepTime(int sleepTime) {
this.sleepTime = sleepTime;
return this;
}

@Override
public Page download(Request request, Task task) {
checkInit();
WebDriver webDriver;
try {
webDriver = webDriverPool.get();
} catch (InterruptedException e) {
logger.warn("interrupted", e);
return null;
}
logger.info("downloading page " + request.getUrl());
webDriver.get(request.getUrl());
try {
Thread.sleep(sleepTime);
} catch (InterruptedException e) {
e.printStackTrace();
}
WebDriver.Options manage = webDriver.manage();
Site site = task.getSite();
if (site.getCookies() != null) {
for (Map.Entry<String, String> cookieEntry : site.getCookies()
.entrySet()) {
Cookie cookie = new Cookie(cookieEntry.getKey(),
cookieEntry.getValue());
manage.addCookie(cookie);
}
}

/*
* TODO You can add mouse event or other processes
*
* @author: bob.li.0718@gmail.com
*/

WebElement webElement = webDriver.findElement(By.xpath("/html"));
String content = webElement.getAttribute("outerHTML");
Page page = new Page();
page.setRawText(content);
page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content,
request.getUrl())));
page.setUrl(new PlainText(request.getUrl()));
page.setRequest(request);
webDriverPool.returnToPool(webDriver);
return page;
}

private void checkInit() {
if (webDriverPool == null) {
synchronized (this) {
webDriverPool = new WebDriverPool(poolSize);
}
}
}

@Override
public void setThread(int thread) {
this.poolSize = thread;
}

@Override
public void close() throws IOException {
webDriverPool.closeAll();
}
}

WebDriverPool 代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
package us.codecraft.webmagic.downloader.selenium;

import org.apache.log4j.Logger;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;

import java.io.FileReader;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.BlockingDeque;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.atomic.AtomicInteger;

/**
* @author code4crafter@gmail.com <br>
* Date: 13-7-26 <br>
* Time: 下午1:41 <br>
*/
class WebDriverPool {
private Logger logger = Logger.getLogger(getClass());

private final static int DEFAULT_CAPACITY = 5;

private final int capacity;

private final static int STAT_RUNNING = 1;

private final static int STAT_CLODED = 2;

private AtomicInteger stat = new AtomicInteger(STAT_RUNNING);

/*
* new fields for configuring phantomJS
*/
private WebDriver mDriver = null;
private boolean mAutoQuitDriver = true;

private static final String CONFIG_FILE = "/Users/zhenguo/Documents/develop/github/webmagic/webmagic-selenium/config.ini";
private static final String DRIVER_FIREFOX = "firefox";
private static final String DRIVER_CHROME = "chrome";
private static final String DRIVER_PHANTOMJS = "phantomjs";

protected static Properties sConfig;
protected static DesiredCapabilities sCaps;

/**
* Configure the GhostDriver, and initialize a WebDriver instance. This part
* of code comes from GhostDriver.
* https://github.com/detro/ghostdriver/tree/master/test/java/src/test/java/ghostdriver
*
* @author bob.li.0718@gmail.com
* @throws IOException
*/
public void configure() throws IOException {
// Read config file
sConfig = new Properties();
sConfig.load(new FileReader(CONFIG_FILE));

// Prepare capabilities
sCaps = new DesiredCapabilities();
sCaps.setJavascriptEnabled(true);
sCaps.setCapability("takesScreenshot", false);

String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);

// Fetch PhantomJS-specific configuration parameters
if (driver.equals(DRIVER_PHANTOMJS)) {
// "phantomjs_exec_path"
if (sConfig.getProperty("phantomjs_exec_path") != null) {
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
sConfig.getProperty("phantomjs_exec_path"));
} else {
throw new IOException(
String.format(
"Property '%s' not set!",
PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY));
}
// "phantomjs_driver_path"
if (sConfig.getProperty("phantomjs_driver_path") != null) {
System.out.println("Test will use an external GhostDriver");
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_PATH_PROPERTY,
sConfig.getProperty("phantomjs_driver_path"));
} else {
System.out
.println("Test will use PhantomJS internal GhostDriver");
}
}

// Disable "web-security", enable all possible "ssl-protocols" and
// "ignore-ssl-errors" for PhantomJSDriver
// sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new
// String[] {
// "--web-security=false",
// "--ssl-protocol=any",
// "--ignore-ssl-errors=true"
// });

ArrayList<String> cliArgsCap = new ArrayList<String>();
cliArgsCap.add("--web-security=false");
cliArgsCap.add("--ssl-protocol=any");
cliArgsCap.add("--ignore-ssl-errors=true");
sCaps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,
cliArgsCap);

// Control LogLevel for GhostDriver, via CLI arguments
sCaps.setCapability(
PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_CLI_ARGS,
new String[] { "--logLevel="
+ (sConfig.getProperty("phantomjs_driver_loglevel") != null ? sConfig
.getProperty("phantomjs_driver_loglevel")
: "INFO") });

// String driver = sConfig.getProperty("driver", DRIVER_PHANTOMJS);

// Start appropriate Driver
if (isUrl(driver)) {
sCaps.setBrowserName("phantomjs");
mDriver = new RemoteWebDriver(new URL(driver), sCaps);
} else if (driver.equals(DRIVER_FIREFOX)) {
mDriver = new FirefoxDriver(sCaps);
} else if (driver.equals(DRIVER_CHROME)) {
mDriver = new ChromeDriver(sCaps);
} else if (driver.equals(DRIVER_PHANTOMJS)) {
mDriver = new PhantomJSDriver(sCaps);
}
}

/**
* check whether input is a valid URL
*
* @author bob.li.0718@gmail.com
* @param urlString urlString
* @return true means yes, otherwise no.
*/
private boolean isUrl(String urlString) {
try {
new URL(urlString);
return true;
} catch (MalformedURLException mue) {
return false;
}
}

/**
* store webDrivers created
*/
private List<WebDriver> webDriverList = Collections
.synchronizedList(new ArrayList<WebDriver>());

/**
* store webDrivers available
*/
private BlockingDeque<WebDriver> innerQueue = new LinkedBlockingDeque<WebDriver>();

public WebDriverPool(int capacity) {
this.capacity = capacity;
}

public WebDriverPool() {
this(DEFAULT_CAPACITY);
}

/**
*
* @return
* @throws InterruptedException
*/
public WebDriver get() throws InterruptedException {
checkRunning();
WebDriver poll = innerQueue.poll();
if (poll != null) {
return poll;
}
if (webDriverList.size() < capacity) {
synchronized (webDriverList) {
if (webDriverList.size() < capacity) {

// add new WebDriver instance into pool
try {
configure();
innerQueue.add(mDriver);
webDriverList.add(mDriver);
} catch (IOException e) {
e.printStackTrace();
}

// ChromeDriver e = new ChromeDriver();
// WebDriver e = getWebDriver();
// innerQueue.add(e);
// webDriverList.add(e);
}
}

}
return innerQueue.take();
}

public void returnToPool(WebDriver webDriver) {
checkRunning();
innerQueue.add(webDriver);
}

protected void checkRunning() {
if (!stat.compareAndSet(STAT_RUNNING, STAT_RUNNING)) {
throw new IllegalStateException("Already closed!");
}
}

public void closeAll() {
boolean b = stat.compareAndSet(STAT_RUNNING, STAT_CLODED);
if (!b) {
throw new IllegalStateException("Already closed!");
}
for (WebDriver webDriver : webDriverList) {
logger.info("Quit webDriver" + webDriver);
webDriver.quit();
webDriver = null;
}
}

}

以上代码参考自 https://github.com/code4craft/webmagic

到此,动态加载的页面数据抓取就实现了。本文使用 selenium 作为渲染的方法,还有很多其他的方法,例如 phantomjs 和 htmlunit 等。有空了可以尝试其他的方法,希望本文对你有所帮助。


本文地址 http://94275.cn/2016/11/06/web-crawler-2/ 作者为 Zhenguo

author:Zhenguo
Author: Zhenguo      Blog: 94275.cn/     Email: jinzhenguo1990@gmail.com
I have almost 13 years of application development experience and have a keen interested in the latest emerging technologies. I use my spare time to turn my experience, ideas and love for IT tech into informative articles, tutorials and more in hope to help others and learn more.
文章目录
  1. 1. 分析
  2. 2. 实现
    1. 2.1. 添加依赖
    2. 2.2. 下载 driver
    3. 2.3. 安装 Chrome 浏览器
    4. 2.4. 测试
    5. 2.5. 实现爬虫
返回顶部