python 网络爬虫虫

恒大 | 冬奥会 | 炒股 | 美股 | 基金 | 穿越 | 黄金投资 | 摩纳哥 | 首次公开募股（IPO） | 外汇交易 | 动漫 | 智利 | 股价 | 视频会议 | 毕业论文 | 东京 | 服饰搭配 | 海淘 | 金融数学 | 赚钱 | 创业团队 | 量化交易 | 盈利模式 | 重大疾病保险 | 足球 | 文案 | 易纲 | 企业管理 | 别墅 | 代理 | 户型 | 八字算命 | 写字楼 | 平面设计 | 赎回 | 在线教育 | 阿里云os | 苏州市 | 交易平台 | 书籍推荐 | 基金定投 | 睡眠 | 燕窝 | 对联 | 韭菜 | 人体 | 白酒 | 人口 | 中医 | 江苏银行 | 二胎 | 咖啡馆 | 中药 | 外汇投资 | 儿科 | 投资银行 | 生意 | 塞浦路斯 | 工资 | 融资 | 广告人 | 商业模式 | 艺术 | 会计学习 | 老挝 | 超市 | 股市 | 网络推广 | 澳大利亚 | 破产 | Python | 失业保险 | 芯片（集成电路） | 汉语 | 肺炎 | 企业邮箱 | 福建省 | 程序员 | 化工 | 热水器 | 非法集资 | 编程 | 银行业务 | 故事 | 债券 | 香港理工大学 | 私募股权（pe） | 数据分析 | 电影 | 负债 | 试管婴儿 | 银行工作 | 能源期货 | 上海租房 | 嘉兴市 | 房地产开发商 | 创业想法 | 日本动漫 | 图片 | 港股 | 石家庄市 | 饮酒 | 医生 | 公司法 | 音响设备 | 金融学 | 图书 | 互联网广告 | 智能电视 | 原油投资 | 饮食 | 智能仪器 | 名言 | 新能源汽车 | 公益活动 | 结构工程 | 电源 | 礼节礼仪 | 办公家具 | 电路 | 微信朋友圈 | 辞退 | 汕尾市 | 民间借贷 | 皮肤 | 离职 | 代购 | 收藏 | 国家开发银行 | 歌词 | 团队管理 | 纪录片 | 澳门 | 电视节目 | 北京地铁 | 星座 | 车辆 | 车祸 | 中学 | 包装设计 | 老师 | 饮料 | 陈卓林 | 学习 | 背景音乐（bgm） | 营销策划 | 民国 | 教育培训 | 头屯河区 | 植物辨识 | 高考志愿 | 人生 | 马云（人物） | 缅甸 | 驾驶 | 今日头条 | 糕点 | 感冒 | 网站运营 | 品牌营销 | 面包车 | 创业股份分配 | 祛痘 | 服装品牌 | 变相传销 | 世界杯 | 巧克力 | 南航 | 元氏县 | 婆媳关系 | 浙江核新同花顺网络信息服务有限公司 | 大学生兼职 | 机动车驾驶证考试 | 股票配资 | 汉服 | 婚礼 | 网络营销 | 焦虑 | logo设计 | 自建房 | 活动策划 | 作文 | 电梯事故 | 整容 | 机器人 | 石油 | 永修县 | 中国电信 | 专利申请 | 手办 | 国际贸易 | 天使投资 | 宁波 | 森美 | 微店 | 沥青 | 珠宝行业 | 期权 | 猎头 | 百度地图 |

你的位置：网站首页 >> 频道首页 >>经济金融 >>python 网络爬虫虫

python 网络爬虫虫

来源：蜘蛛抓取(WebSpider) 时间：2015-10-21 01:02 标签： c 网络爬虫

转载请注明出处：
& & & 在以前的工作中，实现过简单的网络爬虫，没有系统的介绍过，这篇博客就系统的介绍以下如何使用java的HttpClient实现网络爬虫。
& & & 关于网络爬虫的一些理论知识、实现思想以及策略问题，可以参考百度百科“网络爬虫”，那里已经介绍的十分详细，这里也不再啰嗦，下面就主要介绍如何去实现。
http请求：
& & & 代码开始之前，还是首先介绍以下如何通过浏览器获取http请求信息，这一步是分析网站资源的第一步。在浏览器界面右键有“审查元素”这一功能（如果没找到，F12一样可以的），谷歌浏览器效果如下：
& &&点击“审查元素”之后会出现如下界面：
& &&其中的Network栏目是做爬虫应该重点关注的，打开会看到当前网页所有的http请求信息，如下图：
& &&单击每个信息，可以看到http请求的详细信息，如下图所示：
& &&通过程序伪装成浏览器请求的时候，就多需要关注Request Headers里面的信息，还有一些需要登录的网站也是需要关注这些的。Response里面的信息就是服务器返回的内容，这里只做对文本信息的处理，对图片、音频、视频等信息不做介绍。
& &&Response里面就包含这我们爬虫想获取的信息内容。如果里面的格式不好看的话，可以在浏览器中输入该http请求的url地址，然后右键--&查看网页源代码的形式查看相关信息。通过分析网页源代码中的字符串，总结出统一的规则，提取相应的文本信息。
代码实现：
& &&CrawlBase类，模拟http请求的基类
*@Description: 获取网页信息基类
package com.lulei.
import java.io.BufferedR
import java.io.ByteArrayInputS
import java.io.IOE
import java.io.InputS
import java.io.InputStreamR
import java.util.HashM
import java.util.I
import java.util.M
import java.util.Map.E
import mons.httpclient.H
import mons.httpclient.HttpC
import mons.httpclient.HttpE
import mons.httpclient.HttpM
import mons.httpclient.HttpS
import mons.httpclient.methods.GetM
import mons.httpclient.methods.PostM
import org.apache.log4j.L
import com.lulei.util.CharsetU
public abstract class CrawlBase {
private static Logger log = Logger.getLogger(CrawlBase.class);
//链接源代码
private String pageSourceCode = &&;
//返回头信息
private Header[] responseHeaders =
//连接超时时间
private static int connectTimeout = 3500;
//连接读取时间
private static int readTimeout = 3500;
//默认最大访问次数
private static int maxConnectTimes = 3;
//网页默认编码方式
private static String charsetName = &iso-8859-1&;
private static HttpClient httpClient = new HttpClient();
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(connectTimeout);
httpClient.getHttpConnectionManager().getParams().setSoTimeout(readTimeout);
* @param urlStr
* @param charsetName
* @param method
* @param params
* @throws HttpException
* @throws IOException
* @Author: lulei
* @Description: method方式访问页面
public boolean readPage(String urlStr, String charsetName, String method, HashMap&String, String& params) throws HttpException, IOException {
if (&post&.equals(method) || &POST&.equals(method)) {
return readPageByPost(urlStr, charsetName, params);
return readPageByGet(urlStr, charsetName, params);
* @param urlStr
* @param charsetName
* @param params
* @return 访问是否成功
* @throws HttpException
* @throws IOException
* @Author: lulei
* @Description: Get方式访问页面
public boolean readPageByGet(String urlStr, String charsetName, HashMap&String, String& params) throws HttpException, IOException {
GetMethod getMethod = createGetMethod(urlStr, params);
return readPage(getMethod, charsetName, urlStr);
* @param urlStr
* @param charsetName
* @param params
* @return 访问是否成功
* @throws HttpException
* @throws IOException
* @Author: lulei
* @Description: Post方式访问页面
public boolean readPageByPost(String urlStr, String charsetName, HashMap&String, String& params) throws HttpException, IOException{
PostMethod postMethod = createPostMethod(urlStr, params);
return readPage(postMethod, charsetName, urlStr);
* @param method
* @param defaultCharset
* @param urlStr
* @return 访问是否成功
* @throws HttpException
* @throws IOException
* @Author: lulei
* @Description: 读取页面信息和头信息
private boolean readPage(HttpMethod method, String defaultCharset, String urlStr) throws HttpException, IOException{
int n = maxConnectT
while (n & 0) {
if (httpClient.executeMethod(method) != HttpStatus.SC_OK){
log.error(&can not connect & + urlStr + &\t& + (maxConnectTimes - n + 1) + &\t& + httpClient.executeMethod(method));
//获取头信息
responseHeaders = method.getResponseHeaders();
//获取页面源代码
InputStream inputStream = method.getResponseBodyAsStream();
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, charsetName));
StringBuffer stringBuffer = new StringBuffer();
String lineString =
while ((lineString = bufferedReader.readLine()) != null){
stringBuffer.append(lineString);
stringBuffer.append(&\n&);
pageSourceCode = stringBuffer.toString();
InputStream in =new
ByteArrayInputStream(pageSourceCode.getBytes(charsetName));
String charset = CharsetUtil.getStreamCharset(in, defaultCharset);
//下面这个判断是为了IP归属地查询特意加上去的
if (&Big5&.equals(charset)) {
charset = &gbk&;
if (!charsetName.toLowerCase().equals(charset.toLowerCase())) {
pageSourceCode = new String(pageSourceCode.getBytes(charsetName), charset);
} catch (Exception e) {
e.printStackTrace();
System.out.println(urlStr + & -- can't connect
& + (maxConnectTimes - n + 1));
* @param urlStr
* @param params
* @return GetMethod
* @Author: lulei
* @Description: 设置get请求参数
@SuppressWarnings(&rawtypes&)
private GetMethod createGetMethod(String urlStr, HashMap&String, String& params){
GetMethod getMethod = new GetMethod(urlStr);
if (params == null){
return getM
Iterator iter = params.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry entry = (Map.Entry) iter.next();
String key = (String) entry.getKey();
String val = (String) entry.getValue();
getMethod.setRequestHeader(key, val);
return getM
* @param urlStr
* @param params
* @return PostMethod
* @Author: lulei
* @Description: 设置post请求参数
private PostMethod createPostMethod(String urlStr, HashMap&String, String& params){
PostMethod postMethod = new PostMethod(urlStr);
if (params == null){
return postM
Iterator&Entry&String, String&& iter = params.entrySet().iterator();
while (iter.hasNext()) {
Map.Entry&String, String& entry =
iter.next();
String key = (String) entry.getKey();
String val = (String) entry.getValue();
postMethod.setParameter(key, val);
return postM
* @param urlStr
* @param charsetName
* @return 访问是否成功
* @throws IOException
* @Author: lulei
* @Description: 不设置任何头信息直接访问网页
public boolean readPageByGet(String urlStr, String charsetName) throws IOException{
return this.readPageByGet(urlStr, charsetName, null);
* @return String
* @Author: lulei
* @Description: 获取网页源代码
public String getPageSourceCode(){
return pageSourceC
* @return Header[]
* @Author: lulei
* @Description: 获取网页返回头信息
public Header[] getHeader(){
return responseH
* @param timeout
* @Author: lulei
* @Description: 设置连接超时时间
public void setConnectTimeout(int timeout){
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(timeout);
* @param timeout
* @Author: lulei
* @Description: 设置读取超时时间
public void setReadTimeout(int timeout){
httpClient.getHttpConnectionManager().getParams().setSoTimeout(timeout);
* @param maxConnectTimes
* @Author: lulei
* @Description: 设置最大访问次数，链接失败的情况下使用
public static void setMaxConnectTimes(int maxConnectTimes) {
CrawlBase.maxConnectTimes = maxConnectT
* @param connectTimeout
* @param readTimeout
* @Author: lulei
* @Description: 设置连接超时时间和读取超时时间
public void setTimeout(int connectTimeout, int readTimeout){
setConnectTimeout(connectTimeout);
setReadTimeout(readTimeout);
* @param charsetName
* @Author: lulei
* @Description: 设置默认编码方式
public static void setCharsetName(String charsetName) {
CrawlBase.charsetName = charsetN
& &&CrawlListPageBase类是CrawlBase的子类，实现了从页面中获取链接的URL信息基类
*@Description: 获取页面链接地址信息基类
package com.lulei.
import java.io.IOE
import java.util.ArrayL
import java.util.HashM
import java.util.L
import com.lulei.util.DoR
public abstract class CrawlListPageBase extends CrawlBase {
* @param urlStr
* @param charsetName
* @throws IOException
public CrawlListPageBase(String urlStr, String charsetName) throws IOException{
readPageByGet(urlStr, charsetName);
pageurl = urlS
* @param urlStr
* @param charsetName
* @param method
* @param params
* @throws IOException
public CrawlListPageBase(String urlStr, String charsetName, String method, HashMap&String, String& params) throws IOException{
readPage(urlStr, charsetName, method, params);
pageurl = urlS
* @return List&String&
* @Author: lulei
* @Description: 返回页面上需求的链接地址
public List&String& getPageUrls(){
List&String& pageUrls = new ArrayList&String&();
pageUrls = DoRegex.getArrayList(getPageSourceCode(), getUrlRegexString(), pageurl, getUrlRegexStringNum());
return pageU
* @return String
* @Author: lulei
* @Description: 返回页面上需求的网址连接的正则表达式
public abstract String getUrlRegexString();
* @return int
* @Author: lulei
* @Description: 正则表达式中要去的字段位置
public abstract int getUrlRegexStringNum();
& & DoRegex类，封装的一些基于正则表达式字符串匹配查找类
* @Description: 正则处理工具
package com.lulei.
import java.io.UnsupportedEncodingE
import java.net.URLE
import java.util.ArrayL
import java.util.L
import java.util.regex.M
import java.util.regex.P
public class DoRegex {
private static String rootUrlRegex = &(http://.*?/)&;
private static String currentUrlRegex = &(http://.*/)&;
private static String ChRegex = &([\u4e00-\u9fa5]+)&;
* @param dealStr
* @param regexStr
* @param splitStr
* @param n
* @return String
* @Author: lulei
* @Description: 正则匹配结果，每条记录用splitStr分割
public static String getString(String dealStr, String regexStr, String splitStr, int n){
String reStr = &&;
if (dealStr == null || regexStr == null || n & 1 || dealStr.isEmpty()){
return reS
splitStr = (splitStr == null) ? && : splitS
Pattern pattern = pile(regexStr, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(dealStr);
StringBuffer stringBuffer = new StringBuffer();
while (matcher.find()) {
stringBuffer.append(matcher.group(n).trim());
stringBuffer.append(splitStr);
reStr = stringBuffer.toString();
if (splitStr != && && reStr.endsWith(splitStr)){
reStr = reStr.substring(0, reStr.length() - splitStr.length());
return reS
* @param dealStr
* @param regexStr
* @param n
* @return String
* @Author: lulei
* @Description: 正则匹配结果，将所有匹配记录组装成字符串
public static String getString(String dealStr, String regexStr, int n){
return getString(dealStr, regexStr, null, n);
* @param dealStr
* @param regexStr
* @param n
* @return String
* @Author: lulei
* @Description: 正则匹配第一条结果
public static String getFirstString(String dealStr, String regexStr, int n){
if (dealStr == null || regexStr == null || n & 1 || dealStr.isEmpty()){
return &&;
Pattern pattern = pile(regexStr, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(dealStr);
while (matcher.find()) {
return matcher.group(n).trim();
return &&;
* @param dealStr
* @param regexStr
* @param n
* @return ArrayList&String&
* @Author: lulei
* @Description: 正则匹配结果，将匹配结果组装成数组
public static List&String& getList(String dealStr, String regexStr, int n){
List&String& reArrayList = new ArrayList&String&();
if (dealStr == null || regexStr == null || n & 1 || dealStr.isEmpty()){
return reArrayL
Pattern pattern = pile(regexStr, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(dealStr);
while (matcher.find()) {
reArrayList.add(matcher.group(n).trim());
return reArrayL
* @param url
* @param currentUrl
* @return String
* @Author: lulei
* @Description: 组装网址，网页的url
private static String getHttpUrl(String url, String currentUrl){
url = encodeUrlCh(url);
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
if (url.indexOf(&http&) == 0){
(url.indexOf(&/&) == 0){
return getFirstString(currentUrl, rootUrlRegex, 1) + url.substring(1);
return getFirstString(currentUrl, currentUrlRegex, 1) +
* @param dealStr
* @param regexStr
* @param currentUrl
* @param n
* @return ArrayList&String&
* @Author: lulei
* @Description: 获取和正则匹配的绝对链接地址
public static List&String& getArrayList(String dealStr, String regexStr, String currentUrl, int n){
List&String& reArrayList = new ArrayList&String&();
if (dealStr == null || regexStr == null || n & 1 || dealStr.isEmpty()){
return reArrayL
Pattern pattern = pile(regexStr, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(dealStr);
while (matcher.find()) {
reArrayList.add(getHttpUrl(matcher.group(n).trim(), currentUrl));
return reArrayL
* @param url
* @throws UnsupportedEncodingException
* @Author: lulei
* @Description: 将连接地址中的中文进行编码处理
public static String encodeUrlCh (String url) throws UnsupportedEncodingException {
while (true) {
String s = getFirstString(url, ChRegex, 1);
if (&&.equals(s)){
url = url.replaceAll(s, URLEncoder.encode(s, &utf-8&));
* @param dealStr
* @param regexStr
* @param array 正则位置数组
* @Author:lulei
* @Description: 获取全部
public static List&String[]& getListArray(String dealStr, String regexStr, int[] array) {
List&String[]& reArrayList = new ArrayList&String[]&();
if (dealStr == null || regexStr == null || array == null) {
return reArrayL
for (int i = 0; i & array. i++) {
if (array[i] & 1) {
return reArrayL
Pattern pattern = pile(regexStr, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(dealStr);
while (matcher.find()) {
String[] ss = new String[array.length];
for (int i = 0; i & array. i++) {
ss[i] = matcher.group(array[i]).trim();
reArrayList.add(ss);
return reArrayL
* @param dealStr
* @param regexStr
* @param array
* @Author:lulei
* @Description: 获取全部
public static List&String& getStringArray(String dealStr, String regexStr, int[] array) {
List&String& reStringList = new ArrayList&String&();
if (dealStr == null || regexStr == null || array == null) {
return reStringL
for (int i = 0; i & array. i++) {
if (array[i] & 1) {
return reStringL
Pattern pattern = pile(regexStr, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(dealStr);
while (matcher.find()) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i & array. i++) {
sb.append(matcher.group(array[i]).trim());
reStringList.add(sb.toString());
return reStringL
* @param dealStr
* @param regexStr
* @param array 正则位置数组
* @Author:lulei
* @Description: 获取第一个
public static String[] getFirstArray(String dealStr, String regexStr, int[] array) {
if (dealStr == null || regexStr == null || array == null) {
for (int i = 0; i & array. i++) {
if (array[i] & 1) {
Pattern pattern = pile(regexStr, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = pattern.matcher(dealStr);
while (matcher.find()) {
String[] ss = new String[array.length];
for (int i = 0; i & array. i++) {
ss[i] = matcher.group(array[i]).trim();
& &&CharsetUtil类，编码方式检测类
*@Description:
编码方式检测类
package com.lulei.
import java.io.IOE
import java.io.InputS
import java.net.URL;
import java.nio.charset.C
import info.monitorenter.cpdetector.io.ASCIID
import info.monitorenter.cpdetector.io.CodepageDetectorP
import info.monitorenter.cpdetector.io.JChardetF
import info.monitorenter.cpdetector.io.ParsingD
import info.monitorenter.cpdetector.io.UnicodeD
public class CharsetUtil {
private static final CodepageDetectorP
static {//初始化探测器
detector = CodepageDetectorProxy.getInstance();
detector.add(new ParsingDetector(false));
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());
detector.add(JChardetFacade.getInstance());
* @param url
* @param defaultCharset
* @Author:lulei
* @return 获取文件的编码方式
public static String getStreamCharset (URL url, String defaultCharset) {
if (url == null) {
return defaultC
//使用第三方jar包检测文件的编码
Charset charset = detector.detectCodepage(url);
if (charset != null) {
return charset.name();
} catch (Exception e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
return defaultC
* @param inputStream
* @param defaultCharset
* @Author:lulei
* @Description: 获取文件流的编码方式
public static String getStreamCharset (InputStream inputStream, String defaultCharset) {
if (inputStream == null) {
return defaultC
int count = 200;
count = inputStream.available();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
//使用第三方jar包检测文件的编码
Charset charset = detector.detectCodepage(inputStream, count);
if (charset != null) {
return charset.name();
} catch (Exception e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
return defaultC
& &&上面四个类，就实现了网络文本资源信息抓取的基本架构，下面就通过一个实际的例子介绍如何使用上述类实现网络文本资源信息的获取。
百度新闻案例：
& &&1）找到百度新闻更新列表页，如/n?cmd=4&class=civilnews&pn=1&from=tab 界面如下图所示：
& &&文章URL链接地址如下图所示：
& &&通过对源文件的分析，编写BaiduNewList类，实现百度新闻列表页信息的抓取，代码如下：
*@Description:
百度新闻滚动列表页，可以获取当前页面上的链接
package com.lulei.crawl.
import java.io.IOE
import java.util.HashM
import com.lulei.crawl.CrawlListPageB
public class BaiduNewList extends CrawlListPageBase{
private static HashMap&String, String&
* 添加相关头信息，对请求进行伪装
params = new HashMap&String, String&();
params.put(&Referer&, &&);
params.put(&User-Agent&, &Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0. Safari/537.36&);
public BaiduNewList(String urlStr) throws IOException {
super(urlStr, &utf-8&, &get&, params);
public String getUrlRegexString() {
// TODO Auto-generated method stub
//新闻列表页中文章链接地址的正则表达式
return &o &a href=\&(.*?)\&&;
public int getUrlRegexStringNum() {
// TODO Auto-generated method stub
//链接地址在正则表达式中的位置
* @param args
* @throws IOException
* @Author:lulei
* @Description:
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
BaiduNewList baidu = new BaiduNewList(&/n?cmd=4&class=sportnews&pn=1&from=tab&);
for (String s : baidu.getPageUrls()) {
System.out.println(s);
& &&2）通过第一步获取的URL，得到新闻所在的内容页面URL，由于百度新闻列表页面上的新闻来自不同的站，所以很难找到一个通用的结构，大多数的新闻类网站，内容都是放在p标签内，所以就采用了如下的方式获取新闻的内容，如下图：
& &&News类具体实现如下所示：
*@Description:
新闻类网站新闻内容
package com.lulei.crawl.
import java.io.IOE
import java.util.HashM
import mons.httpclient.HttpE
import com.lulei.crawl.CrawlB
import com.lulei.util.DoR
public class News extends CrawlBase{
private static String contentRegex = &&p.*?&(.*?)&/p&&;
private static String titleRegex = &&title&(.*?)&/title&&;
private static int maxLength = 300;
private static HashMap&String, String&
* 添加相关头信息，对请求进行伪装
params = new HashMap&String, String&();
params.put(&Referer&, &&);
params.put(&User-Agent&, &Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0. Safari/537.36&);
* @Author:lulei
* @Description: 默认p标签内的内容为正文，如果正文长度查过设置的最大长度，则截取前半部分
private void setContent() {
String content = DoRegex.getString(getPageSourceCode(), contentRegex, 1);
content = content.replaceAll(&\n&, &&)
.replaceAll(&&script.*?/script&&, &&)
.replaceAll(&&style.*?/style&&, &&)
.replaceAll(&&.*?&&, &&);
this.content = content.length() & maxLength ? content.substring(0, maxLength) :
* @Author:lulei
* @Description: 默认title标签内的内容为标题
private void setTitle() {
this.title = DoRegex.getString(getPageSourceCode(), titleRegex, 1);;
public News(String url) throws HttpException, IOException {
this.url =
readPageByGet(url, &utf-8&, params);
setContent();
setTitle();
public String getUrl() {
public void setUrl(String url) {
this.url =
public String getContent() {
public String getTitle() {
public String getType() {
public void setType(String type) {
this.type =
public static void setMaxLength(int maxLength) {
News.maxLength = maxL
* @param args
* @throws HttpException
* @throws IOException
* @Author:lulei
* @Description: 测试用例
public static void main(String[] args) throws HttpException, IOException {
// TODO Auto-generated method stub
News news = new News(&/viewnews-1634777.html&);
System.out.println(news.getContent());
System.out.println(news.getTitle());
& &&3）编写抓取的入口，这里为了简单，只做了两层的分析，所以新闻更新列表也的URL就直接写在程序中。如下图所示：
& &&执行一次采集任务如下图所示：
& &&在main函数里面只需要一次性或周期性的去执行run函数即可，具体代码如下：
*@Description:
package com.lulei.knn.
import java.io.IOE
import java.util.ArrayL
import java.util.HashM
import java.util.L
import com.lulei.crawl.news.BaiduNewL
import com.lulei.crawl.news.N
import com.lulei.knn.index.KnnI
import com.lulei.knn.index.KnnS
import com.lulei.util.ParseMD5;
public class CrawlNews {
private static List&Info&
private static KnnIndex knnIndex = new KnnIndex();
private static KnnSearch knnSearch = new KnnSearch();
private static HashMap&String, Integer&
infos = new ArrayList&Info&();
infos.add(new Info(&/n?cmd=4&class=sportnews&pn=1&from=tab&, &体育类&));
infos.add(new Info(&/n?cmd=4&class=sportnews&pn=2&from=tab&, &体育类&));
infos.add(new Info(&/n?cmd=4&class=sportnews&pn=3&from=tab&, &体育类&));
infos.add(new Info(&/n?cmd=4&class=mil&pn=1&sub=0&, &军事类&));
infos.add(new Info(&/n?cmd=4&class=mil&pn=2&sub=0&, &军事类&));
infos.add(new Info(&/n?cmd=4&class=mil&pn=3&sub=0&, &军事类&));
infos.add(new Info(&/n?cmd=4&class=finannews&pn=1&sub=0&, &财经类&));
infos.add(new Info(&/n?cmd=4&class=finannews&pn=2&sub=0&, &财经类&));
infos.add(new Info(&/n?cmd=4&class=finannews&pn=3&sub=0&, &财经类&));
infos.add(new Info(&/n?cmd=4&class=internet&pn=1&from=tab&, &互联网&));
infos.add(new Info(&/n?cmd=4&class=housenews&pn=1&sub=0&, &房产类&));
infos.add(new Info(&/n?cmd=4&class=housenews&pn=2&sub=0&, &房产类&));
infos.add(new Info(&/n?cmd=4&class=housenews&pn=3&sub=0&, &房产类&));
infos.add(new Info(&/n?cmd=4&class=gamenews&pn=1&sub=0&, &游戏类&));
infos.add(new Info(&/n?cmd=4&class=gamenews&pn=2&sub=0&, &游戏类&));
infos.add(new Info(&/n?cmd=4&class=gamenews&pn=3&sub=0&, &游戏类&));
*@Description:
抓取网址信息
*@Author:lulei
static class Info{
Info(String url, String type) {
this.url =
this.type =
* @param info
* @Author:lulei
* @Description: 抓取一个列表页面下的新闻信息
private void crawl(Info info) {
if (info == null) {
BaiduNewList baiduNewList = new BaiduNewList(info.url);
List&String& urls = baiduNewList.getPageUrls();
for (String url : urls) {
News news = new News(url);
NewsBean newBean = new NewsBean();
newBean.setId(ParseMD5.parseStrToMd5L32(url));
newBean.setType(info.type);
newBean.setUrl(url);
newBean.setTitle(news.getTitle());
newBean.setContent(news.getContent());
//保存到索引文件中
knnIndex.add(newBean);
if (news.getContent() == null || &&.equals(news.getContent())) {
result.put(&E&, 1+result.get(&E&));
if (info.type.equals(knnSearch.getType(news.getContent()))) {
result.put(&R&, 1+result.get(&R&));
result.put(&W&, 1+result.get(&W&));
} catch (Exception e) {
e.printStackTrace();
* @Author:lulei
* @Description: 启动入口
public void run() {
result = new HashMap&String, Integer&();
result.put(&R&, 0);
result.put(&W&, 0);
result.put(&E&, 0);
for (Info info : infos) {
System.out.println(info.url + &------start&);
crawl(info);
System.out.println(info.url + &------end&);
System.out.println(&R = & + result.get(&R&));
System.out.println(&W = & + result.get(&W&));
System.out.println(&E = & + result.get(&E&));
System.out.println(&精确度：& + (result.get(&R&) * 1.0 / (result.get(&R&) + result.get(&W&))));
System.out.println(&-------------finished---------------&);
} catch (IOException e) {
e.printStackTrace();
public static void main(String[] args) {
new CrawlNews().run();
到此为止，一个完整的采集程序就完成了。

python 网络爬虫虫

我要回帖

更多关于 c 网络爬虫的文章

随机推荐

python 网络爬虫虫

我要回帖

更多关于 c 网络爬虫 的文章

随机推荐

更多关于 c 网络爬虫的文章