crawlera 免费还能免费用吗

初学nodejs,关于node-crawler使用的一点问题 - CNode技术社区
这家伙很懒,什么个性签名都没有留下。
初学nodejs,想搞个crawler,搜索了下在github上发现有个,于是NPM,然后复制范例,编译运行,报错…(范例代码可在github的项目首页上找到)
$(&#content a&).each(function(a) {
^ TypeError: undefined is not a function
at Object.Crawler.callback (D:\Study\nodejs\myCrawler\index.js:10:9)
at exports.Crawler.self.onContent.jsdom.env.done (D:\Study\nodejs\myCrawler\node_modules\crawler\lib\crawler.js:212:37)
at exports.env.exports.jsdom.env.scriptComplete (D:\Study\nodejs\myCrawler\node_modules\crawler\node_modules\jsdom\lib\jsdom.js:205:39)
at process.startup.processNextTick.process._tickCallback (node.js:244:9)
想来应该是jquery路径的问题?可是我查了貌似没问题啊,我直接NPM的什么都没改啊,麻烦大家帮忙看看可能是什么原因
OS:Windows
NodeJS:0.8.15 (x64)
UP一下,求解…
把你的程序帖上来看看吧。
我什么都没写。就是拿的node-crawler首页的那个范例…
所以我现在只能是认为是windows下这个版本的问题了…有空换个其他版本试试
lz问题解决了么,我也是这样
这个 node-crawler 好像依赖的库很多哦,检查一下是否都全了?猜想是否 jsdom 这个没有呢?
&dependencies&: {
&request&: &2.12.0&,
&jsdom&: &0.2.19&,
&generic-pool&: &2.0.2&,
&htmlparser&: &1.7.6&,
&underscore&: &1.3.3&,
&jschardet&: &1.0.2&,
&iconv-lite&: &0.2.7&
不用试了win下面用不了。
重新看了下,确实crawler在win下有bug
// jsdom doesn't support adding local scripts,
// We have to read jQuery from the local fs
if (toQueue.jQueryUrl.match(/^(file\:\/\/|\/)/)) {
// TODO cache this
fs.readFile(toQueue.jQueryUrl.replace(/^file\:\/\//,&&),&utf-8&,function(err,jq) {
crawler/lib/crawler.js 第273行的正则匹配并不适用于win的文件系统,所以读取不到jquery.js了,试着直接改成if(true)就可以正常运行
不过还是推荐自己写crawler,这个crawler只是把几个常用的module组合起来了,过度封装了反而不够灵活
到这下载 windows下编译好的jsdom吧。
node-crawler 在windows就fork我的版本吧,原版有问题 ,估计根本只想在unix系下跑。
osx下也不行
CNode 社区为国内最专业的 Node.js 开源技术社区,致力于 Node.js 的技术研究。
服务器赞助商为
,存储赞助商为
,由提供应用性能服务。
新手搭建 Node.js 服务器,推荐使用无需备案的PHPCrawl webcrawler 爬虫 - Just Code - ITeye技术网站
博客分类:
1. PHPCrawl
PHPCrawl is a framework for crawling/spidering websites written in the programming language PHP, so just call it a webcrawler-library or crawler-engine for PHP PHPCrawl "spiders" websites and passes information about all found documents (pages, links, files ans so on) for futher processing to users of the library. It provides several options to specify the behaviour of the crawler like URL- and Content-Type-filters, cookie-handling, robots.txt-handling, limiting options, multiprocessing and much more. PHPCrawl is completly free opensource software and is licensed under the . To get a first impression on how to use the crawler you may want to take a look at the
inside the manual section. A complete reference and documentation of all available options and methods of the framework can be found in the
The current version of the phpcrawl-package and older releases can be downloaded from a .Note to users of phpcrawl version 0.7x or before: Although in version 0.8 some method-names and parameters have changed, it should be fully compatible to older versions of phpcrawl.
Installation & Quickstart
The following steps show how to use phpcrawl:
Unpack the phpcrawl-package somewhere. That's all you have to do for installation.
Include the phpcrawl-mainclass to your script or project. Its located in the "libs"-path of the package.
include("libs/PHPCrawler.class.php");
There are no other includes needed.
Extend the phpcrawler-class and override the -method with your own code to process the information of every document the crawler finds on its way.
class MyCrawler extends PHPCrawler
function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
// Your code comes here!
// Do something with the $PageInfo-object that
// contains all information about the currently
// received document.
// As example we just print out the URL of the document
echo $PageInfo-&url."\n";
For a list of all available information about a page or file within the -method see the -reference. Note to users of phpcrawl 0.7x or before: The old, overridable method "", that receives the document-information as an array, still is present and gets called. PHPcrawl 0.8 is fully compatible with scripts written for earlier versions.
Create an instance of that class in your script or project, define the behaviour of the crawler and start the crawling-process.
$crawler = new MyCrawler();
$crawler-&setURL("");
$crawler-&addContentTypeReceiveRule("#text/html#");
$crawler-&go();
For a list of all available setup-options/methods of the crawler take a look at the -classreference.
Tutorial: Example Script
The following code is a simple example of using phpcrawl. The listed script just "spiders" some pages of www.php.net until a traffic-limit of 1 mb is reached and prints out some information about all found documents. Please note that this example-script (and others) also comes in a file called "example.php" with the phpcrawl-package. It's recommended to run it from the commandline (php CLI).
// It may take a whils to crawl a site ...
set_time_limit(10000);
// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");
// Extend the class and override the handleDocumentInfo()-method
class MyCrawler extends PHPCrawler
function handleDocumentInfo($DocInfo)
// Just detect linebreak for output ("\n" in CLI-mode, otherwise "&br&").
if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "&br /&";
// Print the URL and the HTTP-status-Code
echo "Page requested: ".$DocInfo-&url." (".$DocInfo-&http_status_code.")".$
// Print the refering URL
echo "Referer-page: ".$DocInfo-&referer_url.$
// Print if the content of the document was be recieved or not
if ($DocInfo-&received == true)
echo "Content received: ".$DocInfo-&bytes_received." bytes".$
echo "Content not received".$
// Now you should do something with the content of the actual
// received page or file ($DocInfo-&source), we skip it in this example
// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process.
$crawler = new MyCrawler();
// URL to crawl
$crawler-&setURL("www.php.net");
// Only receive content of files with content-type "text/html"
$crawler-&addContentTypeReceiveRule("#text/html#");
// Ignore links to pictures, dont even request pictures
$crawler-&addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
// Store and send cookie-data like a browser does
$crawler-&enableCookieHandling(true);
// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
$crawler-&setTrafficLimit(1000 * 1024);
// Thats enough, now here we go
$crawler-&go();
// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler-&getProcessReport();
if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "&br /&";
echo "Summary:".$
echo "Links followed: ".$report-&links_followed.$
echo "Documents received: ".$report-&files_received.$
echo "Bytes received: ".$report-&bytes_received." bytes".$
echo "Process runtime: ".$report-&process_runtime." sec".$
2. PHP Crawler
PHP Crawler is a simple website search script for small-to-medium websites. The only requrements are PHP and MySQL, no shell access required.
来源/下载:
You All know Google used to crawl web pages and index them into there database of millions and trillions of pages they use a great software called spider to perform this Process
this spider will index all the web pages in the internet with great speed like the same way i coded out a simple mini web crawler crawls
the specific webpages and get out there links and displays it, i used PHP and Jquery to perform this actions when a people types the url and clicks crawl the software
crawls the whole web page and displays the link present in it
and you can see the demo here and download the script for free
(161.7 KB)
下载次数: 1
下载次数: 1
下载次数: 2
浏览: 8064804 次
来自: 洛杉矶
楼上这里错了BITMAP_&TYPT&_B ...
写的非常好
还得引用tcpdf?n. 爬行者;履带牵引装置
网络爬行器(Crawler):爬行器从一个或数个种子超链接出发,发现新的网页并保存这些网页的快照,然后分析它们,从中提取网页链接,扩充链接列表库以供...
基于104个网页-
(2)基于Robot的搜索引擎:Robot有时也称为蜘蛛(Spider),漫游者(Wanderer), 爬虫(Crawler)和蠕虫(Worm)是一种能够利用Web文档内的超链接递归地访问新文档 的软件程序。
基于91个网页-
---爬行者(Crawler) 这一类行业最不容易在在线交易方面有所突破。虽然并不一定毫无可能,但是可以说只可能有非常少的例外能够有效实现在线交易。
基于82个网页-
网络爬虫(Crawler)是搜索引擎用于信息采集的程 序,是搜索引擎的重要组成部分,目前互联网中信息量的爆炸式增长对爬虫性 能提出了非常高的要求。
基于76个网页-
履带式起重机
履带起重机
覆带起重机
履带式挖掘机
覆带挖土机
履带挖掘机
履带式发掘机
履带式拖车
履带式挂车
履带拖车;履带拖拉机用的拖车
履带式潜孔钻机
履带式起重机
履带起重机
履带起重机
履带式松土机
主题crawler
更多收起网络短语
- 引用次数:80
Web-Crawler is a important part of search engine,it is responsible for the network information gathering.
网络爬虫是搜索引擎的重要组成部分,它在搜索引擎中负责网络信息的采集。
参考来源 -
- 引用次数:42
The discovery of Sybil group boils down to max-flow/min-cut problem, introduces virtual node and uses a crawler to find Sybil group.
将Sybil攻击团体的发现归结为最大流/最小割问题,引入虚拟节点并利用爬行器找到Sybil攻击团体。
参考来源 -
- 引用次数:5
Finally, an easy Web community seach engine system was constructed byusing Web crawler and open source Lucene, which has the function of rank according to the co-relation and group by search result.
最后,本文还利用Web抓取器和开放源代码的lucene全文检索部件构造了简单的Web社区搜索引擎系统,提供了按照相关度对结果进行排序、对搜索结果进行分组等功能。
参考来源 - 基于流量的Web社区挖掘技术的研究与应用
- 引用次数:4
Designing a web crawler. It can only download the target web page and outlink-pages in the same domain.
设计了相应的网页爬虫,只下载待分类目标网页及其域内后向链接网页。
参考来源 - 基于贝叶斯算法和后向链接的中文网页组合分类研究
徐行机器人
&2,447,543篇论文数据,部分数据来源于
[ 'kr?:l? ]
a person who tries to please someone in order to gain a personal advantage
a person who crawls or creeps along the ground
terrestrial worm that burrows into a often surfaces when the g used as bait by anglers
以上来源于:
爬行的人(或物);爬行动物(或昆虫)
[亦称作crawler tractor]【机械学】履带拖拉机,履带式车辆(或起重机等)
(拖拉机等的)履带
[复数] (幼儿穿的)爬行服,连衫罩裤
[澳大利亚俚语]= syco-phant
以上来源于:《21世纪大英汉词典》
A crawler is a computer program that visits websites and collects information when you do an Internet search. 网络爬虫; 用户上网查资料时访问网站并收集信息的计算机程序
履带起重机;履带吊
覆带式拖拉机
n. 爬行者;履带牵引装置
爬行;养鱼池;匍匐而行
爬行;匍匐行进
爬行;缓慢地行进
The basic design of this crawler is to load the first link to check onto a queue.
这个爬虫的基本设计是加载第一个链接并将其放入一个队列。
A WebSpider or crawler is an automated program that follows links on websites and calls a WebRobot to handle the contents of each link.
网络蜘蛛(也称网络爬虫)是一种用来跟踪站点中的链接并调用网络机器人处理链接内容的自动化程序。
Bathed in xenon spotlights, the white spaceship, attached to its twin solid rocket boosters and orange external fuel tank, crept 3.4 miles on the back of an enormous tractor called the Crawler.
沐浴着氙气聚光灯光,白色的飞船,带着一对固体火箭推进器和橙色外挂燃料箱,在称为“爬行者”的巨大拖拉机上缓慢行驶了3.4英里。
Set up a crawler that can scrape some webpages and parse some basic data.
Set up a bigger crawler that has to fill out a form or two.
As with all dungeon-crawler, hack-and-slash action RPGs, the gameplay does get old after a while.
履带起重机;履带吊
覆带式拖拉机
$firstVoiceSent
- 来自原声例句
请问您想要如何调整此模块?
感谢您的反馈,我们会尽快进行适当修改!
请问您想要如何调整此模块?
感谢您的反馈,我们会尽快进行适当修改!说明:本文章是基于前面的一系列文章完成的,如果您错过了。可以在此查看: 安装python爬虫scrapy踩过的那些坑和编程外的思考
scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据
scrapy爬虫成长日记之将抓取内容写入mysql数据库
如何让你的scrapy爬虫不再被ban
crawlera官方网址: /crawlera/
crawlera帮助文档: /crawlera.html 一、注册crawlera账号,获取crawlera API KEY1、注册一个crawlera账号并激活/account/signup/
填写好用户名,邮件和密码点击sign up即完成注册,收到注册确认邮件确认即可。2、创建一个Organizations
3、创建完Organizations后添加crawlera user
4、查看API key
点击 crawlera&user 的名称 jack 就可以查看 API 的详细信息了( key )
至此,crawlera API的信息已经获取到了。二、修改scrapy项目下面看看怎么添加到scrapy项目1、安装scrapy-crawlerapip install scrapy-crawlera2、修改settings.pyDOWNLOADER_MIDDLEWARES下添加配置项'scrapy_crawlera.CrawleraMiddleware': 600其他配置项CRAWLERA_ENABLED = TrueCRAWLERA_USER = '&API key&'CRAWLERA_PASS = '你crawlera账号的密码'注意:由于之前的项目用了自定义代理的方式,因此DOWNLOADER_MIDDLEWARES下的#'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, #代理需要用到#'cnblogs.middlewares.ProxyMiddleware': 100, #代理需要用到这两个配置项要注释掉。3、测试crawlera的抓取是否可用scrapy crawl CnblogsSpider4、查看结果
这里可以看到crawlera已经正常工作了。5、另外crawlera官网也可以查看抓取结果
scrapy运用crawlera进行抓取就介绍到这里。另外crawlera还提供付费定制服务,如果经费充足也可以考虑付费定制scrapy的爬虫。 代码更新至此:
/jackgitgz/CnblogsSpider ( 提交到github的代码将api和password去掉了,如果想运行需要添加自己的key和password )
三、题外话: 如果你不是scrapy爬虫,而仅仅是想python调用,crawlera也提供了python直接调用的方法1、通过request的方式import requestsurl = &&proxy = &:8010&proxy_auth = &&API KEY&:&proxies = { &http&: &http://{0}@{1}/&.format(proxy_auth, proxy)}headers = { &X-Crawlera-Use-HTTPS&: 1}r = requests.get(url, proxies=proxies, headers=headers)print(&&&Requesting [{}]through proxy [{}]Response Time: {}Response Code: {}Response Headers:{}Response Body:{}&&&.format(url, proxy, r.elapsed.total_seconds(), r.status_code, r.headers, r.text))2、request代理重写urlimport requestsfrom requests.auth import HTTPProxyAuthurl = &&headers = {}proxy_host = &&proxy_auth = HTTPProxyAuth(&&API KEY&&, &&)proxies = {&http&: &http://{}:8010/&.format(proxy_host)}if url.startswith(&https:&): url = &http://& + url[8:] headers[&X-Crawlera-Use-HTTPS&] = &1&r = requests.get(url, headers=headers, proxies=proxies, auth=proxy_auth)print(&&&Requesting [{}]through proxy [{}]Response Time: {}Response Code: {}Response Headers:{}Response Body:{}&&&.format(url, proxy_host, r.elapsed.total_seconds(), r.status_code,
r.headers, r.text)) crawlera就介绍到这里,更多关于crawlera的内容可以参考官方文档:
/index.html
最新教程周点击榜
微信扫一扫

我要回帖

更多关于 crawlera 的文章

 

随机推荐