| Projects on Google Code | Results 1 - 10 of 154 |
HarvestMan is a modular, extensible and flexible web crawler program cum framework written in pure Python. HarvestMan can be used to download files from websites according to a number of customized rules and constraints. It can be used to find information from websites matching keywords or regular e...
crawler,
spider,
web-crawler,
web-bot,
robot,
agent,
internet,
web,
data-mining,
search,
offline-browsing,
python
=Super Web Crawler=
==Goal==
* Implement a simple web crawler in python.
* Capable to serve as a vertical search backend crawler. (100k pages?)
* Store page text and meta data at Mysql db.
==Working Environment==
* Python 2.6.1
* Mysql 5.1.30
* GVIM 7.2 for windows
* Mysql...
1.由于google code 的svn host很不稳定,hyer代码现在放到github托管。在github的主页是
http://github.com/xurenlu/hyer/tree/master
理所当然地,代码采用git进行版本管理。
WIVET is a benchmarking project that aims to statistically analyze web link extractors. In general, web application vulnerability scanners fall into this category. These VAs, given a URL(s), try to extract as many input vectors as possibly they can to increase the coverage of the attack surface.
...
crawler,
benchmarking,
webappsec,
vulnerabilityscanner,
linkextractor,
javascript,
webguvenligi,
wgt,
flash
==[http://joycrawler.googlecode.com/files/dodo-logo.png]==
=Joycrawler is "Webpage Spider" + "PageRank" using Mapreduce mechanism=
==<font color="blue">Crawling Interval is allowed! A huge update is coming! </font>==
==<font color="red">Like Joycrawler? Use Joycrawler? Seek the quick solution? ...
This project is to create a Web Crawler which uses a network of computer to improve computing operations.
The application is a sofware agent that visits a list of URLs and identifies all the hyperlinks in the page and adds them to the list of URLs to visit.
The main objectifs are to :<br />
- Ana...
pget is a Python web crawler that aims to cover the shortcomings of [http://www.gnu.org/software/wget/ wget]:
* threaded downloads
* ability to cycle through a provided list of proxies
* can recontinue crawl if interrupted (by saving state in a database)
* can restrict to just crawlin...
Retriever is a simple crawler packed as a Java library that allows developers to collect and manipulate documents reachable by a variety of protocols (e.g. http, smb). You'll easily crawl documents shared in a LAN, on the Web, and many other sources.
Goal of the project is to prepare tool for collecting data from wykop.pl. Developed by students of Wrocław University of Technology and members of this [http://www.qualityproviders.pl IT company].
A simple distributed Crawler implemented in python.