Welcome to Redpoll!
The name of this project redpoll means a kind of small finches in northern North America and Eruasia, having a red crwon and black chin. We hope our project will grow agilely like these kinds of birds.
What is Redpoll?
Redpoll is a distributed machine learning library written in java. It works by the power of hadoop, which is an open source implementation of google's MapReduce computing Model. We intent to parallelize some traditional classification, clustering algorithms like Naive Bayes, K-Means, EM so that can deal with large-scale data sets. It's Apache 2.0 licensed.
Who may use Redpoll?
People who want to do data mining on large-scale datasets as well as those who are interested in distributed computing or data mining.
News
25 November 2008
- K-means clusterer implementation
- Canopy performance improvement.
9 November 2008
This commit, adds a canopy implementation, was developed in a rush. It still exists somewhat space inefficiency problems inherit from mahout.
6 November 2008
- It can successfully create tf-idf based vector space model now.
- Some bugs were fixed.
4 November 2008 - It can parsing Sogou News Now
- A hadoop style sogou record reader is finished. link: sogou news datasets
- First map/reduce stage for analyzing the textual stuff and counting the document frequency for each term and outputting the results is done.
- Has added a method for feature selection to increase precision of next steps.
Getting Started
Tutorials will come soon.