| Title | Finite Mixture Models for Large Data Sets |
|---|---|
| Student | Arijit Das |
| Mentor | Friedrich Leisch |
| Abstract | |
|
Finite mixtures of distributions provide a mathematical-based approach to the statistical modeling of a wide variety of random phenomena. Due to the usefulness of mixture distributions in the modeling of heterogeneity in a cluster they are the current state-of-the-art in marketing research to cluster consumers into market segments. Package flexmix in R implements a flexible framework for fitting these types of models, and is mainly designed for rapid prototyping.
Though fitting these models using this package, is rather slow for large data sets, thus the main aim of this project would be to re-implement most popular models in C so that it is feasible to mine large data sets. Namely, the C implementation of the EM algorithm for mixture models will have the following components: Gaussians, no regression (AKA model-based clustering) Binomials, no regression (clustering binary data) Multinomial, no regression (clustering categorical data) Linear model, regression with Gaussian response Generalized linear model, regression with response from exponential family Since EM algorithm converges only to a local minimum, it is recommended to repeat it several times and for various number of components keeping only the best solution found. In this project this part of the algorithm would be parallelized for multi-CPU machines or clusters of workstations to achieve greater efficiency. Eligibility: Masters Student at Indian Institute of Technology, Kanpur Major: Statistics Mentor: Prof. Dr. Friedrich Leisch is ready to guide me through the project |
|