| Title | Learning a Context Free Grammar by reading Corpus in a given language |
|---|---|
| Student | Siddharth Angrish |
| Mentor | Simon Lin |
| Abstract | |
|
Abstract:
The target is to develop an Expectation Maximization approach based program to discern a probabilistic CFG from a corpus. The main algorithm will feed on the POS tags of the words and hence will be language independent. The algorithm will require a few seed rules to start and then will try to describe a sentence read as a combination of the seed rules. The new rules generated will be assigned probabilities based on the frequency of occurance of similar rules. The seed rules are based on the hypothsis that a sentence consists of different entities which are related to each other by relations coming from a defined set. These posited entities and the relations are defined below. The seed rules come from a special grammar formalism being developed for the same purpose. The program will require a dictionary having POS tags of the words encountered. In case the language is agglutinative, a Morphological analyzer is expected to break an encountered word into its constituents and get the POS tags. The grammar formalism and the rationale behind it are explained below. The particular seed rules for English are also given. |
|