Title Learning a Context Free Grammar by reading Corpus in a given language
Student Siddharth Angrish
Mentor Simon Lin
Abstract
Abstract:

       The target is to develop an Expectation Maximization
        approach based program to discern a probabilistic CFG from a
        corpus. The main algorithm will feed on the POS tags
        of the words and hence will be language independent.
        The algorithm will require a few seed rules
        to start and then will try to describe a sentence read as
        a combination of the seed rules. The new rules generated
        will be assigned probabilities based on the frequency
        of occurance of similar rules. The seed rules are
        based on the hypothsis that a sentence consists
        of different entities which are related to each other
        by relations coming from a defined set. These posited
        entities and the relations are defined below. The seed
        rules come from a special grammar formalism being
        developed for the same purpose.
            The program will require a dictionary having POS
        tags of the words encountered. In case the language
        is agglutinative, a Morphological analyzer is expected
        to break an encountered word into its constituents and
        get the POS tags.

        The grammar formalism and the rationale behind it are
        explained below. The particular seed rules for
        English are also given.