| Title | RelEx Web Crawler and HypergraphDB Manager |
|---|---|
| Student | Rich Jones |
| Mentor | David Hart |
| Abstract | |
|
One of the most important aspects of future artificial intelligences will be their ability to efficiently use the internet as a source of knowledge for helpful use in natural language conversation. This project aims to create a program which will use the internet to automatically create a database of RelEx output data, which can then be used as a knowledge base for an intelligence to analyze and assist from.
To do so, I would like to modify either the Grub Distributed Web Crawler (as suggested on the ideas page) or CMU's WebSphinx to crawl a website, and to automatically extract and analyze the usable content. This data would then be stored as a file in RelXML or OpenCogXML format for each URL crawled, which could then be reprocessed and analyzed or queried on demand. After receiving feedback about my initial proposal and then doing some research about hypergraphs and their implementation in HypergraphDB, it seems logical that the RelEx output of each page should be automatically added into a single HGDB HyperGraph, where each XML element becomes a new Atom, unless such an Atom already exists, in which case only the appropriate connecting HGLink Atoms would be created. Once this HyperGraph is created, HyperGraphDB's built-in HGQuery operations can be used to find subsets of the hypergraph, easily extracting relationships about the entire span of crawled web pages. This program is not particularly technically demanding, which means it should be in a very stable and usable form by the end of the project. Hopefully, this project will be a great tool for researchers looking to refine their rankings and framing rules, as well as for developers using the OpenCog platform to create intelligent agents which access online knowledge-bases. |
|