My favorites | Sign in
Project Logo
                
Search
for
Updated Jun 26, 2009 by David.Jurgens
HyperspaceAnalogueToLanguage  
A description of the Hyperspace Analogue to Language implementation of the S-Space package.

Introduction

Hyperspace Analogue to Language (HAL) creates a semantic space from word co-occurrences. A word-by-word matrix is formed with each matrix element is the strength of association between the word represented by the row and the word represented by the column. The user of the algorithm then has the option to drop out low entropy columns from the matrix.

As the text is analyzed, a focus word is placed at the beginning of a ten word window that records which neighboring words are counted as co-occurring. Matrix values are accumulated by weighting the co-occurrence inversely proportional to the distance from the focus word; closer neighboring words are thought to reflect more of the focus word's semantics and so are weighted higher. HAL also records word-ordering information by treating the co-occurrence differently based on whether the neighboring word appeared before or after the focus word.

Typically, the all of the co-occurrence information is used to build semantic vectors are used (for an N x N matrix, these are 2*N in length). However, HAL also offers two possibilities for dimensionality reduction. Not all columns provide equal amount of information that can be used to distinguish the meanings of the words. Specifically, the information theoretic entropy of each column can be calculated as a way of ordering the columns by their importance. Using this ranking, either a fixed number of columns may be retained, or a threshold may be set to filter out low-entropy columns.

For more information on HAL, the following paper is the source of this algorithm:

See here for additional papers that use HAL.

HAL Implementation

All of HAL is contained on one file, HAL.java.

"Documents" are given to the algorithm, which allow the user to segment the corpus based on paragraph or sentence boundaries. This has the effect of removing co-occurrence relationships between words on boundaries. Segmentation is not required and is at the users discretion.

Software Requirements

HAL requires Java 6.

Running HAL

HAL can be invoked either using java edu.ucla.sspace.mains.HALMain or through the jar release java -jar hal.jar. Both ways are equivalent.

We provide the following options for changing the behavior of HAL and how the program is run.

The program will produce a .sspace file containing the semantic space information from the input corpus. See FileFormats for exact details on the output formatting.


Sign in to add a comment
Hosted by Google Code