My favorites | Sign in
Project Logo
                
Search
for
Updated May 15, 2009 by David.Jurgens
Labels: Featured
PackageStructure  
Introduction to the code base, the basic functionality, and how it is packaged

Introduction

The S-Space package attempts to provide a clean and reusable interface for using and building various semantic memory models. To facilitate this, we have developed a number of useful interfaces and classes which do the most common work for you. This document will outline which interfaces and classes are already provided and some guidance as to how to use them, along with the overall structure of the package.

Organization

In agreement with most Java packages, all source files can be found in the src directory, in the edu/ucla/sspace sub folders. Corresponding test cases are also provided for many classes, and are located under the test directory, with the same underlying sub folders.

Each Semantic Space model is given it's own directory under sspace, so that they can each be relatively self contained and be free to use other public package classes freely. New models should follow this pattern by choosing a short abbreviation of the s-space.

Since many Semantic Space models should have a reasonable number of things in common, these things have been collected together and stored in the common package. These will by far the most re-used classes in the package for those implementing their own semantic space model.

More on Common

Common provides a mixture of several interfaces which make designing models easier, along with some already implemented sub-classes which fit particular use cases.

The Utilites provided can broadly be split up into the following categories

SemanticSpace classes

The SemanticSpace interface defines the basic functionality which all Semantic Space models should implement for uniformity of use. Various utilities are then provided for any implemented SemanticSpace, such as storing the sspace as a binary or text file, and retrieving the sspace from a binary or text file for evaluation purposes.

Matrix Classes

The Matrix interface defines a consistent set of access and retrieval methods for all Matrix implementations, thereby allowing applications to easily switch between different Matrix implementations as needed. Three implementations are currently provided.

  • SparseMatrix, using the Yale Sparse Matrix Format. This matrix is ideal for sparse matrices that can fit in memory.
  • ArrayMatrix, a dense, in memory matrix for reasonably small matrices.
  • OnDiskMatrix, which stores all values on disk, suitable for extremely large matrices.
Each of these have an intended use case, depending on the the amount of data produced, it's sparseness, and the desired run-time of operations. In addition to the Matrix classes, the code base supports several common operations on matrices using the Matrices class. This class provides methods for:
  • transposing a matrix
  • getting a synchronized instance of any other Matrix instance
  • automatically selecting and instantiating the right Matrix class to use based on user-provided parameters and available system resources.
  • several normalization methods, such as row sum, correlation, column sum, and row length normalization are currently provided.
Since the matrices are intended for very large datasets, multiplication, division, and other common linear algebra abilities are not yet provided, but are likely to be added at a later date.

The S-Space package also provides support for I/O operations on Matrix instances through the MatrixIO class. This allows easy conversion between multiple on-disk matrix storage formats.

The SVD (Singular Value Decomposition) class works in conjunction with the Matrix interface by translating a matrix into a suitable file format for a variety of svd implementations, the Singular Value Decomposition page has more details on this particular interface.

Similarity Functions

Additionally, several similarity measures are provided:

Document Parsing Utilities

The Document interface, and it's iterators and implemented classes provides a uniform method of interacting with corpora so that methods calling processDocument have a uniform means of iterating through many document files. The provided Document related classes are:

In addition to these, a WordIterator is provided, such that it will automagically tokenize a BufferedReader and provided an iterator for each word read.

Data Structures

The S-Space package provides several data structures.

  • Two Java Map implementations
    • BoundedSortedMap which retains only the kth largest elements that are stored in it.
    • TrieMap a trie data structure for storing CharSequence keys using the Map interface.
  • a MultiMap interface for defining one-to-many mapping semantics and a hash-table based implementation, HashMultiMap.
  • a Pair class for storing two objects of a kind.

Additional utilities, data structures, and interface implementations are likely to be added, or split off into a related package if they grow to be significant enough to warrant a branching.


Sign in to add a comment
Hosted by Google Code