My favorites | Sign in
Project Logo
                
Search
for
Updated Aug 17, 2009 by David.Jurgens
Labels: Featured
FrequentlyAskedQuestions  
Frequently Asked Questions about the S-Space Package

Code

How do I get the code?

We will soon offer a versioned source release under the downloads section. Until then, svn checkout is the only way to get the source. We are currently working on an initial 1.0 target feature set and will then release an official versioned source archive.

If you are looking for a pre-built binary, we currently offer a few command-line programs for running specific algorithms in the downloads section.

How stable is the trunk?

We are still actively developing the code base. In general, the interfaces are generally safe to program against. We internally test and review these and they are designed to be stable at commit time. However, the utility classes are much more mutable and may be changed without notice based on our future plans or current needs.

We also have a general rule that the trunk will always compile even though it contain bugs. When in doubt, check the commit logs to see if some code was committed in a partial state. (We rarely do this, but it is often practical when teaming up on a specific piece of code.)

You changed some feature in the trunk that I was using! Can I get it back?

Working from the trunk is often exciting, but dangerous. That said, if we have removed something you needed, let us know. Include a use-case and we will actively look into either replacing the code, or providing alternate functionality that meets your needs.

I found a bug!

Please file it in our issue tracker. Any information you can provide will help us to fix it quicker. We aim to have a quick turn-around time on publicly reported bugs.

The code doesn't compile!

Check that you're using Java 6. (This is especially important for OS X, where the default Java version is currently Java 5.) If you're building from the SVN trunk and using Java 6, let us know. We maintain that the trunk should always compile so if it doesn't for some reason, we will fix it right away.

Can I get feature X?

Probably, but how soon is going to be the limiting factor. We often have unfinished code in our private repository that may be what you're looking for. Due to changes in focus, we never get around to properly testing and commenting this code, so it sits by not checked in.

If we haven't implemented it yet, and the idea make sense, then we'll move it to the top of our to-do list.

As always, the easiest way to get something implemented is to email us at s-space-research-dev@googlegroups.com and let us know what you want to see.

Why do you require external SVD programs when you have an SVD.java class?

Our current code base doesn't have an actual SVD implementation. While other all-Java SVD implementations exist (e.g. COLT and JAMA), they do not support a sparse SVD. In almost all of the use cases, this results in a prohibitively large memory requirement. SVDLIBC, Octave and Matlab all have sparse SVD implementations that are fast and reasonably efficient. Our SVD.java class is merely a wrapper that calls the appropriate external SVD-engine to do the heavy lifting.

However, if there were a fast, memory efficient all-Java sparse SVD, we would use it. If you know of one, please let us know. We have searched extensively for one, but were unable to find any. We have debated about implementing one, but this would detract from the current focus of the package, so any tentative plans are currently on hold.

Running Algorithms

How do I run one of the algorithms?

There are several possibilities. One way to is to use one of the pre-packaged jar files from the downloads page. These can be run with java -jar <jar-name>, and will print algorithm-specific instructions on how to use them.

The second way uses our Main classes, which are command-line executable programs for running the different algorithms. All of the fully implemented algorithms should have an associated class in the edu.ucla.sspace.mains package. The wiki page for a specific algorithm will also have further details.

Is it possible to run them through a graphical interface?

This is not currently supported, but is still possible if you want to do it yourself. Currently, our resources are focused on getting the various algorithms working, so we have not had time to build a nice-looking graphical front end to the algorithms. However, if you want to do this yourself, you can simply instantiate one of the algorithm classes. For example:

import edu.ucla.sspace.lsa.LatentSemanticAnalysis;

public class MyGUI {
  
    // ....

    public void myFunc() {
        LatentSemanticAnalysis lsa = new LatentSemanticAnalysis();
        lsa.processDocument(...);
    }
}

The algorithms are fully self-contained, and can easily be used as libraries.

I keep getting OutOfMemory errors!

This could be due to several issues.

First, consider manually setting the maximum memory for the JVM with -Xmx<size>. See the JVM documentation for further details.

Second, note that many of the algorithms scale based on the number of terms in the corpus. If no pre-processing is done to the corpus, it make contain seemingly duplicate tokens such as:

and so on. The best way to assess whether this is the root issue is to count how many unique token types are in the input corpus.

If you think the algorithm should still scale to the number of unique words but is still throwing errors, please let us know.

I keep getting a Exception in thread "main" java.lang.UnsupportedOperationException: No SVD algorithms are available

It looks like our SVD.java code can't find the backing SVD program that actually does the computation.

For SVDLIBC, Matlab and Octave, check if you can call that from the command line. For example, if you have Matlab installed, you should be able to type matlab from the command-line and have the Matlab program start to run.

For JAMA and COLT, check that the .jar files are specified in the path. If you are running an algorithm from a .jar, ensure that you specify the -Djama.path or -Dcolt.path properties as necessary.

If you can't get any of these to work, let us know so we can help strategize. The SVD is a particular pain point for us as well, so we want to make sure "it just works."

Corpus Questions

I want to filter out certain words in my corpus (i.e. stop words)

This is possible in any of the algorithms. See the documentation for specifics. The current code supports both excluding words, which is useful for stop lists, and also be strictly inclusive and keep only a recognized set of words.

How big should my corpus be?

That largely depends on the algorithm. If your corpus is too small, the words may not have sufficient co-occurrence statistics to form a semantic vector that is actually representative. Furthermore, some algorithms such as LRA require significantly more documents to produce good semantics. Also, note that even with a large corpus, some words may not occur frequently enough to generate accurate vectors.

Having too large of a corpus is also an important issue. LSA is much more sensitive to the number of documents in the corpus. However, other word co-occurrence algorithms such as HAL and Random Indexing are still dependent on the number of words. Some algorithms provide additional options to only calculate semantics for a specific number of words, which saves a large amount of space.

Other Questions

Who do I contact with questions about the project?

If it is a general question, contact s-space-research-dev@googlegroups.com. If you need a private question answered, please email David Jurgens or Keith Stevens.

Why does the project activity tend to drop off in the summer

Two possible reasons. One, we often work on much larger scale projects during the summer. We intend to port these to S-Space package once they are finished, but we don't want to leave intermediate files in the repository until they're verified as working.

Second during the summer, many of the graduate students and undergraduates working on the project leave on vacation or for work. This puts a time constraint on how much work can be done. However, work is still being done, even if it isn't committed.

Why Airhead Research

Airhead Research sands for AI-Head Research. This is a throw-back from an older name for the lab, which we have taken as our own. We're still working on a nice-sounding acronym for "head" to fill out the title.


Sign in to add a comment
Hosted by Google Code