My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
FAQ  
Frequently asked questions
Featured
Updated Jan 30, 2012 by mans.hul...@gmail.com

Foma FAQ

How do I best view/export graphics of my transducers?

If you only want to view the transducers, the following options are available:

  • Mac/OSX: install the GraphViz package (for OSX), and the view command within foma will be available.
  • UNIX/Linux: install Graphviz (e.g. the graphviz package with apt-get), and the view command will be available and will launch your default viewer for png files (using xdg-open).

If you need to export the transducers (say for LaTeX use), the OSX GraphViz viewer allows for direct pdf export of a transducer, which can be used elsewhere. For Linux/UNIX in general, you can do something like:

foma[1]: print dot > myfile.dot            [in foma]

mylinux$ dot -Tpdf myfile.dot > myfile.pdf [on command line]

Of course, if you want to change the default look of the automata, you can always edit the .dot-file foma produces before converting.

Can I use Unicode/UTF8 with foma?

Yes, UTF8 input is the only format foma accepts. If you need to compile legacy grammars in other encodings, you need to recode them first.

For example, a latin1-encoded lexc-file should first be converted by, say the Unix recode utility:

recode latin1..utf8 myoldlexcgrammar.lexc

Foma works fine with ASCII but not with special characters (á,é,í, etc.)

  • Make sure they are UTF8 encoded.
  • Some mathematical operators are reserved symbols and need to be escaped by % or enclosing in quotes. See the list of reserved symbols.

How do I use my automata/transducers elsewhere?

You can

  • Pipe text through the flookup utility, or run a bidirectional pipe between it and some other application (C/C++/Perl/PHP/Ruby/Python/...)
  • Use the foma C API to load/apply transducers (C/C++/Obj C/C#)
  • Javascript/mobile devices: there is a simple js runtime available
  • Roll your own automaton applyer after exportin the foma transducers in a text format, the prolog format (write prolog) or the AT&T format (write att).

Using flookup on its own

You can pipe words or text through flookup. Each line is one "word" as far as flookup is concerned, but they could consist of sentences or arbitrary text.

echo "word" | flookup -i -x mytransducer.foma

Flookup with a bidirectional pipe

The flookup utility flushes its output with every set of outputs it gives for a single input (after 0.9.15, the -b flag is required for this behavior), giving the opportunity to use a bidirectional pipe to link against flookup.

Here's an example in perl that illustrates the usage in a program that opens "mytransducer.foma" and does what "apply down" would do inside foma:

#!/usr/bin/perl

use FileHandle;
use IPC::Open2;
use locale;

my $pidFOMA = open2(*Reader, *Writer, "flookup -i -x -b ./mytransducer.foma");

print Writer "word\n";

while ((my $returnword = <Reader>) ne "\n") {
    chomp($returnword);
    print "$returnword\n";
}
close(Reader); close(Writer);

[Note: versions later than 0.9.15 require the -b flag to flush the output between every word].

Through the foma API

The following code example loads a transducer and applies a word and prints all the possible outputs:

#include <stdio.h>
#include <stdlib.h>
#include "fomalib.h"

int main() {

  struct fsm *net;
  struct apply_handle *ah;
  char *result;

  net = fsm_read_binary_file("mytransducer.foma");
  if (net == NULL) {
    perror("Error loading file");
    exit(EXIT_FAILURE);
  }
  ah = apply_init(net);
  result = apply_down(ah, "word");
  while (result != NULL) {
    printf("%s\n", result);
    result = apply_down(ah, NULL);
  }
  apply_clear(ah);
  fsm_destroy(net);
}

Compile with something like:

gcc -o fomaread fomaread.c -lfoma

or, to include the foma library statically

gcc -o fomaread fomaread.c /usr/local/lib/libfoma.a -lz

The Javascript runtime

A simple Javascript runtime is found in the contrib directory. It provides a function foma_apply_down() that accepts as input arguments an automaton object (which can be generated by a separate script), and an input word. All the possible outputs are returned in an array. In order to use the runtime you need to:

  • First convert the desired transducer/automaton in a foma binary file to a javascript object source using foma2js.perl. For example:
./foma2js.perl -n myNet mynet.foma > mynet.js

Example:

<script type="text/javascript" src="mynet.js"></script>
<script type="text/javascript" src="foma_apply_down.js"></script>
...
<script type="text/javascript">
...
// foma_apply_down returns an array with all the outputs
// for the input inputString
var returnArray = foma_apply_down(myNet, inputString);
...
</script>

Some caveats: the js runtime is a recursive implementation of the application code and, unlike the C library code, does not check for input-side epsilon-loops, which can cause it to recurse infinitely. Also, flag diacritics are not supported. Performance is not nearly as good as with the C library, but perhaps acceptable for many purposes (reaching 10,000 to 100,000 output words per second for average transducers on Safari/Mozilla/Chrome).

Rolling your own

For using a custom applyer, the prolog format is probably the most convenient export format to use in foma. This is because it declares alphabet symbols that don't appear on transitions. Consider:

foma[0]: regex \a b;
339 bytes. 3 states, 3 arcs, 2 paths.

foma[1]: write prolog
network(41A7).
symbol(41A7, "a").
arc(41A7, 0, 1, "?").
arc(41A7, 0, 1, "b").
arc(41A7, 1, 2, "b").
final(41A7, 2).

Here, the symbol a, which never appears on a transition is declared as symbol(), so a matcher can potentially know that, for instance, the ?-transition from 0 to 1 should not match a. However, if you're always using closed alphabets (transducers that don't carry the @ or ? symbols), this is of no concern.

My final transducer is HUGE! Isn't there a way to avoid composing all the components?

Maybe. If it's not a morphological analyzer that you want to parse with that generates too much intermediate ambiguity, you can use the flookup utility to virtualize part of the composition. For example, if you have all the component transducers in order on the stack, you can save them all to one file, and flookup will pass the output of each one as input to the next:

foma[0]: regex a -> b;
366 bytes. 1 states, 3 arcs, Cyclic.
foma[1]: regex b -> c;
366 bytes. 1 states, 3 arcs, Cyclic.
foma[2]: regex c -> d;
366 bytes. 1 states, 3 arcs, Cyclic.
foma[3]: save stack testchain.foma
Writing to file testchain.foma.

...
echo "a" | flookup -i -x testchain.foma 
d

Doing this is often a good idea if you have a tagger or chunker that consists of multiple phases where each tagger transducer addresses a specific component (say tags dates, names, places) and produces no ambiguous outputs. In this case, composing them all provides little gain in application speed, and much loss in space efficiency, so the above technique is advisable.

How do I deal with reduplication without compile-replace

Use the _eq() operator. See the `eq()` definition in the regular expression reference for an example.

It takes forever to compose all the rewrite rules together in my morphology

It's best to not do:

define Allrules Rule1 .o. ... .o. RuleN;
regex Lexicon .o. Allrules;

But rather to do:

regex Lexicon .o. Rule1 .o. ... .o. RuleN;

Or, alternatively filter the input to the first rule with the lower side of the lexicon, thusly:

define Allrules Lexicon.l .o. Rule1 .o. ... .o. RuleN;
regex Lexicon .o. Allrules;

I need to tokenize some text. I want to use foma, but there is no tokenize utility

You can create a transducer that inserts newlines (or your preferred token boundary). Here's a crude tokenizer. It uses a separate mandatory stoplist of words like "Dr." in tokenabbreviations.txt, one on each line:

# tokenizer.script
define Boundary ["("|"\"|"."|","|{"}|";"|":"|"?"|"!"|"¿"|"¡"|"«"|"»"|"'"|"`"|")"|
                 "^"|"@"|"~"|"|"|"_"|"/"|"+"|"="|"&"|"$"|"€"|"£"|"¢"|"¥"|"#"|"*"|
                 "+"|"%"|" "|"-"+|"\u0009"];

define DIGIT [%0|1|2|3|4|5|6|7|8|9];
define TOKENSYM "\u000a";
define SPACE " "|"\u0009"|"\u000a"|"\u000d" ;
define WordsCompounds [\Boundary+ ("_" \Boundary+)+];
define Initials [\Boundary "."]+;
define Numbers DIGIT+ [(%, DIGIT+)* (%. DIGIT+)];
define Abbreviations @txt"tokenabbreviations.txt";
define Tokenizer [WordsCompounds|Numbers|Abbreviations|Boundary|Initials] @-> ... TOKEN .o. 
                 " " -> 0 .o. TOKEN+ @-> TOKEN .o. SPACE+ @-> 0 .o. TOKEN+ @-> "\u000a";
regex Tokenizer;
save stack mytokenizer.foma

The produced transducer can be used like so:

$ echo "This is a sentence, a sentence to be tokenized." | flookup -i -x -w "" mytokenizer.foma

Yielding

This
is
a
sentence
,
a
sentence
to
be
tokenized
.

My replace rules don't work when I have flag diacritics in the input

Although flags are epsilons when applying them, they do need to be declared in replacement rule contexts, one should say e.g.

define BPchange b -> p || _ ("@U.SOMEFLAG.SOMEVALUE@") .#. ;

if there's a possibility that there's a flag between the target and the end-of-word, otherwise the rule won't trigger.

You can also issue the (mildly deprecated) command

SET flag-is-epsilon ON

before composing the rules together with the lexicon. This will allow the rules to trigger even with flags intervening in the contexts.

There's a word path in a transducer which I can see with mine own eyes, but foma won't accept the word in apply down/up

This is probably a symbol tokenization issue. Foma does a leftmost-longest symbol tokenization on all words before looking them up in a transducer. It's dangerous to use multicharacter symbols where their prefixes overlap. [Well, it's dangerous to use multicharacter symbols, period.]

Example:

regex c a t | ca r;

and now

apply down> cat
???

The word is not accepted because the input cat gets tokenized ca+t (because we have symbols ca and t in the alphabet). But there is no path matching ca+t, only c+a+t; hence, it won't match.

A good debugging strategy is to always look at the alphabet in situations like this (with the sigma command).

Comment by medarith...@gmail.com, May 17, 2012

How can we invoked the flookup utility from a java program?

Comment by project member mans.hul...@gmail.com, May 17, 2012

The easiest way is probably to use standard methods for executing shell commands in Java to call flookup. See this link for an example: http://www.dzone.com/snippets/execute-shell-command-java

Comment by medarith...@gmail.com, May 18, 2012

I have tried this process earlier but faced a problem when I did the following

Process pr = run.exec("echo " + "\""+ "city" + "\""+ " | flookup -x eng.foma"); the output is as follows

"city" | flookup -x eng.foma

The output of echo is not transferred to flookup


Sign in to add a comment
Powered by Google Project Hosting