|
FAQ
Frequently asked questions
Featured Foma FAQHow do I best view/export graphics of my transducers?If you only want to view the transducers, the following options are available:
If you need to export the transducers (say for LaTeX use), the OSX GraphViz viewer allows for direct pdf export of a transducer, which can be used elsewhere. For Linux/UNIX in general, you can do something like: foma[1]: print dot > myfile.dot [in foma] mylinux$ dot -Tpdf myfile.dot > myfile.pdf [on command line] Of course, if you want to change the default look of the automata, you can always edit the .dot-file foma produces before converting. Can I use Unicode/UTF8 with foma?Yes, UTF8 input is the only format foma accepts. If you need to compile legacy grammars in other encodings, you need to recode them first. For example, a latin1-encoded lexc-file should first be converted by, say the Unix recode utility: recode latin1..utf8 myoldlexcgrammar.lexc Foma works fine with ASCII but not with special characters (á,é,Ã, etc.)
How do I use my automata/transducers elsewhere?You can
Using flookup on its ownYou can pipe words or text through flookup. Each line is one "word" as far as flookup is concerned, but they could consist of sentences or arbitrary text. echo "word" | flookup -i -x mytransducer.foma Flookup with a bidirectional pipeThe flookup utility flushes its output with every set of outputs it gives for a single input (after 0.9.15, the -b flag is required for this behavior), giving the opportunity to use a bidirectional pipe to link against flookup. Here's an example in perl that illustrates the usage in a program that opens "mytransducer.foma" and does what "apply down" would do inside foma: #!/usr/bin/perl
use FileHandle;
use IPC::Open2;
use locale;
my $pidFOMA = open2(*Reader, *Writer, "flookup -i -x -b ./mytransducer.foma");
print Writer "word\n";
while ((my $returnword = <Reader>) ne "\n") {
chomp($returnword);
print "$returnword\n";
}
close(Reader); close(Writer);[Note: versions later than 0.9.15 require the -b flag to flush the output between every word]. Through the foma APIThe following code example loads a transducer and applies a word and prints all the possible outputs: #include <stdio.h>
#include <stdlib.h>
#include "fomalib.h"
int main() {
struct fsm *net;
struct apply_handle *ah;
char *result;
net = fsm_read_binary_file("mytransducer.foma");
if (net == NULL) {
perror("Error loading file");
exit(EXIT_FAILURE);
}
ah = apply_init(net);
result = apply_down(ah, "word");
while (result != NULL) {
printf("%s\n", result);
result = apply_down(ah, NULL);
}
apply_clear(ah);
fsm_destroy(net);
}Compile with something like: gcc -o fomaread fomaread.c -lfoma or, to include the foma library statically gcc -o fomaread fomaread.c /usr/local/lib/libfoma.a -lz The Javascript runtimeA simple Javascript runtime is found in the contrib directory. It provides a function foma_apply_down() that accepts as input arguments an automaton object (which can be generated by a separate script), and an input word. All the possible outputs are returned in an array. In order to use the runtime you need to:
./foma2js.perl -n myNet mynet.foma > mynet.js
Example: <script type="text/javascript" src="mynet.js"></script> <script type="text/javascript" src="foma_apply_down.js"></script> ... <script type="text/javascript"> ... // foma_apply_down returns an array with all the outputs // for the input inputString var returnArray = foma_apply_down(myNet, inputString); ... </script> Some caveats: the js runtime is a recursive implementation of the application code and, unlike the C library code, does not check for input-side epsilon-loops, which can cause it to recurse infinitely. Also, flag diacritics are not supported. Performance is not nearly as good as with the C library, but perhaps acceptable for many purposes (reaching 10,000 to 100,000 output words per second for average transducers on Safari/Mozilla/Chrome). Rolling your ownFor using a custom applyer, the prolog format is probably the most convenient export format to use in foma. This is because it declares alphabet symbols that don't appear on transitions. Consider: foma[0]: regex \a b; 339 bytes. 3 states, 3 arcs, 2 paths. foma[1]: write prolog network(41A7). symbol(41A7, "a"). arc(41A7, 0, 1, "?"). arc(41A7, 0, 1, "b"). arc(41A7, 1, 2, "b"). final(41A7, 2). Here, the symbol a, which never appears on a transition is declared as symbol(), so a matcher can potentially know that, for instance, the ?-transition from 0 to 1 should not match a. However, if you're always using closed alphabets (transducers that don't carry the @ or ? symbols), this is of no concern. My final transducer is HUGE! Isn't there a way to avoid composing all the components?Maybe. If it's not a morphological analyzer that you want to parse with that generates too much intermediate ambiguity, you can use the flookup utility to virtualize part of the composition. For example, if you have all the component transducers in order on the stack, you can save them all to one file, and flookup will pass the output of each one as input to the next: foma[0]: regex a -> b; 366 bytes. 1 states, 3 arcs, Cyclic. foma[1]: regex b -> c; 366 bytes. 1 states, 3 arcs, Cyclic. foma[2]: regex c -> d; 366 bytes. 1 states, 3 arcs, Cyclic. foma[3]: save stack testchain.foma Writing to file testchain.foma. ... echo "a" | flookup -i -x testchain.foma d Doing this is often a good idea if you have a tagger or chunker that consists of multiple phases where each tagger transducer addresses a specific component (say tags dates, names, places) and produces no ambiguous outputs. In this case, composing them all provides little gain in application speed, and much loss in space efficiency, so the above technique is advisable. How do I deal with reduplication without compile-replaceUse the _eq() operator. See the `eq()` definition in the regular expression reference for an example. It takes forever to compose all the rewrite rules together in my morphologyIt's best to not do: define Allrules Rule1 .o. ... .o. RuleN; regex Lexicon .o. Allrules; But rather to do: regex Lexicon .o. Rule1 .o. ... .o. RuleN; Or, alternatively filter the input to the first rule with the lower side of the lexicon, thusly: define Allrules Lexicon.l .o. Rule1 .o. ... .o. RuleN; regex Lexicon .o. Allrules; I need to tokenize some text. I want to use foma, but there is no tokenize utilityYou can create a transducer that inserts newlines (or your preferred token boundary). Here's a crude tokenizer. It uses a separate mandatory stoplist of words like "Dr." in tokenabbreviations.txt, one on each line: # tokenizer.script
define Boundary ["("|"\"|"."|","|{"}|";"|":"|"?"|"!"|"¿"|"¡"|"«"|"»"|"'"|"`"|")"|
"^"|"@"|"~"|"|"|"_"|"/"|"+"|"="|"&"|"$"|"€"|"£"|"¢"|"¥"|"#"|"*"|
"+"|"%"|" "|"-"+|"\u0009"];
define DIGIT [%0|1|2|3|4|5|6|7|8|9];
define TOKENSYM "\u000a";
define SPACE " "|"\u0009"|"\u000a"|"\u000d" ;
define WordsCompounds [\Boundary+ ("_" \Boundary+)+];
define Initials [\Boundary "."]+;
define Numbers DIGIT+ [(%, DIGIT+)* (%. DIGIT+)];
define Abbreviations @txt"tokenabbreviations.txt";
define Tokenizer [WordsCompounds|Numbers|Abbreviations|Boundary|Initials] @-> ... TOKEN .o.
" " -> 0 .o. TOKEN+ @-> TOKEN .o. SPACE+ @-> 0 .o. TOKEN+ @-> "\u000a";
regex Tokenizer;
save stack mytokenizer.fomaThe produced transducer can be used like so: $ echo "This is a sentence, a sentence to be tokenized." | flookup -i -x -w "" mytokenizer.foma Yielding This is a sentence , a sentence to be tokenized . My replace rules don't work when I have flag diacritics in the inputAlthough flags are epsilons when applying them, they do need to be declared in replacement rule contexts, one should say e.g. define BPchange b -> p || _ ("@U.SOMEFLAG.SOMEVALUE@") .#. ;if there's a possibility that there's a flag between the target and the end-of-word, otherwise the rule won't trigger. You can also issue the (mildly deprecated) command SET flag-is-epsilon ON before composing the rules together with the lexicon. This will allow the rules to trigger even with flags intervening in the contexts. There's a word path in a transducer which I can see with mine own eyes, but foma won't accept the word in apply down/upThis is probably a symbol tokenization issue. Foma does a leftmost-longest symbol tokenization on all words before looking them up in a transducer. It's dangerous to use multicharacter symbols where their prefixes overlap. [Well, it's dangerous to use multicharacter symbols, period.] Example: regex c a t | ca r; and now apply down> cat ??? The word is not accepted because the input cat gets tokenized ca+t (because we have symbols ca and t in the alphabet). But there is no path matching ca+t, only c+a+t; hence, it won't match. A good debugging strategy is to always look at the alphabet in situations like this (with the sigma command). |
How can we invoked the flookup utility from a java program?
The easiest way is probably to use standard methods for executing shell commands in Java to call flookup. See this link for an example: http://www.dzone.com/snippets/execute-shell-command-java
I have tried this process earlier but faced a problem when I did the following
Process pr = run.exec("echo " + "\""+ "city" + "\""+ " | flookup -x eng.foma"); the output is as follows
"city" | flookup -x eng.foma
The output of echo is not transferred to flookup