Frequently Asked Questions List for GMA


last updated: Feb. 16, 2002


Table of Contents:

I. Administrative

  1. How can I make sure that the GMA package I received is genuine?
  2. How do I sign up for the GMA email list(s)?

II. Technical

  1. What is "mapping bitext correspondence" and how does it differ from inducing translation models?
  2. On what platforms does GMA run?
  3. How efficient is GMA?
  4. What language pairs can GMA be used for?
  5. What language-specific resources are required/desirable for use with GMA? What language-specific resources are included with this package?
  6. When should I re-optimize the GMA parameters?
  7. Where can I learn more about how SIMR and GSA work?

I. Administrative



  1. How can I make sure that the GMA package I received is genuine?
    Verify the uudecoded (but not gunzipped) package against the detached GPG signature at http://www.cs.nyu.edu/~melamed/GMA/GPG-sigs/GMA-VVV.tgz.sig where VVV is your version number, such as 1.1 .

  2. How do I sign up for the GMA email lists?
    You can use the web-based interface to (un)subscribe to the moderated GMA-announce list and the unmoderated GMA list.

II. Technical

  1. What is "mapping bitext correspondence" and how is it different from inducing translation models?
    A bitext map is a partial (ideally quite dense) relation between the tokens and token boundaries of a text and those of its translation. Translation models are relations between types, not tokens. E.g., GMA can tell you that the 3rd word in text X arose as a translation of the 4th word in X's translation Y, but it cannot tell you whether that pair of words would be a good entry in a bilingual dictionary. Methods exist for converting between bitext maps and translation models, but the reliable ones are not trivial.

  2. On what platforms does GMA run?
    So far, GMA has successfully run on Linux/i386 and Solaris/SPARC. I welcome proposals to port it to other platforms.

  3. How efficient is GMA?
    The underlying algorithms are all linear in the size of the input. However, the core of the code in this package is from an inefficient research prototype from before I knew how to program. Therefore, the current implementation is very slow and memory intensive. E.g. I tend to need 50X of RAM for every X bytes of bitext.

  4. What language pairs can GMA be used for?
    I'm not aware of any written languages that GMA cannot be used for. So far, GMA has been applied to:
    • French/English
    • Spanish/English
    • Korean/English
    • Chinese/English
    • Arabic/English
    • Czech/English
    • Malay/English
    Please do not hesitate to contact me for assistance with porting GMA to new language pairs.

  5. What language-specific resources are required/desirable for use with GMA? What language-specific resources are included with this package?
    GMA is based on an implementation of the Smooth Injective Map Recognizer (SIMR) algorithm. SIMR works best when supplied with language-specific information such as seed translation lexicons and lists of stop words. No such resources are included with this distribution, except stop words for English, French, Spanish, German, Italian and Malay (all encoded in ISO8859-1). Even without seed lexicons, the software can be useful for language pairs that share lots of cognates, but performance will suffer without lists of stopwords. If you want to work with a language that does not use the roman alphabet, then you definitely need a seed translation lexicon (see the HOWTO section on matching predicates). If you have some resources of this type that you would like to share, I'd be happy to include them in the next GMA release and to give you credit.

  6. When should I re-optimize the GMA parameters?
    SIMR has several numerical parameters that should be re-optimized every time you decide to use a new resource, new tokenization of the input, new matching predicate, etc.. If you just use the default parameters, as many people have done with Gale & Church's algorithm, the accuracy of the output may suffer greatly. Tools for reoptimizing GMA parameters are included in the package, in the train/ directory. To learn how to re-optimize the parameters, read the tech report on "Porting..." mentioned below, and the HOWTO-train file in the docs/ directory.

  7. Where can I learn more about how SIMR and GSA work?
    To better understand what this software does, I suggest you read one or more of my publications on this subject. Or just get the book: