Chapter 1 Introduction 1 Part I Translational Equivalence among Word Tokens 7 --------------------------------------------------------- Chapter 2 A Geometric Approach to Mapping Bitext Correspondence 8 2.1 Introduction 9 2.2 Bitext Geometry 11 2.3 Previous Work 12 2.4 The Smooth Injective Map Recognizer (SIMR) 19 2.4.1 Overview 19 2.4.2 Point Generation 21 2.4.3 Noise Filter 26 2.4.4 Point Selection 28 2.4.5 Reduction of the Search Space 29 2.4.6 Enhancements 31 2.5 Parameter Optimization 34 2.6 Evaluation 35 2.7 Implementation of SIMR for New Language Pairs 43 2.7.1 Step 1: Construct Matching Predicate 43 2.7.2 Step 2: Construct Axis Generators 44 2.7.3 Step 3: Re-optimize Parameters 47 2.8 Conclusion 48 Chapter 3 Application: Alignment 64 3.1 Introduction 65 3.2 Correspondence is Richer than Alignment 66 3.3 The Geometric Segment Alignment (GSA) Algorithm 67 3.4 Evaluation 70 3.5 Conclusion 73 Chapter 4 Application: Automatic Detection of Omissions in Translations 76 4.1 Introduction 77 4.2 The Basic Method 78 4.3 Noise-Free Bitext Maps 80 4.4 A Translators' Tool 83 4.5 Noisy Bitext Maps 84 4.6 ADOMIT 85 4.7 Simulation of Omissions 88 4.8 Evaluation 91 4.9 Conclusion 93 Part II The Type-Token Interface 103 ----------------------------------------- Chapter 5 Models of Co-occurrence 104 5.1 Introduction 105 5.2 Relevant Regions of the Bitext Space 106 5.3 Co-occurrence Counting Methods 108 5.4 Language-Specific Filters 112 5.5 Conclusion 114 Chapter 6 Manual Annotation of Translational Equivalence 119 6.1 Introduction 120 6.2 The Gold Standard Bitext 122 6.3 The Blinker Annotation Tool 125 6.4 Methods for Increasing Reliability 126 6.5 Inter-Annotator Agreement 130 6.6 Conclusion 137 Part III Translational Equivalence among Word Types 143 ----------------------------------------------------------- Chapter 7 Word-to-Word Models of Translational Equivalence 144 7.1 Introduction 145 7.2 Translation Model Decomposition 148 7.3 The One-to-One Assumption 155 7.4 Previous Work 156 7.4.1 Non-Probabilistic Translation Lexicons 156 7.4.2 Re-estimated Sequence-to-Sequence Translation Models 161 7.4.3 Re-estimated Bag-to-Bag Translation Models 168 7.5 Parameter Estimation 170 7.5.1 Method A: The Competitive Linking Algorithm 174 7.5.2 Method B: Improved Estimation Using an Explicit Noise Model 178 7.5.3 Method C: Improved Estimation Using Pre-Existing Word Classes 184 7.6 Effects of Sparse Data 185 7.7 Evaluation 189 7.7.1 Evaluation at the Token Level 189 7.7.2 Evaluation at the Type Level 207 7.8 Application to MT Lexicon Development 213 7.9 Conclusion 218 Chapter 8 Automatic Discovery of Non-Compositional Compounds 228 8.1 Introduction 229 8.2 Objective Functions 232 8.3 Search 234 8.4 Predictive Value Functions 235 8.5 Iteration 237 8.6 Credit Estimation 243 8.7 Single-Best Translation 247 8.8 Experiments 249 8.9 Related Work 256 8.10 Conclusion 260 Chapter 9 Sense-to-Sense Models of Translational Equivalence 271 9.1 Introduction 272 9.2 Previous Work 274 9.3 Formulation of the Problem 278 9.4 A Solution 280 9.4.1 Noise Filters 280 9.4.2 The SenseClusters Algorithm 282 9.5 An Application 285 9.6 Experiments 288 9.6.1 Quantitative Results 288 9.6.2 Qualitative Results 293 9.7 Conclusion 295 Chapter 10 Summary and Outlook 305 Appendix A Annotation Style Guide for the Blinker Project 319 ----------------------------------------------------------------- A.1 General Guidelines 319 A.1.1 Omissions in Translation 320 A.1.2 Phrasal Correspondence 321 A.2 Detailed Guidelines 323 A.2.1 Idioms and Near Idioms 324 A.2.2 Referring Expressions 324 A.2.3 Verbs 326 A.2.4 Prepositions 330 A.2.5 Determiners 332 A.2.6 Punctuation 334 Bibliography 369