Cyril Allauzen

Senior Research Scientist
Computer Science Department, Courant Institute of Mathematical Sciences, NYU
Address: 251 Mercer Street, Room 819, New York, NY 10012
Email: my_last_name at cs dot nyu dot edu google dot com

I am a Research Scientist at the Courant Institute. Before joining Courant, I was a member of the Speech Algorithms Department at AT&T Labs - Research. I obtained my Ph.D. in 2001 at the Institut Gaspard-Monge at the Université Paris-Est Marne-la-Vallée.

Update: As of January 2008, I am now a Research Scientist at Google.

Research Interests

My topics of interest currently are:
  • weighted automata and finite-state transducers (theory and algorithms),
  • machine learning (kernel methods),
  • natural language processing (speech recognition, speech synthesis),
  • text algorithms (string matching, indexing).

Software

  • Finite-State Transducer Library (OpenFst Library):
    An open-source software library for constructing, combining, optimizing, and searching weighted finite-state transducers.

  • Kernel Library (OpenKernel Library):
    An open-source software library for creating, combining, learning and using kernels for machine learning applications.

  • Grammar Library (GRM Library):
    A general software collection for constructing and modifying weighted automata and transducers representing weighted grammars or statistical language models.

Publications

[1]
Cyril Allauzen and Michael Riley. Pre-initialized composition for large-vocabulary speech recognition. In Interspeech 2013, pages 666-670, 2013.

[2]
Hasim Sak, Yun hsuan Sung, Françoise Beaufays, and Cyril Allauzen. Written-domain language modeling for automatic speech recognition. In Interspeech 2013, pages 675-679, 2013.

[3]
Brian Roark, Cyril Allauzen, and Michael Riley. Smoothed marginal distribution constraints for language modeling. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL), pages 43-52, 2013.

[4]
Hasim Sak, Françoise Beaufays, Kaisuke Nakajima, and Cyril Allauzen. Language model verbalization for automatic speech recognition. In Proceedings of ICASSP 2013. IEEE, 2013.

[5]
Cyril Allauzen, Edward Benson, Ciprian Chelba, Michael Riley, and Johan Schalkwyk. Voice query refinement. In Interspeech 2012, 2012.

[6]
Cyril Allauzen and Michael Riley. A pushdown transducer extension for the OpenFst library. In Implementation and Application of Automata - 17th International Conference, CIAA 2012, volume 7381 of Lecture Notes in Computer Science, pages 66-77. Springer, 2012.

[7]
Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, and Terry Tai. The OpenGrm open-source finite-state grammar software libraries. In Proceedings of ACL (System Demonstrations) 2012, pages 61-66, 2012.

[8]
Cyril Allauzen, Corinna Cortes, and Mehryar Mohri. A dual coordinate descent algorithm for SVMs combined with rational kernels. International Journal of Foundations of Computer Science, 22(8):1761-1779, 2011.

[9]
Cyril Allauzen and Michael Riley. Bayesian language model interpolation for mobile speech input. In Interspeech 2011, pages 1429-1432, 2011. (PDF, 91052 bytes)

[10]
Jeffrey Sorensen and Cyril Allauzen. Unary data structures for language models. In Interspeech 2011, pages 1425-1428, 2011. (PDF, 244941 bytes)

[11]
Gonzalo Iglesias, Cyril Allauzen, William Byrne, Adrià de Gispert, and Michael Riley. Hierarchical phrase-based translation representations. In Proceedings of the 2011 Conference on Empirical Methods on Natural Language Processing (EMNLP 2011). Association for Computational Linguistics, 2011. (PDF, 161045 bytes)

[12]
Cyril Allauzen, Mehryar Mohri, and Ashish Rastogi. General algorithms for testing the ambiguity of finite automata and the double-tape ambiguity of finite-state transducers. International Journal of Foundations of Computer Science, 22(4):883-904, 2011. (PDF, 270824 bytes)

[13]
Cyril Allauzen, Corinna Cortes, and Mehryar Mohri. Large-scale training of svms with automata kernels. In Implementation and Applications of Automata, 15th International Conference, CIAA 2010, volume 6482 of Lecture Notes in Computer Science, pages 17-27. Springer, 2011. (PDF, 139916 bytes)

[14]
Cyril Allauzen, Michael Riley, and Johan Schalkwyk. Filters for efficient composition of weighted finite-state transducers. In Implementation and Applications of Automata, 15th International Conference, CIAA 2010, volume 6482 of Lecture Notes in Computer Science, pages 28-38. Springer, 2011. (PDF, 189785 bytes)

[15]
Brandon Ballinger, Cyril Allauzen, Alexander Gruenstein, and Johan Schalkwyk. On-demand language model interpolation for mobile speech input. In Interspeech 2010, pages 1812-1815, 2010. (PDF, 227629 bytes)

[16]
Cyril Allauzen, Shankar Kumar, Wolfgang Macherey, Mehryar Mohri, and Michael Riley. Expected sequence similarity maximization. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 957-965, Los Angeles, California, June 2010. Association for Computational Linguistics.

[17]
Cyril Allauzen, Michael Riley, and Johan Schalkwyk. A generalized composition algorithm for weighted finite-state transducers. In Interspeech 2009, pages 1203-1206. ISCA, 2009. (PDF, 182281 bytes)

[18]
Cyril Allauzen and Mehryar Mohri. N-way composition of weighted finite-state transducers. International Journal of Foundations of Computer Science, 20(4):613-627, 2009. (PDF, 244348 bytes)

[19]
Cyril Allauzen and Mehryar Mohri. Linear-space computation of the edit-distance between a string and a finite automaton. In Joseph Chan, Jacqueline W. Daykin, and M. Sohel Rahman, editors, London Algorithmics 2008: Theory and Practice, volume 11 of Texts in Algorithmics. College Publications, 2009. Dedicated to Maxime Crochemore on his 60th birthday. (PDF, 226226 bytes)

[20]
Cyril Allauzen, Mehryar Mohri, and Ashish Rastogi. General algorithms for testing the ambiguity of finite automata. In Developments in Language Theory, 12th International Conference, DLT 2008, volume 5257 of Lecture Notes in Computer Science, pages 108-120. Springer, 2008.

[21]
Cyril Allauzen, Mehryar Mohri, and Ameet Talwalkar. Sequence kernels for predicting protein essentiality. In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), volume 307 of ACM International Conference Proceeding Series, pages 9-16. ACM, 2008.

[22]
Cyril Allauzen and Mehryar Mohri. 3-way composition of weighted finite-state transducers. In Implementation and Applications of Automata, 13th International Conference, CIAA 2008, volume 5148 of Lecture Notes in Computer Science, pages 262-273. Springer, 2008.

[23]
Cyril Allauzen and Mehryar Mohri. N-way composition of weighted finite-state transducers. Technical Report TR2007-902, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, August 2007. (PostScript, 12 pages, 495845 bytes) (PDF, 170317 bytes)

[24]
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. OpenFst: a general and efficient weighted finite-state transducer library. In Proceedings of the 12th International Conference on Implementation and Application of Automata, (CIAA 2007), volume 4783 of Lecture Notes in Computer Science, pages 11-23. Springer, 2007. (PDF, 439327 bytes)

[25]
Cyril Allauzen and Mehryar Mohri. A unified construction of the Glushkov, follow, and Antimirov automata. In Proceedings of the 31st International Symposium on Mathematical Foundations of Computer Science (MFCS 2006), volume 4162 of Lecture Notes in Computer Science, pages 110-121. Springer, 2006. (PostScript, 12 pages, 421075 bytes) (PDF, 183721 bytes)

[26]
Cyril Allauzen and Mehryar Mohri. A unified construction of the Glushkov, follow and Antimirov automata. Technical Report TR2006-880, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, April 2006. (PostScript) (PDF)

[27]
Cyril Allauzen and Mehryar Mohri. The design principles and algorithms of a weighted grammar library. International Journal of Foundations of Computer Science, 16(3):403-421, 2005. (PostScript, 19 pages, 461500 bytes) (PDF, 207665 bytes)

[28]
Sarangarajan Parthasarathy, Cyril Allauzen, and Rungsun Munkong. Robust access to large structured data using voice form-filling. In Proceedings of Eurospeech-Interspeech 2005, pages 2493-2496, 2005.

[29]
Vincent Goffin, Cyril Allauzen, Enrico Bocchieri, Dilek Hakkani-Tür, Andrej Ljolje, Sarangarajan Parthasarathy, Mazim Rahim, Giuseppe Riccardi, and Murat Saraclar. The AT&T Watson speech recognizer. In Proceedings of the 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'2005), volume 1, pages 1033-1036, 2005. (PDF, 154911 bytes)

[30]
Cyril Allauzen, Mehryar Mohri, and Michael Riley. Statistical modeling for unit selection in speech synthesis. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL'2004), pages 55-62, 2004. (PostScript, 8 pages, 261280 bytes) (PDF, 108214 bytes)

[31]
Cyril Allauzen, Mehryar Mohri, and Brian Roark. A general weighted grammar library. In Proceedings of the Ninth International Conference on Implementation and Application of Automata (CIAA'2004), volume 3317 of Lecture Notes in Computer Science, pages 23-34. Springer, 2005. (PostScript, 11 pages, 363398 bytes) (PDF, 138900 bytes)

[32]
Cyril Allauzen and Mehryar Mohri. An optimal pre-determinization algorithm for weighted transducers. Theoretical Computer Science, 328(1-2):3-18, 2004. (PostScript, 18 pages, 521242 bytes) (PDF, 201575 bytes)

[33]
Cyril Allauzen, Mehryar Mohri, Michael Riley, and Brian Roark. A generalized construction of integrated speech recognition transducers. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'2004), volume I, pages 761-764, 2004. (PostScript, 4 pages, 254138 bytes) (PDF, 84958 bytes)

[34]
Cyril Allauzen, Mehryar Mohri, and Murat Saraclar. General indexation of weighted automata -- application to spoken utterance retrieval. In Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT/NAACL 2004, pages 33-40, 2004. (PostScript, 8 pages, 277832 bytes) (PDF, 114898 bytes)

[35]
Cyril Allauzen and Mehryar Mohri. Finitely subsequential transducers. International Journal of Foundations of Computer Science, 14(6):983-994, 2003. (PostScript, 12 pages, 244150 bytes) (PDF, 228635 bytes)

[36]
Cyril Allauzen and Mehryar Mohri. An efficient pre-determinization algorithm. In Proceedings of the Eighth International Conference on Implementation and Application of Automata (CIAA'2003), volume 2759 of Lecture Notes in Computer Science, pages 83-95. Springer, 2003. (PostScript, 12 pages, 296741 bytes) (PDF, 214919 bytes)

[37]
Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL'2003), pages 40-47, 2003. (PostScript, 8 pages, 230300 bytes) (PDF, 161534 bytes)

[38]
Cyril Allauzen and Mehryar Mohri. Efficient algorithms for testing the twins property. Journal of Automata, Languages and Combinatorics, 8(2):117-144, 2003. (PostScript, 29 pages, 579078 bytes) (PDF, 355609 bytes)

[39]
Cyril Allauzen and Mehryar Mohri. Generalized optimization algorithm for speech recognition transducers. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP'2003), volume I, pages 352-355. IEEE, 2003. (PostScript, 4 pages, 194582 bytes) (PDF, 99019 bytes)

[40]
Cyril Allauzen and Mehryar Mohri. p-Subsequentiable transducers. In Proceedings of the Seventh International Conference on Implementation and Application of Automata (CIAA'2002), volume 2608 of Lecture Notes in Computer Science, pages 24-34. Springer, 2003. (PostScript, 12 pages, 250466 bytes) (PDF, 212437 bytes)

[41]
Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Efficient experimental string matching by weak factor recognition. In Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching (CPM'2001), volume 2089 of Lecture Notes in Computer Science, pages 51-72. Springer, 2001.

[42]
Cyril Allauzen. Combinatoire sur les mots et recherche de motifs. PhD thesis, Université de Marne-la-Vallée, 2001.

[43]
Cyril Allauzen and Mathieu Raffinot. Simple optimal string matching. J. Algorithms, 36(1):102-116, 2000.

[44]
Cyril Allauzen and Mathieu Raffinot. Simple optimal string matching (extended abstract). In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000), volume 1848 of Lecture Notes in Computer Science, pages 364-374. Springer, 2000.

[45]
Cyril Allauzen. Calcul efficace du shuffle de k mots. Technical report 2000-02, Institut Gaspard-Monge, Université de Marne-la-Vallée, 2000.

[46]
Cyril Allauzen and Mathieu Raffinot. Oracle des facteurs d'un ensemble de mots. Technical report 99-11, Institut Gaspard-Monge, Université de Marne-la-Vallée, 1999.

[47]
Cyril Allauzen, Maxime Crochemore, and Mathieu Raffinot. Factor oracle : a new structure for pattern matching. In Proceedings of SOFSEM'99, volume 1725 of Lecture Notes in Computer Science, pages 295-310. Springer, 1999.

[48]
Cyril Allauzen. Une caractérisation simple des nombres de Sturm. Journal de Théorie des Nombres de Bordeaux, 10(2):237-241, 1998.

[49]
Cyril Allauzen and Bruno Durand. Tillings problems. In E. Börger, E. Grädel, and Y. Gurevich, editors, The classical decision problem. Springer, 1997.

[50]
Cyril Allauzen and Bruno Durand. Pavages du plan: périodicité et décidabilité. Technical report 95-28, Laboratoire de l'Informatique du Parallélisme, École Normale Supérieure de Lyon, 1995.

Co-authors