AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes
In language production research, the latency with which speakers produce a spoken response to a stimulus and the onset and offset times of words in longer utterances are key dependent variables. Measuring these variables automatically often yields partially incorrect results. However, exact measurements through the visual inspection of the recordings are extremely time-consuming. We present AlignTool, an open-source alignment tool that establishes preliminarily the onset and offset times of words and phonemes in spoken utterances using Praat, and subsequently performs a forced alignment of the spoken utterances and their orthographic transcriptions in the automatic speech recognition system MAUS. AlignTool creates a Praat TextGrid file for inspection and manual correction by the user, if necessary. We evaluated AlignTool’s performance with recordings of single-word and four-word utterances as well as semi-spontaneous speech. AlignTool performs well with audio signals with an excellent signal-to-noise ratio, requiring virtually no corrections. For audio signals of lesser quality, AlignTool still is highly functional but its results may require more frequent manual corrections. We also found that audio recordings including long silent intervals tended to pose greater difficulties for AlignTool than recordings filled with speech, which AlignTool analyzed well overall. We expect that by semi-automatizing the temporal analysis of complex utterances, AlignTool will open new avenues in language production research.
KeywordsLanguage production Time course Voice-key Automatic alignment
- Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., ... Weinert, R. (1991). The HCRC Map Task Corpus. Language and Speech, 34, 351-366.Google Scholar
- Baayen, R. H., Piepenbrook, R., & van Rijn, H. (1995). The CELEX lexical database (CD-ROM). Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA.Google Scholar
- BAS (Bavarian Archive for Speech Signals) (2017a, August 9). BAS WebServices. Retrieved from https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface.
- BAS (Bavarian Archive for Speech Signals) (2017b, August 9) BAS WebServices: G2P. Retrieved from https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface/Grapheme2Phoneme.
- BAS (Bavarian Archive for Speech Signals) (2017c, August, 8). BAS WebServices: General Help – Terms of Usage. Retrieved from https://clarin.phonetik.uni-muenchen.de/BASWebServices/help/termsOfUsage#termsofusage.
- Bates, E., D’Amico, S., Jacobsen, T., Székely, A., Andonova, E., Devescovi, A., ... Tzeng, O. (2003). Timed picture naming in seven languages. Psychonomic Bulletin and Review, 10, 344-380.Google Scholar
- Bebout, J. & Belke, E. (2017). Language play facilitates language learning: Optimizing the input for rapid gender-like category induction. Cognitive Research: Principles and Implications, 2, 11.Google Scholar
- Belke, E., Keite, V., & Schillingmann, L. (2017). AlignTool Documentation. Retrieved from https://www.linguistics.rub.de/~belke/aligntool.shtml.
- Boersma, P. & Weenink, D. (2016). Praat: Doing phonetics by computer (Version 6.0.14) [Computer software]. Retrieved from http://www.praat.org/.
- Brennan, S. E., Schuhmann, K. S., & Batres, K. M. (2013). Entrainment on the move and in the lab: The Walking Around Corpus. Proceedings of the 35th Annual Conference of the Cognitive Science Society. Google Scholar
- Duyck, W., Anseel, F., Szmalec, A., Mestdagh, P., Tavernier, A., & Hartsuiker, R. (2008). Improving accuracy in detecting acoustic onsets. Journal of Experimental Psychology: Human Perception & Performance, 34, 1317-1326.Google Scholar
- Fink, G. A. (1999). Developing HMM-based recognizers with ESMERALDA. In V. Matousek, P. Mautner, J. Ocelíková, & P. Sojka (Eds.), Lecture notes in artificial intelligence science: Vol. 1692. Text, speech and dialogue: Second international workshop, TSD ’99, Plzen, Czech Republic, September 13-17, 1999 (pp. 229-234). Berlin: Springer.Google Scholar
- Katzberg, D., Belke, E., Wrede, B., Ernst, J., Berwe, Th., & Meyer, A. S. (2014). AUDIOMAX: A software using an automatic speech recognition system for fast and accurate temporal analyses of word onsets in spoken utterances. Poster presented at the International Workshop on Language Production 2014, Geneva.Google Scholar
- Kisler, T., Reichel, U. D., Schiel, F., Draxler, Ch., Jackl, B., & Pörner, N. (2016). BAS Speech Science Web Services - an update of current developments. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 23-28, 2016.Google Scholar
- Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. Cambridge: MIT Press.Google Scholar
- Pechmann, T., Reetz, H., & Zerbst, D. (1989). Kritik einer Messmethode: Zur Ungenauigkeit von Voicekey Messungen [Critique on a measurement method: About the inaccuracy of voicekey measurements]. Sprache & Kognition, 8, 65-71.Google Scholar
- Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications and speech recognition. Proceedings of the IEEE, 77, 257-286.Google Scholar
- Rosenfelder, I., Fruehwald, J., Evanini, K., & Jiahong, Y. (2011). FAVE (Forced Alignment and Vowel Extraction) Program Suite. Retrieved from http://fave.ling.upenn.edu.
- Roux, F., Armstrong, B. C., & Carreiras, M. (2016). Chronset: An automated tool for detecting speech onsets. Behavior Research Methods. Google Scholar
- Schiel, F. (1999). Automatic phonetic transcription of non-prompted speech. International Congress of Phonetic Sciences 14, 607-610.).Google Scholar
- Schiel, F. (2015, November 5). Munich Automatic Segmentation. Retrieved from http://www.bas.uni-muenchen.de/Bas/BasMAUS.html.
- Sichelschmidt, L., Jang, K.-W., Koesling, H., Ritter, H., & Weiß, P. (2010). Alignment in aufgabenorientierten Dialogen: ein multimodales Such- und Vergleichskorpus. [Alignment in task-oriented dialogues: A multimodal search and comparison corpus]. Linguistische Berichte, 222, 205-230.Google Scholar
- Strunk, J., Schiel, F., & Seifart, F. (2014). Untrained forced alignment of transcriptions and audio for language documentation corpora using WebMAUS. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland, May 26-31, 2014.Google Scholar