Abstract
In this paper, we propose a novel co-occurrence probabilities based similarity measure for inducing semantic classes. Clustering with the new similarity measure outperforms the widely used distance based on Kullback-Leibler divergence in precision, recall and F1 evaluation. In our experiments, we induced semantic classes from unannotated in-domain corpus and then used the induced classes and structures to generate large in-domain corpus which was then used for language model adaptation. Character recognition rate was improved from 85.2% to 91%. We imply a new measure to solve the lack of domain data problem by first induction then generation for a dialogue system.
Similar content being viewed by others
References
Gorin A L. On automated language acquisition. Acoustical Society of America Journal, 1995, 97(6): 3441–3461.
Arai K, Wright J H, Riccardi G, Gorin A L. Grammar fragment acquisition using syntactic and semantic clustering. Speech Communication, 1999, 27(1): 43–62.
Meng H M, Siu K C. Semiautomatic acquisition of semantic structures for understanding domain-specific natural language queries. IEEE Trans. Knowl. Data Eng., 2002, 14(1): 172–181.
Pargellis A N, Fosler-Lussier E, Lee C H, Potamianos A, Tsai A. Auto-induced semantic classes. Speech Communication, 2004, 43(3): 183–203.
Pangos A, Iosif E, Potamianos A, Fosler-Lussier E. Combining statistical similarity measures for automatic induction of semantic classes. In Proc. 2005 IEEE Workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico, Nov. 27-Dec. 1, 2005, pp.278–283.
Iosif E, Tegos A, Pangos A, Fosler-Lussier E, Potamianos A. Unsupervised combination of metrics for semantic class induction. In Proc. Spoken Language Technology Workshop, Palm Beach, Aruba, Dec. 10-13, 2006, pp.86–89.
Iosif E, Potamianos A. A soft-clustering algorithm for automatic induction of semantic classes. In Proc. Interspeech 2007, Antwerp, Belgium, Aug. 27-31, 2007, pp.1609–1612.
Wang C, Chung G, Seneff S. Automatic induction of language model data for a spoken dialogue system. Language Resources and Evaluation, 2006, 40(1): 25–46.
Lin D. An information-theoretic definition of similarity. In Proc. the 15th International Conference on Machine Learning, Madison, USA, July 24-27, 1998, pp.296–304.
Dagan I, Lee L, Pereira F. Similarity-based models of word cooccurrence probabilities. Machine Learning, 1999, 34(1–3): 43–69.
Weeds J, Weir D, McCarthy D. Characterising measures of lexical distributional similarity. In Proc. the 20th International Conference on Computer Linguistics, Switzerland, August 23-27, 2004, pp.1015–1021.
Cover T M, Thomas J A. Elements of Information Theory. Wiley-Interscience, 2006, pp.224–238.
Bellegarda J R. Statistical language model adaptation: Review and perspectives. Speech Communication, 2004, 42(1): 93–108.
Hakkani-Tür D Z, Riccardi G, Tur G. An active approach to spoken language processing. ACM Transactions on Speech and Language Processing, 2006, 3(3): 1–31.
Stolcke A. SRILM — An extensible language modeling toolkit. In Proc. ICSLP, Denver, USA, September 16-20, 2002, pp.901–904.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is partially supported by the National Natural Science Foundation of China under Grant Nos. 10925419, 90920302, 10874203, 60875014, 61072124, 11074275, 11161140319.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Li, YL., Xu, WQ. & Yan, YH. A Novel Similarity Measure to Induce Semantic Classes and Its Application for Language Model Adaptation in a Dialogue System. J. Comput. Sci. Technol. 27, 443–450 (2012). https://doi.org/10.1007/s11390-012-1233-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-012-1233-0