Abstract
This paper describes the discovery method of finding proper points for dividing a corpus with time series information for extracting local and frequent keywords. Local and frequent keywords express a corpus with time series information and are useful for comprehending it. To extract keywords from the corpus, the previous works proposed corpus separating method. However, this method divides the corpus at equal intervals so that it cannot take into account the change of topic. To consider the change of topics and divide the corpus based on it, we utilize the idea of topic model and the topic extracted by Latent Dirichlet Allocation (LDA). In the experiment using newspaper articles during five years topics, we confirm that the topics of each document change as time passed by using the output from LDA and the point which is available on dividing the corpus by the change of topics notably is observable.
Chapter PDF
Similar content being viewed by others
References
Liu, V., Curran, R.: Web Text Corpus for Natural Language Processing. In: Proceedings of the 11th Conference of The European Chapter of The Association for Computational Linguistics, Trento, Italy, pp. 233–240 (2006)
Liu, F., Liu, F., Liu, Y.: Automatic Keyword Extraction for TheMeeting Corpus Using Supervised Approach and Bigram Expansion. In: IEEE Workshop on Spoken Language Technology, pp. 181–184 (2008)
Dredze, M., Wallach, H., Puller, D., Pereira, F.: Generating Summary Keywords for Emails UsingTopics. In: Proceedings of The 2008 International Conference on Intelligent User Interfaces, pp. 199–206 (2008)
Litvak, M., Last, M.: Graph-based Keyword Extraction for Single-Document Summarization. In: Proceeding of The Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24. Association for Computational Linguistics (2008)
Saga, R., Tsuji, H.: Improved Keyword Extraction by Separation into Multiple Document Sets According to Time Series. In: HCII, CCIS 374, pp. 450–453 (2013)
Church, K., Gale, W.: Inverse Document Frequency (IDF): A Measure of Deviations from Poisson. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 121–130 (1995)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
McCallum, K.A.: MALLET, http://mallet.cs.umass.edu
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kobayashi, H., Saga, R. (2014). Finding Division Points for Time-Series Corpus Based on Topic Changes. In: Yamamoto, S. (eds) Human Interface and the Management of Information. Information and Knowledge Design and Evaluation. HIMI 2014. Lecture Notes in Computer Science, vol 8521. Springer, Cham. https://doi.org/10.1007/978-3-319-07731-4_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-07731-4_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07730-7
Online ISBN: 978-3-319-07731-4
eBook Packages: Computer ScienceComputer Science (R0)