Finding Division Points for Time-Series Corpus Based on Topic Changes
This paper describes the discovery method of finding proper points for dividing a corpus with time series information for extracting local and frequent keywords. Local and frequent keywords express a corpus with time series information and are useful for comprehending it. To extract keywords from the corpus, the previous works proposed corpus separating method. However, this method divides the corpus at equal intervals so that it cannot take into account the change of topic. To consider the change of topics and divide the corpus based on it, we utilize the idea of topic model and the topic extracted by Latent Dirichlet Allocation (LDA). In the experiment using newspaper articles during five years topics, we confirm that the topics of each document change as time passed by using the output from LDA and the point which is available on dividing the corpus by the change of topics notably is observable.
Keywordskeyword extraction time series information LDA
Unable to display preview. Download preview PDF.
- 1.Liu, V., Curran, R.: Web Text Corpus for Natural Language Processing. In: Proceedings of the 11th Conference of The European Chapter of The Association for Computational Linguistics, Trento, Italy, pp. 233–240 (2006)Google Scholar
- 2.Liu, F., Liu, F., Liu, Y.: Automatic Keyword Extraction for TheMeeting Corpus Using Supervised Approach and Bigram Expansion. In: IEEE Workshop on Spoken Language Technology, pp. 181–184 (2008)Google Scholar
- 3.Dredze, M., Wallach, H., Puller, D., Pereira, F.: Generating Summary Keywords for Emails UsingTopics. In: Proceedings of The 2008 International Conference on Intelligent User Interfaces, pp. 199–206 (2008)Google Scholar
- 4.Litvak, M., Last, M.: Graph-based Keyword Extraction for Single-Document Summarization. In: Proceeding of The Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24. Association for Computational Linguistics (2008)Google Scholar
- 5.Saga, R., Tsuji, H.: Improved Keyword Extraction by Separation into Multiple Document Sets According to Time Series. In: HCII, CCIS 374, pp. 450–453 (2013)Google Scholar
- 6.Church, K., Gale, W.: Inverse Document Frequency (IDF): A Measure of Deviations from Poisson. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 121–130 (1995)Google Scholar
- 8.McCallum, K.A.: MALLET, http://mallet.cs.umass.edu