Skip to main content
Log in

Contextual correlation based thread detection in short text message streams

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Short text message streams are produced by Instant Messaging and Short Message Service which are wildly used nowadays. Each stream contains more than one thread usually. Detecting threads in the streams is helpful to various applications, such as business intelligence, investigation of crime and public opinion analysis. Existing works which are mainly based on text similarity encounter many challenges including the sparse eigenvector and anomaly of short text message. This paper introduces a novel concept of contextual correlation instead of the traditional text similarity into single-pass clustering algorithm to cover the challenges of thread detection. We firstly analyze the contextually correlative nature of conversations in short text message streams, and then propose an unsupervised method to compute the correlative degree. As a reference, a single-pass algorithm employing the contextual correlation is developed to detect threads in massive short text stream. Experiments on large real-life online chat logs show that our approach improves the performance by 11% when compared with the best similarity-based algorithm in terms of F1 measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA broadcast news transcription and understanding workshop (Vol. 1998).

  • Atkinson, J., & Heritage, J. (1984). Preference organization. In Structure of social action: Studies in conversation analysis (pp. 53–56).

  • Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34(1), 177–210. doi:10.1023/A:1007506220214.

    Article  MATH  Google Scholar 

  • Galley, M., Mckeown, K., Lussier, F. E., & Jing, H. (2003). Discourse segmentation of multi-party conversation. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.9751.

  • Hearst, M. A. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64. URL: http://portal.acm.org/citation.cfm?id=972687.

  • Heringer, J. (1977). Pre-sequences and indirect speech acts. Southern California Occasional Papers in Linguistics, 5, 169–179.

    Google Scholar 

  • HowNet (2010). http://www.keenage.com. Accessed 1 Oct 2010.

  • Java, A., Song, X., Finin, T., & Tseng, B. (2007). Why we twitter: Understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis (pp. 56–65). New York: ACM.

    Chapter  Google Scholar 

  • Levinson, S. C. (1983). Pragmatics (Cambridge textbooks in linguistics). Cambridge: Cambridge University Press.

    Google Scholar 

  • Lin, C., Yang, J. M., Cai, R., Wang, X. J., & Wang, W. (2009). Simultaneously modeling semantics and structure of threaded discussions: A sparse coding approach and its applications. In SIGIR ’09: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 131–138). New York: ACM. doi:10.1145/1571941.1571966.

    Chapter  Google Scholar 

  • Lucene (2010). http://lucene.apache.org. Accessed 1 Oct 2010.

  • Olney, A., & Cai, Z. (2005). An orthonormal basis for topic segmentation in tutorial dialogue. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 971–978). Menlo Park: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Passonneau, R., & Litman, D. (1997). Discourse segmentation by human and automated means. Computational Linguistics, 23(1), 103–139.

    Google Scholar 

  • Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. doi:10.1145/361219.361220.

    Article  MATH  Google Scholar 

  • Shen, D., Yang, Q., Sun, J., & Chen, Z. (2006). Thread detection in dynamic text message streams. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 35–42). New York: ACM.

    Chapter  Google Scholar 

  • Simpson, C. (1999). Internet relay chat. ERIC Digest, ERIC Clearinghouse of Information and Technology.

  • Tencent QQ (2010). http://download.tech.qq.com. Accessed 1 Oct 2010.

  • Tianhao, W., Khan, F., Fisher, T., Shuler, L., & Pottenger, W. (2002) Error-driven Boolean-logic-rule-based learning for mining chat-room conversations. Tech. rep., Lehigh University Technical Report LU-CSE-02-008.

  • Van Rijsbergen, C. (1979). Information retrieval. London: Buttersworth.

    Google Scholar 

  • Wang, L., Jia, Y., & Chen, Y. (2008). Conversation extraction in dynamic text message stream. Journal of Computers, 3(10), 86.

    Google Scholar 

  • Wang, L., Jia, Y., & Han, W. (2007). Instant message clustering based on extended vector space model. In Proceedings of the 2nd international conference on advances in computation and intelligence (pp. 435–443). Berlin: Springer.

    Google Scholar 

  • Wang, Y., Joshi, M., Cohen, W., & Rosé, C. (2008). Recovering implicit thread structure in newsgroup style conversations. In Proceedings of the 2nd international conference on weblogs and social media (ICWSM II).

  • Wikipedia (2010). http://en.wikipedia.org/wiki/Instant_messaging. Accessed 1 Oct 2010.

  • Xia, Y., & Wong, K. (2006). Anomaly detecting within dynamic Chinese chat text. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics workshop on NEW TEXT wikis and blogs and other dynamic text sources (p. 48).

  • Yang, Y., Pierce, T., & Carbonell, J. (1998). A study of retrospective and on-line event detection. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 28–36). New York: ACM.

    Chapter  Google Scholar 

Download references

Acknowledgements

This work is supported in part by the Key Project of National Natural Science Foundation of China under Grant No.60933005; the National Natural Science Foundation of China under Grant No.60873204; the National High-Tech Research and Development Plan of China under Grant No.2001AA012505. The authors are grateful to the text mining team in Lab 613 of NUDT, which contributes a lot to this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiuming Huang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, J., Zhou, B., Wu, Q. et al. Contextual correlation based thread detection in short text message streams. J Intell Inf Syst 38, 449–464 (2012). https://doi.org/10.1007/s10844-011-0162-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-011-0162-7

Keywords

Navigation