Clustering Weblogs on the Basis of a Topic Detection Method

  • Fernando Perez-Tellez
  • David Pinto
  • John Cardiff
  • Paolo Rosso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6256)


In recent years we have seen a vast increase in the volume of information published on weblog sites and also the creation of new web technologies where people discuss actual events. The need for automatic tools to organize this massive amount of information is clear, but the particular characteristics of weblogs such as shortness and overlapping vocabulary make this task difficult. In this work, we present a novel methodology to cluster weblog posts according to the topics discussed therein. This methodology is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. We present our results which demonstrate a considerable improvement over the baseline.


Clustering Weblogs Topic Detection 


  1. 1.
    Agrawal, N., Galan, M., Liu, H., Subramanya, S.: Clustering blogs with collective wisdom. In: Proc. of the International Conference on Web Engineering, pp. 336–339. IEEE Computer Society, USA (2008)Google Scholar
  2. 2.
    Allan, J., Carbonell, J.G., Doddington, G., Yamron, J., Yang, Y.: Topic Detection and Tracking Pilot Study: Final Report. In: Proc. DARPA Broadcast News Transcription and Understanding Workshop (1998)Google Scholar
  3. 3.
    Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: Proc. SIGIR International Conference on Research and Development in Information Retrieval, pp. 37–45. ACM, NY (1998)Google Scholar
  4. 4.
    Banerjee, S., Pedersen, T.: An adapted Lesk algorithm for word sense disambiguation using WordNet. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 136–145. Springer, Heidelberg (2006)Google Scholar
  5. 5.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Marchine Learning Research, 3, 993–1022 (2003)zbMATHGoogle Scholar
  6. 6.
    Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. Journal of American Society of Information Science 41, 391–407 (1990)CrossRefGoogle Scholar
  7. 7.
    Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  8. 8.
    Flynn, C., Dunnion, J.: Topic Detection in the News Domain. In: Proc. of the 2004 International Symposium on Information and Communication Technologies, pp. 103–108. ACM, New York (2004)Google Scholar
  9. 9.
    Grefenstette, G.: Explorations in Automatic Thesaurus Discovery. Kluwer Ac., Dordrecht (1994)CrossRefzbMATHGoogle Scholar
  10. 10.
    Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)CrossRefGoogle Scholar
  11. 11.
    Hofman, T.: Probabilistic latent semantic indexing. In: Proc. of the Twenty-Second Annual International SIGIR Conference, pp. 50–57. ACM, NY (1999)Google Scholar
  12. 12.
    Karp, R.M., Rabin, M.O.: Efficient Randomized Pattern-Matching Algorithms. IBM Journal of Research and Development 31(2), 249–260 (1987)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Li, B., Xu, S., Zhang, J.: Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments. In: ACM Southeast Regional Conference, pp. 94–99 (2007)Google Scholar
  14. 14.
    Manning, D.C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  15. 15.
    Perez-Tellez, F., Pinto, D., Cardiff, J., Rosso, P.: Characterizing Weblog Corpora. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds.) NLPIS 2010. LNCS, vol. 5723, pp. 299–300. Springer, Heidelberg (2010)Google Scholar
  16. 16.
    Pinto, D.: On Clustering and Evaluation of Narrow Domain Short-Text Corpora. PhD dissertation, Universidad Politecnica de Valencia, Spain (2008)Google Scholar
  17. 17.
    Qiu, Y., Frei, H.P.: Concept based query expansion. In: Proc. of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–169. ACM, New York (1993)Google Scholar
  18. 18.
    Sekiguchi, Y., Kawashima, H., Okuda, H., Oku, M.: Topic Detection from Blog Documents Using Users’ Interests. In: Proc. of the 7th International Conference on Mobile Data Management (2006)Google Scholar
  19. 19.
    Spärck, J.K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)CrossRefGoogle Scholar
  20. 20.
    Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  21. 21.
    Wartena, C., Brussee, R.: Topic Detection by Clustering Keywords. In: Proc. of the 19th International Conference on Database and Expert Systems Application, pp. 54–58. IEEE Computer Society, USA (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Fernando Perez-Tellez
    • 1
  • David Pinto
    • 2
  • John Cardiff
    • 1
  • Paolo Rosso
    • 3
  1. 1.Social Media Research GroupInstitute of Technology TallaghtDublinIreland
  2. 2.Benemérita Universidad Autónoma de PueblaMexico
  3. 3.Natural Language Engineering Lab, ELiRFUniversidad Pólitecnica de ValenciaSpain

Personalised recommendations