Advertisement

Fine-grained document clustering via ranking and its application to social media analytics

  • Taufik Sutanto
  • Richi Nayak
Original Article

Abstract

Extracting valuable insights from a large volume of unstructured data such as texts through clustering analysis is paramount to many big data applications. However, document clustering is challenged by the computational complexity of the underlying methods and the high dimensionality of data, especially when the number of required clusters is large. A fine-grained clustering solution is required to understand a data set that represents heterogeneous topics such as social media data. This paper presents the Fine-Grained document Clustering via Ranking (FGCR) approach which leverages the search engine capability of handling big data efficiently. Ranking scores from a search engine are used to calculate dynamic clusters’ representations called loci in an unsupervised learning setting. Clustering decisions are efficiently made based on an optimal selection from a small subset of loci instead of the entire cluster set as in the conventional centroid-based clustering. A comprehensive empirical study on several social media data sets shows that FGCR is able to produce insightful and accurate fine-grained solution. Moreover, it is magnitudes faster and requires less computational resources compared to other state-of-the-art document clustering approaches.

Keywords

Clustering Fine-grained Loci Ranking Social media analytics 

References

  1. Aksyonoff A (2011) Introduction to Search with Sphinx: from installation to relevance tuning. O’Reilly, SebastopolGoogle Scholar
  2. Arthur D, Vassilvitskii S (2007) k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035Google Scholar
  3. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Database theory–ICDT ’99. Springer, pp 217–235Google Scholar
  4. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022zbMATHGoogle Scholar
  5. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022zbMATHGoogle Scholar
  6. Chen H, Chiang RH, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188Google Scholar
  7. Chen J, Fang H-R, Saad Y (2009) Fast approximate k nn graph construction for high dimensional data via recursive lanczos bisection. J Mach Learn Res 10:1989–2012MathSciNetzbMATHGoogle Scholar
  8. De Vries CM, De Vine L, Geva S, Nayak R (2015) Parallel streaming signature em-tree: a clustering algorithm for web scale applications. In: Proceedings of the 24th international conference on World Wide Web, pp 216–226. International World Wide Web Conferences Steering CommitteeGoogle Scholar
  9. Dorow B (2006) A graph model for words and their meanings. PhD thesis, Institut fÃijr Maschinelle Sprachverarbeitung der UniversitÂĺ at StuttgartGoogle Scholar
  10. Eisenstein J, O’Connor B, Smith NA, Xing EP (2010) A latent variable model for geographic lexical variation. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1277–1287Google Scholar
  11. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231Google Scholar
  12. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279CrossRefGoogle Scholar
  13. Ferrara E, Interdonato R, Tagarelli A (2014) Online popularity and topical interests through the lens of instagram. In: Proceedings of the 25th ACM conference on Hypertext and social media. ACM, pp 24–34Google Scholar
  14. Fuhr N, Lechtenfeld M, Stein B, Gollub T (2012) The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr. 15(2):93–115CrossRefGoogle Scholar
  15. Gellman M, Turner JR (2013) Encyclopedia of behavioral medicine. Springer, BerlinCrossRefGoogle Scholar
  16. He W, Zha S, Li L (2013) Social media competitive analysis and text mining: a case study in the pizza industry. Int J Inf Manag 33(3):464–472CrossRefGoogle Scholar
  17. Hearst MA, Pedersen JO (1996) Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th annual international ACM sigir conference on research and development in information retrieval, SIGIR ’96, pp 76–84, New York, NY, USAGoogle Scholar
  18. Hou J, Nayak R (2013) The heterogeneous cluster ensemble method using hubness for clustering text documents. In: WISE 2013. Springer, Berlin Heidelberg, pp 102–110Google Scholar
  19. Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. Access IEEE 2:652–687CrossRefGoogle Scholar
  20. Hu X, Liu H (2012) Text analytics in social media. Springer, Boston, pp 385–414Google Scholar
  21. Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95CrossRefGoogle Scholar
  22. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRefGoogle Scholar
  23. Jardine N, van Rijsbergen CJ (1971) The use of hierarchic clustering in information retrieval. Inf Storage Retr 7(5):217–240CrossRefGoogle Scholar
  24. Johnson WB, Lindenstrauss J (1984) Extensions of lipschitz mappings into a hilbert space. Contemp Math 26(189–206):1MathSciNetzbMATHGoogle Scholar
  25. Katal A, Wazid M, Goudar R (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth international conference on contemporary computing (IC3). IEEE, pp 404–409Google Scholar
  26. Klawonn F, Höppner F, Jayaram B (2012) What are clusters in high dimensions and are they difficult to find? In: International workshop on clustering high-dimensional data, Springer, pp 14–33Google Scholar
  27. Kriegel H-P, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data (TKDD) 3(1):1CrossRefGoogle Scholar
  28. Kurland O (2013) The cluster hypothesis in information retrieval. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’13, pp 1126–1126, New York, NY, USAGoogle Scholar
  29. Leuski A (2001) Evaluating document clustering for interactive information retrieval. In: Proceedings of the tenth international conference on information and knowledge management, ACM, pp 33–40Google Scholar
  30. Lloyd SP (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28(2):129–137MathSciNetCrossRefzbMATHGoogle Scholar
  31. Losee RM, Paris LAH (1999) Measuring search-engine quality and query difficulty: ranking with target and freestyle. J Assoc Inf Sci Technol 50(10):882Google Scholar
  32. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge university press, CambridgeCrossRefzbMATHGoogle Scholar
  33. Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge university press, CambridgeCrossRefzbMATHGoogle Scholar
  34. McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 169–178Google Scholar
  35. Medhat W, Hassan A, Korashy H (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J 5(4):1093–1113CrossRefGoogle Scholar
  36. Mihalcea R, Tarau P (2004) TextRank: Bringing order into texts. In: Conference on empirical methods in natural language processing, Barcelona, SpainGoogle Scholar
  37. O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: Exploratory search and topic summarization for twitter. In: ICWSMGoogle Scholar
  38. Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Discov 24(3):515–554CrossRefGoogle Scholar
  39. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  40. Petkos G, Papadopoulos S, Mezaris V, Kompatsiaris Y (2014) Social event detection at mediaeval 2014: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2014 multimedia benchmark workshop Barcelona, SpainGoogle Scholar
  41. Raiber F, Kurland O (2012) Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In: Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 2507–2510Google Scholar
  42. Raiber F, Kurland O (2013) Ranking document clusters using markov random fields. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. pp 333–342Google Scholar
  43. Reuter T, Papadopoulos S, Petkos G, Mezaris V, Kompatsiaris Y, Cimiano P, de Vries C, Geva S (2013) Social event detection at mediaeval 2013: challenges, datasets, and evaluation. In: Proceedings of the MediaEval 2013 multimedia benchmark workshop Barcelona, Spain, 2013Google Scholar
  44. Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146CrossRefGoogle Scholar
  45. Rosa KD, Shah R, Lin B, Gershman A, Frederking R (2011) Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSMGoogle Scholar
  46. Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. ACM, pp 1177–1178Google Scholar
  47. Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905CrossRefGoogle Scholar
  48. Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: Murgante B, Misra S, Rocha A, Torre C, Rocha J, FalcÃčo M, Taniar D, Apduhan B, Gervasi O (eds) Computational science and its applications âĂŞ ICCSA 2014, volume 8583 of Lecture notes in computer science. Springer International Publishing, pp 707–720Google Scholar
  49. Sinclair GR (2012) StÃl’fan and the Voyant Tools Team. Voyant tools (web application). http://voyant-tools.org/
  50. Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. SIGIR ’96, pp 21–29, New York, NY, USAGoogle Scholar
  51. Smucker MD, Allan J (2009) A new measure of the cluster hypothesis. In: Conference on the theory of information retrieval. Springer, pp 281–288Google Scholar
  52. Spink A, Wolfram D, Jansen MB, Saracevic T (2001) Searching the web: the public and their queries. J Assoc Inf Sci Technol 52(3):226–234CrossRefGoogle Scholar
  53. Sutanto T, Nayak R (2014) Ranking based clustering for social event detection. In: Working notes proceedings of the mediaeval 2014 workshop, vol 1263, pp 1–2. CEUR workshop proceedingsGoogle Scholar
  54. Sutanto T, Nayak R (2014) The ranking based constrained document clustering method and its application to social event detection. In: Database systems for advanced applications. Springer, pp 47–60Google Scholar
  55. Sutanto T, Nayak R (2015) Semi-supervised document clustering via loci. In: Wang J, Cellary W, Wang D, Wang H, Chen S-C, Li T, Zhang Y (eds) Web information systems engineering âĂŞ WISE 2015 volume 9419 of Lecture notes in computer science. Springer International Publishing, pp 208–215Google Scholar
  56. Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. In: Advances in knowledge discovery and data mining. Springer, pp 183–195Google Scholar
  57. Tomašev N, Radovanović M, Mladenić D, Ivanović M (2014) Hubness-based fuzzy measures for high-dimensional k-nearest neighbor classification. Int J Mach Learn Cybern 5(3):445–458CrossRefGoogle Scholar
  58. Trepte S, Reinecke L (2011) Privacy online: perspectives on privacy and self-disclosure in the social web. Springer, BerlinCrossRefGoogle Scholar
  59. Van Rijsbergen C (1979) Information retrieval, 2nd edn. Butterworths, LondonzbMATHGoogle Scholar
  60. Voorhees EM (1985) The cluster hypothesis revisited. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, pp 188–196Google Scholar
  61. Wang C, Chow SSM, Wang Q, Ren K, Lou W (2013) Privacy-preserving public auditing for secure cloud storage. IEEE Trans Comput 62(2):362–375MathSciNetCrossRefzbMATHGoogle Scholar
  62. Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244MathSciNetCrossRefGoogle Scholar
  63. Widenius M, Axmark D (2002) MySQL reference manual: documentation from the source. O’Reilly Media Inc, SebastopolGoogle Scholar
  64. Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, pp 267–273Google Scholar
  65. Yin J, Karimi S, Lampert A, Cameron M, Robinson B, Power R (2015) Using social media to enhance emergency situation awareness. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15. AAAI Press, pp 4234–4238Google Scholar
  66. Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. Comput Netw 31(11):1361–1374CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Syarif Hidayatullah State Islamic University JakartaJakartaIndonesia
  2. 2.Queensland University of Technology (QUT)BrisbaneAustralia

Personalised recommendations