Advertisement

Journal of Intelligent Information Systems

, Volume 39, Issue 3, pp 577–610 | Cite as

DoSO: a document self-organizer

  • Gerasimos Spanakis
  • Georgios Siolas
  • Andreas Stafylopatis
Article

Abstract

In this paper, we propose a Document Self Organizer (DoSO), an extension of the classic Self Organizing Map (SOM) model, in order to deal more efficiently with a document clustering task. Starting from a document representation model, based on important “concepts” exploiting Wikipedia knowledge, that we have previously developed in order to overcome some of the shortcomings of the Bag-of-Words (BOW) model, we demonstrate how SOM’s performance can be boosted by using the most important concepts of the document collection to explicitly initialize the neurons. We also show how a hierarchical approach can be utilized in the SOM model and how this can lead to a more comprehensive final clustering result with hierarchical descriptive labels attached to neurons and clusters. Experiments show that the proposed model (DoSO) yields promising results both in terms of extrinsic and SOM evaluation measures.

Keywords

Document representation Document clustering SOM Wikipedia 

References

  1. Alias-i (2008). LingPipe 4.1.0 (online). http://alias-i.com/lingpipe. Accessed 23 Jan 2012
  2. Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval, 12, 461–486.CrossRefGoogle Scholar
  3. Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering short texts using Wikipedia. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 787–788). New York, NY, U.S.A.: ACM.Google Scholar
  4. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., et al. (2009). DBpedia—A crystallization point for the Web of data. Journal Web Semantics, 7(3), 154–165.CrossRefGoogle Scholar
  5. Bloehdorn, S., Cimiano, P., & Hotho, A. (2006). Learning ontologies to improve text clustering and classification. In M. Spiliopoulou, R. Kruse, A. Nürnberger, C. Borgelt, & W. Gaul (Eds.), From data and information analysis to knowledge engineering: Proceedings of the 29th annual conference of the German classification society (GfKl 2005), 9–11 Mar 2005, Magdeburg, Germany. Studies in classification, data analysis, and knowledge organization (Vol. 30, pp. 334–341). Berlin-Heidelberg, Germany: Springer.Google Scholar
  6. Breaux, T. D., & Reed, J. W. (2005). Using ontology in hierarchical information clustering. In HICSS ’05: Proceedings of the proceedings of the 38th annual Hawaii international conference on system sciences (HICSS’05)—track 4 (p. 111.2). Washington, DC, U.S.A.: IEEE Computer Society.Google Scholar
  7. Bunescu, R. C., & Pasca, M. (2007). Using encyclopedic knowledge for named entity disambiguation. In EACL. The Association for Computer Linguistics.Google Scholar
  8. A. Carnegie Group Inc., & B. Reuters Ltd. (1997). Reuters-21578 text categorization test collection.Google Scholar
  9. Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6), 759–771.CrossRefGoogle Scholar
  10. Chen, H., Schuffels, C., & Orwig, R. (1996). Internet categorization and search: A self-organizing approach. Journal of Visual Communication and Image Representation, 7(1), 88–102.CrossRefGoogle Scholar
  11. Cucerzan, S. (2007). Large-scale named entity disambiguation based on Wikipedia data. In Proc. 2007 joint conference on EMNLP and CNLL (pp. 708–716).Google Scholar
  12. Davison, M. L. (1983). Multidimensional scaling. New York: Wiley.MATHGoogle Scholar
  13. Demartines, P., & Herault, J. (1997). Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets. IEEE Transactions on Neural Networks, 8(1), 148–154.CrossRefGoogle Scholar
  14. Francis, W. N., & Kucera, H. (1964). Manual of information to accompany a standard corpus of present-day edited American english, for use with digital computers. Providence, Rhode Island.Google Scholar
  15. Fung, B. C. M., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proc. of the 3rd SIAM international conference on data mining (SDM) (pp. 59–70). San Francisco, CA: SIAM.Google Scholar
  16. Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In AAAI’06: Proceedings of the 21st national conference on artificial intelligence (pp. 1301–1306). Menlo Park, CA: AAAI Press.Google Scholar
  17. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In IJCAI’07: Proceedings of the 20th international joint conference on artifical intelligence (pp. 1606–1611). San Francisco, CA, U.S.A.: Morgan Kaufmann Publishers Inc.Google Scholar
  18. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.MATHGoogle Scholar
  19. Hammouda, K. M., & Kamel, M. S. (2004). Efficient phrase-based document indexing for Web document clustering. IEEE Transactions on Knowledge and Data Engineering, 16, 1279–1296.CrossRefGoogle Scholar
  20. He, J., Tan, A.-h., & Tan, C.-l. (2002). ART-C: A neural architecture for self-organization under constraints. In In proceedings of international joint conference on neural networks (IJCNN) (pp. 2550–2555).Google Scholar
  21. Himberg, J. (2000). A SOM based cluster visualization and its application for false coloring. In IJCNN ’00: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00) (Vol. 3, p. 3587). Washington, DC, U.S.A.: IEEE Computer Society.Google Scholar
  22. Hofmann, T. (1999). The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In In IJCAI (pp. 682–687).Google Scholar
  23. Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. In Y. Ding, K. van Rijsbergen, I. Ounis, & J. Jose (Eds.), Proceedings of the semantic Web workshop of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (SIGIR 2003), 1 Aug 2003, Toronto Canada.Google Scholar
  24. Hotho, A., & Stumme, G. (2002). Conceptual clustering of text clusters. In Proceedings of FGML workshop (pp. 37–45). Special Interest Group of German Informatics Society (FGML).Google Scholar
  25. Hu, J., Fang, L., Cao, Y., Zeng, H.-J., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging Wikipedia semantics. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 179–186). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  26. Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting Wikipedia as external knowledge for document clustering. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 389–396). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  27. Huang, A., Milne, D., Frank, E., & Witten, I. H. (2009). Clustering documents using a Wikipedia-based concept representation. In Proceedings of the 13th Pacific–Asia Conference on advances in knowledge discovery and data mining. PAKDD ’09 (pp. 628–636). Berlin, Heidelberg: Springer.CrossRefGoogle Scholar
  28. Jin, H., Wong, M.-L., & Leung, K. S. (2005). Scalable model-based clustering for large databases based on data summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(11), 1710–1719.CrossRefGoogle Scholar
  29. Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.CrossRefGoogle Scholar
  30. Junker, M., Sintek, M., & Rinck, M. (2000). Learning for text categorization and information extraction with ILP. Learning Language in Logic, 247–258.Google Scholar
  31. Kangas, J., Kohonen, T., & Laaksonen, J. (1990). Variants of self-organizing maps. IEEE Transactions on Neural Networks, 1(1), 93–99.CrossRefGoogle Scholar
  32. Karypis, G. (2002). CLUTO—A clustering toolkit (Vol. 02–017). Technical Report.Google Scholar
  33. Kiran, G. V. R., & Shankar, R. (2010). Enhancing document clustering using various external knowledge sources. In Proceedings of the 15th Australasian document computing symposium.Google Scholar
  34. Kohonen, T. (1989). Self-organization and associative memory (3rd Edn.). New York, NY, U.S.A.: Springer New York, Inc.CrossRefGoogle Scholar
  35. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., et al. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.CrossRefGoogle Scholar
  36. Kohonen, T., Schroeder, M. R., & Huang, T. S. (Eds.) (2001). Self-organizing maps. Secaucus, NJ, U.S.A.: Springer New York, Inc.MATHGoogle Scholar
  37. Kraaijveld, M. (1992). A non-linear projection method based on Kohonen’s topology preserving maps. In 11th IAPR international conference on pattern recognition, 1992. Conference B: Pattern recognition methodology and systems, proceedings (Vol. II, pp. 41 –45).Google Scholar
  38. Lagus, K., Kaski, S., & Kohonen, T. (2004). Mining massive document collections by the WEBSOM method. Informing Science, 163(1–3), 135–156.CrossRefGoogle Scholar
  39. Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the international conference on machine learning. Tahoe City, California, U.S.A.: Morgan Kaufmann.Google Scholar
  40. Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  41. Li, Y., Luk, W. P. R., Ho, K. S. E., & Chung, F. L. K. (2007). Improving weak ad-hoc queries using Wikipedia as external corpus. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 797–798). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  42. Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In SIGIR ’91: Proceedings of the 14th annual international ACM SIGIR conference on research and development in information retrieval (pp. 262–269). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  43. Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 191–198). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  44. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn treebank. Computational Linguistics, 19(2), 313–330.Google Scholar
  45. Mendes, P., Jakob, M., Garca-Silva, A., & Bizer, C. (2011). Dbpedia spotlight: Shedding light on the Web of documents. In In the proceedings of the 7th international conference on semantic systems (I-semantics).Google Scholar
  46. Merkl, D. (1998). Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21(1–3), 61–77.CrossRefGoogle Scholar
  47. Merkl, D., & Rauber, A. (1997). Alternative ways for cluster visualization in self-organizing maps. In In Proc. of the workshop on self-organizing maps (WSOM97) (pp. 106–111).Google Scholar
  48. Mihalcea, R., & Csomai, A. (2007). Wikify!: Linking documents to encyclopedic knowledge. In CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 233–242). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  49. Miikkulainen, R. (1990). Script recognition with hierarchical feature maps. Connection Science, 2, 83–101.CrossRefGoogle Scholar
  50. Milne, D., & Witten, I. H. (2008). Learning to link with Wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge management. CIKM ’08 (pp 509–518). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  51. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312.CrossRefGoogle Scholar
  52. Moutarde, F., & Ultsch, A. (2005). U*F clustering: A new performant “cluster-mining” method based on segmentation of self-organizing maps. In Workshop on self-organizing maps (WSOM’2005).Google Scholar
  53. Navigli, R., & Ponzetto, S. P. (2010). Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics. ACL ’10 (pp. 216–225). Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.Google Scholar
  54. Pampalk, E., Rauber, A., & Merkl, D. (2002). Using smoothed data histograms for cluster visualization in self-organizing maps. In ICANN ’02: Proceedings of the international conference on artificial neural networks (pp. 871–876). London, U.K.: Springer.Google Scholar
  55. Pölzlbauer, G. (2004). Survey and comparison of quality measures for self-organizing maps. In J. Paralič, G. Pölzlbauer, & A. Rauber (Eds.), Proceedings of the fifth workshop on data analysis (WDA’04), Sliezsky dom, Vysoké Tatry, 24–27 June 2004 (pp. 67–82). Slovakia: Elfa Academic Press.Google Scholar
  56. Pullwitt, D. (2002). Integrating contextual information to enhance som-based text document clustering. Neural Networks, 15(8–9), 1099–1106.CrossRefGoogle Scholar
  57. Ratinov, L., Roth, D., Downey, D., & Anderson, M. (2011). Local and global algorithms for disambiguation to Wikipedia. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies (Vol. 1, pp. 1375–1384). HLT ’11. Stroudsburg, PA, U.S.A.: Association for Computational Linguistics.Google Scholar
  58. Rauber, A. (1999). LabelSOM: On the labeling of self-organizing maps. In International joint conference on neural networks, 1999. IJCNN ’99 (Vol. 5, pp. 3527–3532).Google Scholar
  59. Rauber, A., Merkl, D., & Dittenbach, M. (2002). The growing hierarchical self-organizing map: Exploratory analysis of high-dimensional data. IEEE Transactions on Neural Networks, 13, 1331–1341.CrossRefGoogle Scholar
  60. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326.CrossRefGoogle Scholar
  61. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York, U.S.A.: McGraw-Hill.MATHGoogle Scholar
  62. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.MATHCrossRefGoogle Scholar
  63. Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5), 401–409.CrossRefGoogle Scholar
  64. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, Manchester, UK.Google Scholar
  65. Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. In ROMAND ’04: Proceedings of the 3rd workshop on robust methods in analysis of natural language data (pp. 104–113). Morristown, NJ, U.S.A.: Association for Computational Linguistics.CrossRefGoogle Scholar
  66. Shehata, S., Karray, F., & Kamel, M. S. (2010). An efficient concept-based mining model for enhancing text clustering. IEEE Transactions on Knowledge and Data Engineering, 22, 1360–1371.CrossRefGoogle Scholar
  67. Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classification using sequential information maximization. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (pp. 129–136). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  68. Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3), 233–272.MATHCrossRefGoogle Scholar
  69. Spanakis, G., Siolas, G., & Stafylopatis, A. (2011). Exploiting Wikipedia knowledge for conceptual hierarchical clustering of documents. The Computer Journal, Section C: Computational Intelligence. doi: 10.1093/comjnl/bxr024.Google Scholar
  70. Stanford (2009). Named entity recognizer (online). http://www-nlp.stanford.edu/software/CRF-NER.shtml. Accessed 23 Jan 2012
  71. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In M. Grobelnik, D. Mladenic, & N. Milic-Frayling (Eds.), KDD-2000 workshop on text mining, Boston, MA (pp. 109–111).Google Scholar
  72. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2002). Computing iceberg concept lattices with TITANIC. Data & Knowledge Engineering, 42(2), 189–222.MATHCrossRefGoogle Scholar
  73. Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). Yago: A large ontology from Wikipedia and Wordnet. Journal Web Semantics, 6, 203–217.CrossRefGoogle Scholar
  74. Talavera, L., & Bejar, J. (2001). Generality-based conceptual clustering with probabilistic concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 196–206.CrossRefGoogle Scholar
  75. Tenenbaum, J. B., Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.CrossRefGoogle Scholar
  76. Toral, A., & Munoz, R. (2006). A proposal to automatically build and maintain gazetteers for named entity recognition by using Wikipedia. In EACL. The Association for Computer Linguistics.Google Scholar
  77. Ultsch, A., & Siemon, H. P. (1990). Kohonen’s self organizing feature maps for exploratory data analysis. In Proceedings of international neural networks conference (INNC) (pp. 305–308). Kluwer Academic Press.Google Scholar
  78. Vesanto, J., & Alhoniemi, E. (2000). Clustering of the self-organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600.CrossRefGoogle Scholar
  79. Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information Systems, 18, 153–172.CrossRefGoogle Scholar
  80. Wang, P., & Domeniconi, C. (2008). Building semantic kernels for text classification using Wikipedia. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 713–721). New York, NY, U.S.A.: ACM.CrossRefGoogle Scholar
  81. Wang, P., Hu, J., Zeng, H.-J., & Chen, Z. (2009). Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems, 19(3), 265–281.CrossRefGoogle Scholar
  82. Wang, B. B., Mckay, R. I. B., Abbass, H. A., & Barlow, M. (2003). A comparative study for domain ontology guided feature extraction. In ACSC ’03: Proceedings of the 26th Australasian computer science conference (pp. 69–78). Darlinghurst, Australia, Australia: Australian Computer Society, Inc.Google Scholar
  83. Wikipedia (2011). Wikipedia API (online). http://en.Wikipedia.org/w/api.php. Accessed 18 Oct 2011
  84. Willett, P. (1988). Recent trends in hierarchic document clustering: A critical review. Information Processing & Management, 24(5), 577–597.CrossRefGoogle Scholar
  85. Xiong, H., Steinbach, M., Tan, P., & Kumar, V. (2004). HICAP: Hierarchical clustering with pattern preservation. In Proceedings of SIAM international conference on data mining (pp. 279–290). Philadelphia, PA: SIAM.Google Scholar
  86. Xue, X.-B., & Zhou, Z.-H. (2009). Distributional features for text categorization. IEEE Transactions on Knowledge and Data Engineering, 21(3), 428–442.MathSciNetCrossRefGoogle Scholar
  87. Yin, H. (2002). ViSOM—A novel method for multivariate data projection and structure visualization. IEEE Transactions on Neural Networks, 13(1), 237–243.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Gerasimos Spanakis
    • 1
  • Georgios Siolas
    • 1
  • Andreas Stafylopatis
    • 1
  1. 1.National Technical University of AthensAthensGreece

Personalised recommendations