A Methodology for Mining Document-Enriched Heterogeneous Information Networks

  • Miha Grčar
  • Nada Lavrač
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6926)


The paper presents a new methodology for mining heterogeneous information networks, motivated by the fact that, in many real-life scenarios, documents are available in heterogeneous information networks, such as interlinked multimedia objects containing titles, descriptions, and subtitles. The methodology consists of transforming documents into bag-of-words vectors, decomposing the corresponding heterogeneous network into separate graphs and computing structural-context feature vectors with PageRank, and finally constructing a common feature vector space in which knowledge discovery is performed. We exploit this feature vector construction process to devise an efficient classification algorithm. We demonstrate the approach by applying it to the task of categorizing video lectures. We show that our approach exhibits low time and space complexity without compromising classification accuracy.


text mining heterogeneous information networks data fusion classification centroid-based classifier diffusion kernels 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based Keyword Search in Databases. In: Proceedings of VLDB 2004, pp. 564–575 (2004)Google Scholar
  2. 2.
    Crestani, F.: Application of Spreading Activation Techniques in Information Retrieval. Artificial Intelligence Review 11, 453–482 (1997)CrossRefGoogle Scholar
  3. 3.
    Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)CrossRefGoogle Scholar
  4. 4.
    Fortuna, B., Grobelnik, M., Mladenic, D.: OntoGen: Semi-Automatic Ontology Editor. In: Smith, M.J., Salvendy, G. (eds.) HCII 2007. LNCS, vol. 4558, pp. 309–318. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Grobelnik, M., Mladenic, D.: Simple Classification into Large Topic Ontology of Web Documents. Journal of Computing and Information Technology 13(4), 279–285 (2005)CrossRefGoogle Scholar
  6. 6.
    Han, J.: Mining Heterogeneous Information Networks by Exploring the Power of Links. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 13–30. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  7. 7.
    Jeh, G., Widom, J.: SimRank: A Measure of Structural Context Similarity. In: Proceedings of KDD 2002, pp. 538–543 (2002)Google Scholar
  8. 8.
    Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph Regularized Transductive Classification on Heterogeneous Information Networks. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS, vol. 6321, pp. 570–586. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Joachims, T., Finley, T., Yu, C.-N.J.: Cutting-Plane Training of Structural SVMs. Journal of Machine Learning 77(1) (2009)Google Scholar
  10. 10.
    Kim, H.R., Chan, P.K.: Learning Implicit User Interest Hierarchy for Context in Personalization. Journal of Applied Intelligence 28(2) (2008)Google Scholar
  11. 11.
    Kleinberg, J.M.: Authoritative Sources in a Hyperlinked Environment. Journal of the Association for Computing Machinery 46, 604–632 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Kondor, R.I., Lafferty, J.: Diffusion Kernels on Graphs and Other Discrete Structures. In: Proceedings of ICML 2002, pp. 315–322 (2002)Google Scholar
  13. 13.
    Lanckriet, G.R.G., Deng, M., Cristianini, N., Jordan, M.I., Noble, W.S.: Kernel-based Data Fusion and Its Application to Protein Function Prediction in Yeast. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 300–311 (2004)Google Scholar
  14. 14.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford InfoLab (1999)Google Scholar
  15. 15.
    Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)Google Scholar
  16. 16.
    Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  17. 17.
    Storn, R., Price, K.: Differential Evolution: A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. Journal of Global Optimization 11, 341–359 (1997)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Mladenic, D.: Machine Learning on Non-Homogeneous, Distributed Text Data. PhD thesis (1998)Google Scholar
  19. 19.
    Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)zbMATHGoogle Scholar
  20. 20.
    de Nooy, W., Mrvar, A., Batagelj, V.: Exploratory Social Network Analysis with Pajek. Cambridge University Press, Cambridge (2005)CrossRefGoogle Scholar
  21. 21.
    Getoor, L., Diehl, C.P.: Link Mining: A Survey. SIGKDD Explorations 7(2), 3–12 (2005)CrossRefGoogle Scholar
  22. 22.
    Tan, S.: An Improved Centroid Classifier for Text Categorization. Expert Systems with Applications 35(1-2) (2008)Google Scholar
  23. 23.
    Gärtner, T.: A Survey of Kernels for Structured Data. ACM SIGKDD Explorations Newsletter 5(1), 49–58 (2003)CrossRefGoogle Scholar
  24. 24.
    Chakrabarti, S.: Dynamic Personalized PageRank in Entity-Relation Graphs. In: Proceedings of WWW 2007, pp. 571–580 (2007)Google Scholar
  25. 25.
    Stoyanovich, J., Bedathur, S., Berberich, K., Weikum, G.: EntityAuthority: Semantically Enriched Graph-based Authority Propagation. In: Proceedings of the 10th International Workshop on Web and Databases (2007)Google Scholar
  26. 26.
    Fogaras, D., Rácz, B.: Towards Scaling Fully Personalized PageRank. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 105–117. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  27. 27.
    Rakotomamonjy, A., Bach, F., Grandvalet, Y., Canu, S.: SimpleMKL. Journal of Machine Learning Research 9, 2491–2521 (2008)MathSciNetzbMATHGoogle Scholar
  28. 28.
    Vishwanathan, S.V.N., Sun, Z., Theera-Ampornpunt, N., Varma, M.: Multiple Kernel Learning and the SMO Algorithm. In: Advances in Neural Information Processing Systems, vol. 23 (2010)Google Scholar
  29. 29.
    Zhu, X., Ghahramani, Z.: Learning from Labeled and Unlabeled Data with Label Propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002)Google Scholar
  30. 30.
    Zhou, D., Schölkopf, B.: A Regularization Framework for Learning from Graph Data. In: ICML Workshop on Statistical Relational Learning and Its Connections to Other Fields (2004)Google Scholar
  31. 31.
    Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph Regularized Transductive Classification on Heterogeneous Information Networks. In: Balcázar, J.L., Bonchi, F., Gionis, A., Sebag, M. (eds.) ECML PKDD 2010. LNCS, vol. 6321, pp. 570–586. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  32. 32.
    Yin, X., Han, J., Yang, J., Yu, P.S.: CrossMine: Efficient Classification Across Multiple Database Relations. In: Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 172–195. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Miha Grčar
    • 1
  • Nada Lavrač
    • 1
  1. 1.Dept. of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia

Personalised recommendations