Advertisement

On a Novel Representation of Multiple Textual Documents in a Single Graph

  • Nikolaos Giarelis
  • Nikos Kanakaris
  • Nikos KaracapilidisEmail author
Conference paper
  • 46 Downloads
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 193)

Abstract

This paper introduces a novel approach to represent multiple documents as a single graph, namely, the graph-of-docs model, together with an associated novel algorithm for text categorization. The proposed approach enables the investigation of the importance of a term into a whole corpus of documents and supports the inclusion of relationship edges between documents, thus enabling the calculation of important metrics as far as documents are concerned. Compared to well-tried existing solutions, our initial experimentations demonstrate a significant improvement of the accuracy of the text categorization process. For the experimentations reported in this paper, we used a well-known dataset containing about 19,000 documents organized in various subjects.

Keywords

Natural language processing Text categorization Document clustering Document representation 

Notes

Acknowledgements

The work presented in this paper is supported by the OpenBio-C project (www.openbio.eu), which is co-financed by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH—CREATE—INNOVATE (Project id: T1EDK-05275).

References

  1. 1.
    Aggarwal, C.C.: Machine Learning for Text. Springer (2018)Google Scholar
  2. 2.
    Armenatzoglou, N., Pham, H., Ntranos, V., Papadias, D., Shahabi, C.: Real-time multi-criteria social graph partitioning: a game theoretic approach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1617–1628, ACM Press (2015)Google Scholar
  3. 3.
    Blanco, R., Lioma, C.: Graph-based term weighting for information retrieval. Inf. Retr. 15(1), 54–92 (2012)CrossRefGoogle Scholar
  4. 4.
    Boudin, F.: A comparison of centrality measures for graph-based keyphrase extraction. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 834–838 (2013)Google Scholar
  5. 5.
    Bougouin, A., Boudin, F., Daille, B.: Topicrank: graph-based topic ranking for keyphrase extraction. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 543–551 (2013)Google Scholar
  6. 6.
    Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Kanterakis, A., Iatraki, G., Pityanou, K., Koumakis, L., Kanakaris, N., Karacapilidis, N., Potamias, G.: Towards reproducible bioinformatics: the OpenBio-C scientific workflow environment. In: Proceedings of the 19th IEEE International Conference on Bioinformatics and Bioengineering (BIBE), pp. 221–226, Athens, Greece (2019)Google Scholar
  8. 8.
    Karacapilidis, N., Papadias, D., Gordon, T., Voss, H.: Collaborative environmental planning with GeoMed. Eur. J. Oper. Res. Spec. Issue Environ. Plan. 102(2), 335–346 (1997)CrossRefGoogle Scholar
  9. 9.
    Karacapilidis, N., Tzagarakis, M., Karousos, N., Gkotsis, G., Kallistros, V., Christodoulou, S., Mettouris, C., Nousia, D.: Tackling cognitively-complex collaboration with CoPe_it! Int. J. Web-Based Learn Teach. Technol 4(3), 22–38 (2009)CrossRefGoogle Scholar
  10. 10.
    Landherr, A., Friedl, B., Heidemann, J.: A critical review of centrality measures in social networks. Bus Inf. Syst. Eng. 2(6), 371–385 (2010)CrossRefGoogle Scholar
  11. 11.
    Lu, H., Halappanavar, M., Kalyanaraman, A.: Parallel heuristics for scalable community detection. Parallel Comput. 47, 19–37 (2015)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004)Google Scholar
  13. 13.
    Miller, J.J.: Graph database applications and concepts with Neo4j. In: Proceedings of the Southern Association for Information Systems Conference, vol. 2324, no. 36, Atlanta, GA, USA (2013)Google Scholar
  14. 14.
    Monge, A., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records (1997)Google Scholar
  15. 15.
    Nijssen, S., Kok, J. N.: A quickstart in frequent structure mining can make a difference. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 647–652, ACM Press (2004)Google Scholar
  16. 16.
    Nikolentzos, G., Meladianos, P., Rousseau, F., Stavrakas, Y., Vazirgiannis, M.: Shortest-path graph kernels for document similarity. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1890–1900 (2017)Google Scholar
  17. 17.
    Nikolentzos, G., Siglidis, G., Vazirgiannis, M.: Graph Kernels: a survey. arXiv preprint arXiv:1904.12218 (2019)
  18. 18.
    Ohsawa, Y., Benson, N. E., Yachida, M.: KeyGraph: automatic indexing by co-occurrence graph based on building construction metaphor. In: Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries, pp. 12–18, IEEE Press (1998)Google Scholar
  19. 19.
    Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76(3) (2007)Google Scholar
  20. 20.
    Rawat, D.S., Kashyap, N.K.: Graph database: a complete GDBMS survey. Int. J. 3, 217–226 (2017)Google Scholar
  21. 21.
    Rousseau, F., Kiagias, E., Vazirgiannis, M.: Text categorization as a graph classification problem. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1, pp. 1702–1712 (2015)Google Scholar
  22. 22.
    Rousseau, F., Vazirgiannis, M.: Graph-of-word and TW-IDF: new approach to ad hoc IR. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 59–68, ACM (2013)Google Scholar
  23. 23.
    Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T., Tsuda, K.: gBoost: a mathematical programming approach to graph classification and regression. Mach. Learn. 75(1), 69–89 (2009)CrossRefGoogle Scholar
  24. 24.
    Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Siglidis, G., Nikolentzos, G., Limnios, S., Giatsidis, C., Skianis, K., Vazirgianis, M.: Grakel: a graph kernel library in python. arXiv preprint arXiv:1806.02193 (2018)
  26. 26.
    Sonawane, S.S., Kulkarni, P.A.: Graph based representation and analysis of text document: a survey of techniques. Int. J. Comput. Appl. 96(19) (2014)Google Scholar
  27. 27.
    Tixier, A., Malliaros, F., Vazirgiannis, M.: A graph degeneracy-based approach to keyword extraction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1860–1870 (2016)Google Scholar
  28. 28.
    Wang, W., Wang, C., Zhu, Y., Shi, B., Pei, J., Yan, X., Han, J.: Graphminer: a structural pattern-mining system for large disk-based graph databases and its applications. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 879–881. ACM Press (2005)Google Scholar
  29. 29.
    Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: Proceedings of the IEEE International Conference on Data Mining, pp. 721–724. IEEE Press (2002)Google Scholar
  30. 30.
    Yang, Z., Algesheimer, R., Tessone, C.J.: A comparative analysis of community detection algorithms on artificial networks. Sci. Rep. 6, 30750.  https://doi.org/10.1038/srep30750 (2016)

Copyright information

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Industrial Management and Information Systems LabMEAD, University of PatrasRio PatrasGreece

Personalised recommendations