Advertisement

Semantic Marking Method for Non-text Documents of Website Based on Their Context in Hypertext Clustering

  • Sergey PapshevEmail author
  • Alexander Sytnik
  • Nina Melnikova
  • Alexey Bogomolov
Conference paper
Part of the Studies in Systems, Decision and Control book series (SSDC, volume 199)

Abstract

Initial indexing and structuration of information on Internet are the conditions for resolving of the task of an effective search of information that best relates to user’s query now. Mainly they deal with text-based time expensive processing methods. Hyper structured nature of the web is used as an alternate approach for this purpose, but websites also contain information in the non-text format: (images, movies, pdf-files etc.). These documents, first of all, are intended for perception by the person, but not for the automated processing. In this article, we propose the method for the decision of this problem on the way of semantic marking of non-text documents based on their context in hypertext clustering. At the same time, we develop the approach of the context independent semantic clustering of the website with using of web-analytics information, which utilizes internal hypertext structure, user’s behavior statistics and does not require full-text content analysis. For this purpose, we represent the hypertext structure of the site as a graph and apply flow simulation algorithms to produce web clustering. Then we make a semantic description of the clusters by sets of keywords. Non-text documents have hyperlinks to some web clusters, so we consider extracted keywords for relating cluster as its semantic marking. We have checked the suggested method on the example of site sstu.ru.

Keywords

Semantic marking Hypertext clustering Graph Non-text document 

References

  1. 1.
    Manjaly, A.V., Priya, B.S.: Malayalam text and non-text classification of natural scene images based on multiple instance learning. In: IEEE International Conference on Advances in Computer Applications, ICACA 2016, pp. 190–196, Coimbatore, India (2016/ 2017).  https://doi.org/10.1109/icaca.2016.7887949
  2. 2.
    Franzoni, V., Milani, A., Pallottelli, S., Leung, C.H.C., Yuanxi, L.: Context-based image semantic similarity. In: Proceedings of 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE, pp. 1280–1284 (2015)Google Scholar
  3. 3.
    Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3) (2009). Article 17CrossRefGoogle Scholar
  4. 4.
    Sridevi, K., Umarani, R., Selvi, V.: An analysis of web document clustering algorithms. Int. J. Sci. Technol. 1(6), 275–282 (2011)Google Scholar
  5. 5.
    Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD Explor. Newslett. 2(1), 1–15 (2000)CrossRefGoogle Scholar
  6. 6.
    MCL—a cluster algorithm for graphs, http://micans.org/mcl/. Accessed 20 Oct 2018
  7. 7.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)Google Scholar
  8. 8.
    Aggarwal, C.C., Wang, H.A.: Survey of Clustering Algorithms for Graph Data. Springer, Boston, pp. 275–301 (2010)Google Scholar
  9. 9.
    Ngomo, N., Schumacher, F.: Borderow: a local graph clustering algorithm for natural language processing. In: Computational Linguistics and Intelligent Text Processing, pp. 547–558 (2009)Google Scholar
  10. 10.
    Salin, V., Slastihina, M., Ermilov, I., Speck, R., Auer, S., Papshev, S.: Semantic clustering of website based on its hypertext structure. In: Proceedings of 6th International Conference, KESW 2015. Communications in Computer and Information Science, pp. 182–194 (2015)Google Scholar
  11. 11.
    Kumbaroska, V., Mitrevski, P.: Behavioural-based modelling and analysis of Navigation Patterns across Information Networks. Emerg. Res. Solut. ICT 1, 60–74 (2016).  https://doi.org/10.20544/ERSICT.02.16.P06CrossRefGoogle Scholar
  12. 12.
    Schaeffer, S.E.: Graph clustering by flow simulation. Comput. Sci. Rev. T(1), 27–64.  https://doi.org/10.1016/j.cosrev.2007.05.001CrossRefGoogle Scholar
  13. 13.
    Scikit-learn machine learning in Python. http://scikit-learn.org/stable/modules/clustering.html. Accessed 18 Apr 2018

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Yuri Gagarin State Technical University of SaratovSaratovRussia
  2. 2.Institute of Precision Mechanics and ControlRussian Academy of SciencesSaratovRussia

Personalised recommendations