Pattern Recognition and Image Analysis

, Volume 28, Issue 4, pp 771–782 | Cite as

Relevance of a Set of Topical Texts to a Knowledge Unit and the Estimation of the Closeness of Linguistic Forms of Its Expression to a Semantic Pattern

  • G. M. EmelyanovEmail author
  • D. V. Mikhailov
  • A. P. Kozlov
Applied Problems


Interrelated problems of completeness of knowledge extraction from a set (corpus) of subject-oriented texts are analyzed through the relevance to a source phrase and the search for the most rational linguistic variant of the description of a selected knowledge fragment. These problems are topical for constructing systems of information processing, analysis, estimation, and understanding. In addition, the basis for extracting image components of a source phrase is the joint estimation of the coupling strength of its word combinations encountered in the phrases of a text analyzed, and the splitting of these words into classes by the value of the TF-IDF metric relative to the corpus texts. The relevance of a text corpus to a source knowledge unit by the degree of covering the words of a source phrase with the most relevant sets of relations relative to documents in which its image components are represented most fully is introduced by expanding word relations to three and more elements (using the base of known syntactic relations and without using it). This estimation is proposed for the targeted selection of text-corpus phrases that are either mutually equivalent or semantically complementary to each other and represent the same image. To rank the selected phrases by the degree of closeness to a semantic pattern (i.e. sense standard), three alternative estimations are introduced: based on splitting the source-phrase words into classes by the meaning of the TF-IDF metric and based on the numerical estimation of their binding strength (considering prepositions and conjunctions and without them). In addition, the text information necessary to represent a selected knowledge unit is compressed at least two times preserving its meaning.


pattern recognition intelligent data analysis information theory open-form test assignment natural- language expression of expert knowledge contextual annotation document ranking in information retrieval 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    D. V. Mikhaylov, A. P. Kozlov, and G. M. Emelyanov, “An approach based on TF–IDF metrics to extract the knowledge and relevant linguistic means on subjectoriented text sets,” Comput. Opt. 39 (3), 429–438 (2015) [in Russian].CrossRefGoogle Scholar
  2. 2.
    D. V. Mikhaylov, A. P. Kozlov, and G. M. Emelyanov, “An approach based on analysis of n–grams on links of words to extract the knowledge and relevant linguistic means on subject–oriented text sets,” Comput. Opt. 41 (3), 461–471 (2017) [in Russian].CrossRefGoogle Scholar
  3. 3.
    G. M. Emelyanov, D. V. Mikhaylov, and A. P. Kozlov, “Formation of the representation of topical knowledge units in the problem of their estimation on the basis of open tests,” Mash. Obuch. Anal. Dannykh (Mach. Learn. Data Anal.) 1 (8), 1089–1106 (2014) [in Russian].Google Scholar
  4. 4.
    E. Huang, “Paraphrase detection using recursive autoencoder,” Stanford NLP Group, Natural Language Processing, Final Projects Reports (Stanford University, Stanford, CA, 2011). Available at: http://nlp.stanford. edu/courses/cs224n/2011/reports/ehhuang.pdfGoogle Scholar
  5. 5.
    MaltParser: A Data–Driven Dependency Parser. Available at Scholar
  6. 6.
    M. S. Kudinov, “Shallow parsing of Russian text with conditional random fields,” Mash. Obuch. Anal. Dannykh (Mach. Learn. Data Anal.) 1 (6), 714–724 (2013) [in Russian].Google Scholar
  7. 7.
    I. V. Smirnov, A. O. Shelmanov, E. S. Kuznetsova, and I. V. Khramoin, “Semantic–syntactic analysis of natural languages. Part II. Method for semantic–syntactic analysis of texts,” Iskusstv. Intell. Prin. Resh. (Artif. Intell. Decis. Making) No. 1, 11–24 (2014) [in Russian].Google Scholar
  8. 8.
    D. V. Mikhaylov, A. P. Kozlov, and G. M. Emelyanov, “Extraction of knowledge and relevant linguistic means with efficiency estimation for the formation of subjectoriented text sets,” Comput. Opt. 40 (4), 572–582 (2016) [in Russian].CrossRefGoogle Scholar
  9. 9.
    T. T. Tanimoto, An Elementary Mathematical Theory of Classification and Prediction, Technical Report (International Business Machines Corporation, New York, 1958).Google Scholar
  10. 10.
    Russian National Corpus [in Russian]. Available at: Scholar
  11. 11.
    K. Vorontsov, “Additive regularization for topic modeling of coherent text,” in Mathematical Methods for Pattern Recognition: Book of Abstracts of 18th Russian National Conference MMPR–18 (Taganrog, 2017) (Torus Press, Moscow, 2017), p. 131.Google Scholar

Copyright information

© Pleiades Publishing, Ltd. 2018

Authors and Affiliations

  • G. M. Emelyanov
    • 1
    Email author
  • D. V. Mikhailov
    • 1
  • A. P. Kozlov
    • 1
  1. 1.Novgorod State UniversityVelikii NovgorodRussia

Personalised recommendations