Relevance of a Set of Topical Texts to a Knowledge Unit and the Estimation of the Closeness of Linguistic Forms of Its Expression to a Semantic Pattern
- 13 Downloads
Interrelated problems of completeness of knowledge extraction from a set (corpus) of subject-oriented texts are analyzed through the relevance to a source phrase and the search for the most rational linguistic variant of the description of a selected knowledge fragment. These problems are topical for constructing systems of information processing, analysis, estimation, and understanding. In addition, the basis for extracting image components of a source phrase is the joint estimation of the coupling strength of its word combinations encountered in the phrases of a text analyzed, and the splitting of these words into classes by the value of the TF-IDF metric relative to the corpus texts. The relevance of a text corpus to a source knowledge unit by the degree of covering the words of a source phrase with the most relevant sets of relations relative to documents in which its image components are represented most fully is introduced by expanding word relations to three and more elements (using the base of known syntactic relations and without using it). This estimation is proposed for the targeted selection of text-corpus phrases that are either mutually equivalent or semantically complementary to each other and represent the same image. To rank the selected phrases by the degree of closeness to a semantic pattern (i.e. sense standard), three alternative estimations are introduced: based on splitting the source-phrase words into classes by the meaning of the TF-IDF metric and based on the numerical estimation of their binding strength (considering prepositions and conjunctions and without them). In addition, the text information necessary to represent a selected knowledge unit is compressed at least two times preserving its meaning.
Keywordspattern recognition intelligent data analysis information theory open-form test assignment natural- language expression of expert knowledge contextual annotation document ranking in information retrieval
Unable to display preview. Download preview PDF.
- 3.G. M. Emelyanov, D. V. Mikhaylov, and A. P. Kozlov, “Formation of the representation of topical knowledge units in the problem of their estimation on the basis of open tests,” Mash. Obuch. Anal. Dannykh (Mach. Learn. Data Anal.) 1 (8), 1089–1106 (2014) [in Russian].Google Scholar
- 4.E. Huang, “Paraphrase detection using recursive autoencoder,” Stanford NLP Group, Natural Language Processing, Final Projects Reports (Stanford University, Stanford, CA, 2011). Available at: http://nlp.stanford. edu/courses/cs224n/2011/reports/ehhuang.pdfGoogle Scholar
- 5.MaltParser: A Data–Driven Dependency Parser. Available at http://www.maltparser.org/Google Scholar
- 6.M. S. Kudinov, “Shallow parsing of Russian text with conditional random fields,” Mash. Obuch. Anal. Dannykh (Mach. Learn. Data Anal.) 1 (6), 714–724 (2013) [in Russian].Google Scholar
- 7.I. V. Smirnov, A. O. Shelmanov, E. S. Kuznetsova, and I. V. Khramoin, “Semantic–syntactic analysis of natural languages. Part II. Method for semantic–syntactic analysis of texts,” Iskusstv. Intell. Prin. Resh. (Artif. Intell. Decis. Making) No. 1, 11–24 (2014) [in Russian].Google Scholar
- 9.T. T. Tanimoto, An Elementary Mathematical Theory of Classification and Prediction, Technical Report (International Business Machines Corporation, New York, 1958).Google Scholar
- 10.Russian National Corpus [in Russian]. Available at: http://www.ruscorpora.ru/en/Google Scholar
- 11.K. Vorontsov, “Additive regularization for topic modeling of coherent text,” in Mathematical Methods for Pattern Recognition: Book of Abstracts of 18th Russian National Conference MMPR–18 (Taganrog, 2017) (Torus Press, Moscow, 2017), p. 131.Google Scholar