Interpretation of text patterns

Bashar, Md Abul; Li, Yuefeng

doi:10.1007/s10618-018-0556-z

Interpretation of text patterns

Published: 22 February 2018

Volume 32, pages 849–884, (2018)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

858 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Patterns are used as a fundamental means to analyse data in many text mining applications. Many efficient techniques have been developed to discover patterns. However, the excessive number of discovered patterns and lack of grounded (e.g. a priori defined) semantics have made it difficult for a user to interpret and explore the patterns. An insight into the meanings of the patterns can benefit users in the process of exploring them. In this regard, this paper presents a model to automatically interpret patterns by achieving two goals: (1) providing the meanings of patterns in terms of ontology concepts and (2) providing a new method for generating and extracting features from an ontology to describe the relevant information more effectively. Taking advantage of a domain ontology and a set of relevant statistics (e.g. term frequency in a document, inverse term frequency in a domain ontology, etc.), our proposed model can give an insight into the hidden meanings of the patterns. The model is evaluated by comparing it with different baseline models on three standard datasets. The results show that the performance of the proposed model is significantly better than baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on Semantic Text Mining Based on Domain Ontology

U-STRUCT: A Framework for Conversion of Unstructured Text Documents into Structured Form

Interpretive Psychotherapy of Text Mining Approaches

Notes

References

Afrati F, Gionis A, Mannila H (2004) Approximating a collection of frequent sets. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Seattle, WA, USA, pp 12–19
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, vol 22. ACM, Washington, DC, USA, pp 207–216
Anderson JR (1983) A spreading activation theory of memory. J Verbal Learn Verbal Behav 22(3):261–295
Article Google Scholar
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the web. In: Proceedings of the 20th international joint conference on artifical intelligence, vol 7. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 2670–2676
Bayardo Jr RJ (1998) Efficiently mining long patterns from databases. In: ACM Sigmod record, vol 27. ACM, Seattle, Washington, USA, pp 85–93
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155
MATH Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Bloehdorn S, Cimiano P, Hotho A (2006) Learning ontologies to improve text clustering and classification. From data and information analysis to knowledge engineering. Springer, Magdeburg, pp 334–341
Chapter Google Scholar
Brewster C, Alani H, Dasmahapatra S, Wilks Y (2004) Data driven ontology evaluation. In: International conference on language resources and evaluation (LREC 2004). Lisbon, Portugal
Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, Athens, Greece, pp 33–40
Bunescu R, Mooney RJ (2006) Subsequence kernels for relation extraction. Advances in neural information processing systems. MIT Press, Cambridge, pp 171–178
Google Scholar
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Disc 14(1):171–206
Article MathSciNet Google Scholar
Calegari S, Pasi G (2013) Personal ontologies: generation of user profiles based on the yago ontology. Inf Process Manag 49(3):640–658
Article Google Scholar
Caropreso MF, Matwin S, Sebastiani F (2001) A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. Text Databases Doc Manag Theory Pract 5478:78–102
Google Scholar
Chemudugunta C, Holloway A, Smyth P, Steyvers M (2008a) Modeling documents by combining semantic concepts with unsupervised statistical learning. In: International semantic web conference. Springer, Karlsruhe, pp 229–244
Chemudugunta C, Smyth P, Steyvers M (2008b) Combining concept hierarchies and statistical topic models. In: Proceedings of the 17th ACM conference on information and knowledge management, ACM, Napa Valley, California, USA, pp 1469–1470
Collins AM, Loftus EF (1975) A spreading-activation theory of semantic processing. Psychol Rev 82(6):407
Article Google Scholar
Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167
Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11(6):453–482
Article Google Scholar
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Del Corro L, Gemulla R (2013) Clausie: clause-based open information extraction. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 355–366
Egozi O, Gabrilovich E, Markovitch S (2008) Concept-based feature generation and selection for information retrieval. In: AAAI conference on artificial intelligence, vol 8. Chicago, Illinois, pp 1132–1137
Egozi O, Markovitch S, Gabrilovich E (2011) Concept-based information retrieval using explicit semantic analysis. ACM Trans Inf Syst (TOIS) 29(2):1–38
Article Google Scholar
Fader A, Soderland S, Etzioni O (2011) Identifying relations for open information extraction. In: Proceedings of the conference on empirical methods in natural language processing, Association for Computational Linguistics, pp 1535–1545
Gabrilovich E, Markovitch S (2005) Feature generation for text categorization using world knowledge. In: Proceedings of the 19th international joint conference on artificial intelligence, vol 5. Edinburgh, Scotland, pp 1048–1053
Gabrilovich E, Markovitch S (2007a) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on artificial intelligence, vol 6. Hyderabad, India, pp 1606–1611
Gabrilovich E, Markovitch S (2007b) Harnessing the expertise of 70, 000 human editors: knowledge-based feature generation for text categorization. J Mach Learn Res 8(10):2297–2345
Google Scholar
Gabrilovich E, Markovitch S (2009) Wikipedia-based semantic interpretation for natural language processing. J Artif Intell Res 34(2):443–498
Article MATH Google Scholar
Gallo A, De Bie T, Cristianini N (2007) Mini: mining informative non-redundant itemsets. In: European conference on principles of data mining and knowledge discovery. Springer, pp 438–445
Gauch S, Chaffee J, Pretschner A (2003) Ontology-based personalized search and browsing. Web Intell Agent Syst 1(3):219–234
Google Scholar
Glorot X, Bordes A, Bengio Y (2011) Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 513–520
Goutsias J, Mahler RP, Nguyen HT (2012) Random sets: theory and applications, vol 97. Springer, Berlin
Google Scholar
Grossman DA (2004) Information retrieval: algorithms and heuristics, vol 15. Springer, Berlin
Book MATH Google Scholar
Guns T, Nijssen S, De Raedt L (2013) k-pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418
Article Google Scholar
Han J, Wang J, Lu Y, Tzvetkov P (2002) Mining top-k frequent closed patterns without minimum support. In: IEEE international conference on data mining (ICDM), IEEE, Maebashi City, Japan, pp 211–218
Hennig L, Umbrath W, Wetzker R (2008) An ontology-based approach to text summarization. In: IEEE/WIC/ACM international joint conference on web intelligence (WI) and intelligent agent technology (IAT), vol 3. IEEE. Sydney, NSW, Australia, pp 291–294
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp 50–57
Hopfield JJ (1984) Neurons with graded response have collective computational properties like those of two-state neurons. Proc Nat Acad Sci 81(10):3088–3092
Article MATH Google Scholar
Hotho A, Nürnberger A, Paaß G (2005) A brief survey of text mining. Ldv Forum 20:19–62
Google Scholar
Hulpus I, Hayes C, Karnstedt M, Greene D (2013) Unsupervised graph-based topic labelling using dbpedia. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, Rome, Italy, pp 465–474
Ingaramo D, Pinto D, Rosso P, Errecalde M (2008) Evaluation of internal validity measures in short-text corpora. In: Computational linguistics and intelligent text processing, Springer, Haifa, Israel, pp 555–567
Karp RM (1972) Reducibility among combinatorial problems. Complexity of computer computations. Springer, Berlin, pp 85–103
Chapter Google Scholar
Knobbe AJ, Ho EK (2006) Pattern teams. In: European conference on principles of data mining and knowledge discovery, Springer, pp 577–584
Kriegel HP, Borgwardt KM, Kröger P, Pryakhin A, Schubert M, Zimek A (2007) Future trends in data mining. Data Min Knowl Disc 15(1):87–97
Article MathSciNet Google Scholar
Kruse R, Schwecke E, Heinsohn J (1991) Uncertainty and vagueness in knowledge based systems. Springer, New York Inc., New York
Book MATH Google Scholar
Kruse R, Schwecke E, Heinsohn J (2012) Uncertainty and vagueness in knowledge based systems: numerical methods. Springer, Berlin
MATH Google Scholar
Lau JH, Newman D, Karimi S, Baldwin T (2010) Best topic word selection for topic labelling. In: Proceedings of the 23rd international conference on computational linguistics: Posters, Association for Computational Linguistics, Beijing, China, pp 605–613
Lau JH, Grieser K, Newman D, Baldwin T (2011) Automatic labelling of topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Portland, Oregon, USA, pp 1536–1545
Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Google Scholar
Li G, Zaki MJ (2016) Sampling frequent and minimal boolean patterns: theory and application in classification. Data Min Knowl Disc 30(1):181–225
Article MathSciNet Google Scholar
Li Y, Zhong N (2006) Mining ontology for automatically acquiring web user information needs. IEEE Trans Knowl Data Eng 18(4):554–568
Article Google Scholar
Li Y, Algarni A, Zhong N (2010) Mining positive and negative patterns for relevance feature discovery. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Washington, DC, USA, pp 753–762
Li Y, Algarni A, Albathan M, Shen Y, Bijaksana MA (2015) Relevance feature discovery for text mining. IEEE Trans Knowl Data Eng 27(6):1656–1669. https://doi.org/10.1109/TKDE.2014.2373357
Article Google Scholar
Liu B, Zhao K, Benkler J, Xiao W (2006) Rule interestingness analysis using olap operations. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Philadelphia, PA, USA, pp 297–306
Liu J, Shang J, Wang C, Ren X, Han J (2015) Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, ACM, pp 1729–1744
Liu J, Ren X, Shang J, Cassidy T, Voss CR, Han J (2016) Representing documents via latent keyphrase inference. In: Proceedings of the 25th international conference on world wide web, International World Wide Web Conferences Steering Committee, pp 1057–1067
Mao XL, Ming ZY, Zha ZJ, Chua TS, Yan H, Li X (2012) Automatic labeling hierarchical topics. In: Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, pp 2383–2386
Mei Q, Liu C, Su H, Zhai C (2006a) A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: Proceedings of the 15th international conference on world wide web, ACM, Edinburgh, Scotland, pp 533–542
Mei Q, Xin D, Cheng H, Han J, Zhai C (2006b) Generating semantic annotations for frequent patterns with context analysis. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Philadelphia, PA, USA, pp 337–346
Mei Q, Shen X, Zhai C (2007a) Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, San Jose, California, USA, pp 490–499
Mei Q, Xin D, Cheng H, Han J, Zhai C (2007b) Semantic annotation of frequent patterns. ACM Trans Knowl Discov Data (TKDD) 1(3):11:1–11:30
Google Scholar
Michelson M, Macskassy SA (2010) Discovering users’ topics of interest on twitter: a first look. In: Proceedings of the fourth workshop on analytics for noisy unstructured text data, ACM, pp 73–80
Mielikäinen T, Mannila H (2003) The pattern ordering problem. In: European conference on principles of data mining and knowledge discovery, Springer, pp 327–338
Mihelčić M, Šimić G, Leko MB, Lavrač N, Džeroski S, Šmuc T (2017) Using redescription mining to relate clinical and biological characteristics of cognitively impaired and Alzheimer’s disease patients. PLoS ONE 12:1–35. https://doi.org/10.1371/journal.pone.0187364
Google Scholar
Mikolov T (2012) Statistical language models based on neural networks. Presentation at google, mountain view, 2nd April
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: International conference on learning representations (ICLR) workshop
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mikolov T, Yih Wt, Zweig G (2013c) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), vol 13, pp 746–751
Molchanov I (2006) Theory of random sets. Springer, Berlin
Google Scholar
Navigli R, Velardi P, Gangemi A (2003) Ontology learning and its application to automated terminology translation. IEEE Intell Syst 18(1):22–31
Article Google Scholar
Parida L, Ramakrishnan N (2005) Redescription mining: structure theory and algorithms. In: AAAI, vol 5, pp 837–844
Parthasarathy S, Zaki MJ, Ogihara M, Dwarkadas S (1999) Incremental and interactive sequence mining. In: Proceedings of the eighth international conference on information and knowledge management, ACM, Kansas City, Missouri, USA, pp 251–258
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory. Springer, London, UK, pp 398–416
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), vol 14, pp 1532–1543
Porter MF (1980) An algorithm for suffix stripping. Program Electron Libr Inf Syst 14(3):130–137
Article Google Scholar
Quillan MR (1966) Semantic memory. Technical report, DTIC Document
Raedt LD, Zimmermann A (2007) Constraint-based pattern set mining. In: Proceedings of the 2007 SIAM international conference on data mining, SIAM, pp 237–248
Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning cartwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 266–275
Robertson SE, Soboroff I (2002) The trec 2002 filtering track report. In: TREC, vol 2002, Gaithersburg, Maryland, USA, pp 27–39
Rocchio JJ (1971) Relevance feedback in information retrieval. The smart retrieval system-experiments in automatic document processing, pp 313–323
Rose T, Stevenson M, Whitehead M (2002) The reuters corpus volume 1—from yesterday’s news to tomorrow’s language resources. In: Proceedings of the third international conference on language resources and evaluation (LREC), vol 2, Canary Islands, Spain, pp 827–832
Ruggieri S (2010) Frequent regular itemset mining. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 263–272
Rumelhart DE, Hinton GE, Williams RJ (1988) Learning representations by back-propagating errors. Cognit Model 5(3):1
MATH Google Scholar
Salton G (1968) Automatic information organization and retrieval. McGraw-Hill, New York
Google Scholar
Schmitz M, Bart R, Soderland S, Etzioni O, et al (2012) Open language learning for information extraction. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, pp 523–534
Schwenk H (2007) Continuous space language models. Comput Speech Lang 21(3):492–518
Article Google Scholar
Shen Y, Li Y, Xu Y (2012) Adopting relevance feature to learn personalized ontologies. In: Australasian joint conference on artificial intelligence, Springer, Sydney, Australia, pp 457–468
Siebes A, Vreeken J, Leeuwen Mv (2006) Item sets that compress. In: Proceedings of the 2006 SIAM international conference on data mining, SIAM, pp 395–406
Sieg A, Mobasher B, Burke R (2007) Web search personalization with ontological user profiles. In: Proceedings of the sixteenth ACM conference on information and knowledge management, ACM, Lisbon, Portugal, pp 525–534
Socher R, Lin CC, Manning C, Ng AY (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136
Song Y, Wang H, Wang Z, Li H, Chen W (2011) Short text conceptualization using a probabilistic knowledgebase. In: Proceedings of the twenty-second international joint conference on artificial intelligence—vol 3. AAAI Press, Barcelona, pp 2330–2336
Spasic I, Ananiadou S, McNaught J, Kumar A (2005) Text mining and ontologies in biomedicine: making sense of raw text. Brief Bioinform 6(3):239–251
Article Google Scholar
Sun X, Xiao Y, Wang H, Wang W (2015) On conceptual labeling of a bag of words. In: Proceedings of the 24th international conference on artificial intelligence. AAAI Press, Buenos Aires, pp 1326–1332
Tan AH, et al (1999) Text mining: the state of the art and the challenges. In: Proceedings of the PAKDD 1999 workshop on knowledge discovery from advanced databases, vol 8, pp 65–70
Tao X, Li Y, Zhong N (2011) A personalized ontology model for web information gathering. IEEE Trans Knowl Data Eng 23(4):496–511
Article Google Scholar
Thiel K, Berthold MR (2010) Node similarities from spreading activation. In: 10th international conference on data mining (ICDM). IEEE, pp 1085–1090
Tran T, Cimiano P, Rudolph S, Studer R (2007) Ontology-based interpretation of keywords for semantic search. The semantic web, pp 523–536
Turney PD (2013) Distributional semantics beyond words: supervised learning of analogy and paraphrase. Trans Assoc Comput Linguist 1:353–366
Google Scholar
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188
Article MathSciNet MATH Google Scholar
Verma R, Chen P, Lu W (2007) A semantic free-text summarization system using ontology knowledge. In: Proceedings of document understanding conference, Citeseer, Rochester, New York, USA, p 5
Vo DT, Bagheri E (2016) Open information extraction. Encycl Semant Comput Robot Intell. https://doi.org/10.1142/S2425038416300032
Google Scholar
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using wikipedia. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Las Vegas, Nevada, USA, pp 713–721
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, Philadelphia, PA, USA, pp 424–433
Weston J, Bengio S, Usunier N (2011) Wsabie: Scaling up to large vocabulary image annotation. In: Proceedings of the twenty-second international joint conference on artificial intelligence, vol 11, pp 2764–2770
Wortsman J, Matsuoka LY, Chen TC, Lu Z, Holick MF (2000) Decreased bioavailability of vitamin D in obesity. Am J Clin Nutr 72(3):690–693
Article Google Scholar
Wu F, Weld DS (2010) Open information extraction using wikipedia. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, pp 118–127
Wu ST (2007) Knowledge discovery using pattern taxonomy model in text mining. PhD thesis, Electrical Engineering and Computer Science, Queensland University of Technology
Wu ST, Li Y, Xu Y (2006) Deploying approaches for pattern refinement in text mining. In: Sixth international conference on data mining, ICDM’06, IEEE, pp 1157–1161
Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of the 31st international conference on very large data bases, VLDB Endowment, Trondheim, Norway, pp 709–720
Xue GR, Zeng HJ, Chen Z, Yu Y, Ma WY, Xi W, Fan W (2004) Optimizing web search using web click-through data. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management, ACM, pp 118–126
Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM, Chicago, Illinois, USA, pp 314–323
Yi K, Chan LM (2009) Linking folksonomy to library of congress subject headings: an exploratory study. J Doc 65(6):872–900
Article Google Scholar
Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, ACM, pp 364–373
Zhong N, Li Y, Wu ST (2012) Effective pattern discovery for text mining. IEEE Trans Knowl Data Eng 24(1):30–44
Article Google Scholar
Zhou G, Qian L, Fan J (2010) Tree kernel-based semantic relation extraction with rich syntactic and semantic information. Inf Sci 180(8):1313–1325
Article MathSciNet Google Scholar
Zhu J, Nie Z, Liu X, Zhang B, Wen JR (2009) Statsnowball: a statistical approach to extracting entity relationships. In: Proceedings of the 18th international conference on World wide web, ACM, pp 101–110

Download references

Acknowledgements

This paper was partially supported by Grant DP140103157 from the Australian Research Council (ARC Discovery Project). Besides, we thank Dr Yan Shen and Dr Yang Gao for their constructive comments and support on the experiments. We also thank the anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, Queensland University of Technology (QUT), Brisbane, QLD, 4001, Australia
Md Abul Bashar & Yuefeng Li

Authors

Md Abul Bashar
View author publications
You can also search for this author in PubMed Google Scholar
Yuefeng Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuefeng Li.

Additional information

Responsible editor: Hendrik Blockeel.

Appendix A: Description of evaluation measures

For a given topic, recall is the fraction of relevant documents that are retrieved, i.e.

$$\begin{aligned} r_{c} = \frac{\left| \left\{ rl \right\} \cap \left\{ rt \right\} \right| }{\left| \left\{ rl \right\} \right| }; \end{aligned}$$

precision is the fraction of retrieved documents that are relevant, i.e.

$$\begin{aligned} p_{c} = \frac{\left| \left\{ rl \right\} \cap \left\{ rt \right\} \right| }{\left| \left\{ rt \right\} \right| }; \end{aligned}$$

where rl is a relevant document rt is a retrieved document.

We want both the precision and the recall to be high, rather than the precision being high but the recall low or vice versa. To measure this property $F_{score}$ is used. It is defined by the following formula:

$$\begin{aligned} F_{score} = (1+\sigma ^{2})\frac{p_{c}\times r_{c}}{\sigma ^{2}p_{c}+r_{c}}. \end{aligned}$$

The $\sigma $ is a user defined-value that reflects our concern about false negative (irrelevant) versus false positive (relevant), which is conventionally assigned to 1 (in that case it is called $F_{1}$). The $F_{score}$ is the harmonic mean of recall and precision. The harmonic mean tends to be closer to the smaller of the two values. Therefore, $F_{score}$ will be high when both of recall and precision are high. The break-even point (BP) is the value for which both recall and precision are equal.

Precision, recall, F measure and break-even point are set-based measures that are computed based on unordered sets of documents. We can also use measures that evaluate the ranked (ordered) documents, which are now standard in information filtering systems. All the retrieved documents are taken into account in the precision calculation. However, in a ranked document context, it can be evaluated at a given cutoff that considers only the topmost results returned by the system. This measure is called $top-u$ precision, in our evaluation, we use the top-20 precision.

For each $top-u$ returned documents, precision and recall values can be plotted on a precision-recall curve. If the $(u+1)$th document returned by the system is nonrelevant then recall is the same as for the top u documents, but precision is dropped. If $(u+1)$th document is relevant, then both precision and recall increase. This curve has a saw-tooth shape, but often jiggles are removed by interpolated precision. The interpolated precision $p_{int}$ at a certain recall level $r_c$ is defined as the highest precision found for any recall level $r'_c \ge r_c$, i.e. $p_{int} (r_c) = \max _{r'_c \ge r_c} p_c(r'_c)$, where $p_{int} (r_c)$ is interpolated precision at recall level $r_c$, $p_c(r'_c)$ is precision at recall level $r'_c$. By definition, the interpolated precision at a recall of 0 is 1.

Mean Average Precision (MAP) is mean for average precision. Let, we are given some topics; and for each topic, we are given some documents sorted according to their relevance to the topic. The equation for calculating the AP (average precision) for a filtering system that returns u documents sorted according to their relevance to a topic is:

$$\begin{aligned} AP = \frac{\sum _{i=1}^{u} (p_{c_i})\times (r_{v_i})}{\left| \left\{ rl \right\} \right| }, \end{aligned}$$

where $p_{c_i}$ = $p_c$ at ith position and $r_{v_i}$ is the relevance value (i.e. 0 or 1) of the document in the ith position of the sorted list. The MAP is an average of APs over all the topics. MAP is commonly used by TREC participants, and it gives the indication of the order-matters precision.

A $t-Test$ is a parametric statistical hypothesis test, while Wilcoxon signed-rank test is a non-parametric statistical hypothesis test. These two tests are used to determine if two sets of data are significantly different from each other. Unlike parametric statistics, nonparametric statistics do not assume any specific probability distributions of the variables being assessed. We apply both of these to statistically analyse the difference between our proposed model and the best baseline model’s results—for the measure Top-20, $F_{1}$, BP and MAP.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bashar, M.A., Li, Y. Interpretation of text patterns. Data Min Knowl Disc 32, 849–884 (2018). https://doi.org/10.1007/s10618-018-0556-z

Download citation

Received: 24 August 2016
Accepted: 15 February 2018
Published: 22 February 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s10618-018-0556-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interpretation of text patterns

Abstract

Access this article

Similar content being viewed by others

Research on Semantic Text Mining Based on Domain Ontology

U-STRUCT: A Framework for Conversion of Unstructured Text Documents into Structured Form

Interpretive Psychotherapy of Text Mining Approaches

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix A: Description of evaluation measures

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interpretation of text patterns

Abstract

Access this article

Similar content being viewed by others

Research on Semantic Text Mining Based on Domain Ontology

U-STRUCT: A Framework for Conversion of Unstructured Text Documents into Structured Form

Interpretive Psychotherapy of Text Mining Approaches

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix A: Description of evaluation measures

Appendix A: Description of evaluation measures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation