Cost-effective conceptual design using taxonomies
- 27 Downloads
Abstract
It is known that annotating entities in unstructured and semi-structured datasets by their concepts improves the effectiveness of answering queries over these datasets. Ideally, one would like to annotate entities of all relevant concepts in a dataset. However, it takes substantial time and computational resources to annotate concepts in large datasets, and an organization may have sufficient resources to annotate only a subset of relevant concepts. Clearly, it would like to annotate a subset of concepts that provides the most effective answers to queries over the dataset. We propose a formal framework that quantifies the amount by which annotating entities of concepts from a taxonomy in a dataset improves the effectiveness of answering queries over the dataset. Because the problem is \(\mathbf {NP}\)-hard, we propose efficient approximation and pseudo-polynomial time algorithms for several cases of the problem. Our extensive empirical studies validate our framework and show accuracy and efficiency of our algorithms.
Keywords
Concept annotation Information extraction Data curation Taxonomies Cost-effective data curationNotes
Acknowledgements
This study was funded by National Science Foundation with Grant No. IIS-1421247, CCF-0938071, CCF-0938064 and CNS-0716532.
References
- 1.Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)CrossRefGoogle Scholar
- 2.Anderson, M., Cafarella, M., Jiang, Y., Wang, G., Zhang, B.: An integrated development environment for faster feature engineering. PVLDB 7(13), 1657–1660 (2014)Google Scholar
- 3.Anderson M., et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)Google Scholar
- 4.Arora, S., Manokaran, R., Moshkovitz, D., Weinstein, O.: Inapproximability of densest \(\kappa \)-subgraph from average-case hardness. people.csail.mit.edu/dmoshkov/papers (2011)Google Scholar
- 5.Arulselvan, A.: A note on the set union knapsack problem. Discrete Appl. Math. 169, 214–218 (2014)MathSciNetCrossRefMATHGoogle Scholar
- 6.Bhaskara, A., Charikar, M., Vijayaraghavan, A., Guruswami, V., Zhou, Y.: Polynomial integrality gaps for strong SDP relaxations of densest K-subgraph. In: SODA, pp. 388–405 (2012)Google Scholar
- 7.Boehm, B., et al.: Software development cost estimation approaches: a survey. Ann. Softw. Eng. 10(1–4), 177–205 (2000)CrossRefMATHGoogle Scholar
- 8.Chakrabarti, S., Puniyani, K., Das, S.: Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In: WWW, pp. 717–726 (2007)Google Scholar
- 9.Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18, 1411–1428 (2006)Google Scholar
- 10.Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss FR, Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010a)Google Scholar
- 11.Chiticariu, L., Li, Y., Raghavan, S., Reiss, F.R.: Enterprise information extraction: recent developments and open challenges. In: SIGMOD, pp. 1257–1258 (2010b)Google Scholar
- 12.Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! long live rule-based information extraction systems! In: EMNLP, pp. 827–832 (2013)Google Scholar
- 13.Chu-Carroll, J., et al.: Semantic Search via XML fragments: a high-precision approach to IR. In: SIGIR, pp. 445–452 (2006)Google Scholar
- 14.Demidova, E., Zhou, X., Oelze, I., Nejdl, W.: Evaluating evidences for keyword query disambiguation in entity centric database search. In: DEXA, pp. 240–247 (2010)Google Scholar
- 15.Deshpande, O., Lamba, D., Tourn, M., Das, S., Subramaniam, S., Rajaraman, A., Harinarayan, V., Doan, A.: Building, maintaining, and using knowledge bases: a report from the trenches. In: SIGMOD, pp. 1209–1220 (2013)Google Scholar
- 16.Dill, S., et al.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW, pp. 178–186 (2003)Google Scholar
- 17.Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)Google Scholar
- 18.Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B.J., Gokhale, C., Huang, J., Shen, W., Vuong, B.: Information extraction challenges in managing unstructured data. SIGMOD Rec. 37(4), 14–20 (2008)CrossRefGoogle Scholar
- 19.Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. PVLDB 6(2), 37–48 (2013)Google Scholar
- 20.Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI (2005)Google Scholar
- 21.Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Spanners: a formal framework for information extraction. In: PODS, pp. 37–48 (2013)Google Scholar
- 22.Furche, T., Guo, J., Maneth, S., Schallhart, C.: Robust and noise resistant wrapper induction. In: SIGMOD, pp. 773–784 (2016)Google Scholar
- 23.GarciaMolina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Upper Saddle River (2008)Google Scholar
- 24.Gulhane, P., et al.: Web-scale information extraction with vertex. In: ICDE, pp. 1209–1220 (2011)Google Scholar
- 25.Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: CoNLL, pp. 98–108 (2014)Google Scholar
- 26.Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Research and applications: induced lexico-syntactic patterns improve information extraction from online medical forums. JAMIA 21(5), 902–909 (2014)Google Scholar
- 27.Hua, W., Wang, Z., Wang, H., Zheng, K., Zhou, X.: Short text understanding through lexical-semantic analysis. In: ICDE, pp. 495–506 (2015)Google Scholar
- 28.Huang, J., Yu, C.: Prioritization of domain-specific web information extraction. In: AAAI (2010)Google Scholar
- 29.Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: COLING, pp. 1–7 (2002)Google Scholar
- 30.Jain, A., Doan, A., Gravano, L.: Optimizing SQL queries over text databases. In: ICDE (2008)Google Scholar
- 31.Kanani, P., et al.: Selecting actions for resource-bounded information extraction using reinforcement learning. In: WSDM, pp. 253–262 (2012)Google Scholar
- 32.Khot, S.: Ruling out PTAS for graph min-bisection, densest subgraph and bipartite clique. In: FOCS, pp. 136–145 (2004)Google Scholar
- 33.Kimelfeld, B.: Database principles in information extraction. In: PODS, pp. 156–163 (2014)Google Scholar
- 34.Liu, K., et al.: Meshlabeler: improving the accuracy of large-scale mesh indexing by integrating diverse evidence. Bioinformatics 31(13), i339–i347 (2015)CrossRefGoogle Scholar
- 35.Manning, C.D., Raghavan, P., Schütze, H., et al.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar
- 36.Manurangsi, P.: Almost-polynomial ratio ETH-hardness of approximating densest \(k\)-subgraph. CoRR abs/1611.05991, (2016) http://arxiv.org/abs/1611.05991
- 37.McCallum, A.: Information extraction: distilling structured data from unstructured text. ACM Queue pp. 48–57 (2005)Google Scholar
- 38.Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: ACL, pp. 1003–1011 (2009)Google Scholar
- 39.Mork, J., Demner-Fushman, D., Schmidt, S., Aronson, A.: Recent enhancements to the nlm medical text indexer. In: CLEF (Working Notes), pp. 1328–1336 (2014)Google Scholar
- 40.Nallapati, R., Manning, C.D.: Legal docket-entry classification: where machine learning stumbles. In: EMNLP, pp. 438–446 (2008)Google Scholar
- 41.Pound, J., Ilyas, I., Weddell, G.: Expressive and flexible access to web-extracted data: a keyword-based structured query language. In: SIGMOD, pp. 423–434 (2010)Google Scholar
- 42.Ratner AJ, De Sa CM, Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. In: NIPS, pp. 3567–3575 (2016)Google Scholar
- 43.Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930 (2014)Google Scholar
- 44.Sanderson, M.: Ambiguous queries: test collections need more sense. In: SIGIR, pp. 499–506 (2008)Google Scholar
- 45.Sarawagi, S.: Information extraction. Found. Trends\({\textregistered }\) Databases 1, 261–377 (2008)Google Scholar
- 46.Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R., Sen, P.: Web information extraction using markov logic networks. In: KDD, pp. 1406–1414 (2011)Google Scholar
- 47.Shen, W., Doan, A., Naughton JF, Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: PVLDB, pp. 1033–1044 (2007)Google Scholar
- 48.Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD, pp. 1031–1042 (2008)Google Scholar
- 49.Suchanek, F., et al.: Yago: A core of semantic knowledge unifying wordnet and wikipedia. In: WWW, pp. 697–706 (2007)Google Scholar
- 50.Termehchy, A., Vakilian, A., Chodpathumwan, Y., Winslett, M.: Which concepts are worth extracting? In: SIGMOD, pp. 779–790 (2014)Google Scholar
- 51.Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design using taxonomies. CoRR abs/1503.05656, arXiv:1503.05656 (2015)
- 52.Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design over taxonomies. In: WebDB, pp. 35–40 (2017)Google Scholar
- 53.Vazirani, V.: Approximation Algorithms. Springer, Berlin (2001)MATHGoogle Scholar
- 54.Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD, pp. 517–528 (2011)Google Scholar
- 55.Wu, W., Li, H., Wang, H., Zhu, K.: Probase: a probabilistic taxonomy for text understanding. In: SIGMOD, pp. 481–492 (2012)Google Scholar