The VLDB Journal

, Volume 27, Issue 3, pp 369–394 | Cite as

Cost-effective conceptual design using taxonomies

  • Yodsawalai Chodpathumwan
  • Ali Vakilian
  • Arash Termehchy
  • Amir Nayyeri
Regular Paper


It is known that annotating entities in unstructured and semi-structured datasets by their concepts improves the effectiveness of answering queries over these datasets. Ideally, one would like to annotate entities of all relevant concepts in a dataset. However, it takes substantial time and computational resources to annotate concepts in large datasets, and an organization may have sufficient resources to annotate only a subset of relevant concepts. Clearly, it would like to annotate a subset of concepts that provides the most effective answers to queries over the dataset. We propose a formal framework that quantifies the amount by which annotating entities of concepts from a taxonomy in a dataset improves the effectiveness of answering queries over the dataset. Because the problem is \(\mathbf {NP}\)-hard, we propose efficient approximation and pseudo-polynomial time algorithms for several cases of the problem. Our extensive empirical studies validate our framework and show accuracy and efficiency of our algorithms.


Concept annotation Information extraction Data curation Taxonomies Cost-effective data curation 



This study was funded by National Science Foundation with Grant No. IIS-1421247, CCF-0938071, CCF-0938064 and CNS-0716532.


  1. 1.
    Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)CrossRefGoogle Scholar
  2. 2.
    Anderson, M., Cafarella, M., Jiang, Y., Wang, G., Zhang, B.: An integrated development environment for faster feature engineering. PVLDB 7(13), 1657–1660 (2014)Google Scholar
  3. 3.
    Anderson M., et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)Google Scholar
  4. 4.
    Arora, S., Manokaran, R., Moshkovitz, D., Weinstein, O.: Inapproximability of densest \(\kappa \)-subgraph from average-case hardness. (2011)Google Scholar
  5. 5.
    Arulselvan, A.: A note on the set union knapsack problem. Discrete Appl. Math. 169, 214–218 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Bhaskara, A., Charikar, M., Vijayaraghavan, A., Guruswami, V., Zhou, Y.: Polynomial integrality gaps for strong SDP relaxations of densest K-subgraph. In: SODA, pp. 388–405 (2012)Google Scholar
  7. 7.
    Boehm, B., et al.: Software development cost estimation approaches: a survey. Ann. Softw. Eng. 10(1–4), 177–205 (2000)CrossRefzbMATHGoogle Scholar
  8. 8.
    Chakrabarti, S., Puniyani, K., Das, S.: Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In: WWW, pp. 717–726 (2007)Google Scholar
  9. 9.
    Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18, 1411–1428 (2006)Google Scholar
  10. 10.
    Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss FR, Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010a)Google Scholar
  11. 11.
    Chiticariu, L., Li, Y., Raghavan, S., Reiss, F.R.: Enterprise information extraction: recent developments and open challenges. In: SIGMOD, pp. 1257–1258 (2010b)Google Scholar
  12. 12.
    Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! long live rule-based information extraction systems! In: EMNLP, pp. 827–832 (2013)Google Scholar
  13. 13.
    Chu-Carroll, J., et al.: Semantic Search via XML fragments: a high-precision approach to IR. In: SIGIR, pp. 445–452 (2006)Google Scholar
  14. 14.
    Demidova, E., Zhou, X., Oelze, I., Nejdl, W.: Evaluating evidences for keyword query disambiguation in entity centric database search. In: DEXA, pp. 240–247 (2010)Google Scholar
  15. 15.
    Deshpande, O., Lamba, D., Tourn, M., Das, S., Subramaniam, S., Rajaraman, A., Harinarayan, V., Doan, A.: Building, maintaining, and using knowledge bases: a report from the trenches. In: SIGMOD, pp. 1209–1220 (2013)Google Scholar
  16. 16.
    Dill, S., et al.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW, pp. 178–186 (2003)Google Scholar
  17. 17.
    Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)Google Scholar
  18. 18.
    Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B.J., Gokhale, C., Huang, J., Shen, W., Vuong, B.: Information extraction challenges in managing unstructured data. SIGMOD Rec. 37(4), 14–20 (2008)CrossRefGoogle Scholar
  19. 19.
    Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. PVLDB 6(2), 37–48 (2013)Google Scholar
  20. 20.
    Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI (2005)Google Scholar
  21. 21.
    Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Spanners: a formal framework for information extraction. In: PODS, pp. 37–48 (2013)Google Scholar
  22. 22.
    Furche, T., Guo, J., Maneth, S., Schallhart, C.: Robust and noise resistant wrapper induction. In: SIGMOD, pp. 773–784 (2016)Google Scholar
  23. 23.
    GarciaMolina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Upper Saddle River (2008)Google Scholar
  24. 24.
    Gulhane, P., et al.: Web-scale information extraction with vertex. In: ICDE, pp. 1209–1220 (2011)Google Scholar
  25. 25.
    Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: CoNLL, pp. 98–108 (2014)Google Scholar
  26. 26.
    Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Research and applications: induced lexico-syntactic patterns improve information extraction from online medical forums. JAMIA 21(5), 902–909 (2014)Google Scholar
  27. 27.
    Hua, W., Wang, Z., Wang, H., Zheng, K., Zhou, X.: Short text understanding through lexical-semantic analysis. In: ICDE, pp. 495–506 (2015)Google Scholar
  28. 28.
    Huang, J., Yu, C.: Prioritization of domain-specific web information extraction. In: AAAI (2010)Google Scholar
  29. 29.
    Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: COLING, pp. 1–7 (2002)Google Scholar
  30. 30.
    Jain, A., Doan, A., Gravano, L.: Optimizing SQL queries over text databases. In: ICDE (2008)Google Scholar
  31. 31.
    Kanani, P., et al.: Selecting actions for resource-bounded information extraction using reinforcement learning. In: WSDM, pp. 253–262 (2012)Google Scholar
  32. 32.
    Khot, S.: Ruling out PTAS for graph min-bisection, densest subgraph and bipartite clique. In: FOCS, pp. 136–145 (2004)Google Scholar
  33. 33.
    Kimelfeld, B.: Database principles in information extraction. In: PODS, pp. 156–163 (2014)Google Scholar
  34. 34.
    Liu, K., et al.: Meshlabeler: improving the accuracy of large-scale mesh indexing by integrating diverse evidence. Bioinformatics 31(13), i339–i347 (2015)CrossRefGoogle Scholar
  35. 35.
    Manning, C.D., Raghavan, P., Schütze, H., et al.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  36. 36.
    Manurangsi, P.: Almost-polynomial ratio ETH-hardness of approximating densest \(k\)-subgraph. CoRR abs/1611.05991, (2016)
  37. 37.
    McCallum, A.: Information extraction: distilling structured data from unstructured text. ACM Queue pp. 48–57 (2005)Google Scholar
  38. 38.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: ACL, pp. 1003–1011 (2009)Google Scholar
  39. 39.
    Mork, J., Demner-Fushman, D., Schmidt, S., Aronson, A.: Recent enhancements to the nlm medical text indexer. In: CLEF (Working Notes), pp. 1328–1336 (2014)Google Scholar
  40. 40.
    Nallapati, R., Manning, C.D.: Legal docket-entry classification: where machine learning stumbles. In: EMNLP, pp. 438–446 (2008)Google Scholar
  41. 41.
    Pound, J., Ilyas, I., Weddell, G.: Expressive and flexible access to web-extracted data: a keyword-based structured query language. In: SIGMOD, pp. 423–434 (2010)Google Scholar
  42. 42.
    Ratner AJ, De Sa CM, Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. In: NIPS, pp. 3567–3575 (2016)Google Scholar
  43. 43.
    Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930 (2014)Google Scholar
  44. 44.
    Sanderson, M.: Ambiguous queries: test collections need more sense. In: SIGIR, pp. 499–506 (2008)Google Scholar
  45. 45.
    Sarawagi, S.: Information extraction. Found. Trends\({\textregistered }\) Databases 1, 261–377 (2008)Google Scholar
  46. 46.
    Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R., Sen, P.: Web information extraction using markov logic networks. In: KDD, pp. 1406–1414 (2011)Google Scholar
  47. 47.
    Shen, W., Doan, A., Naughton JF, Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: PVLDB, pp. 1033–1044 (2007)Google Scholar
  48. 48.
    Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD, pp. 1031–1042 (2008)Google Scholar
  49. 49.
    Suchanek, F., et al.: Yago: A core of semantic knowledge unifying wordnet and wikipedia. In: WWW, pp. 697–706 (2007)Google Scholar
  50. 50.
    Termehchy, A., Vakilian, A., Chodpathumwan, Y., Winslett, M.: Which concepts are worth extracting? In: SIGMOD, pp. 779–790 (2014)Google Scholar
  51. 51.
    Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design using taxonomies. CoRR abs/1503.05656, arXiv:1503.05656 (2015)
  52. 52.
    Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design over taxonomies. In: WebDB, pp. 35–40 (2017)Google Scholar
  53. 53.
    Vazirani, V.: Approximation Algorithms. Springer, Berlin (2001)zbMATHGoogle Scholar
  54. 54.
    Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD, pp. 517–528 (2011)Google Scholar
  55. 55.
    Wu, W., Li, H., Wang, H., Zhu, K.: Probase: a probabilistic taxonomy for text understanding. In: SIGMOD, pp. 481–492 (2012)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  • Yodsawalai Chodpathumwan
    • 1
  • Ali Vakilian
    • 2
  • Arash Termehchy
    • 3
  • Amir Nayyeri
    • 3
  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA
  2. 2.Massachusetts Institute of TechnologyCambridgeUSA
  3. 3.Oregon State UniversityCorvallisUSA

Personalised recommendations