Skip to main content
Log in

Cost-effective conceptual design using taxonomies

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

It is known that annotating entities in unstructured and semi-structured datasets by their concepts improves the effectiveness of answering queries over these datasets. Ideally, one would like to annotate entities of all relevant concepts in a dataset. However, it takes substantial time and computational resources to annotate concepts in large datasets, and an organization may have sufficient resources to annotate only a subset of relevant concepts. Clearly, it would like to annotate a subset of concepts that provides the most effective answers to queries over the dataset. We propose a formal framework that quantifies the amount by which annotating entities of concepts from a taxonomy in a dataset improves the effectiveness of answering queries over the dataset. Because the problem is \(\mathbf {NP}\)-hard, we propose efficient approximation and pseudo-polynomial time algorithms for several cases of the problem. Our extensive empirical studies validate our framework and show accuracy and efficiency of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  2. Anderson, M., Cafarella, M., Jiang, Y., Wang, G., Zhang, B.: An integrated development environment for faster feature engineering. PVLDB 7(13), 1657–1660 (2014)

    Google Scholar 

  3. Anderson M., et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)

  4. Arora, S., Manokaran, R., Moshkovitz, D., Weinstein, O.: Inapproximability of densest \(\kappa \)-subgraph from average-case hardness. people.csail.mit.edu/dmoshkov/papers (2011)

  5. Arulselvan, A.: A note on the set union knapsack problem. Discrete Appl. Math. 169, 214–218 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bhaskara, A., Charikar, M., Vijayaraghavan, A., Guruswami, V., Zhou, Y.: Polynomial integrality gaps for strong SDP relaxations of densest K-subgraph. In: SODA, pp. 388–405 (2012)

  7. Boehm, B., et al.: Software development cost estimation approaches: a survey. Ann. Softw. Eng. 10(1–4), 177–205 (2000)

    Article  MATH  Google Scholar 

  8. Chakrabarti, S., Puniyani, K., Das, S.: Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In: WWW, pp. 717–726 (2007)

  9. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18, 1411–1428 (2006)

    Google Scholar 

  10. Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss FR, Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010a)

  11. Chiticariu, L., Li, Y., Raghavan, S., Reiss, F.R.: Enterprise information extraction: recent developments and open challenges. In: SIGMOD, pp. 1257–1258 (2010b)

  12. Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! long live rule-based information extraction systems! In: EMNLP, pp. 827–832 (2013)

  13. Chu-Carroll, J., et al.: Semantic Search via XML fragments: a high-precision approach to IR. In: SIGIR, pp. 445–452 (2006)

  14. Demidova, E., Zhou, X., Oelze, I., Nejdl, W.: Evaluating evidences for keyword query disambiguation in entity centric database search. In: DEXA, pp. 240–247 (2010)

  15. Deshpande, O., Lamba, D., Tourn, M., Das, S., Subramaniam, S., Rajaraman, A., Harinarayan, V., Doan, A.: Building, maintaining, and using knowledge bases: a report from the trenches. In: SIGMOD, pp. 1209–1220 (2013)

  16. Dill, S., et al.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW, pp. 178–186 (2003)

  17. Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)

  18. Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B.J., Gokhale, C., Huang, J., Shen, W., Vuong, B.: Information extraction challenges in managing unstructured data. SIGMOD Rec. 37(4), 14–20 (2008)

    Article  Google Scholar 

  19. Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. PVLDB 6(2), 37–48 (2013)

    Google Scholar 

  20. Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI (2005)

  21. Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Spanners: a formal framework for information extraction. In: PODS, pp. 37–48 (2013)

  22. Furche, T., Guo, J., Maneth, S., Schallhart, C.: Robust and noise resistant wrapper induction. In: SIGMOD, pp. 773–784 (2016)

  23. GarciaMolina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Upper Saddle River (2008)

    Google Scholar 

  24. Gulhane, P., et al.: Web-scale information extraction with vertex. In: ICDE, pp. 1209–1220 (2011)

  25. Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: CoNLL, pp. 98–108 (2014)

  26. Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Research and applications: induced lexico-syntactic patterns improve information extraction from online medical forums. JAMIA 21(5), 902–909 (2014)

    Google Scholar 

  27. Hua, W., Wang, Z., Wang, H., Zheng, K., Zhou, X.: Short text understanding through lexical-semantic analysis. In: ICDE, pp. 495–506 (2015)

  28. Huang, J., Yu, C.: Prioritization of domain-specific web information extraction. In: AAAI (2010)

  29. Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: COLING, pp. 1–7 (2002)

  30. Jain, A., Doan, A., Gravano, L.: Optimizing SQL queries over text databases. In: ICDE (2008)

  31. Kanani, P., et al.: Selecting actions for resource-bounded information extraction using reinforcement learning. In: WSDM, pp. 253–262 (2012)

  32. Khot, S.: Ruling out PTAS for graph min-bisection, densest subgraph and bipartite clique. In: FOCS, pp. 136–145 (2004)

  33. Kimelfeld, B.: Database principles in information extraction. In: PODS, pp. 156–163 (2014)

  34. Liu, K., et al.: Meshlabeler: improving the accuracy of large-scale mesh indexing by integrating diverse evidence. Bioinformatics 31(13), i339–i347 (2015)

    Article  Google Scholar 

  35. Manning, C.D., Raghavan, P., Schütze, H., et al.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  36. Manurangsi, P.: Almost-polynomial ratio ETH-hardness of approximating densest \(k\)-subgraph. CoRR abs/1611.05991, (2016) http://arxiv.org/abs/1611.05991

  37. McCallum, A.: Information extraction: distilling structured data from unstructured text. ACM Queue pp. 48–57 (2005)

  38. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: ACL, pp. 1003–1011 (2009)

  39. Mork, J., Demner-Fushman, D., Schmidt, S., Aronson, A.: Recent enhancements to the nlm medical text indexer. In: CLEF (Working Notes), pp. 1328–1336 (2014)

  40. Nallapati, R., Manning, C.D.: Legal docket-entry classification: where machine learning stumbles. In: EMNLP, pp. 438–446 (2008)

  41. Pound, J., Ilyas, I., Weddell, G.: Expressive and flexible access to web-extracted data: a keyword-based structured query language. In: SIGMOD, pp. 423–434 (2010)

  42. Ratner AJ, De Sa CM, Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. In: NIPS, pp. 3567–3575 (2016)

  43. Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930 (2014)

  44. Sanderson, M.: Ambiguous queries: test collections need more sense. In: SIGIR, pp. 499–506 (2008)

  45. Sarawagi, S.: Information extraction. Found. Trends\({\textregistered }\) Databases 1, 261–377 (2008)

  46. Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R., Sen, P.: Web information extraction using markov logic networks. In: KDD, pp. 1406–1414 (2011)

  47. Shen, W., Doan, A., Naughton JF, Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: PVLDB, pp. 1033–1044 (2007)

  48. Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD, pp. 1031–1042 (2008)

  49. Suchanek, F., et al.: Yago: A core of semantic knowledge unifying wordnet and wikipedia. In: WWW, pp. 697–706 (2007)

  50. Termehchy, A., Vakilian, A., Chodpathumwan, Y., Winslett, M.: Which concepts are worth extracting? In: SIGMOD, pp. 779–790 (2014)

  51. Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design using taxonomies. CoRR abs/1503.05656, arXiv:1503.05656 (2015)

  52. Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design over taxonomies. In: WebDB, pp. 35–40 (2017)

  53. Vazirani, V.: Approximation Algorithms. Springer, Berlin (2001)

    MATH  Google Scholar 

  54. Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD, pp. 517–528 (2011)

  55. Wu, W., Li, H., Wang, H., Zhu, K.: Probase: a probabilistic taxonomy for text understanding. In: SIGMOD, pp. 481–492 (2012)

Download references

Acknowledgements

This study was funded by National Science Foundation with Grant No. IIS-1421247, CCF-0938071, CCF-0938064 and CNS-0716532.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yodsawalai Chodpathumwan.

Appendices

Appendix

Proofs

Proof of Theorem 1

The problem of CECD can be reduced to the problem of choosing a cost-effective design from a set of concepts by creating a taxonomy \(\mathcal {X}=(R,\mathcal {C},\mathcal {R})\) where all nodes except for R are leaf concepts. Since the problem of choosing cost-effective concepts from a set of concepts is \(\mathbf {NP}\)-hard [50], CECD is also \(\mathbf {NP}\)-hard. \(\square \)

Proof of Lemma 1

We have \({\sum \limits _{C\in \texttt {free}(\mathcal {S})} u(C)d(C) \le u(C_{\max })} \sum \limits _{C\in \texttt {free}(\mathcal {S})}d(C)\). Since the frequencies of leaf concepts in \(\mathcal {X}\) follow a power-law distribution, we have that \(\sum \limits _{C \in \texttt {leaf}(\mathcal {C})} d(C) \le 1+\log (|\texttt {leaf}(\mathcal {C})|)\) where \(\texttt {leaf}(\mathcal {C})\) is the set leaf concepts in \(\mathcal {C}\) and \(|\texttt {leaf}(\mathcal {C})|\) is the number of such concepts. Since \(|\texttt {leaf}(\mathcal {C})|\) \(\le |\mathcal {C}|\), we have that \(QU(\texttt {free}(\mathcal {S})) \le \sum \limits _{C\in \texttt {free}(\mathcal {S})} u(C)d(C) \le (1 + \log |\mathcal {C}|)\ u(C_{\max }) \le 2u(\mathcal {C}_{\max })\log |\mathcal {C}|\). \(\square \)

Proof of Theorem 2

Let \(\mathcal {S}^*\) be a disjoint design over \(\mathcal {X}\) with total cost at most B that maximizes QU function. Let \(\mathcal {S}^*[i]\) be the set of concepts in \(S^*\) of level i, i.e., \(\mathcal {S}^*[i] =\mathcal {S}^*\cap \mathcal {C}[i]\). Since \(\mathcal {S}^*\) is a disjoint design, for all \(1\le i, j\le h\), \(\texttt {part}(\mathcal {S}^*[i]) \cap \texttt {part}(\mathcal {S}^*[j]) =\emptyset \). Thus, \(QU(\mathcal {S}^*) =\sum \limits _{1\le i\le h}{QU(\mathcal {S}^*[i])} + QU(\texttt {free}(\mathcal {S}^*))\). Next, we consider the two possible cases. First, assume that \(\sum _{i=1}^h\) \(QU(\mathcal {S}^*[i])\) \(\ge QU(\texttt {free}(\mathcal {S}^{*}))\). It follows that the Level-wise algorithm returns a \((\frac{2h}{\texttt {pr}_{\min }})\)-approximate solution of the disjoint CECD defined over \(\mathcal {X}\).

In the other case in which \(QU(\texttt {free}(\mathcal {S}^{*}))\) is larger than the other term, by Lemma 1, extracting the concept with the maximum u value gives a \((\frac{4\log (|\mathcal {C}|)}{\texttt {pr}_{\min }})\)-approximation. These two cases together imply that we have an \(O({h+\log |\mathcal {C}|\over \texttt {pr}_{\min }})\)-approximation. \(\square \)

Proof of Theorem 4

Construct an instance \(\mathcal {M}\) of SU-Knapsack from the MC-CECD instance as described in the preceding paragraph of Theorem 4.

For each pair of concepts \(C_i, C_j\), the total cost of \(\mathcal {I}_1 := (\{ {C_i} \},\{ {C_j} \})\) is equal to the total cost of \(\mathcal {I}_2 := (\{ {C_i} \},\{ {C_j} \},\{ {C_i,C_j} \})\), and all elements have positive profits. If items \(\{ {C_i} \}, \{ {C_j} \}\) are selected in an optimal solution of \(\mathcal {M}\), item \(\{ {C_i,C_j} \}\) will also be picked. Thus, an optimal solution of \(\mathcal {M}\) corresponds to a design \(\mathcal {S}\) in MC-CECD: \(\mathcal {I}_{\mathcal {S}} =\emptyset \cup \bigcup _{C \in \mathcal {S}} \{ {C} \} \cup \bigcup _{C_1, C_2 \in \mathcal {S}} \{ {C_1, C_2} \}\). Furthermore, the profit of \(\mathcal {I}_{\mathcal {S}}\) is: \(p(\mathcal {I}_{\mathcal {S}}) =p(\emptyset ) + \sum _{C\in \mathcal {S}} p(\{ {C} \}) + \sum _{C_1, C_2\in \mathcal {S}} p(\{ {C_1, C_2} \}) =QU(\mathcal {S})\). Hence, the algorithm of [5] returns a design whose queribility is at least \((1-1/e^{\frac{1}{k+1}})\) times the queribility of an optimal design. \(\square \)

Proof of Theorem 5

We use the following construction to reduce Densest-k-Subgraph to CECD problem over DAG. Given a graph \(G =(V,E)\) and a number k, we build an instance of the CECD over a DAG taxonomy as follows. For each edge \(e \in E\), we introduce a leaf concept \(a_e\), and for each vertex \(v \in V\), we introduce a leaf concept \(a_v\) and a non-leaf concept \(S_v\) such that \(S_v\) is an ancestor of \(a_v\) and all \(a_e\) corresponding to the incident edges of v in G. Further, we set the budget B to k, the cost of each non-leaf concept to 1 and the cost of leaf concepts to \(k+1\). If we select \(S_v\) and \(S_u\) in the design and \((u,v) \in E\), \(a_{e}\) will be a singleton partition. We set the popularities and frequencies of all concepts in the taxonomy, respectively, to the same fixed values u and d. Let m be the number of edges in G and n be the number of vertices in G. For each partition \(p\in \texttt {part}(\mathcal {S})\), we set \(d(p) =1/\beta \) if p is a singleton edge concept and \(d(p) =1\) otherwise. \(\beta \) is a parameter which we determine later in the proof. Since leaf concepts are not affordable, there is an optimal design with exactly k non-leaf concepts. In each design \(\mathcal {S}\) of size k, the contribution of every leaf concept in a non-singleton edge concept partition is exactly ud. Let \(H_{\mathcal {S}}\) be the set of vertices in G whose corresponding non-leaf concepts in \(\mathcal {C}\) are in \(\mathcal {S}\). \(E(H_{\mathcal {S}})\) denotes the set of edges with both endpoints in \(H_{\mathcal {S}}\). It corresponds to the set of edge concepts of \(\mathcal {C}\) whose both non-leaf concepts associated with their endpoints are in \(\mathcal {S}\). Let \(\mathcal {S}_{\textsc {OPT}}\) be the solution of CECD corresponding to an optimal solution of the Densest-k-Subgraph. We have \(QU(\mathcal {S}_{\textsc {OPT}}) =ud(\beta \cdot t + m + n -t)\), where t denotes the number of edges in \(H_{\mathcal {S}_{\textsc {OPT}}}\). Let \(\mathcal {A}\) be an \(\alpha \)-approximation algorithm of CECD and \(\mathcal {S}_{\mathcal {A}}\) be the \(\alpha \)-approximate design returned by \(\mathcal {A}\). Let \(QU(\mathcal {S}_{\mathcal {A}}) =ud (t'\cdot \beta + m + n -t')\), where \(t'\) is the number of edges whose both endpoints are in \(\mathcal {S}_{\mathcal {A}}\). Since \(\mathcal {A}\) is an \(\alpha \)-approximation algorithm of CECD, \(t\cdot \beta + m + n -t \le \alpha ( t'\cdot \beta + m + n -t')\). Thus, \(t \le t'\cdot \alpha + (m+n)\cdot {\alpha -1 \over \beta -1}\). Setting \(\beta =(m+n)(\alpha -1) + 1\) leads to \(O(\alpha )\)-approximation for the Densest-k-Subgraph. \(\square \)

Proof of Corollary 1

By setting \(\beta = {(m+n)(\alpha -1)\over \epsilon } + 1\) in proof of Theorem 5 and using the hardness result of [32], no polynomial time approximation scheme algorithm exists for CECD unless \(\mathbf {NP}\subseteq \cap _{\epsilon >0} \mathbf {BPTIME}(2^{n^{\epsilon }})\). \(\square \)

Proof of Corollary 2

It follows from the hardness result of [36] and Theorem 5. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chodpathumwan, Y., Vakilian, A., Termehchy, A. et al. Cost-effective conceptual design using taxonomies. The VLDB Journal 27, 369–394 (2018). https://doi.org/10.1007/s00778-018-0501-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-018-0501-1

Keywords

Navigation