Cost-effective conceptual design using taxonomies

Chodpathumwan, Yodsawalai; Vakilian, Ali; Termehchy, Arash; Nayyeri, Amir

doi:10.1007/s00778-018-0501-1

Cost-effective conceptual design using taxonomies

Regular Paper
Published: 24 March 2018

Volume 27, pages 369–394, (2018)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Yodsawalai Chodpathumwan¹^na1,
Ali Vakilian²^na1,
Arash Termehchy³ &
…
Amir Nayyeri³

248 Accesses
Explore all metrics

Abstract

It is known that annotating entities in unstructured and semi-structured datasets by their concepts improves the effectiveness of answering queries over these datasets. Ideally, one would like to annotate entities of all relevant concepts in a dataset. However, it takes substantial time and computational resources to annotate concepts in large datasets, and an organization may have sufficient resources to annotate only a subset of relevant concepts. Clearly, it would like to annotate a subset of concepts that provides the most effective answers to queries over the dataset. We propose a formal framework that quantifies the amount by which annotating entities of concepts from a taxonomy in a dataset improves the effectiveness of answering queries over the dataset. Because the problem is \(\mathbf {NP}\)-hard, we propose efficient approximation and pseudo-polynomial time algorithms for several cases of the problem. Our extensive empirical studies validate our framework and show accuracy and efficiency of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recent trends in knowledge graphs: theory and practice

Article 16 April 2021

Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities

Article Open access 28 May 2023

Knowledge Graph Identification

References

Abiteboul, S., Manolescu, I., Rigaux, P., Rousset, M., Senellart, P.: Web Data Management. Cambridge University Press, Cambridge (2011)
Book Google Scholar
Anderson, M., Cafarella, M., Jiang, Y., Wang, G., Zhang, B.: An integrated development environment for faster feature engineering. PVLDB 7(13), 1657–1660 (2014)
Google Scholar
Anderson M., et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)
Arora, S., Manokaran, R., Moshkovitz, D., Weinstein, O.: Inapproximability of densest \(\kappa \)-subgraph from average-case hardness. people.csail.mit.edu/dmoshkov/papers (2011)
Arulselvan, A.: A note on the set union knapsack problem. Discrete Appl. Math. 169, 214–218 (2014)
Article MathSciNet MATH Google Scholar
Bhaskara, A., Charikar, M., Vijayaraghavan, A., Guruswami, V., Zhou, Y.: Polynomial integrality gaps for strong SDP relaxations of densest K-subgraph. In: SODA, pp. 388–405 (2012)
Boehm, B., et al.: Software development cost estimation approaches: a survey. Ann. Softw. Eng. 10(1–4), 177–205 (2000)
Article MATH Google Scholar
Chakrabarti, S., Puniyani, K., Das, S.: Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In: WWW, pp. 717–726 (2007)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. TKDE 18, 1411–1428 (2006)
Google Scholar
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss FR, Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: ACL, pp. 128–137 (2010a)
Chiticariu, L., Li, Y., Raghavan, S., Reiss, F.R.: Enterprise information extraction: recent developments and open challenges. In: SIGMOD, pp. 1257–1258 (2010b)
Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! long live rule-based information extraction systems! In: EMNLP, pp. 827–832 (2013)
Chu-Carroll, J., et al.: Semantic Search via XML fragments: a high-precision approach to IR. In: SIGIR, pp. 445–452 (2006)
Demidova, E., Zhou, X., Oelze, I., Nejdl, W.: Evaluating evidences for keyword query disambiguation in entity centric database search. In: DEXA, pp. 240–247 (2010)
Deshpande, O., Lamba, D., Tourn, M., Das, S., Subramaniam, S., Rajaraman, A., Harinarayan, V., Doan, A.: Building, maintaining, and using knowledge bases: a report from the trenches. In: SIGMOD, pp. 1209–1220 (2013)
Dill, S., et al.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW, pp. 178–186 (2003)
Doan, A., Ramakrishnan, R., Vaithyanathan, S.: Managing information extraction: state of the art and research directions. In: SIGMOD, pp. 799–800 (2006)
Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B.J., Gokhale, C., Huang, J., Shen, W., Vuong, B.: Information extraction challenges in managing unstructured data. SIGMOD Rec. 37(4), 14–20 (2008)
Article Google Scholar
Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. PVLDB 6(2), 37–48 (2013)
Google Scholar
Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI (2005)
Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Spanners: a formal framework for information extraction. In: PODS, pp. 37–48 (2013)
Furche, T., Guo, J., Maneth, S., Schallhart, C.: Robust and noise resistant wrapper induction. In: SIGMOD, pp. 773–784 (2016)
GarciaMolina, H., Ullman, J., Widom, J.: Database Systems: The Complete Book. Prentice Hall, Upper Saddle River (2008)
Google Scholar
Gulhane, P., et al.: Web-scale information extraction with vertex. In: ICDE, pp. 1209–1220 (2011)
Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: CoNLL, pp. 98–108 (2014)
Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Research and applications: induced lexico-syntactic patterns improve information extraction from online medical forums. JAMIA 21(5), 902–909 (2014)
Google Scholar
Hua, W., Wang, Z., Wang, H., Zheng, K., Zhou, X.: Short text understanding through lexical-semantic analysis. In: ICDE, pp. 495–506 (2015)
Huang, J., Yu, C.: Prioritization of domain-specific web information extraction. In: AAAI (2010)
Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: COLING, pp. 1–7 (2002)
Jain, A., Doan, A., Gravano, L.: Optimizing SQL queries over text databases. In: ICDE (2008)
Kanani, P., et al.: Selecting actions for resource-bounded information extraction using reinforcement learning. In: WSDM, pp. 253–262 (2012)
Khot, S.: Ruling out PTAS for graph min-bisection, densest subgraph and bipartite clique. In: FOCS, pp. 136–145 (2004)
Kimelfeld, B.: Database principles in information extraction. In: PODS, pp. 156–163 (2014)
Liu, K., et al.: Meshlabeler: improving the accuracy of large-scale mesh indexing by integrating diverse evidence. Bioinformatics 31(13), i339–i347 (2015)
Article Google Scholar
Manning, C.D., Raghavan, P., Schütze, H., et al.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Manurangsi, P.: Almost-polynomial ratio ETH-hardness of approximating densest \(k\)-subgraph. CoRR abs/1611.05991, (2016) http://arxiv.org/abs/1611.05991
McCallum, A.: Information extraction: distilling structured data from unstructured text. ACM Queue pp. 48–57 (2005)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: ACL, pp. 1003–1011 (2009)
Mork, J., Demner-Fushman, D., Schmidt, S., Aronson, A.: Recent enhancements to the nlm medical text indexer. In: CLEF (Working Notes), pp. 1328–1336 (2014)
Nallapati, R., Manning, C.D.: Legal docket-entry classification: where machine learning stumbles. In: EMNLP, pp. 438–446 (2008)
Pound, J., Ilyas, I., Weddell, G.: Expressive and flexible access to web-extracted data: a keyword-based structured query language. In: SIGMOD, pp. 423–434 (2010)
Ratner AJ, De Sa CM, Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. In: NIPS, pp. 3567–3575 (2016)
Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930 (2014)
Sanderson, M.: Ambiguous queries: test collections need more sense. In: SIGIR, pp. 499–506 (2008)
Sarawagi, S.: Information extraction. Found. Trends\({\textregistered }\) Databases 1, 261–377 (2008)
Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R., Sen, P.: Web information extraction using markov logic networks. In: KDD, pp. 1406–1414 (2011)
Shen, W., Doan, A., Naughton JF, Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: PVLDB, pp. 1033–1044 (2007)
Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. In: SIGMOD, pp. 1031–1042 (2008)
Suchanek, F., et al.: Yago: A core of semantic knowledge unifying wordnet and wikipedia. In: WWW, pp. 697–706 (2007)
Termehchy, A., Vakilian, A., Chodpathumwan, Y., Winslett, M.: Which concepts are worth extracting? In: SIGMOD, pp. 779–790 (2014)
Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design using taxonomies. CoRR abs/1503.05656, arXiv:1503.05656 (2015)
Vakilian, A., Chodpathumwan, Y., Termehchy, A., Nayyeri, A.: Cost-effective conceptual design over taxonomies. In: WebDB, pp. 35–40 (2017)
Vazirani, V.: Approximation Algorithms. Springer, Berlin (2001)
MATH Google Scholar
Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD, pp. 517–528 (2011)
Wu, W., Li, H., Wang, H., Zhu, K.: Probase: a probabilistic taxonomy for text understanding. In: SIGMOD, pp. 481–492 (2012)

Download references

Acknowledgements

This study was funded by National Science Foundation with Grant No. IIS-1421247, CCF-0938071, CCF-0938064 and CNS-0716532.

Author information

The first two authors have equally contributed to the paper.

Authors and Affiliations

University of Illinois at Urbana-Champaign, Urbana, IL, USA
Yodsawalai Chodpathumwan
Massachusetts Institute of Technology, Cambridge, MA, USA
Ali Vakilian
Oregon State University, Corvallis, OR, USA
Arash Termehchy & Amir Nayyeri

Authors

Yodsawalai Chodpathumwan
View author publications
You can also search for this author in PubMed Google Scholar
Ali Vakilian
View author publications
You can also search for this author in PubMed Google Scholar
Arash Termehchy
View author publications
You can also search for this author in PubMed Google Scholar
Amir Nayyeri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yodsawalai Chodpathumwan.

Appendices

Appendix

Proofs

Proof of Theorem 1

The problem of CECD can be reduced to the problem of choosing a cost-effective design from a set of concepts by creating a taxonomy \(\mathcal {X}=(R,\mathcal {C},\mathcal {R})\) where all nodes except for R are leaf concepts. Since the problem of choosing cost-effective concepts from a set of concepts is \(\mathbf {NP}\)-hard [50], CECD is also \(\mathbf {NP}\)-hard. \(\square \)

Proof of Lemma 1

We have \({\sum \limits _{C\in \texttt {free}(\mathcal {S})} u(C)d(C) \le u(C_{\max })} \sum \limits _{C\in \texttt {free}(\mathcal {S})}d(C)\). Since the frequencies of leaf concepts in \(\mathcal {X}\) follow a power-law distribution, we have that \(\sum \limits _{C \in \texttt {leaf}(\mathcal {C})} d(C) \le 1+\log (|\texttt {leaf}(\mathcal {C})|)\) where \(\texttt {leaf}(\mathcal {C})\) is the set leaf concepts in \(\mathcal {C}\) and \(|\texttt {leaf}(\mathcal {C})|\) is the number of such concepts. Since \(|\texttt {leaf}(\mathcal {C})|\) \(\le |\mathcal {C}|\), we have that \(QU(\texttt {free}(\mathcal {S})) \le \sum \limits _{C\in \texttt {free}(\mathcal {S})} u(C)d(C) \le (1 + \log |\mathcal {C}|)\ u(C_{\max }) \le 2u(\mathcal {C}_{\max })\log |\mathcal {C}|\). \(\square \)

Proof of Theorem 2

Let \(\mathcal {S}^*\) be a disjoint design over \(\mathcal {X}\) with total cost at most B that maximizes QU function. Let \(\mathcal {S}^*[i]\) be the set of concepts in \(S^*\) of level i, i.e., \(\mathcal {S}^*[i] =\mathcal {S}^*\cap \mathcal {C}[i]\). Since \(\mathcal {S}^*\) is a disjoint design, for all \(1\le i, j\le h\), \(\texttt {part}(\mathcal {S}^*[i]) \cap \texttt {part}(\mathcal {S}^*[j]) =\emptyset \). Thus, \(QU(\mathcal {S}^*) =\sum \limits _{1\le i\le h}{QU(\mathcal {S}^*[i])} + QU(\texttt {free}(\mathcal {S}^*))\). Next, we consider the two possible cases. First, assume that \(\sum _{i=1}^h\) \(QU(\mathcal {S}^*[i])\) \(\ge QU(\texttt {free}(\mathcal {S}^{*}))\). It follows that the Level-wise algorithm returns a \((\frac{2h}{\texttt {pr}_{\min }})\)-approximate solution of the disjoint CECD defined over \(\mathcal {X}\).

In the other case in which \(QU(\texttt {free}(\mathcal {S}^{*}))\) is larger than the other term, by Lemma 1, extracting the concept with the maximum u value gives a \((\frac{4\log (|\mathcal {C}|)}{\texttt {pr}_{\min }})\)-approximation. These two cases together imply that we have an \(O({h+\log |\mathcal {C}|\over \texttt {pr}_{\min }})\)-approximation. \(\square \)

Proof of Theorem 4

Construct an instance \(\mathcal {M}\) of SU-Knapsack from the MC-CECD instance as described in the preceding paragraph of Theorem 4.

For each pair of concepts \(C_i, C_j\), the total cost of \(\mathcal {I}_1 := (\{ {C_i} \},\{ {C_j} \})\) is equal to the total cost of \(\mathcal {I}_2 := (\{ {C_i} \},\{ {C_j} \},\{ {C_i,C_j} \})\), and all elements have positive profits. If items \(\{ {C_i} \}, \{ {C_j} \}\) are selected in an optimal solution of \(\mathcal {M}\), item \(\{ {C_i,C_j} \}\) will also be picked. Thus, an optimal solution of \(\mathcal {M}\) corresponds to a design \(\mathcal {S}\) in MC-CECD: \(\mathcal {I}_{\mathcal {S}} =\emptyset \cup \bigcup _{C \in \mathcal {S}} \{ {C} \} \cup \bigcup _{C_1, C_2 \in \mathcal {S}} \{ {C_1, C_2} \}\). Furthermore, the profit of \(\mathcal {I}_{\mathcal {S}}\) is: \(p(\mathcal {I}_{\mathcal {S}}) =p(\emptyset ) + \sum _{C\in \mathcal {S}} p(\{ {C} \}) + \sum _{C_1, C_2\in \mathcal {S}} p(\{ {C_1, C_2} \}) =QU(\mathcal {S})\). Hence, the algorithm of [5] returns a design whose queribility is at least \((1-1/e^{\frac{1}{k+1}})\) times the queribility of an optimal design. \(\square \)

Proof of Theorem 5

We use the following construction to reduce Densest-k-Subgraph to CECD problem over DAG. Given a graph \(G =(V,E)\) and a number k, we build an instance of the CECD over a DAG taxonomy as follows. For each edge \(e \in E\), we introduce a leaf concept \(a_e\), and for each vertex \(v \in V\), we introduce a leaf concept \(a_v\) and a non-leaf concept \(S_v\) such that \(S_v\) is an ancestor of \(a_v\) and all \(a_e\) corresponding to the incident edges of v in G. Further, we set the budget B to k, the cost of each non-leaf concept to 1 and the cost of leaf concepts to \(k+1\). If we select \(S_v\) and \(S_u\) in the design and \((u,v) \in E\), \(a_{e}\) will be a singleton partition. We set the popularities and frequencies of all concepts in the taxonomy, respectively, to the same fixed values u and d. Let m be the number of edges in G and n be the number of vertices in G. For each partition \(p\in \texttt {part}(\mathcal {S})\), we set \(d(p) =1/\beta \) if p is a singleton edge concept and \(d(p) =1\) otherwise. \(\beta \) is a parameter which we determine later in the proof. Since leaf concepts are not affordable, there is an optimal design with exactly k non-leaf concepts. In each design \(\mathcal {S}\) of size k, the contribution of every leaf concept in a non-singleton edge concept partition is exactly ud. Let \(H_{\mathcal {S}}\) be the set of vertices in G whose corresponding non-leaf concepts in \(\mathcal {C}\) are in \(\mathcal {S}\). \(E(H_{\mathcal {S}})\) denotes the set of edges with both endpoints in \(H_{\mathcal {S}}\). It corresponds to the set of edge concepts of \(\mathcal {C}\) whose both non-leaf concepts associated with their endpoints are in \(\mathcal {S}\). Let \(\mathcal {S}_{\textsc {OPT}}\) be the solution of CECD corresponding to an optimal solution of the Densest-k-Subgraph. We have \(QU(\mathcal {S}_{\textsc {OPT}}) =ud(\beta \cdot t + m + n -t)\), where t denotes the number of edges in \(H_{\mathcal {S}_{\textsc {OPT}}}\). Let \(\mathcal {A}\) be an \(\alpha \)-approximation algorithm of CECD and \(\mathcal {S}_{\mathcal {A}}\) be the \(\alpha \)-approximate design returned by \(\mathcal {A}\). Let \(QU(\mathcal {S}_{\mathcal {A}}) =ud (t'\cdot \beta + m + n -t')\), where \(t'\) is the number of edges whose both endpoints are in \(\mathcal {S}_{\mathcal {A}}\). Since \(\mathcal {A}\) is an \(\alpha \)-approximation algorithm of CECD, \(t\cdot \beta + m + n -t \le \alpha ( t'\cdot \beta + m + n -t')\). Thus, \(t \le t'\cdot \alpha + (m+n)\cdot {\alpha -1 \over \beta -1}\). Setting \(\beta =(m+n)(\alpha -1) + 1\) leads to \(O(\alpha )\)-approximation for the Densest-k-Subgraph. \(\square \)

Proof of Corollary 1

By setting \(\beta = {(m+n)(\alpha -1)\over \epsilon } + 1\) in proof of Theorem 5 and using the hardness result of [32], no polynomial time approximation scheme algorithm exists for CECD unless \(\mathbf {NP}\subseteq \cap _{\epsilon >0} \mathbf {BPTIME}(2^{n^{\epsilon }})\). \(\square \)

Proof of Corollary 2

It follows from the hardness result of [36] and Theorem 5. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chodpathumwan, Y., Vakilian, A., Termehchy, A. et al. Cost-effective conceptual design using taxonomies. The VLDB Journal 27, 369–394 (2018). https://doi.org/10.1007/s00778-018-0501-1

Download citation

Received: 23 June 2017
Revised: 11 January 2018
Accepted: 12 March 2018
Published: 24 March 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s00778-018-0501-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cost-effective conceptual design using taxonomies

Abstract

Access this article

Similar content being viewed by others

Recent trends in knowledge graphs: theory and practice

Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities

Knowledge Graph Identification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Proofs

Proof of Theorem 1

Proof of Lemma 1

Proof of Theorem 2

Proof of Theorem 4

Proof of Theorem 5

Proof of Corollary 1

Proof of Corollary 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cost-effective conceptual design using taxonomies

Abstract

Access this article

Similar content being viewed by others

Recent trends in knowledge graphs: theory and practice

Healthcare knowledge graph construction: A systematic review of the state-of-the-art, open issues, and opportunities

Knowledge Graph Identification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix

Proofs

Proof of Theorem 1

Proof of Lemma 1

Proof of Theorem 2

Proof of Theorem 4

Proof of Theorem 5

Proof of Corollary 1

Proof of Corollary 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation