Holistic primary key and foreign key detection

  • Lan JiangEmail author
  • Felix Naumann


Primary keys (PKs) and foreign keys (FKs) are important elements of relational schemata in various applications, such as query optimization and data integration. However, in many cases, these constraints are unknown or not documented. Detecting them manually is time-consuming and even infeasible in large-scale datasets. We study the problem of discovering primary keys and foreign keys automatically and propose an algorithm to detect both, namely Holistic Primary Key and Foreign Key Detection (HoPF). PKs and FKs are subsets of the sets of unique column combinations (UCCs) and inclusion dependencies (INDs), respectively, for which efficient discovery algorithms are known. Using score functions, our approach is able to effectively extract the true PKs and FKs from the vast sets of valid UCCs and INDs. Several pruning rules are employed to speed up the procedure. We evaluate precision and recall on three benchmarks and two real-world datasets. The results show that our method is able to retrieve on average 88% of all primary keys, and 91% of all foreign keys. We compare the performance of HoPF with two baseline approaches that both assume the existence of primary keys.


Data profiling application Primary key Foreign key Database management 



  1. Abedjan, Z., Golab, L., Naumann, F. (2015). Profiling relational data: a survey. VLDB Journal, 24(4), 557–581.CrossRefGoogle Scholar
  2. Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bulletin of the Calcutta Mathematical Society, 35, 99–109.MathSciNetzbMATHGoogle Scholar
  3. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R. (2003). Robust and efficient fuzzy match for online data cleaning. In Proceedings of the international conference on management of data (SIGMOD) (pp. 313–324).Google Scholar
  4. Chen, Z., Narasayya, V.R., Chaudhuri, S. (2014). Fast foreign-key detection in microsoft SQL server powerpivot for excel. Proceedings of the VLDB Endowment, 7 (13), 1417–1428.CrossRefGoogle Scholar
  5. Faust, M., Schwalb, D., Plattner, H. (2014). Composite group-keys – space-efficient indexing of multiple columns for compressed in-memory column stores. In Memory data management and analysis - first and second international workshops, revised selected papers (pp. 139–150).Google Scholar
  6. Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A. (2004). CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the international conference on management of data (SIGMOD) (pp. 647–658).Google Scholar
  7. Kantola, M., Mannila, H., Räihä, K., Siirtola, H. (1992). Discovering functional and inclusion dependencies in relational databases. International Journal of Intelligence Systems, 7(7), 591–607.CrossRefzbMATHGoogle Scholar
  8. Lopes, S., Petit, J., Toumani, F. (2002). Discovering interesting inclusion dependencies: application to logical database tuning. Information Systems (IS), 27(1), 1–19.CrossRefzbMATHGoogle Scholar
  9. Lucchesi, C.L., & Osborn, S.L. (1978). Candidate keys for relations. Journal of Computer and System Sciences, 17(2), 270–279.MathSciNetCrossRefzbMATHGoogle Scholar
  10. Marchi, F.D., Lopes, S., Petit, J. (2009). Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information System, 32(1), 53–73.CrossRefGoogle Scholar
  11. Memari, M., Link, S., Dobbie, G. (2015). SQL data profiling of foreign keys. In Proceedings of the international conference on conceptual modeling (ER) (pp. 229–243).Google Scholar
  12. Papenbrock, T., Bergmann, T., Finke, M., Zwiener, J., Naumann, F. (2015). Data profiling with metanome. Proceedings of the VLDB Endowment, 8(12), 1860–1863.CrossRefGoogle Scholar
  13. Papenbrock, T., & Naumann, F. (2017). Data-driven schema normalization. In Proceedings of the international conference on extending database technology (EDBT) (pp. 342–353).Google Scholar
  14. Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., Leser, U. (2009). A machine learning approach to foreign key discovery. In Proceedings of the ACM SIGMOD workshop on the web and databases (WebDB).Google Scholar
  15. Rubner, Y., Tomasi, C., Guibas, L.J. (1998). A metric for distributions with applications to image databases. In Procedings of the international conference on computer vision (ICCV) (pp. 59–66).Google Scholar
  16. Tschirschnitz, F., Papenbrock, T., Naumann, F. (2017). Detecting inclusion dependencies on very many tables. ACM Transactions on Database Systems (TODS), 42(3), 18:1–18,29.MathSciNetCrossRefGoogle Scholar
  17. Venetis, P., Halevy, A.Y., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G., Wu, C. (2011). Recovering semantics of tables on the web. Proceedings of the VLDB Endowment, 4(9), 528–538.CrossRefGoogle Scholar
  18. Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D. (2010). On multi-column foreign key discovery. Proceedings of the VLDB Endowment, 3(1–2), 805–814.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations