Advertisement

Hollow-tree: a metric access method for data with missing values

  • Safia BrinisEmail author
  • Caetano TrainaJr.
  • Agma J. M. Traina
Article
  • 44 Downloads

Abstract

Similarity search is fundamental to store and retrieve large volumes of complex data required by many real world applications. A useful mechanism for such concept is the query-by-similarity. Based on their topological properties, metric similarity functions can be used to index sets of data which can be queried effectively and efficiently by the so-called metric access methods. However, data produced by various application domains and the varying data types handled often lead to missing data, hence, they do not follow the metric similarity requirements. As a consequence, missing data cause distortions in the index structure and yield bias in the query answer. In this paper, we propose the Hollow-tree, a novel access method aimed at successfully retrieving data with missing attribute values. It employs new strategies for indexing and searching data elements, capable of handling the missing data issues when the cause of missingness is ignorable. The indexing strategy is based on a family of distance functions that allow measuring the distance between elements with missing values, along with a set of policies able to organize the elements in the index without causing distortions to its internal structure. The searching strategy employs fractal dimension property of the data to achieve accurate query answer while considering data with missing values part of the response. Results from experiments performed on a variety of real and synthetic data sets showed that, while other metric access methods deteriorate with small amounts of missing values, the Hollow-tree maintains a remarkable performance with almost 100% of precision and recall for range queries and more than 90% for k-nearest neighbor queries, for up to 40% of missing values.

Keywords

Missing values Missing at random Similarity search Fractal dimension 

Notes

Acknowledgements

This research was financed, in part, by the grant number 2016/17078-0 from the Sao Paulo Research Foundation (FAPESP), by the grant number 1406799 from the Coordination for the Improvement of Higher Education Personnel (CAPES), and by the grant numbers 150626/2017-7, 433328/2018-5, 309061/2017-2, 307615/2017-0, and 437420/2018-3 from the National Council for Scientific and Technological Development (CNPq).

References

  1. Banks, H.T., Hu, S., Rosenberg, E. (2017). A dynamical modeling approach for analysis of longitudinal clinical trials in the presence of missing endpoints. Applied Mathematics Letters, 63, 109–117.MathSciNetzbMATHGoogle Scholar
  2. Bell, M.L., Fiero, M., Horton, N.J., Hsu, C.H. (2014). Handling missing data in rcts: a review of the top medical journals. BMC Medical Research Methodology, 14(118), 1–8.Google Scholar
  3. Belussi, A., & Faloutsos, C. (1995). Estimating the selectivity of spatial queries using the correlation fractal dimension. In International conference on very large data bases, Zurich, Switzerland (pp. 299–310).Google Scholar
  4. Berchtold, S., Bohm, C., Braunmuller, B., Keim, D.A., Kriegel, H. (1997). Fast parallel similarity search in multimedia databases. In ACM SIGMOD International conference on management of data, Tucson, Arizona, USA (pp. 1–12).Google Scholar
  5. Brinis, S., Traina, A.J.M., Traina, C. Jr. (2014). Analyzing missing data in metric spaces. Journal of Information and Data Management, 5(3), 224–237.Google Scholar
  6. Canahuate, G., Gibas, M., Ferhatosmanoglu, H. (2006). Indexing incomplete databases. In International conference on advances in databases, Munich, Germany (pp. 884–901).Google Scholar
  7. Cheng, W., Jin, X., Sun, J.T., Lin, X., Zhang, X., Wang, W. (2014). Searching dimension incomplete databases. IEEE Transactions on Knowledge and Data Engineering, 26(3), 725–738.Google Scholar
  8. Ciaccia, P., Patella, M., Zezula, P. (1997). M-tree : an efficient access method for similarity search in metric spaces. In International conference on very large data bases, San Francisco, CA, USA (pp. 426–435).Google Scholar
  9. Doi, K. (2007). Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Computerized Medical Imaging and Graphics, 31 (4-5), 198–211.Google Scholar
  10. Dong, Y., & Peng, C. (2013). Principled missing data methods for researchers. SpringerPlus, 2(1), 1–17.Google Scholar
  11. Faloutsos, C., & Kamel, I. (1994). Beyond uniformity and independence: analysis of r-trees using the concept of fractal dimension. In ACM SIGACT-SIGMOD-SIGART Symposium on principles of database systems, New york, NY, USA (pp. 4–13).Google Scholar
  12. Faloutsos, C., Seeger, B., Traina, A.J.M., Traina, C. Jr. (2000). Spatial join selectivity using power laws. In ACM SIGMOD International conference on management of data, New York, NY, USA (pp. 177–188).Google Scholar
  13. Guo, Y., Ding, G., Han, J. (2018). Robust quantization for general similarity search. IEEE Transactions on Image Processing, 27(2), 949–963.MathSciNetzbMATHGoogle Scholar
  14. Korn, F., Pagel, B., Faloutsos, C. (2001). On the dimensionality curse and the self-similarity blessing. IEEE Transactions on Knowledge and Data Engineering, 13(1), 96–111.Google Scholar
  15. Little, R.J.A., & Rubin, D.B. (2014). Statistical analysis with missing data. Hoboken: Wiley Series in Probability and Statistics.zbMATHGoogle Scholar
  16. Ooi, B.C., Goh, C.H., Tan, K.L. (1998). Fast high-dimensional data search in incomplete databases. In International conference on very large data bases, New york, NY, USA (pp. 357–367).Google Scholar
  17. Papadopoulos, A., & Manolopoulos, Y. (1997). Performance of nearest neighbor queries in r-trees. In International conference on database theory, Delphi, Greece (pp. 394–408).Google Scholar
  18. Pedersen, A.B., Mikkelsen, E.M., Cronin-Fenton, D., Kristensen, N.R., Pham, T.M., Pedersen, L., Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol, 9, 157–166.Google Scholar
  19. Samet, H. (2006). Foundations of multidimensional and metric data structures. San Francisco: Morgan Kaufmann Publishers Inc.zbMATHGoogle Scholar
  20. Schroeder, M. (1991). Fractals, Chaos, Power Laws. W.H. Freeman and Company, New York, USA.Google Scholar
  21. Traina, C. Jr, Traina, A.J.M., Faloutsos, C. (2000). Distance exponent: a new concept for selectivity estimation in metric trees. In Technology IEEE International conference on data engineering, ICDE, San Diego, CA (p. 195).Google Scholar
  22. Traina, C. Jr, Traina, A.J.M., Faloutsos, C., Seeger, B. (2002). Fast indexing and visualization of metric data sets using slim-trees. IEEE Transactions on Knowledge and Data Engineering, 14(2), 244–260.Google Scholar
  23. Vieira, M.R., Traina, C. Jr, Traina, A.J.M., Arantes, A., Faloutsos, C. (2007). Boosting k-nearest neighbor queries estimating suitable query radii. In International conference on scientific and statistical database management, SSDBM, Los Alamitos, CA, USA (p. 10).Google Scholar
  24. Vieira, M.R., Traina, C. Jr, Chino, F.J.T., Traina, A.J.M. (2010). Dbm-tree: a dynamic metric access method sensitive to local density data. Journal of Information and Data Management, 1, 111–128.Google Scholar
  25. Wei, H., Yu, J.X., Lu, C. (2018). String similarity search: a hash-based approach. IEEE Transactions on Knowledge and Data Engineering, 30(1), 170–184.Google Scholar
  26. Wilson, D.R., & Martinez, T.R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6(1), 1–34.MathSciNetzbMATHGoogle Scholar
  27. Yamagishi, Y., Aoyama, K., Saito, K., Ikeda, T. (2018). Pivot generation algorithm with a complete binary tree for efficient exact similarity search. IEICE Transactions on Information and Systems E101.D(1), 142–151.Google Scholar
  28. Yianilos, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In ACM-SIAM symposium on discrete algorithms, Austin, USA (pp. 311–321).Google Scholar
  29. Zezula, P., Dohnal, V., Amato, G., Batko, M. (2006). Similarity search: the metric space approach. Berlin: Springer.zbMATHGoogle Scholar
  30. Zhao, X., Xiao, C., Lin, X., Zhang, W., Wang, Y. (2018). Efficient structure similarity searches: a partition-based approach. The VLDB Journal, 27(1), 53–78.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Computer Science Department, Institute of Mathematics and Computer SciencesUniversity of São PauloSão CarlosBrazil

Personalised recommendations