Hollow-tree: a metric access method for data with missing values

Brinis, Safia; Traina, Caetano; Traina, Agma J. M.

doi:10.1007/s10844-019-00567-8

Hollow-tree: a metric access method for data with missing values

Published: 09 July 2019

Volume 53, pages 481–508, (2019)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

309 Accesses
3 Citations
Explore all metrics

Abstract

Similarity search is fundamental to store and retrieve large volumes of complex data required by many real world applications. A useful mechanism for such concept is the query-by-similarity. Based on their topological properties, metric similarity functions can be used to index sets of data which can be queried effectively and efficiently by the so-called metric access methods. However, data produced by various application domains and the varying data types handled often lead to missing data, hence, they do not follow the metric similarity requirements. As a consequence, missing data cause distortions in the index structure and yield bias in the query answer. In this paper, we propose the Hollow-tree, a novel access method aimed at successfully retrieving data with missing attribute values. It employs new strategies for indexing and searching data elements, capable of handling the missing data issues when the cause of missingness is ignorable. The indexing strategy is based on a family of distance functions that allow measuring the distance between elements with missing values, along with a set of policies able to organize the elements in the index without causing distortions to its internal structure. The searching strategy employs fractal dimension property of the data to achieve accurate query answer while considering data with missing values part of the response. Results from experiments performed on a variety of real and synthetic data sets showed that, while other metric access methods deteriorate with small amounts of missing values, the Hollow-tree maintains a remarkable performance with almost 100% of precision and recall for range queries and more than 90% for k-nearest neighbor queries, for up to 40% of missing values.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

MiDaS: Extract Golden Results from Knowledge Discovery Even over Incomplete Databases

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Article 03 April 2020

Notes

References

Banks, H.T., Hu, S., Rosenberg, E. (2017). A dynamical modeling approach for analysis of longitudinal clinical trials in the presence of missing endpoints. Applied Mathematics Letters, 63, 109–117.
Article MathSciNet Google Scholar
Bell, M.L., Fiero, M., Horton, N.J., Hsu, C.H. (2014). Handling missing data in rcts: a review of the top medical journals. BMC Medical Research Methodology, 14(118), 1–8.
Google Scholar
Belussi, A., & Faloutsos, C. (1995). Estimating the selectivity of spatial queries using the correlation fractal dimension. In International conference on very large data bases, Zurich, Switzerland (pp. 299–310).
Berchtold, S., Bohm, C., Braunmuller, B., Keim, D.A., Kriegel, H. (1997). Fast parallel similarity search in multimedia databases. In ACM SIGMOD International conference on management of data, Tucson, Arizona, USA (pp. 1–12).
Article Google Scholar
Brinis, S., Traina, A.J.M., Traina, C. Jr. (2014). Analyzing missing data in metric spaces. Journal of Information and Data Management, 5(3), 224–237.
Google Scholar
Canahuate, G., Gibas, M., Ferhatosmanoglu, H. (2006). Indexing incomplete databases. In International conference on advances in databases, Munich, Germany (pp. 884–901).
Google Scholar
Cheng, W., Jin, X., Sun, J.T., Lin, X., Zhang, X., Wang, W. (2014). Searching dimension incomplete databases. IEEE Transactions on Knowledge and Data Engineering, 26(3), 725–738.
Article Google Scholar
Ciaccia, P., Patella, M., Zezula, P. (1997). M-tree : an efficient access method for similarity search in metric spaces. In International conference on very large data bases, San Francisco, CA, USA (pp. 426–435).
Doi, K. (2007). Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Computerized Medical Imaging and Graphics, 31 (4-5), 198–211.
Article Google Scholar
Dong, Y., & Peng, C. (2013). Principled missing data methods for researchers. SpringerPlus, 2(1), 1–17.
Article Google Scholar
Faloutsos, C., & Kamel, I. (1994). Beyond uniformity and independence: analysis of r-trees using the concept of fractal dimension. In ACM SIGACT-SIGMOD-SIGART Symposium on principles of database systems, New york, NY, USA (pp. 4–13).
Faloutsos, C., Seeger, B., Traina, A.J.M., Traina, C. Jr. (2000). Spatial join selectivity using power laws. In ACM SIGMOD International conference on management of data, New York, NY, USA (pp. 177–188).
Guo, Y., Ding, G., Han, J. (2018). Robust quantization for general similarity search. IEEE Transactions on Image Processing, 27(2), 949–963.
Article MathSciNet Google Scholar
Korn, F., Pagel, B., Faloutsos, C. (2001). On the dimensionality curse and the self-similarity blessing. IEEE Transactions on Knowledge and Data Engineering, 13(1), 96–111.
Article Google Scholar
Little, R.J.A., & Rubin, D.B. (2014). Statistical analysis with missing data. Hoboken: Wiley Series in Probability and Statistics.
MATH Google Scholar
Ooi, B.C., Goh, C.H., Tan, K.L. (1998). Fast high-dimensional data search in incomplete databases. In International conference on very large data bases, New york, NY, USA (pp. 357–367).
Papadopoulos, A., & Manolopoulos, Y. (1997). Performance of nearest neighbor queries in r-trees. In International conference on database theory, Delphi, Greece (pp. 394–408).
Pedersen, A.B., Mikkelsen, E.M., Cronin-Fenton, D., Kristensen, N.R., Pham, T.M., Pedersen, L., Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol, 9, 157–166.
Article Google Scholar
Samet, H. (2006). Foundations of multidimensional and metric data structures. San Francisco: Morgan Kaufmann Publishers Inc.
MATH Google Scholar
Schroeder, M. (1991). Fractals, Chaos, Power Laws. W.H. Freeman and Company, New York, USA.
Traina, C. Jr, Traina, A.J.M., Faloutsos, C. (2000). Distance exponent: a new concept for selectivity estimation in metric trees. In Technology IEEE International conference on data engineering, ICDE, San Diego, CA (p. 195).
Traina, C. Jr, Traina, A.J.M., Faloutsos, C., Seeger, B. (2002). Fast indexing and visualization of metric data sets using slim-trees. IEEE Transactions on Knowledge and Data Engineering, 14(2), 244–260.
Article Google Scholar
Vieira, M.R., Traina, C. Jr, Traina, A.J.M., Arantes, A., Faloutsos, C. (2007). Boosting k-nearest neighbor queries estimating suitable query radii. In International conference on scientific and statistical database management, SSDBM, Los Alamitos, CA, USA (p. 10).
Vieira, M.R., Traina, C. Jr, Chino, F.J.T., Traina, A.J.M. (2010). Dbm-tree: a dynamic metric access method sensitive to local density data. Journal of Information and Data Management, 1, 111–128.
Google Scholar
Wei, H., Yu, J.X., Lu, C. (2018). String similarity search: a hash-based approach. IEEE Transactions on Knowledge and Data Engineering, 30(1), 170–184.
Article Google Scholar
Wilson, D.R., & Martinez, T.R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6(1), 1–34.
Article MathSciNet Google Scholar
Yamagishi, Y., Aoyama, K., Saito, K., Ikeda, T. (2018). Pivot generation algorithm with a complete binary tree for efficient exact similarity search. IEICE Transactions on Information and Systems E101.D(1), 142–151.
Article Google Scholar
Yianilos, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In ACM-SIAM symposium on discrete algorithms, Austin, USA (pp. 311–321).
Zezula, P., Dohnal, V., Amato, G., Batko, M. (2006). Similarity search: the metric space approach. Berlin: Springer.
Book Google Scholar
Zhao, X., Xiao, C., Lin, X., Zhang, W., Wang, Y. (2018). Efficient structure similarity searches: a partition-based approach. The VLDB Journal, 27(1), 53–78.
Article Google Scholar

Download references

Acknowledgements

This research was financed, in part, by the grant number 2016/17078-0 from the Sao Paulo Research Foundation (FAPESP), by the grant number 1406799 from the Coordination for the Improvement of Higher Education Personnel (CAPES), and by the grant numbers 150626/2017-7, 433328/2018-5, 309061/2017-2, 307615/2017-0, and 437420/2018-3 from the National Council for Scientific and Technological Development (CNPq).

Author information

Authors and Affiliations

Computer Science Department, Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
Safia Brinis, Caetano Traina Jr. & Agma J. M. Traina

Authors

Safia Brinis
View author publications
You can also search for this author in PubMed Google Scholar
Caetano Traina Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Agma J. M. Traina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Safia Brinis.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper was supported by CNPq, CAPES and FAPESP

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brinis, S., Traina, C. & Traina, A.J.M. Hollow-tree: a metric access method for data with missing values. J Intell Inf Syst 53, 481–508 (2019). https://doi.org/10.1007/s10844-019-00567-8

Download citation

Received: 06 July 2018
Revised: 10 May 2019
Accepted: 04 June 2019
Published: 09 July 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10844-019-00567-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hollow-tree: a metric access method for data with missing values

Abstract

Access this article

Similar content being viewed by others

Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

MiDaS: Extract Golden Results from Knowledge Discovery Even over Incomplete Databases

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hollow-tree: a metric access method for data with missing values

Abstract

Access this article

Similar content being viewed by others

Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

MiDaS: Extract Golden Results from Knowledge Discovery Even over Incomplete Databases

Grid-R-tree: a data structure for efficient neighborhood and nearest neighbor queries in data mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation