SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Fernando, Thilak L.; Webb, Geoffrey I.

doi:10.1007/s10618-016-0463-0

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Published: 11 May 2016

Volume 31, pages 264–286, (2017)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Thilak L. Fernando¹ &
Geoffrey I. Webb¹

552 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Article 30 October 2019

Sunil Aryal, Kai Ming Ting, … Gholamreza Haffari

Improved Euclidean Distance in the K Nearest Neighbors Method

Rank-Based Similarity Index (RBSI) in a Multidimensional DataSet

Notes

A monotonic transform, \(f:{\mathbb {R}} \rightarrow {\mathbb {R}}\) is either \(\forall x > y \Leftrightarrow f(x) \ge f(y)\) or \(\forall x > y \Leftrightarrow f(x) \le f(y)\). Such a transform can produce ambiguities in the order. A strictly monotonic transform is either \(\forall x > y \Leftrightarrow f(x) > f(y)\) or \(\forall x > y \Leftrightarrow f(x) < f(y)\). Hence, a strictly monotonic transform can guarantee either order preservation or order reversal.
Datasets are generally subjected to min-max normalization. As a result, linear order preserving transforms do not alter the similarity scores.
Only the top k instances are important to the user in information retrieval.
ReFeat works only with imbalanced trees. Sample size 2 can only produce balanced trees.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Article Google Scholar
Ashby FG, Ennis DM (2007) Similarity measures. Scholarpedia 2(12):4116
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MathSciNet MATH Google Scholar
Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307
MathSciNet Google Scholar
Conover WJ (1980) Practical nonparametric statistics. Wiley, New York
Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second ACM international conference on knowledge discovery and data mining, pp 226–231
Faith DP, Minchin PR, Belbin L (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69(1–3):57–68
Article Google Scholar
Giacinto G, Roli F (2005) Instance-based relevance feedback for image retrieval. Adv Neural Inf Process Syst 17:489–496
Google Scholar
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, Burlington
MATH Google Scholar
He J, Li M, Zhang HJ, Tong H, Zhang C (2004) Manifold-ranking based image retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia, MULTIMEDIA ’04, ACM, New York, pp 9–16
Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 22 Oct 2014
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book MATH Google Scholar
Osborne J (2002) Notes on the use of data transformations. Pract Assess Res Eval 8(6):1–8
Google Scholar
Osborne JW (2010) Improving your data transformations: applying the box-cox transformation. Pract Assess Res Eval 15(12):1–9
Google Scholar
Petitjean F, Gançarski P (2012) Summarizing a set of time series by averaging: from steiner sequence to compact multiple alignment. Theor Comput Sci 414(1):76–91
Article MathSciNet MATH Google Scholar
Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The SMART retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, pp 313–323
Google Scholar
Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Gr Stat 15(1):118–138
Article MathSciNet Google Scholar
SIGKDD (2015) 2014 SIGKDD test of time award winners. http://www.kdd.org/awards/view/2014-sikdd-test-of-time-award-winners. Accessed 16 May 2015
Stevens S (1946) On the theory of scales of measurement. Science 103(2684):677–680
Article MATH Google Scholar
University of Eastern Finland (2015) Clustering datasets. https://cs.joensuu.fi/sipu/datasets/. Accessed 19 Nov 2015
Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach, advances in database systems, vol 32. Springer, Berlin
MATH Google Scholar
Zhang R, Zhang ZM (2006) BALAS: empirical bayesian learning in the relevance feedback for image retrieval. Image Vis Comput 24(3):211–223
Article Google Scholar
Zhou G, Ting K, Liu F, Yin Y (2012) Relevance feature mapping for content-based multimedia information retrieval. Pattern Recognit 45(4):1707–1720
Article Google Scholar
Zhou ZH, Dai HB (2006) Query-sensitive similarity measure for content-based image retrieval. In: Proceedings of the sixth international conference on data mining, ICDM ’06, IEEE Computer Society, Washington, DC, pp 1211–1215

Download references

Acknowledgments

We are grateful to Francois Petitjean for valuable feedback and suggestions. This research has been supported by the Australian Research Council under Grant DP140100087.

Author information

Authors and Affiliations

Monash University, Melbourne, Australia
Thilak L. Fernando & Geoffrey I. Webb

Authors

Thilak L. Fernando
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thilak L. Fernando.

Additional information

Responsible editors: Eamonn Keogh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fernando, T.L., Webb, G.I. SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data Min Knowl Disc 31, 264–286 (2017). https://doi.org/10.1007/s10618-016-0463-0

Download citation

Received: 23 November 2015
Accepted: 26 April 2016
Published: 11 May 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s10618-016-0463-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Abstract

Access this article

Similar content being viewed by others

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Improved Euclidean Distance in the K Nearest Neighbors Method

Rank-Based Similarity Index (RBSI) in a Multidimensional DataSet

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Abstract

Access this article

Similar content being viewed by others

A comparative study of data-dependent approaches without learning in measuring similarities of data objects

Improved Euclidean Distance in the K Nearest Neighbors Method

Rank-Based Similarity Index (RBSI) in a Multidimensional DataSet

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation