Abstract
In this chapter we focus on the axiomatic foundations of and the relations between two fundamental concepts of sequence analysis: distance and similarity. We discuss and interpret each of the individual axioms and point out their relevance in practical application. We will discuss units of distance, admissible transformations and normalization as a method that allows for interpreting the size of distances and similarities. We also discuss how similarity and distance can be derived from each other and, in passing, we deal with some quite common misunderstandings pertaining to these concepts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Please note that I write x, y for sequences, x1, y1 for the states of the sequences, x, y for the representing vectors and x1, y1 for the coordinates of the vectors.
- 2.
This example was suggested to me by Matthias Studer through personal communication.
- 3.
Chen et al. (2009) use the expression “similarity metric” for any s that satisfies the axioms S1-S5. This is well-defendable since a similarity s “metricizes” the sequence space, just like a distance. However, we prefer to call such an s a “similarity” since the noun “metric” has been associated with “distance” for over a century now.
References
Batagelj, V., & Bren, M. (1995). Comparing resemblance measures. Journal of Classification, 12, 73–90.
Berghammer, C. (2010). Family life trajectories and religiosity in Austria. European Sociological Review, 26, 1–18.
Bonetti, M., Piccarreta, R., & Salford, G. (2013). Parametric and nonparametric analysis of life courses: An application to family formation patterns. Demography, 50, 881–902.
Bras, H., Liefbroer, A. C., & Elzinga, C. H. (2010). Standardization of pathways to adulthood? An analysis of Dutch cohorts born between 1850 and 1900. Demography, 47(4), 1013–1034.
Chen, S., Ma, B., & Zhang, K. (2009). On the similarity metric and the distance metric. Theoretical Computer Science, 410(24–25), 2365–2376.
Crochemore, M., Hancart, C., & Lecroq, T. (2007). Algorithms on strings. Cambridge: Cambridge University Press.
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley.
Elzinga, C. H. (2003). Sequence similarity—A non-aligning technique. Sociological Methods and Research, 31(4), 3–29.
Elzinga, C. H. (2005). Combinatorial representation of token sequences. Journal of Classification, 22(1), 87–118.
Elzinga, C. H., & Liefbroer, A. C. (2007). De-standardization and differentiation of family life trajectories. European Journal of Population, 23(3–4), 225–250.
Elzinga, C. H., & Wang, H. (2013). Versatile string kernels. Theoretical Computer Science,495, 50–65.
Elzinga, C. H., Rahmann, S., & Wang, H. (2008). Algorithms for subsequence combinatorics. Theoretical Computer Science, 409(3), 394–404.
Elzinga, C. H., Wang, H., Lin, Z., & Kumar, Y. (2011). Concordance and concensus. Information Sciences, 181, 2529–2549.
Emms, M., & Franco-Penya, H.-H. (2013). On the expressivity of alignment-based distanceand similarity measures on sequences and trees in inducing orderings. In P. Latorre Carmona, J.J. Sánchez, & A.L. Fred (Eds.), Mathematical methodologies in pattern recognition and machine learning. ICPRAM 2012 international conference on pattern recognition applications and methods, vol. 30 of Springer proceedings in mathematics & statistics, (pp. 1–18). New York: Springer.
Fasang, A. E. (2010). Retirement: Institutional pathways and individual trajectories in Britain and Germany. Sociological Research Online, 15(2), 1.
Fréchet, M. R. (1906). Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo, 22(1), 1–72.
Gabadinho, A., Ritschard, G., Müller, N., & Studer, M. (2011). Analyzing and visualizing state sequences in R with TraMineR.Journal of Statistical Software, 40(4), 1–37.
Gauthier, J.-A., Widmer, E., Bucher, P., & Notredame, C. (2010). Multichannel sequence analysis applied to social science data. Sociological Methodology, 40(1), 1–38.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857–871.
Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3, 5–48.
Halpin, B. (2010). Optimal matching analysis and life-course data: The importance of duration. Sociological Methods & Research, 38(3), 365–388.
Hamming, R. W. (1950). Error-detecting and error-correcting codes. Bell System Technical Journal, 26(2), 147–160.
Han, S.-K., & Moen, P. (1999). Clocking out: Temporal patterning of retirement. American Journal of Sociology, 105(1), 191–236.
Holliday, J. D., Hu, C.-Y., & Willett, P. (2002). Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D-fragment bit-strings. Combinatorial Chemistry and High-Throughput Screening, 5, 155–166.
Hollister, M. N. (2009). Is optimal matching sub-optimal? Sociological Methods & Research, 38, 235–264.
Lipkus, A. H. (1999). A proof of the triangle inequality for the Tanimoto distance. Journal of Mathematical Chemistry, 26, 263–265.
Manzoni, A., Vermunt, J. K., Luijkx, R., & Muffels, R. (2010). Memory bias in retrospective collected employment careers: A model-based approach to correct for measurement error. Sociological Methodology, 40(1), 39–73.
Martin, P., Schoon, I., & Ross, A. (2008). Beyond transitions: Applying optimal matching analysis to life course research. International Journal of Social Research Methodology, 11(3), 179–199.
Massoni, S., Olteanu, M., & Rousset, P. (2009). Career-path analysis using optimal matching and self-organizing maps. In J.C. Príncipe & R. Miikkulainen (Eds.), Advances in self-organizing maps. Lecture notes in computer science 5629. (pp. 154–612). New York: Springer.
Rogers, D. H., & Tanimoto, T. T. (1960). A computer program for classifying plants. Science, 132, 1115–1118.
Rohwer, G., & Pötter, U. (1999). TDA User’s Manual. Bochum: Ruhr-Universität.
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Support vector machines, regularization optimization, and beyond. Cambridge: MIT Press.
Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern recognition. Cambridge: Cambridge University Press.
Studer, M., Ritschard, G., Gabadinho, A., & Muller, N. S. (2011). Discrepancy analysis of state sequences. Sociological Methods & Research, 40(3), 471–510.
Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352.
Wang, H. (2006). Nearest neighbors by neighborhood counting. IEEE Transactions on Pattern Learning and Machine Intelligence, 28(6), 1–12.
Yujian, L., & Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer New York Heidelberg Dordrecht London
About this chapter
Cite this chapter
Elzinga, C. (2014). Distance, Similarity and Sequence Comparison. In: Blanchard, P., Bühlmann, F., Gauthier, JA. (eds) Advances in Sequence Analysis: Theory, Method, Applications. Life Course Research and Social Policies, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-319-04969-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-04969-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04968-7
Online ISBN: 978-3-319-04969-4
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)