Distance, Similarity and Sequence Comparison

Elzinga, Cees H.

doi:10.1007/978-3-319-04969-4_4

Cees H. Elzinga⁷

Part of the book series: Life Course Research and Social Policies ((LCRS,volume 2))

1881 Accesses
4 Citations

Abstract

In this chapter we focus on the axiomatic foundations of and the relations between two fundamental concepts of sequence analysis: distance and similarity. We discuss and interpret each of the individual axioms and point out their relevance in practical application. We will discuss units of distance, admissible transformations and normalization as a method that allows for interpreting the size of distances and similarities. We also discuss how similarity and distance can be derived from each other and, in passing, we deal with some quite common misunderstandings pertaining to these concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Please note that I write x, y for sequences, x1, y1 for the states of the sequences, x, y for the representing vectors and x1, y1 for the coordinates of the vectors.
2.
This example was suggested to me by Matthias Studer through personal communication.
3.
Chen et al. (2009) use the expression “similarity metric” for any s that satisfies the axioms S1-S5. This is well-defendable since a similarity s “metricizes” the sequence space, just like a distance. However, we prefer to call such an s a “similarity” since the noun “metric” has been associated with “distance” for over a century now.

References

Batagelj, V., & Bren, M. (1995). Comparing resemblance measures. Journal of Classification, 12, 73–90.
Article Google Scholar
Berghammer, C. (2010). Family life trajectories and religiosity in Austria. European Sociological Review, 26, 1–18.
Article Google Scholar
Bonetti, M., Piccarreta, R., & Salford, G. (2013). Parametric and nonparametric analysis of life courses: An application to family formation patterns. Demography, 50, 881–902.
Article Google Scholar
Bras, H., Liefbroer, A. C., & Elzinga, C. H. (2010). Standardization of pathways to adulthood? An analysis of Dutch cohorts born between 1850 and 1900. Demography, 47(4), 1013–1034.
Article Google Scholar
Chen, S., Ma, B., & Zhang, K. (2009). On the similarity metric and the distance metric. Theoretical Computer Science, 410(24–25), 2365–2376.
Article Google Scholar
Crochemore, M., Hancart, C., & Lecroq, T. (2007). Algorithms on strings. Cambridge: Cambridge University Press.
Book Google Scholar
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley.
Google Scholar
Elzinga, C. H. (2003). Sequence similarity—A non-aligning technique. Sociological Methods and Research, 31(4), 3–29.
Article Google Scholar
Elzinga, C. H. (2005). Combinatorial representation of token sequences. Journal of Classification, 22(1), 87–118.
Article Google Scholar
Elzinga, C. H., & Liefbroer, A. C. (2007). De-standardization and differentiation of family life trajectories. European Journal of Population, 23(3–4), 225–250.
Article Google Scholar
Elzinga, C. H., & Wang, H. (2013). Versatile string kernels. Theoretical Computer Science,495, 50–65.
Article Google Scholar
Elzinga, C. H., Rahmann, S., & Wang, H. (2008). Algorithms for subsequence combinatorics. Theoretical Computer Science, 409(3), 394–404.
Article Google Scholar
Elzinga, C. H., Wang, H., Lin, Z., & Kumar, Y. (2011). Concordance and concensus. Information Sciences, 181, 2529–2549.
Article Google Scholar
Emms, M., & Franco-Penya, H.-H. (2013). On the expressivity of alignment-based distanceand similarity measures on sequences and trees in inducing orderings. In P. Latorre Carmona, J.J. Sánchez, & A.L. Fred (Eds.), Mathematical methodologies in pattern recognition and machine learning. ICPRAM 2012 international conference on pattern recognition applications and methods, vol. 30 of Springer proceedings in mathematics & statistics, (pp. 1–18). New York: Springer.
Google Scholar
Fasang, A. E. (2010). Retirement: Institutional pathways and individual trajectories in Britain and Germany. Sociological Research Online, 15(2), 1.
Article Google Scholar
Fréchet, M. R. (1906). Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo, 22(1), 1–72.
Article Google Scholar
Gabadinho, A., Ritschard, G., Müller, N., & Studer, M. (2011). Analyzing and visualizing state sequences in R with TraMineR.Journal of Statistical Software, 40(4), 1–37.
Google Scholar
Gauthier, J.-A., Widmer, E., Bucher, P., & Notredame, C. (2010). Multichannel sequence analysis applied to social science data. Sociological Methodology, 40(1), 1–38.
Article Google Scholar
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857–871.
Article Google Scholar
Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3, 5–48.
Article Google Scholar
Halpin, B. (2010). Optimal matching analysis and life-course data: The importance of duration. Sociological Methods & Research, 38(3), 365–388.
Article Google Scholar
Hamming, R. W. (1950). Error-detecting and error-correcting codes. Bell System Technical Journal, 26(2), 147–160.
Article Google Scholar
Han, S.-K., & Moen, P. (1999). Clocking out: Temporal patterning of retirement. American Journal of Sociology, 105(1), 191–236.
Article Google Scholar
Holliday, J. D., Hu, C.-Y., & Willett, P. (2002). Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D-fragment bit-strings. Combinatorial Chemistry and High-Throughput Screening, 5, 155–166.
Article Google Scholar
Hollister, M. N. (2009). Is optimal matching sub-optimal? Sociological Methods & Research, 38, 235–264.
Article Google Scholar
Lipkus, A. H. (1999). A proof of the triangle inequality for the Tanimoto distance. Journal of Mathematical Chemistry, 26, 263–265.
Article Google Scholar
Manzoni, A., Vermunt, J. K., Luijkx, R., & Muffels, R. (2010). Memory bias in retrospective collected employment careers: A model-based approach to correct for measurement error. Sociological Methodology, 40(1), 39–73.
Article Google Scholar
Martin, P., Schoon, I., & Ross, A. (2008). Beyond transitions: Applying optimal matching analysis to life course research. International Journal of Social Research Methodology, 11(3), 179–199.
Article Google Scholar
Massoni, S., Olteanu, M., & Rousset, P. (2009). Career-path analysis using optimal matching and self-organizing maps. In J.C. Príncipe & R. Miikkulainen (Eds.), Advances in self-organizing maps. Lecture notes in computer science 5629. (pp. 154–612). New York: Springer.
Google Scholar
Rogers, D. H., & Tanimoto, T. T. (1960). A computer program for classifying plants. Science, 132, 1115–1118.
Article Google Scholar
Rohwer, G., & Pötter, U. (1999). TDA User’s Manual. Bochum: Ruhr-Universität.
Google Scholar
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Support vector machines, regularization optimization, and beyond. Cambridge: MIT Press.
Google Scholar
Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern recognition. Cambridge: Cambridge University Press.
Book Google Scholar
Studer, M., Ritschard, G., Gabadinho, A., & Muller, N. S. (2011). Discrepancy analysis of state sequences. Sociological Methods & Research, 40(3), 471–510.
Article Google Scholar
Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352.
Article Google Scholar
Wang, H. (2006). Nearest neighbors by neighborhood counting. IEEE Transactions on Pattern Learning and Machine Intelligence, 28(6), 1–12.
Article Google Scholar
Yujian, L., & Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095.
Article Google Scholar

Download references

Author information

Authors and Affiliations

VU University Amsterdam, Amsterdam, The Netherlands
Cees H. Elzinga

Authors

Cees H. Elzinga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cees H. Elzinga .

Editor information

Editors and Affiliations

Inst of Political & International Studies, University of Lausanne, Lausanne, Austria
Philippe Blanchard
Centre LINES/LIVES, University of Lausanne, Lausanne, Austria
Felix Bühlmann
Centre LINES/LIVES, University of Lausanne, Lausanne, Austria
Jacques-Antoine Gauthier

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Elzinga, C. (2014). Distance, Similarity and Sequence Comparison. In: Blanchard, P., Bühlmann, F., Gauthier, JA. (eds) Advances in Sequence Analysis: Theory, Method, Applications. Life Course Research and Social Policies, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-319-04969-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-04969-4_4
Published: 03 July 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04968-7
Online ISBN: 978-3-319-04969-4
eBook Packages: Humanities, Social Sciences and LawSocial Sciences (R0)

Publish with us

Policies and ethics