Skip to main content

Distance, Similarity and Sequence Comparison

  • Chapter
  • First Online:
Advances in Sequence Analysis: Theory, Method, Applications

Part of the book series: Life Course Research and Social Policies ((LCRS,volume 2))

Abstract

In this chapter we focus on the axiomatic foundations of and the relations between two fundamental concepts of sequence analysis: distance and similarity. We discuss and interpret each of the individual axioms and point out their relevance in practical application. We will discuss units of distance, admissible transformations and normalization as a method that allows for interpreting the size of distances and similarities. We also discuss how similarity and distance can be derived from each other and, in passing, we deal with some quite common misunderstandings pertaining to these concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Please note that I write x, y for sequences, x1, y1 for the states of the sequences, x, y for the representing vectors and x1, y1 for the coordinates of the vectors.

  2. 2.

    This example was suggested to me by Matthias Studer through personal communication.

  3. 3.

    Chen et al. (2009) use the expression “similarity metric” for any s that satisfies the axioms S1-S5. This is well-defendable since a similarity s “metricizes” the sequence space, just like a distance. However, we prefer to call such an s a “similarity” since the noun “metric” has been associated with “distance” for over a century now.

References

  • Batagelj, V., & Bren, M. (1995). Comparing resemblance measures. Journal of Classification, 12, 73–90.

    Article  Google Scholar 

  • Berghammer, C. (2010). Family life trajectories and religiosity in Austria. European Sociological Review, 26, 1–18.

    Article  Google Scholar 

  • Bonetti, M., Piccarreta, R., & Salford, G. (2013). Parametric and nonparametric analysis of life courses: An application to family formation patterns. Demography, 50, 881–902.

    Article  Google Scholar 

  • Bras, H., Liefbroer, A. C., & Elzinga, C. H. (2010). Standardization of pathways to adulthood? An analysis of Dutch cohorts born between 1850 and 1900. Demography, 47(4), 1013–1034.

    Article  Google Scholar 

  • Chen, S., Ma, B., & Zhang, K. (2009). On the similarity metric and the distance metric. Theoretical Computer Science, 410(24–25), 2365–2376.

    Article  Google Scholar 

  • Crochemore, M., Hancart, C., & Lecroq, T. (2007). Algorithms on strings. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification (2nd ed.). New York: Wiley.

    Google Scholar 

  • Elzinga, C. H. (2003). Sequence similarity—A non-aligning technique. Sociological Methods and Research, 31(4), 3–29.

    Article  Google Scholar 

  • Elzinga, C. H. (2005). Combinatorial representation of token sequences. Journal of Classification, 22(1), 87–118.

    Article  Google Scholar 

  • Elzinga, C. H., & Liefbroer, A. C. (2007). De-standardization and differentiation of family life trajectories. European Journal of Population, 23(3–4), 225–250.

    Article  Google Scholar 

  • Elzinga, C. H., & Wang, H. (2013). Versatile string kernels. Theoretical Computer Science,495, 50–65.

    Article  Google Scholar 

  • Elzinga, C. H., Rahmann, S., & Wang, H. (2008). Algorithms for subsequence combinatorics. Theoretical Computer Science, 409(3), 394–404.

    Article  Google Scholar 

  • Elzinga, C. H., Wang, H., Lin, Z., & Kumar, Y. (2011). Concordance and concensus. Information Sciences, 181, 2529–2549.

    Article  Google Scholar 

  • Emms, M., & Franco-Penya, H.-H. (2013). On the expressivity of alignment-based distanceand similarity measures on sequences and trees in inducing orderings. In P. Latorre Carmona, J.J. Sánchez, & A.L. Fred (Eds.), Mathematical methodologies in pattern recognition and machine learning. ICPRAM 2012 international conference on pattern recognition applications and methods, vol. 30 of Springer proceedings in mathematics & statistics, (pp. 1–18). New York: Springer.

    Google Scholar 

  • Fasang, A. E. (2010). Retirement: Institutional pathways and individual trajectories in Britain and Germany. Sociological Research Online, 15(2), 1.

    Article  Google Scholar 

  • Fréchet, M. R. (1906). Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo, 22(1), 1–72.

    Article  Google Scholar 

  • Gabadinho, A., Ritschard, G., Müller, N., & Studer, M. (2011). Analyzing and visualizing state sequences in R with TraMineR.Journal of Statistical Software, 40(4), 1–37.

    Google Scholar 

  • Gauthier, J.-A., Widmer, E., Bucher, P., & Notredame, C. (2010). Multichannel sequence analysis applied to social science data. Sociological Methodology, 40(1), 1–38.

    Article  Google Scholar 

  • Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 27, 857–871.

    Article  Google Scholar 

  • Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3, 5–48.

    Article  Google Scholar 

  • Halpin, B. (2010). Optimal matching analysis and life-course data: The importance of duration. Sociological Methods & Research, 38(3), 365–388.

    Article  Google Scholar 

  • Hamming, R. W. (1950). Error-detecting and error-correcting codes. Bell System Technical Journal, 26(2), 147–160.

    Article  Google Scholar 

  • Han, S.-K., & Moen, P. (1999). Clocking out: Temporal patterning of retirement. American Journal of Sociology, 105(1), 191–236.

    Article  Google Scholar 

  • Holliday, J. D., Hu, C.-Y., & Willett, P. (2002). Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D-fragment bit-strings. Combinatorial Chemistry and High-Throughput Screening, 5, 155–166.

    Article  Google Scholar 

  • Hollister, M. N. (2009). Is optimal matching sub-optimal? Sociological Methods & Research, 38, 235–264.

    Article  Google Scholar 

  • Lipkus, A. H. (1999). A proof of the triangle inequality for the Tanimoto distance. Journal of Mathematical Chemistry, 26, 263–265.

    Article  Google Scholar 

  • Manzoni, A., Vermunt, J. K., Luijkx, R., & Muffels, R. (2010). Memory bias in retrospective collected employment careers: A model-based approach to correct for measurement error. Sociological Methodology, 40(1), 39–73.

    Article  Google Scholar 

  • Martin, P., Schoon, I., & Ross, A. (2008). Beyond transitions: Applying optimal matching analysis to life course research. International Journal of Social Research Methodology, 11(3), 179–199.

    Article  Google Scholar 

  • Massoni, S., Olteanu, M., & Rousset, P. (2009). Career-path analysis using optimal matching and self-organizing maps. In J.C. Príncipe & R. Miikkulainen (Eds.), Advances in self-organizing maps. Lecture notes in computer science 5629. (pp. 154–612). New York: Springer.

    Google Scholar 

  • Rogers, D. H., & Tanimoto, T. T. (1960). A computer program for classifying plants. Science, 132, 1115–1118.

    Article  Google Scholar 

  • Rohwer, G., & Pötter, U. (1999). TDA User’s Manual. Bochum: Ruhr-Universität.

    Google Scholar 

  • Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Support vector machines, regularization optimization, and beyond. Cambridge: MIT Press.

    Google Scholar 

  • Shawe-Taylor, J., & Christianini, N. (2004). Kernel methods for pattern recognition. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Studer, M., Ritschard, G., Gabadinho, A., & Muller, N. S. (2011). Discrepancy analysis of state sequences. Sociological Methods & Research, 40(3), 471–510.

    Article  Google Scholar 

  • Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327–352.

    Article  Google Scholar 

  • Wang, H. (2006). Nearest neighbors by neighborhood counting. IEEE Transactions on Pattern Learning and Machine Intelligence, 28(6), 1–12.

    Article  Google Scholar 

  • Yujian, L., & Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cees H. Elzinga .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer New York Heidelberg Dordrecht London

About this chapter

Cite this chapter

Elzinga, C. (2014). Distance, Similarity and Sequence Comparison. In: Blanchard, P., Bühlmann, F., Gauthier, JA. (eds) Advances in Sequence Analysis: Theory, Method, Applications. Life Course Research and Social Policies, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-319-04969-4_4

Download citation

Publish with us

Policies and ethics