Asymptotic properties of Lee distance

Nikolov, Nikolay I.; Stoimenova, Eugenia

doi:10.1007/s00184-018-0687-7

Asymptotic properties of Lee distance

Published: 29 September 2018

Volume 82, pages 385–408, (2019)
Cite this article

Metrika Aims and scope Submit manuscript

232 Accesses
5 Citations
1 Altmetric
Explore all metrics

A Correction to this article was published on 20 February 2021

This article has been updated

Abstract

Distances on permutations are often convenient tools for analyzing and modeling rank data. They measure the closeness between two rankings and can be very useful and informative for revealing the main structure and features of the data. In this paper, some statistical properties of the Lee distance are studied. Asymptotic results for the random variable induced by Lee distance are derived and used to compare the Distance-based probability model and the Marginals model for complete rankings. Three rank datasets are analyzed as an illustration of the presented models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rank Data Clustering Based on Lee Distance

Change of Measure Applications in Nonparametric Statistics

Empirical Likelihood and Ranking Methods

Change history

20 February 2021
A Correction to this paper has been published: https://doi.org/10.1007/s00184-021-00810-9

References

Alvo M, Yu PL (2014) Statistical methods for ranking data. Frontiers in probability and the statistical sciences. Springer, Berlin
Google Scholar
Chan CH, Yan F, Kittler J, Mikolajczyk K (2015) Full ranking as local descriptor for visual recognition: a comparison of distance metrics on $\mathbf{S}_{n}$. Pattern Recognit 48:134–160
Article Google Scholar
Critchlow DE (1985) Metric methods for analyzing partially ranked data. Lecture Notes in Statistics, vol 34. Springer, New York
Critchlow DE (1992) On rank statistics: an approach via metrics on the permutation group. J Stat Plan Inference 32:325–346
Article MathSciNet Google Scholar
Deza M, Huang T (1998) Metrics on permutations, a survey. J Comb Inf Syst Sci 23:173–185
MathSciNet MATH Google Scholar
Diaconis P (1988) Group representations in probability and statistics. IMS Lecture Notes—Monograph Series, vol 11. Institute of Mathematical Statistics, Hayward
Diaconis P (1989) A generalization of spectral analysis with application to ranked data. Ann Stat 17:949–979
Article MathSciNet Google Scholar
Fligner M, Verducci T (1986) Distance based ranking models. J R Stat Soc 48:359–369
MathSciNet MATH Google Scholar
Hoeffding W (1951) A combinatorial limit theorem. Ann Math Stat 22:558–566
Article MathSciNet Google Scholar
Irurozki E, Calvo B, Lozano A (2014) Sampling and learning the Mallows and Weighted Mallows models under the Hamming distance. Technical report. https://addi.ehu.es/bitstream/handle/10810/11240/tr14-3.pdf. Accessed 28 Sept 2018
Lee CY (1961) An algorithm for path connections and its applications. IRE Trans Electron Comput 10:346–365
Article MathSciNet Google Scholar
Mallows CM (1957) Non-null ranking models. I. Biometrika 44:114–130
Article MathSciNet Google Scholar
Mao A, Procaccia AD, Chen Y (2013) Better human computation through principled voting. In: Proceedings of 27th AAAI conference on artificial intelligence, pp 1142–1148
Marden JI (1995) Analyzing and modeling rank data. Monographs on statistics and applied probability, vol 64. Chapman & Hall, London
Google Scholar
Mattei N, Walsh T (2013) Preflib: a library of preference data. In: Proceedings of 3rd international conference on algorithmic decision theory. Springer. http://www.preflib.org. Accessed 28 Sept 2018
Mukherjee S (2016) Estimation in exponential families on permutations. Ann Stat 44:853–875
Article MathSciNet Google Scholar
Nikolov NI (2016) Lee distance in two-sample rank tests. In: Proceedings of 11th international conference on computer data analysis and modeling, pp 100–103
Nikolov NI, Stoimenova E (2017) Mallows’ model based on Lee distance. In: Proceedings of 20th European young statisticians meeting, pp 59–66
Nikolov NI, Stoimenova E (2018) EM estimation of the parameters in latent Mallows’ models. Studies in computational intelligence. Springer, Berlin
Google Scholar
Skowron P, Faliszewski P, Slinko A (2013) Achieving fully proportional representation is easy in practice. In: Proceedings of 2013 international conference on autonomous agents and multi-agent systems, pp 399–406
Verducci JS (1982) Discrimination between two populations on the basis of ranked preferences. PhD dissertation, Department of Statistics, Stanford University
Verducci JS (1989) Minimum majorization decomposition. In: Gleser LJ, Perlman MD, Press SJ, Sampson AR (eds) Contributions to probability and statistics. Springer, Berlin, pp 160–173
Chapter Google Scholar
Yu PLH, Xu H (2018) Rank aggregation using latent-scale distance-based models. Stat Comput. https://doi.org/10.1007/s11222-018-9811-9
Article Google Scholar

Download references

Acknowledgements

The work of the first author was supported by the Support Program of Bulgarian Academy of Sciences for Young Researchers under Grant 17-95/2017. The work of the second author was supported by the National Science Fund of Bulgaria under Grant DH02-13.

Author information

Authors and Affiliations

Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Acad. G. Bontchev Str., Block 8, 1113, Sofia, Bulgaria
Nikolay I. Nikolov & Eugenia Stoimenova

Authors

Nikolay I. Nikolov
View author publications
You can also search for this author in PubMed Google Scholar
Eugenia Stoimenova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolay I. Nikolov.

Appendix

In order to prove Theorem 3, let’s consider the random variables $D_{N,k}=d_{L}\left( \pi ,e_{N}\right) $, where $k=1,2,\ldots ,N$ and $\pi $ is randomly selected from ${\mathbf {S}}_{N,k}=\left\{ \sigma \in {\mathbf {S}}_{N}: \sigma (N)=k\right\} $, i.e. $\pi \sim Uniform({\mathbf {S}}_{N,k})$. Then, for fixed k,

$$\begin{aligned} D_{N,k}(\pi )=\sum \limits _{i=1}^{N}c_{N}(\pi (i),i)=\sum \limits _{i=1}^{N-1}c_{N}(\pi (i),i) + c_{N}(k,N)=\sum \limits _{i=1}^{N-1}\tilde{c}_{N}(\sigma (i),i) + c_{N}(k,N), \end{aligned}$$

where $\sigma \in {\mathbf {S}}_{N-1}$ and for $i,j=1,2,\ldots ,N-1$,

$$\begin{aligned} \sigma (i)= {\left\{ \begin{array}{ll} \pi (i), &{} \text{ if } \pi (i)<k\\ \pi (i)-1, &{} \text{ if } \pi (i)>k, \end{array}\right. } \qquad \tilde{c}_{N}(j,i)= {\left\{ \begin{array}{ll} c_{N}(j,i), &{} \text{ if } j<k \\ c_{N}(j+1,i), &{} \text{ if } j\ge k. \end{array}\right. } \end{aligned}$$

(21)

Lemma 1

Let $ \tilde{D}_{N-1}\left( \sigma \right) =\sum \nolimits _{i=1}^{N-1}\tilde{c}_{N}(\sigma (i),i)$, where $\sigma \sim Uniform({\mathbf {S}}_{N-1})$ and $\tilde{c}_{N}(\cdot ,\cdot )$ is as in (21). Then the distribution of $\tilde{D}_{N-1}$ is asymptotically normal and the mean and variance of $\tilde{D}_{N-1}$ are

$$\begin{aligned} {\mathbf {E}}\left( \tilde{D}_{N-1}\right)&= \displaystyle \frac{c_{N}(k,N)}{N-1}+\frac{N-2}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] , \\ {\mathbf {Var}} \left( \tilde{D}_{N-1}\right)&= \displaystyle \frac{ \displaystyle N^{2} \left( c_{N}\left( k,N\right) \right) ^{2}- 2N\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] c_{N}\left( k,N\right) }{\left( N-2\right) \left( N-1\right) ^{2}} + \beta _{N-1}, \end{aligned}$$

where

$$\begin{aligned} \beta _{N-1} = {\left\{ \begin{array}{ll} \displaystyle \frac{N^{2}\left( N^{3}-2N^{2}+10N-12\right) }{48(N-1)^{2}}, &{}\quad \text{ for } N \text{ even } \\ \displaystyle \frac{\left( N+1\right) \left( N^{3}-3N^{2}+6N-6\right) }{48(N-2)}, &{}\quad \text{ for } N \text{ odd. } \end{array}\right. } \end{aligned}$$

(22)

Proof

From (6) of Theorem 1 and formulas (21) and (10), it follows that

$$\begin{aligned}&{\mathbf {E}}\left( \tilde{D}_{N-1}\right) {\mathop {=}\limits ^{(6)}}\frac{1}{N-1} \sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{c}_{N}(i,j) {\mathop {=}\limits ^{(21)}}\frac{1}{N-1}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\sum _{j=1}^{N-1}c_{N}(i,j)\\&\quad =\frac{1}{N-1}\sum _{i=1}^{N}\sum _{j=1}^{N}c_{N}(i,j)-\frac{1}{N-1} \sum _{i=1}^{N}c_{N}(i,N)-\frac{1}{N-1}\sum _{j=1}^{N}c_{N}(k,j)+\frac{c_{N}(k,N)}{N-1}\\&\quad {\mathop {=}\limits ^{(10)}}\frac{N}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] -\frac{1}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] -\frac{1}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] +\frac{c_{N}(k,N)}{N-1}\\&\quad =\frac{c_{N}(k,N)}{N-1}+\frac{N-2}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] . \end{aligned}$$

Using (7) of Theorem 1,

$$\begin{aligned} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right)= & {} \frac{1}{N-2}\sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{b}_{N}^{2}(i,j)=\frac{1}{N-2}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\sum _{j=1}^{N-1}b_{N}^{2}(i,j), \quad \text{ where }\nonumber \\ b_{N}(i,j)= & {} c_{N}(i,j)- \sum _{\begin{array}{c} g=1 \\ g\ne k \end{array}}^{N}\frac{c_{N}(g,j)}{N-1}-\sum _{h=1}^{N-1}\frac{c_{N}(i,h)}{N-1}+\frac{1}{\left( N-1\right) ^{2}} \sum _{\begin{array}{c} g=1 \\ g\ne k \end{array}}^{N}\sum _{h=1}^{N-1}c_{N}(g,h), \nonumber \\ \end{aligned}$$

(23)

for $i,j=1,2,\ldots ,N$. Simplifying expression (23) gives

$$\begin{aligned} b_{N}(i,j)=c_{N}(i,j)+\frac{c_{N}(i,N)+c_{N}(k,j)}{N-1}+\frac{c_{N}(k,N)}{\left( N-1\right) ^{2}}-\frac{N}{\left( N-1\right) ^{2}}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] . \end{aligned}$$

(24)

When N is even, the variance of $\tilde{D}_{N-1}$ can be calculated by

$$\begin{aligned} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right)&=\frac{1}{N-2}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\left\{ \sum _{j=1}^{k-\frac{N}{2}}b_{N}^{2}(i,j)+\sum _{j=k-\frac{N}{2}+1}^{\frac{N}{2}}b_{N}^{2}(i,j)+\sum _{j=\frac{N}{2}+1}^{k}b_{N}^{2}(i,j)\right. \\&\quad \left. +\sum _{j=k+1}^{N-1}b_{N}^{2}(i,j)\right\} =\frac{1}{N-2}\left( Q_{1}+Q_{2}+Q_{3}+Q_{4}\right) , \end{aligned}$$

where the summation $\sum _{j=l_{1}}^{l_{2}}=0$, if $l_{1}>l_{2}$. Since the computations for $Q_{1}$, $Q_{2}$, $Q_{3}$ and $Q_{4}$ are similar, only the steps for $Q_{1}$ are presented herein.

$$\begin{aligned} Q_{1}&=\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}\sum _{j=1}^{k-\frac{N}{2}}b_{N}^{2}(i,j)=\sum _{j=1}^{k-\frac{N}{2}}\sum _{\begin{array}{c} i=1 \\ i\ne k \end{array}}^{N}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\left\{ \sum _{i=1}^{j-1}b_{N}^{2}(i,j)+ \sum _{i=j}^{\frac{N}{2}}b_{N}^{2}(i,j)\right. \\&\quad \left. +\sum _{i=\frac{N}{2}+1}^{\frac{N}{2}+j-1}b_{N}^{2}(i,j)+\sum _{i=\frac{N}{2}+j}^{N}b_{N}^{2}(i,j)-b_{N}^{2}(k,j)\right\} =Q_{1}^{(1)}+Q_{1}^{(2)}+Q_{1}^{(3)}+Q_{1}^{(4)}-Q_{1}^{(5)}, \end{aligned}$$

where

$$\begin{aligned} Q_{1}^{(1)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=1}^{j-1}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=1}^{j-1} \left( j-i+\frac{i+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \\ Q_{1}^{(2)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=j}^{\frac{N}{2}}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=j}^{\frac{N}{2}} \left( i-j+\frac{i+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \\ Q_{1}^{(3)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+1}^{\frac{N}{2}+j-1}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+1}^{\frac{N}{2}+j-1} \left( i-j+\frac{N-i+(N-k+j)}{N-1}+B_{N}(k) \right) ^{2}, \\ Q_{1}^{(4)}&= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+j}^{N}b_{N}^{2}(i,j)= \sum _{j=1}^{k-\frac{N}{2}}\sum _{i=\frac{N}{2}+j}^{N} \left( N-i+j+\frac{N-i+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \\ Q_{1}^{(5)}&= \sum _{j=1}^{k-\frac{N}{2}}b_{N}^{2}(k,j)= \sum _{j=1}^{k-\frac{N}{2}} \left( N-k+j+\frac{N-k+(N-k+j)}{N-1}+B_{N}(k)\right) ^{2}, \end{aligned}$$

for $ B_{N}(k)=\frac{c_{N}(k,N)}{\left( N-1\right) ^{2}}-\frac{N}{\left( N-1\right) ^{2}}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] =\frac{4(N-k)-N^{3}}{4\left( N-1\right) ^{2}}$ and $\sum _{i=l_{1}}^{l_{2}}=0$, if $l_{1}>l_{2}$. The calculation of $Q_{1}$ is completed by repeatedly using the formula

$$\begin{aligned} \sum _{i=1}^{n}\left( i-a\right) ^{2}=na^{2}+\frac{n(n+1)(2n+1-6a)}{6} \end{aligned}$$

(25)

for appropriate values of a and n.

The quantities $Q_{2}$, $Q_{3}$ and $Q_{4}$ can be decomposed and calculated in a similar fashion as shown for $Q_{1}$. The final result for the variance of $\tilde{D}_{N-1}$, when N is even, is

$$\begin{aligned} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right) = \displaystyle \frac{ \displaystyle 2N^{2} \left( c_{N}\left( k,N\right) \right) ^{2}- N^{3}c_{N}\left( k,N\right) }{2\left( N-2\right) \left( N-1\right) ^{2}} + \frac{N^{2}\left( N^{3}-2N^{2}+10N-12\right) }{48(N-1)^{2}}. \end{aligned}$$

The variance ${\mathbf {Var}} \left( \tilde{D}_{N-1}\right) $, when N is odd, can be obtained by decomposing it to four decomposable double sums and applying formula (25), as in the case when N is even.

From (24) and (2), it follows that

$$\begin{aligned} \displaystyle \max _{1 \le i,j \le N}b_{N}^{2}(i,j) \le \left( \left[ \frac{N}{2}\right] +\frac{\displaystyle \left[ \frac{N}{2}\right] +\left[ \frac{N}{2}\right] }{N-1}+\frac{\displaystyle \left[ \frac{N}{2}\right] }{\left( N-1\right) ^{2}}-\frac{N}{\left( N-1\right) ^{2}}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] \right) ^{2}. \end{aligned}$$

By using (22),

$$\begin{aligned} \displaystyle \frac{1}{N-1}\sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{b}_{N}^{2}(i,j)=\frac{N-2}{N-1} {\mathbf {Var}} \left( \tilde{D}_{N-1}\right) \ge \frac{N-2}{N-1}\beta _{N-1}= N^{3}\left( \frac{1}{48}+O\left( \frac{1}{N}\right) \right) , \end{aligned}$$

where $\lim _{N \rightarrow \infty }O\left( \frac{1}{N}\right) =0$. Therefore,

$$\begin{aligned} \lim _{N \rightarrow \infty } \frac{ \max _{1 \le i,j \le N-1}\tilde{b}_{N}^{2}(i,j)}{ \frac{1}{N}\sum _{i=1}^{N-1}\sum _{j=1}^{N-1}\tilde{b}_{N}^{2}(i,j)}\le \lim _{N \rightarrow \infty } \frac{N^{2}\left( \frac{1}{16}+O\left( \frac{1}{N}\right) \right) }{N^{3}\left( \frac{1}{48}+O\left( \frac{1}{N}\right) \right) }=0, \end{aligned}$$

i.e. the condition (8) of Theorem 1 is fulfilled and the distribution of $\tilde{D}_{N-1}$ is asymptotically normal. $\square $

Proof (Proof of Theorem 3)

From (14), (19) and (15), it follows that

$$\begin{aligned} m_{ij}(\theta ,N)=\sum _{\pi (i)=j} \exp \left( \theta d(\pi ,e_{N})-\psi _{N}(\theta )\right) =\frac{(N-1)!\tilde{g}_{N-1}(\theta )}{N!g_{N}(\theta )}=\frac{1}{N}\frac{\tilde{g}_{N-1}(\theta )}{g_{N}(\theta )}, \end{aligned}$$

where $g_{N}(\cdot )$ and $\tilde{g}_{N-1}(\cdot )$ are the moment generating functions of $D_{L}(\pi )$ and $D_{i,j}(\sigma )$, for $\pi \sim Uniform({\mathbf {S}}_{N})$ and $\sigma \sim Uniform({\mathbf {S}}_{i,j})$. Since $D_{i,j}$ depends on i and j only through $c_{N}(i,j)$, the random variables $D_{i,j}$ and $D_{N,k}$ are identically distributed for ${k=N-c_{N}(i,j)}$. From Theorem 2 and Lemma 1, $g_{N}(\cdot )$ and $\tilde{g}_{N-1}(\cdot )$ can be approximated, so

$$\begin{aligned} m_{ij}(\theta ,N) \frac{N}{ \exp \left( \theta \mu + \displaystyle \frac{\theta ^{2}\nu ^{2}}{2}\right) } \xrightarrow [N \rightarrow \infty ] \displaystyle 1, \end{aligned}$$

where $\mu ={\mathbf {E}}\left( D_{i,j}\right) -{\mathbf {E}}(D_{L})$ and $\nu ^{2}={\mathbf {Var}}\left( D_{i,j}\right) -{\mathbf {Var}}(D_{L})$.

According to Lemma 1,

$$\begin{aligned} {\mathbf {E}}\left( D_{i,j}\right)&= \displaystyle \frac{c_{N}(i,j)}{N-1}+\frac{N-2}{N-1}\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] +c_{N}(i,j),\\ {\mathbf {Var}} \left( D_{i,j}\right)&= \displaystyle \frac{ \displaystyle N^{2} \left( c_{N}\left( i,j\right) \right) ^{2}- 2N\left[ \frac{N+1}{2}\right] \left[ \frac{N}{2}\right] c_{N}\left( i,j\right) }{\left( N-2\right) \left( N-1\right) ^{2}} + \beta _{N-1}. \end{aligned}$$

The values of $\mu $ and $\nu ^{2}$ are obtained by combining the results above with formulas (10) and (12). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikolov, N.I., Stoimenova, E. Asymptotic properties of Lee distance. Metrika 82, 385–408 (2019). https://doi.org/10.1007/s00184-018-0687-7

Download citation

Received: 29 March 2018
Published: 29 September 2018
Issue Date: 04 April 2019
DOI: https://doi.org/10.1007/s00184-018-0687-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymptotic properties of Lee distance

Abstract

Access this article

Similar content being viewed by others

Rank Data Clustering Based on Lee Distance

Change of Measure Applications in Nonparametric Statistics

Empirical Likelihood and Ranking Methods

Change history

20 February 2021

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Lemma 1

Proof

Proof (Proof of Theorem 3)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Rank Data Clustering Based on Lee Distance

Change of Measure Applications in Nonparametric Statistics

Empirical Likelihood and Ranking Methods

Change history

20 February 2021

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Lemma 1

Proof

Proof (Proof of Theorem 3)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation