New Approaches to Principal Component Analysis for Trees

Aydın, Burcu; Pataki, Gábor; Wang, Haonan; Ladha, Alim; Bullitt, Elizabeth; Marron, J. S.

doi:10.1007/s12561-012-9055-8

New Approaches to Principal Component Analysis for Trees

Published: 08 February 2012

Volume 4, pages 132–156, (2012)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

Burcu Aydın¹,
Gábor Pataki²,
Haonan Wang³,
Alim Ladha⁴,
Elizabeth Bullitt⁵ &
…
J. S. Marron⁶

209 Accesses
2 Citations
Explore all metrics

Abstract

Object Oriented Data Analysis is a new area in statistics that studies populations of general data objects. In this article we consider populations of tree-structured objects as our focus of interest. We develop improved analysis tools for data lying in a binary tree space analogous to classical Principal Component Analysis methods in Euclidean space. Our extensions of PCA are analogs of one dimensional subspaces that best fit the data. Previous work was based on the notion of tree-lines.

In this paper, a generalization of the previous tree-line notion is proposed: k-tree-lines. Previously proposed tree-lines are k-tree-lines where k=1. New sub-cases of k-tree-lines studied in this work are the 2-tree-lines and tree-curves, which explain much more variation per principal component than tree-lines. The optimal principal component tree-lines were computable in linear time. Because 2-tree-lines and tree-curves are more complex, they are computationally more expensive, but yield improved data analysis results.

We provide a comparative study of all these methods on a motivating data set consisting of brain vessel structures of 98 subjects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Statistical power for cluster analysis

Article Open access 31 May 2022

Notes

For k=2, a simple counter-example where projection is not unique can be constructed. By definition, the set of k ₁-tree-lines include the set of k ₂-tree-lines if k ₁≥k ₂. Therefore the non-uniqueness trivially extends to all k>1.

References

Alfaro CA, Aydın B, Bullitt E, Ladha A, Valencia CE (2011) Dimension reduction in principal component analysis for trees. Manuscript in progress
Aydın B (2009) Principal component analysis of tree structured objects. Ph.D. Thesis, University of North Carolina at Chapel Hill
Aydın B, Pataki G, Wang H, Bullitt E, Marron JS (2009) A principal component analysis for trees. Ann Appl Stat 3:1597–1615
Article MathSciNet MATH Google Scholar
Aydın B, Pataki G, Wang H, Ladha A, Bullitt E, Marron JS (2011) Visualizing the structure of large trees. Electron J Stat 5:405–420
Article MathSciNet Google Scholar
Aylward S, Bullitt E (2002) Initialization, noise, singularities and scale in height ridge traversal for tubular object centerline extraction. IEEE Trans Med Imaging 21:61–75
Article Google Scholar
Banks D, Constantine GM (1998) Metric models for random graphs. J Classif 15:199–223
Article MathSciNet MATH Google Scholar
Bazaraa MS, Shetty CM (1979) Nonlinear programming: Theory and algorithms. Wiley, New York
MATH Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MathSciNet MATH Google Scholar
Breiman L, Friedman JH, Olshen JA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
MATH Google Scholar
Bullitt E, Gerig G, Pizer SM, Aylward SR (2003) Measuring tortuosity of the intracerebral vasculature from MRA images. IEEE Trans Med Imaging 22:1163–1171
Article Google Scholar
Bullitt E, Zeng D, Ghosh A, Aylward SR, Lin W, Marks BL, Smith K (2010) The effects of healthy aging on intracerebral blood vessels visualized by magnetic resonance angiography. Neurobiol Aging 31(2):290–300
Article Google Scholar
Cook WJ, Cunningham WH, Pulleyblank WR, Schrijver A (1997) Combinatorial optimization. Wiley, New York
Book Google Scholar
Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Oxford University Press, New York
MATH Google Scholar
Handle, http://hdl.handle.net/1926/594 (2008)
Land AH, Doig AG (1960) An automatic method of solving discrete programming problems. Econometrica 28(3):497–520
Article MathSciNet MATH Google Scholar
Lawler EL, Bell MD (1966) A method for solving discrete optimization problems. Oper Res 14(6):1098–1112
Article Google Scholar
Lawler EL, Wood DE (1966) Branch-and-bound methods: A survey. Oper Res 14:699–719
Article MathSciNet MATH Google Scholar
Nye T (2011) Principal component analysis in the space of phylogenetic trees. Unpublished manuscript, http://www.mas.ncl.ac.uk/~ntmwn/pca/preprint.pdf
Schrijver A (1998) Theory of linear and integer programming. Wiley, New York
MATH Google Scholar
Shen D, Shen H, Bhamidi S, Munoz-Maldonado Y, Kim Y, Marron JS (2011) Functional data analysis for trees. Manuscript in progress
Wang H, Marron JS (2007) Object oriented data analysis: sets of trees. Ann Stat 35:1849–1873
Article MathSciNet MATH Google Scholar
Wang Y, Marron JS, Aydın B, Ladha A, Bullitt E, Wang H (2011) Nonparametric regression model with tree-structured response. Manuscript in progress

Download references

Acknowledgements

During this research, Burcu Aydın was partially supported by NSF grants DMS-0606577 and DMS-0854908, and NIH Grant RFA-ES-04-008. Haonan Wang was partially supported by NSF grants DMS-0706761 and DMS-0854903. Alim Ladha and Elizabeth Bullitt were partially supported by NIH grants R01EB000219-NIH-NIBIB and R01 CA124608-NIH-NCI. J.S. Marron was partially supported by NSF grants DMS-0606577 and DMS-0854908, and NIH Grant RFA-ES-04-008.

Author information

Authors and Affiliations

HP Laboratories, 1501 Page Mill Rd MS 1140, Palo Alto, CA, 94304, USA
Burcu Aydın
UNC at Chapel Hill, 307 Hanes Hall CB 3260, Chapel Hill, NC, 27599, USA
Gábor Pataki
Colorado State University, 216 Statistics Building, Fort Collins, CO, 80523, USA
Haonan Wang
UNC at Chapel Hill, 2160 Bioinformatics #7060, Chapel Hill, NC, 27599, USA
Alim Ladha
UNC at Chapel Hill, CB 7062 Department of Neurosurgery UNC-CH, Chapel Hill, NC, 27599, USA
Elizabeth Bullitt
UNC at Chapel Hill, 352 Hanes Hall CB 3260, Chapel Hill, NC, 27599, USA
J. S. Marron

Authors

Burcu Aydın
View author publications
You can also search for this author in PubMed Google Scholar
Gábor Pataki
View author publications
You can also search for this author in PubMed Google Scholar
Haonan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Alim Ladha
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Bullitt
View author publications
You can also search for this author in PubMed Google Scholar
J. S. Marron
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Burcu Aydın.

Appendix

1.1 6.1 Proof of Lemma 1

The approach taken here is to count all possible 2-tree-lines on a given data set. A polynomial bound on this number will suffice to conclude that we have a problem that can be solved in polynomial time, as the process of calculating the total distance of a given 2-tree-line to the points in the data set is a linear-time process.

For a given data set, the number of possible k-tree-lines depends on the size of its support tree only, and not on the number of data trees in it. In this section, it will be assumed that the support tree is a full tree, i.e. all levels of the support tree include all the nodes on those levels. Another simplification is that we will assume the starting tree for the 2-tree-lines considered is the root node. This approach will give an upper bound on the 2-tree-line count, since arranging the same number of nodes in a full tree and starting from the root node would give the highest number of possible 2-tree-lines. These two assumptions will enable us to disregard the structure of an arbitrary starting tree and the support tree in finding an upper bound that depends on the node count only.

Let:

We know that:

$$f(n) = f_1(n)+f_2(n), \quad \forall n \geq0.$$

We will write a recursive formula for f(n). If we consider the most trivial case where our tree is only the root node, we obtain the initial condition for the recursion:

To get the recursive formula, assume that we know the values of f ₁(n) and f ₂(n), and we are looking for f ₁(n+1) and f ₂(n+1). First let us count the 2-tree-lines that end at (n+1)st level with a single node. This single node can be either one of the two children of a 2-tree-line ending at level n with a single node, or it can be one of the four children of a 2-tree-line ending at level n with two nodes. Therefore:

$$f_1(n+1)=2f_1(n)+4f_2(n).$$

For f ₂(n+1), first consider f ₁(n). These lines end with a single node at nth level, and have two children, where both of them need to be added. Since the order of the addition matters, each such line gives us two options for extension. For f ₂(n), we need to choose two nodes out of the four children of nth level nodes. However, not all of the 2-combinations of these are available. Now let us name the nodes on nth level as a and b, b being the last added node. Let us name their children as a ₁,a ₂ and b ₁,b ₂, respectively. Now the possible choices for addition are (a ₁,a ₂), (a ₂,a ₁), (b ₁,a ₁), (b ₁,a ₂), (b ₂,a ₁), (b ₂,a ₂). Summing all the choices up, we get

$$f_2(n+1)=2f_1(n)+6f_2(n).$$

These two formulas are valid for all n greater than 1. Now let us write these two in matrix form:

$$\left[\begin{array}{c} f_1(n+1) \\ f_2(n+1)\end{array}\right] = \left[\begin{array}{c@{\quad}c}2 & 4 \\2 & 6\end{array}\right] \left[\begin{array}{c} f_1(n) \\ f_2(n)\end{array}\right].$$

Using this formula, we can write

$$\left[\begin{array}{c} f_1(n+1) \\ f_2(n+1)\end{array}\right] = {\left[\begin{array}{c@{\quad}c}2 & 4 \\2 & 6\end{array}\right]}^{n}\left[\begin{array}{c} f_1(1) \\ f_2(1)\end{array}\right].$$

To further simplify this, we can re-write the coefficient matrix using spectral decomposition:

$$\left[\begin{array}{c@{\quad}c} 2 & 4 \\ 2 & 6\end{array}\right] = {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]} {\left[\begin{array}{c@{\quad}c} \lambda_1 & 0 \\ 0 & \lambda_2\end{array}\right] } {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]}^{-1}$$

where $\lambda_{1} = 4+2\sqrt{3}$ and $\lambda_{2} = 4-2\sqrt{3}$, the eigenvalues of coefficient matrix. Now we can get the nth multiple of this easily:

$${\left[\begin{array}{c@{\quad}c} 2 & 4 \\ 2 & 6\end{array}\right]}^{n} = {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]} {\left[\begin{array}{c@{\quad}c} \lambda_1 & 0 \\ 0 & \lambda_2\end{array}\right] }^{n} {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]}^{-1}.$$

Insert this into f(n+1) formula along with the initial conditions, and do the necessary simplifications:

Summing these up, we get the desired quantity:

$$f(n+1) = f_1(n+1)+f_2(n+1) = \frac{(\lambda_1)^n + (\lambda_2)^n}{2}.$$

We know that a full support tree with n levels has ∼2ⁿ nodes. If we call the total number of nodes in the support tree m, we have n=log₂(m). So for a problem with full support tree size m, the total number of 2-tree-lines is:

$$\frac{\lambda_1^{(\log_2m) - 1} + \lambda_2^{(\log_2m) - 1}}{2}.$$

So the order of the problem of finding all 2-tree-lines is:

$$O\biggl(\frac{1}{2\lambda_1}m^{\log_2\lambda_1}\biggr) = O\bigl (m^{2.9}\bigr).$$

1.2 6.2 Proof of Theorem 1

Lemma 1 already establishes the order for the total count of 2-tree-lines. To prove Theorem 1, we also need the maximum length of these 2-tree-lines.

The full support tree with m nodes has a depth log₂(m+1). And it is easy to see that the 2-tree-line with maximum number of nodes in it that can be defined on this support tree has 1+2∗(log₂(m+1)) nodes. We obtain this number by using the observation that a 2-tree-line starting from the root can contain at most two nodes from each level, except the root level. Therefore, the maximum number of nodes contained in each 2-tree-line has an order of O(logm). Combine this with Lemma 1, and we see that the order of all nodes contained in the list of all 2-tree-lines is O(m ^2.9logm). The final step is to show that the brute-force method needs to account every node on the 2-tree-line list only once to form the list. This step is rather trivial, so it will not be elaborated here.

1.3 6.3 Proof of Proposition 1

The projection of a data point onto an object is, naturally, a point on that object. In our case, this implies the fact that the projection of a data point t _i onto a 2-tree-line K, P _K(t _i), is a tree that is contained in Pa(K). Therefore we can write:

(1)

Let K ^∗ be any maximal 2-tree-line that can be extended from K. Naturally, K ^∗⊇K, and

$$ \sum_{v \in \mathit{MP}(K)}{w(v)} \geq\sum _{v \in \mathit{Pa}(K^*)}{w(v)}.$$

(2)

Now, using (1) and (2), we can show

Which proves that any maximal 2-tree-line extending from K will have a worse objective function value than $\sum_{t_{i} \in T}{|t_{i}|} -\sum_{v \in \mathit{MP}(K)}{w(v)}$, and therefore ∑_v∈MP(K) w(v) provides a lower bound.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aydın, B., Pataki, G., Wang, H. et al. New Approaches to Principal Component Analysis for Trees. Stat Biosci 4, 132–156 (2012). https://doi.org/10.1007/s12561-012-9055-8

Download citation

Received: 27 May 2011
Accepted: 19 January 2012
Published: 08 February 2012
Issue Date: May 2012
DOI: https://doi.org/10.1007/s12561-012-9055-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

New Approaches to Principal Component Analysis for Trees

Abstract

Access this article

Similar content being viewed by others