Skip to main content
Log in

New Approaches to Principal Component Analysis for Trees

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

Object Oriented Data Analysis is a new area in statistics that studies populations of general data objects. In this article we consider populations of tree-structured objects as our focus of interest. We develop improved analysis tools for data lying in a binary tree space analogous to classical Principal Component Analysis methods in Euclidean space. Our extensions of PCA are analogs of one dimensional subspaces that best fit the data. Previous work was based on the notion of tree-lines.

In this paper, a generalization of the previous tree-line notion is proposed: k-tree-lines. Previously proposed tree-lines are k-tree-lines where k=1. New sub-cases of k-tree-lines studied in this work are the 2-tree-lines and tree-curves, which explain much more variation per principal component than tree-lines. The optimal principal component tree-lines were computable in linear time. Because 2-tree-lines and tree-curves are more complex, they are computationally more expensive, but yield improved data analysis results.

We provide a comparative study of all these methods on a motivating data set consisting of brain vessel structures of 98 subjects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. For k=2, a simple counter-example where projection is not unique can be constructed. By definition, the set of k 1-tree-lines include the set of k 2-tree-lines if k 1k 2. Therefore the non-uniqueness trivially extends to all k>1.

References

  1. Alfaro CA, Aydın B, Bullitt E, Ladha A, Valencia CE (2011) Dimension reduction in principal component analysis for trees. Manuscript in progress

  2. Aydın B (2009) Principal component analysis of tree structured objects. Ph.D. Thesis, University of North Carolina at Chapel Hill

  3. Aydın B, Pataki G, Wang H, Bullitt E, Marron JS (2009) A principal component analysis for trees. Ann Appl Stat 3:1597–1615

    Article  MathSciNet  MATH  Google Scholar 

  4. Aydın B, Pataki G, Wang H, Ladha A, Bullitt E, Marron JS (2011) Visualizing the structure of large trees. Electron J Stat 5:405–420

    Article  MathSciNet  Google Scholar 

  5. Aylward S, Bullitt E (2002) Initialization, noise, singularities and scale in height ridge traversal for tubular object centerline extraction. IEEE Trans Med Imaging 21:61–75

    Article  Google Scholar 

  6. Banks D, Constantine GM (1998) Metric models for random graphs. J Classif 15:199–223

    Article  MathSciNet  MATH  Google Scholar 

  7. Bazaraa MS, Shetty CM (1979) Nonlinear programming: Theory and algorithms. Wiley, New York

    MATH  Google Scholar 

  8. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MathSciNet  MATH  Google Scholar 

  9. Breiman L, Friedman JH, Olshen JA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont

    MATH  Google Scholar 

  10. Bullitt E, Gerig G, Pizer SM, Aylward SR (2003) Measuring tortuosity of the intracerebral vasculature from MRA images. IEEE Trans Med Imaging 22:1163–1171

    Article  Google Scholar 

  11. Bullitt E, Zeng D, Ghosh A, Aylward SR, Lin W, Marks BL, Smith K (2010) The effects of healthy aging on intracerebral blood vessels visualized by magnetic resonance angiography. Neurobiol Aging 31(2):290–300

    Article  Google Scholar 

  12. Cook WJ, Cunningham WH, Pulleyblank WR, Schrijver A (1997) Combinatorial optimization. Wiley, New York

    Book  Google Scholar 

  13. Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Oxford University Press, New York

    MATH  Google Scholar 

  14. Handle, http://hdl.handle.net/1926/594 (2008)

  15. Land AH, Doig AG (1960) An automatic method of solving discrete programming problems. Econometrica 28(3):497–520

    Article  MathSciNet  MATH  Google Scholar 

  16. Lawler EL, Bell MD (1966) A method for solving discrete optimization problems. Oper Res 14(6):1098–1112

    Article  Google Scholar 

  17. Lawler EL, Wood DE (1966) Branch-and-bound methods: A survey. Oper Res 14:699–719

    Article  MathSciNet  MATH  Google Scholar 

  18. Nye T (2011) Principal component analysis in the space of phylogenetic trees. Unpublished manuscript, http://www.mas.ncl.ac.uk/~ntmwn/pca/preprint.pdf

  19. Schrijver A (1998) Theory of linear and integer programming. Wiley, New York

    MATH  Google Scholar 

  20. Shen D, Shen H, Bhamidi S, Munoz-Maldonado Y, Kim Y, Marron JS (2011) Functional data analysis for trees. Manuscript in progress

  21. Wang H, Marron JS (2007) Object oriented data analysis: sets of trees. Ann Stat 35:1849–1873

    Article  MathSciNet  MATH  Google Scholar 

  22. Wang Y, Marron JS, Aydın B, Ladha A, Bullitt E, Wang H (2011) Nonparametric regression model with tree-structured response. Manuscript in progress

Download references

Acknowledgements

During this research, Burcu Aydın was partially supported by NSF grants DMS-0606577 and DMS-0854908, and NIH Grant RFA-ES-04-008. Haonan Wang was partially supported by NSF grants DMS-0706761 and DMS-0854903. Alim Ladha and Elizabeth Bullitt were partially supported by NIH grants R01EB000219-NIH-NIBIB and R01 CA124608-NIH-NCI. J.S. Marron was partially supported by NSF grants DMS-0606577 and DMS-0854908, and NIH Grant RFA-ES-04-008.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Burcu Aydın.

Appendix

Appendix

1.1 6.1 Proof of Lemma 1

The approach taken here is to count all possible 2-tree-lines on a given data set. A polynomial bound on this number will suffice to conclude that we have a problem that can be solved in polynomial time, as the process of calculating the total distance of a given 2-tree-line to the points in the data set is a linear-time process.

For a given data set, the number of possible k-tree-lines depends on the size of its support tree only, and not on the number of data trees in it. In this section, it will be assumed that the support tree is a full tree, i.e. all levels of the support tree include all the nodes on those levels. Another simplification is that we will assume the starting tree for the 2-tree-lines considered is the root node. This approach will give an upper bound on the 2-tree-line count, since arranging the same number of nodes in a full tree and starting from the root node would give the highest number of possible 2-tree-lines. These two assumptions will enable us to disregard the structure of an arbitrary starting tree and the support tree in finding an upper bound that depends on the node count only.

Let:

We know that:

$$f(n) = f_1(n)+f_2(n), \quad \forall n \geq0.$$

We will write a recursive formula for f(n). If we consider the most trivial case where our tree is only the root node, we obtain the initial condition for the recursion:

To get the recursive formula, assume that we know the values of f 1(n) and f 2(n), and we are looking for f 1(n+1) and f 2(n+1). First let us count the 2-tree-lines that end at (n+1)st level with a single node. This single node can be either one of the two children of a 2-tree-line ending at level n with a single node, or it can be one of the four children of a 2-tree-line ending at level n with two nodes. Therefore:

$$f_1(n+1)=2f_1(n)+4f_2(n).$$

For f 2(n+1), first consider f 1(n). These lines end with a single node at nth level, and have two children, where both of them need to be added. Since the order of the addition matters, each such line gives us two options for extension. For f 2(n), we need to choose two nodes out of the four children of nth level nodes. However, not all of the 2-combinations of these are available. Now let us name the nodes on nth level as a and b, b being the last added node. Let us name their children as a 1,a 2 and b 1,b 2, respectively. Now the possible choices for addition are (a 1,a 2), (a 2,a 1), (b 1,a 1), (b 1,a 2), (b 2,a 1), (b 2,a 2). Summing all the choices up, we get

$$f_2(n+1)=2f_1(n)+6f_2(n).$$

These two formulas are valid for all n greater than 1. Now let us write these two in matrix form:

$$\left[\begin{array}{c} f_1(n+1) \\ f_2(n+1)\end{array}\right] = \left[\begin{array}{c@{\quad}c}2 & 4 \\2 & 6\end{array}\right] \left[\begin{array}{c} f_1(n) \\ f_2(n)\end{array}\right].$$

Using this formula, we can write

$$\left[\begin{array}{c} f_1(n+1) \\ f_2(n+1)\end{array}\right] = {\left[\begin{array}{c@{\quad}c}2 & 4 \\2 & 6\end{array}\right]}^{n}\left[\begin{array}{c} f_1(1) \\ f_2(1)\end{array}\right].$$

To further simplify this, we can re-write the coefficient matrix using spectral decomposition:

$$\left[\begin{array}{c@{\quad}c} 2 & 4 \\ 2 & 6\end{array}\right] = {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]} {\left[\begin{array}{c@{\quad}c} \lambda_1 & 0 \\ 0 & \lambda_2\end{array}\right] } {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]}^{-1}$$

where \(\lambda_{1} = 4+2\sqrt{3}\) and \(\lambda_{2} = 4-2\sqrt{3}\), the eigenvalues of coefficient matrix. Now we can get the nth multiple of this easily:

$${\left[\begin{array}{c@{\quad}c} 2 & 4 \\ 2 & 6\end{array}\right]}^{n} = {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]} {\left[\begin{array}{c@{\quad}c} \lambda_1 & 0 \\ 0 & \lambda_2\end{array}\right] }^{n} {\left[\begin{array}{c@{\quad}c}\sqrt{3}-1 & 1 \\1& \frac{1-\sqrt{3}}{2}\end{array}\right]}^{-1}.$$

Insert this into f(n+1) formula along with the initial conditions, and do the necessary simplifications:

Summing these up, we get the desired quantity:

$$f(n+1) = f_1(n+1)+f_2(n+1) = \frac{(\lambda_1)^n + (\lambda_2)^n}{2}.$$

We know that a full support tree with n levels has ∼2n nodes. If we call the total number of nodes in the support tree m, we have n=log2(m). So for a problem with full support tree size m, the total number of 2-tree-lines is:

$$\frac{\lambda_1^{(\log_2m) - 1} + \lambda_2^{(\log_2m) - 1}}{2}.$$

So the order of the problem of finding all 2-tree-lines is:

$$O\biggl(\frac{1}{2\lambda_1}m^{\log_2\lambda_1}\biggr) = O\bigl (m^{2.9}\bigr).$$

1.2 6.2 Proof of Theorem 1

Lemma 1 already establishes the order for the total count of 2-tree-lines. To prove Theorem 1, we also need the maximum length of these 2-tree-lines.

The full support tree with m nodes has a depth log2(m+1). And it is easy to see that the 2-tree-line with maximum number of nodes in it that can be defined on this support tree has 1+2∗(log2(m+1)) nodes. We obtain this number by using the observation that a 2-tree-line starting from the root can contain at most two nodes from each level, except the root level. Therefore, the maximum number of nodes contained in each 2-tree-line has an order of O(logm). Combine this with Lemma 1, and we see that the order of all nodes contained in the list of all 2-tree-lines is O(m 2.9logm). The final step is to show that the brute-force method needs to account every node on the 2-tree-line list only once to form the list. This step is rather trivial, so it will not be elaborated here.

1.3 6.3 Proof of Proposition 1

The projection of a data point onto an object is, naturally, a point on that object. In our case, this implies the fact that the projection of a data point t i onto a 2-tree-line K, P K (t i ), is a tree that is contained in Pa(K). Therefore we can write:

(1)

Let K be any maximal 2-tree-line that can be extended from K. Naturally, K K, and

$$ \sum_{v \in \mathit{MP}(K)}{w(v)} \geq\sum _{v \in \mathit{Pa}(K^*)}{w(v)}.$$
(2)

Now, using (1) and (2), we can show

Which proves that any maximal 2-tree-line extending from K will have a worse objective function value than \(\sum_{t_{i} \in T}{|t_{i}|} -\sum_{v \in \mathit{MP}(K)}{w(v)}\), and therefore ∑ vMP(K) w(v) provides a lower bound.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aydın, B., Pataki, G., Wang, H. et al. New Approaches to Principal Component Analysis for Trees. Stat Biosci 4, 132–156 (2012). https://doi.org/10.1007/s12561-012-9055-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-012-9055-8

Keywords

Navigation