Skip to main content

Inference of a Dyadic Measure and Its Simplicial Geometry from Binary Feature Data and Application to Data Quality

  • Chapter
  • First Online:
  • 790 Accesses

Part of the book series: Association for Women in Mathematics Series ((AWMS,volume 17))

Abstract

We propose a new method for representing data sets with an ordered set of binary features which summarizes both measure-theoretic and topological properties. The method does not require any assumption of metric space properties for the data. A data set with an ordered set of binary features is viewed as a dyadic set with a dyadic measure. We prove that dyadic sets with dyadic measures have a canonical set of binary features and determine canonical nerve simplicial complexes. The method computes the two related representations: multiscale parameters for the dyadic measure and the Betti numbers of the simplicial complex. The dyadic product formula representation formulated in previous work is exploited. The parameters characterize the relative skewness of the measure at dyadic scales and localities. The more abstract Betti number statistics summarize the simplicial geometry of the support of the measure. We prove that they provide a simple privacy property. Our methods are compared with other results for measures on sets with tree structures, recent multi-resolution theory, and computational topology. We illustrate the method on a data quality data set and propose future research directions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   49.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    By a dyadic measure on a dyadic set we mean a measure on a dyadic set. A dyadic set is a set with collection of subsets of X organized as an ordered binary tree, whose root set is X, consisting of disjoint left and right child subsets for each parent set, whose union is the parent set. The measure is a measure on the sigma algebra generated by the collection of subsets. The measure is additive in the sense that the sum of the measures of the left and right child sets is the measure of the parent [2, 15].

  2. 2.

    A nerve simplicial complex is a collection of sets, all of whose non-empty subsets obtained by intersection are contained in the collection [11].

References

  1. L. Ahlfors, Lectures on Quasi-Conformal Mappings, vol. 10 (van Nostrand Mathematical Studies, Princeton, 1966)

    MATH  Google Scholar 

  2. D. Bassu, P.W. Jones, L. Ness, D. Shallcross, Product Formalisms for Measures on Spaces with Binary Tree Structures: Representation, Visualization and Multiscale Noise, submitted to SIGMA Forum of Maths (under revision) (2016). https://arxiv.org/abs/1601.02946

  3. A. Beurling, L. Ahlfors, The boundary correspondence under quasi-conformal mappings. Acta Math. 96, 125–142 (1956)

    Article  MathSciNet  Google Scholar 

  4. L. Billera, S. Holmes, K. Vogtmann, Geometry of the space of phylogenetic trees. Adv. Appl. Math. 27, 733–767 (2001)

    Article  MathSciNet  Google Scholar 

  5. C. Dwork, A. Roth, The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–401 (2014)

    Article  MathSciNet  Google Scholar 

  6. H. Edelsbrunner, J. Harer, Persistent homology—a survey. Contemp. Math. 453, 257–282 (2008)

    Article  MathSciNet  Google Scholar 

  7. F. Fasy, B. Lecci, A. Rinaldo, L. Wasserman, S. Balakrishnan, A. Singh, Confidence sets for persistence diagrams. Ann. Stat. 42, 2301–2339 (2014)

    Article  MathSciNet  Google Scholar 

  8. R. Fefferman, C. Kenig, J. Pipher, The theory of weights and the Dirichlet problem for elliptical equations. Ann. Math. 134, 65–124 (1991)

    Article  MathSciNet  Google Scholar 

  9. M. Gavish, B. Nadler, R. Coifman, Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning, in Proceedings of the 27th International Conference on Machine Learning (Omnipress, Madison, 2010), pp. 367–374

    Google Scholar 

  10. S. Harker, K. Mischaikow, M. Mrozek, V. Nanda, Discrete Morse theoretic algorithms for computing homology of complexes and maps. Found. Comput. Math. 14, 151–184 (2014)

    Article  MathSciNet  Google Scholar 

  11. M.T. Kaczynski, M.K. Mrozek, Computational Homology in Applied Mathematical Sciences 157 (Springer, New York, 2004)

    Google Scholar 

  12. J.-P. Kahane, Sur le chaos multiplicative. Ann. Sci. Math. 9, 105–150 (1985)

    MATH  Google Scholar 

  13. E. Kolaczyk, R. Nowak, Multiscale likelihood analysis and complexity penalized estimation. Ann. Stat. 32, 500–527 (2004)

    Article  MathSciNet  Google Scholar 

  14. X. Meng, A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it), in Past, Present, Future Stat. Sci. (CRC Press, Boca Raton, 2014), pp. 537–562

    Google Scholar 

  15. L. Ness, Dyadic product formula representations of confidence measures and decision rules for dyadic data set samples, in MISNC SI DS 201 (ACM, New York, 2016)

    Google Scholar 

  16. R. Rhodes, V. Vargas, Gaussian multiplicative chaos and applications: a review. Probab. Surv. 11, 315–392 (2014)

    Article  MathSciNet  Google Scholar 

  17. K. Turner, S. Mukhurjee, D. Boyer, Persistent homology transform modeling shapes and surfaces. Inf. Inf. 3, 310–344 (2014)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The author gratefully acknowledges the CCICADA Center at Rutgers and the CCICADA Data Quality Team for providing the raw data quality statistics and thanks Christie Nelson for explaining them. The author also gratefully acknowledges use of the open source Computational Homology Project software (CHomP)[10] and thanks Shaun Harker for assistance with the installation and use of the software. This work was partially enabled by DIMACS through support from the National Science Foundation under Grant No. CCF-1445755 and partially supported by DARPA SocialSim-W911NF-17-C-0098.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linda Ness .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Mutual Constraint Violations by Source

The data in Tables 7, 8, 9, 10, 11, 12, 13, and 14 in this appendix is sufficient to reproduce the data quality analysis described in the paper. The data consisted of data quality statistics for each source describing Table 7 violations for each data item in each source. The sources were the individual sources 2, 3, 4, and 5 and the composite source 1. The mutual constraint violations view of the data for each of the individual sources is shown in tables. The data for the composite source is a composite of the data for the individual tables, so that data is not shown in a table. The raw input spreadsheet tables were pre-processed into 3 column tables. In each of these tables, there is one row for each unique maximal set of constraints violated by a data element. The rows are listed in decreasing order of violation. The first column lists the number of elements whose maximal violation set is the one listed in the row. The third column lists the numbers of the constraints in the maximal violation set. The second column is the label of the path from the roots of the binary tree to the node corresponding to the maximal set. The mutual constraint violations for source 2 are shown in Table 7. The constraint violations for source 3 are shown in Tables 8, 9, and 10. The constraint violations for source 4 are shown in Tables 11 and 12. The constraint violations for source 5 are shown in Tables 13 and 14. The simplicial complex for each data source is generated by listing the mutual constraint violation sets for the data items in each data source. These lists are shown for sources 2, 3, 4, and 5 in the right hand columns of these long tables.

Table 7 Mutual constraint violations for source 2 in decreasing order of violation
Table 8 Mutual constraint violations for source 3 in decreasing order of violation part 1
Table 9 Mutual constraint violations for source 3 in decreasing order of violation part 2
Table 10 Mutual constraint violations for source 3 in decreasing order of violation part 3
Table 11 Mutual constraint violations for source 4 in decreasing order of violation part 1
Table 12 Mutual constraint violations for source 4 in decreasing order of violation part 2
Table 13 Mutual constraint violations for source 5 in decreasing order of violation part 1
Table 14 Mutual constraint violations for source 5 in decreasing order of violation part 2

Appendix 2: Example: Simplicial Complexes for Source 2

The lexicographically sorted list of mutual constraint violations for source 2 inferred from the right hand columns of the table for source 2 is shown in Table 15. This was the input to chomp-simplicial for computation of the Betti numbers for source 2. Note the input to chomp-simplicial for the computation of the Betti numbers of the other sources could be inferred from the right hand columns of the long tables in Appendix 1 for the other sources.

Table 15 Simplicial complex generators for source 2: maximal mutual constraint violation sets

Appendix 3: Representation Lemma for Measures on Tree-Structured Spaces

A natural question to ask is: Does the representation lemma for dyadic measures discussed in Sect. 3.3 and proven in [2] and [8] generalize to measures on sets with a general tree structures? We provide one answer in the next lemma. Fix notions by defining a tree structure on a set X that consists of a tree \(\mathcal {T}\), a mapping \(\mathcal {T}: nodes(\mathcal {T}) \rightarrow 2^X\) from nodes to subsets of X, and constraints on the mapping: \(\mathcal {T}(root) = X\) and the image of a parent node is the disjoint union of the images of its child nodes, i.e., S(n) = ∪c:child(n) S(c)

Lemma 2 (Representation Lemma for Measures on Tree-Structured Sets)

Let \( (X, \mathcal {T}, \mathcal {T})\) denote a tree-structured set. Let ν and μ denote strictly positive and non-negative measures, respectively, on the sigma algebra \(\varSigma (\mathcal {T})\) generated by the sets in the image of \(\mathcal {T}\) . For a non-root node \(n \in \mathcal {T}\) let r >= p >= n denote the set of nodes p on the path from n to the root node r ordered by the parent relationship. Let a n be the parameter uniquely defined by

$$\displaystyle \begin{aligned} \mu(S(n)) = (1 + a_n) \frac{\nu(S(n))}{\nu(S(parent(n)))} \mu(S(parent(n))) \end{aligned} $$
(12)

if μ(S(parent(n))) ≠ 0. If μ(S(parent(n))) = 0 define a n = 0.

  1. 1.

    \(\nu (S(n)) = \nu (X) \prod _{r > p \geq n} {\frac {\nu (S(p))}{\nu (S(parent(p))}}\)

  2. 2.

    \(\mu (S(n)) = \mu (X) \prod _{r > p \geq n} { (1 + a_p) \frac {\nu (S(p))}{\nu (S(parent(p))}}\)

  3. 3.

    for a non-leaf node p, the function \(f: \{S(c): c \in child(p)\} \rightarrow \mathbb {R} , f(S(c)) = a_c \cdot \nu (S(c))\}\) is orthogonal to the functions which are constant on the {S(c) : c  child(p)}

    $$\displaystyle \begin{aligned} 0 = \sum_{c: child(p)} {a_c \cdot \nu(S(c))}\end{aligned} $$
    (13)

    Thus, relative to a choice of a multi-resolution basis for functions on tree consisting of the parent node n and its child nodes, the function f can be expressed as a unique linear combination of card({S(c) : c  child(p)}) − 1 basis functions. For binary trees, the function h p which is 1 on the left child node c L and -1 on the right child node c R is a basis function, and the linear combination is \(a_{c_L} \cdot h_p\).

  4. 4.

    S maps the set of nodes at each level i in the tree to disjoint partitions \(\mathcal {T}_i\) of X. The partition at level i + 1 refines the partition at level i. Let ν i and μ i denote the measures on \(\varSigma (\mathcal {T}_i)\) , the sigma algebra generated by the sets in \(\mathcal {T}_i\) , determined by the first two path formulas. The weak star limit of ν i and μ i exists.

  5. 5.

    \(-1 \leq a_n < \frac {\nu (S(parent(n))) } {\nu (S(n))} - 1 \)

Proof

The first statement, the path formula for ν, is trivially true by telescoping cancellation. Informally, it just says that the ν(S(n)) equals ν(X) multiplied by the conditional probabilities determined by a path from the root node to n. None of the denominators in the conditional probabilities is zero because ν is strictly positive. The second statement, the path formula for μ, is proved by induction on the length of the path. It is true for a path of length 1 beginning at the root, by the definition of the parameter in the statement of the lemma. The induction step just substitutes the induction hypothesis for μ(S(parent(n))). The third statement is true if μ(S(p)) = 0 because then the parameters a c for the child nodes are all 0. If μ(S(p)) ≠ 0 the third statement is proved by noting that for a non-leaf node μ(S(p)) =∑c:child(p) μ(c) because measures are additive on disjoint sets and the tree structure definition requires that the disjoint union of the set images of child nodes equals the set image of the parent node. Substituting the path formula for μ into the right side, expanding, and factoring out μ(S(p)) give

$$\displaystyle \begin{aligned} \mu(S(p)) = \mu(S(p)) \cdot \left(\sum_{c: child(p)} {\frac{\nu(S(c))}{\nu(S(p))} } + \sum_{c: child(S(p))} {a_p \cdot \frac{\nu(S(c))}{\nu(S(p))} } \right) \end{aligned} $$
(14)

The first sum term in parentheses is 1 because the sum of the ν conditional probabilities of the child nodes equals 1. The second sum term in parentheses equals 0. Multiply by ν(p) to obtain the third statement. The first two sentences in the fourth statement are implied by the definition of tree structure. The fourth statement claiming that the weak star limit exists is proved using the same method as used for Lemma 2.1 in [2] and Lemma 3.20 in [8]. The key point is that ν i(X) and μ i(X) are constant for all levels i. The fifth statement is true if μ(S(parent(n))) = 0 since then a n = 0 by definition. If μ(S(parent(n))) ≠ 0, the definition can be rewritten as

$$\displaystyle \begin{aligned} \frac{\mu(S(n))}{ \mu(S(parent(n)))} \cdot \frac{\nu(S(parent(n))) } {\nu(S(n))} = 1 + a_n \end{aligned} $$
(15)

Since ν is strictly positive and μ is non-negative the left side of the equation is non-negative, so 1 + a n ≥ 0 implying a n ≥−1, with a n = −1 only if μ(S(n)) = 0 and μ(S(parent(n))) ≠ 0. If μ(S(n)) > 0 and μ(S(parent(n))) ≠ 0, 1 + a n is biggest when μ(S(n)) = μ(S(parent(n)) (making the first factor in the equation equal to 1) so the result follows. If the tree is a regular k-adic tree and ν is the naive measure ν(X) = 1 and \(\nu (c) = \frac {1}{|child\; nodes|}\) \(\frac {\nu (S(parent(n))) } {\nu (S(n))}\) is constant for all nodes and equals k so a n = k − 1 and is independent of n. This agrees with the dyadic theory where k = 2.

Statements 3 and 5 in the lemma show that the parameter space for general tree measures is much more complex and depends node by node on the measure ν. Statement 5 explains why the parameter space for dyadic measures on binary trees is simpler and easily related to the Haar-like basis. While it is easy to define a dyadic version of Wasserstein distance between dyadic measures, definition of a canonical Wasserstein-like distance between measures on tree-structured sets appears to be a research issue. Perhaps the tree distance theory developed in [4] could be exploited.

Rights and permissions

Reprints and permissions

Copyright information

© 2019 The Author(s) and the Association for Women in Mathematics

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ness, L. (2019). Inference of a Dyadic Measure and Its Simplicial Geometry from Binary Feature Data and Application to Data Quality. In: Gasparovic, E., Domeniconi, C. (eds) Research in Data Science. Association for Women in Mathematics Series, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-030-11566-1_6

Download citation

Publish with us

Policies and ethics