Inference of a Dyadic Measure and Its Simplicial Geometry from Binary Feature Data and Application to Data Quality
We propose a new method for representing data sets with an ordered set of binary features which summarizes both measure-theoretic and topological properties. The method does not require any assumption of metric space properties for the data. A data set with an ordered set of binary features is viewed as a dyadic set with a dyadic measure. We prove that dyadic sets with dyadic measures have a canonical set of binary features and determine canonical nerve simplicial complexes. The method computes the two related representations: multiscale parameters for the dyadic measure and the Betti numbers of the simplicial complex. The dyadic product formula representation formulated in previous work is exploited. The parameters characterize the relative skewness of the measure at dyadic scales and localities. The more abstract Betti number statistics summarize the simplicial geometry of the support of the measure. We prove that they provide a simple privacy property. Our methods are compared with other results for measures on sets with tree structures, recent multi-resolution theory, and computational topology. We illustrate the method on a data quality data set and propose future research directions.
The author gratefully acknowledges the CCICADA Center at Rutgers and the CCICADA Data Quality Team for providing the raw data quality statistics and thanks Christie Nelson for explaining them. The author also gratefully acknowledges use of the open source Computational Homology Project software (CHomP) and thanks Shaun Harker for assistance with the installation and use of the software. This work was partially enabled by DIMACS through support from the National Science Foundation under Grant No. CCF-1445755 and partially supported by DARPA SocialSim-W911NF-17-C-0098.
- 2.D. Bassu, P.W. Jones, L. Ness, D. Shallcross, Product Formalisms for Measures on Spaces with Binary Tree Structures: Representation, Visualization and Multiscale Noise, submitted to SIGMA Forum of Maths (under revision) (2016). https://arxiv.org/abs/1601.02946
- 9.M. Gavish, B. Nadler, R. Coifman, Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning, in Proceedings of the 27th International Conference on Machine Learning (Omnipress, Madison, 2010), pp. 367–374Google Scholar
- 11.M.T. Kaczynski, M.K. Mrozek, Computational Homology in Applied Mathematical Sciences 157 (Springer, New York, 2004)Google Scholar
- 14.X. Meng, A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it), in Past, Present, Future Stat. Sci. (CRC Press, Boca Raton, 2014), pp. 537–562Google Scholar
- 15.L. Ness, Dyadic product formula representations of confidence measures and decision rules for dyadic data set samples, in MISNC SI DS 201 (ACM, New York, 2016)Google Scholar