A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

Chen, Guanhua; Ma, Xiuli; Yang, Dongqing; Tang, Shiwei; Shuai, Meng

doi:10.1007/978-3-642-02279-1_41

A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

Guanhua Chen¹⁷,
Xiuli Ma^17,18,
Dongqing Yang^17,19,
Shiwei Tang^17,18 &
…
Meng Shuai¹⁸

Conference paper

1426 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5566))

Abstract

Data summarization is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, a general framework that satisfies all of these requirements is proposed to summarize high-dimensional data. We formulate this problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured using the MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the optimal summary with minimal representation cost is computationally infeasible, an approximate optimal summary is achieved by a heuristic algorithm whose computation cost is quadratic to the size of data and linear to the dimensionality of data. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study on both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing high-dimensional datasets with binary, categorical and numeric attributes.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Frequent Itemset Mining Implementations Repository, http://fimi.cs.helsinki.fi/
Afrati, F., Gionis, A., Mannila, H.: Approximating a collection of frequent sets. In: Proc. KDD 2004 (2004)
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc SIGMOD 1998 (1998)
Google Scholar
Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Chandola, V., Kumar, V.: Summarization - Compressing data into an informative representation. Knowl. Inf. Syst. 12(3) (2007)
Google Scholar
Cortez, P., Morais, A.: A Data Mining Approach to Predict Forest Fires using Meteorological Data. In: Proc. EPIA 2007 (2007)
Google Scholar
Gao, B.J., Ester, M.: Turning Clusters into Patterns: Rectangle-based Discriminative Data Description. In: Proc. ICDM 2006 (2006)
Google Scholar
Han, J., Wang, J., Lu, Y., Tzvetkov, P.: Mining top-k frequent closed patterns without minimum support. In: Proc. ICDM 2002 (2002)
Google Scholar
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proc. KDD 2004 (2004)
Google Scholar
Johnson, D., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proc. VLDB 2004 (2004)
Google Scholar
Lakshmanan, L.V.S., Ng, R.T., Wang, C.X., Zhou, X., Johnson, T.J.: The Generalized MDL approach for Summarization. In: Proc. VLDB 2002 (2002)
Google Scholar
Liu, B., Hu, M., Hsu, W.: Multi-level organization and summarization of the discovered rules. In: Proc. KDD 2000 (2000)
Google Scholar
Mani, I.: Advances in Automatic Text Summarization. MIT Press, Cambridge (1999)
Google Scholar
Navlakha, S., Rastogi, R., Shrivastava, N.: Graph Summarization with Bounded Error. In: Proc. SIGMOD 2008 (2008)
Google Scholar
Siebes, A., Vreeken, J., Leeuwen, M.: Item Sets that Compress. In: Proc. SDM (2006)
Google Scholar
Rissanen, J.: Modeling by the shortest data description. Automatica 14, 465–471 (1978)
Article MATH Google Scholar
Tian, Y., Hankins, R.A., Patel, J.M.: Efficient Aggregation for Graph Summarization. In: Proc. SIGMOD 2008 (2008)
Google Scholar
Wang, J., Karypis, G.: On Efficiently Summarizing Categorical Databases. Knowl. Inf. Syst. 9(1), 19–37 (2006)
Article Google Scholar
Xiang, Y., Jin, R., Fuhry, D., Dragan, F.F.: Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle Scheme. In: Proc. KDD (2008)
Google Scholar
Zhu, F., Yan, X., Han, J., Yu, P.S., Cheng, H.: Mining Colossal Frequent Patterns by Core Pattern Fusion. In: Proc. ICDE 2007 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Guanhua Chen, Xiuli Ma, Dongqing Yang & Shiwei Tang
Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, 100871, China
Xiuli Ma, Shiwei Tang & Meng Shuai
Key Laboratory of High Confidence Software Technologies (Ministry of Education), Peking University, Beijing, 100871, China
Dongqing Yang

Authors

Guanhua Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiuli Ma
View author publications
You can also search for this author in PubMed Google Scholar
Dongqing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shiwei Tang
View author publications
You can also search for this author in PubMed Google Scholar
Meng Shuai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Avenue, IL 61801, Urbana, USA
Marianne Winslett

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, G., Ma, X., Yang, D., Tang, S., Shuai, M. (2009). A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data. In: Winslett, M. (eds) Scientific and Statistical Database Management. SSDBM 2009. Lecture Notes in Computer Science, vol 5566. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02279-1_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-02279-1_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02278-4
Online ISBN: 978-3-642-02279-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics