Skip to main content

A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5566))

Abstract

Data summarization is an important data mining task which aims to find a compact description of a dataset. Emerging applications place special requirements to the data summarization techniques including the ability to find concise and informative summary from high dimensional data, the ability to deal with different types of attributes such as binary, categorical and numeric attributes, end-user comprehensibility of the summary, insensibility to noise and missing values and scalability with the data size and dimensionality. In this work, a general framework that satisfies all of these requirements is proposed to summarize high-dimensional data. We formulate this problem in a bipartite graph scheme, mapping objects (data records) and values of attributes into two disjoint groups of nodes of a graph, in which a set of representative objects is discovered as the summary of the original data. Further, the capability of representativeness is measured using the MDL principle, which helps to yield a highly intuitive summary with the most informative objects of the input data. While the problem of finding the optimal summary with minimal representation cost is computationally infeasible, an approximate optimal summary is achieved by a heuristic algorithm whose computation cost is quadratic to the size of data and linear to the dimensionality of data. In addition, several techniques are developed to improve both quality of the resultant summary and efficiency of the algorithm. A detailed study on both real and synthetic datasets shows the effectiveness and efficiency of our approach in summarizing high-dimensional datasets with binary, categorical and numeric attributes.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Frequent Itemset Mining Implementations Repository, http://fimi.cs.helsinki.fi/

  2. Afrati, F., Gionis, A., Mannila, H.: Approximating a collection of frequent sets. In: Proc. KDD 2004 (2004)

    Google Scholar 

  3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc SIGMOD 1998 (1998)

    Google Scholar 

  4. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html

  5. Chandola, V., Kumar, V.: Summarization - Compressing data into an informative representation. Knowl. Inf. Syst. 12(3) (2007)

    Google Scholar 

  6. Cortez, P., Morais, A.: A Data Mining Approach to Predict Forest Fires using Meteorological Data. In: Proc. EPIA 2007 (2007)

    Google Scholar 

  7. Gao, B.J., Ester, M.: Turning Clusters into Patterns: Rectangle-based Discriminative Data Description. In: Proc. ICDM 2006 (2006)

    Google Scholar 

  8. Han, J., Wang, J., Lu, Y., Tzvetkov, P.: Mining top-k frequent closed patterns without minimum support. In: Proc. ICDM 2002 (2002)

    Google Scholar 

  9. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proc. KDD 2004 (2004)

    Google Scholar 

  10. Johnson, D., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.: Compressing large boolean matrices using reordering techniques. In: Proc. VLDB 2004 (2004)

    Google Scholar 

  11. Lakshmanan, L.V.S., Ng, R.T., Wang, C.X., Zhou, X., Johnson, T.J.: The Generalized MDL approach for Summarization. In: Proc. VLDB 2002 (2002)

    Google Scholar 

  12. Liu, B., Hu, M., Hsu, W.: Multi-level organization and summarization of the discovered rules. In: Proc. KDD 2000 (2000)

    Google Scholar 

  13. Mani, I.: Advances in Automatic Text Summarization. MIT Press, Cambridge (1999)

    Google Scholar 

  14. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph Summarization with Bounded Error. In: Proc. SIGMOD 2008 (2008)

    Google Scholar 

  15. Siebes, A., Vreeken, J., Leeuwen, M.: Item Sets that Compress. In: Proc. SDM (2006)

    Google Scholar 

  16. Rissanen, J.: Modeling by the shortest data description. Automatica 14, 465–471 (1978)

    Article  MATH  Google Scholar 

  17. Tian, Y., Hankins, R.A., Patel, J.M.: Efficient Aggregation for Graph Summarization. In: Proc. SIGMOD 2008 (2008)

    Google Scholar 

  18. Wang, J., Karypis, G.: On Efficiently Summarizing Categorical Databases. Knowl. Inf. Syst. 9(1), 19–37 (2006)

    Article  Google Scholar 

  19. Xiang, Y., Jin, R., Fuhry, D., Dragan, F.F.: Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle Scheme. In: Proc. KDD (2008)

    Google Scholar 

  20. Zhu, F., Yan, X., Han, J., Yu, P.S., Cheng, H.: Mining Colossal Frequent Patterns by Core Pattern Fusion. In: Proc. ICDE 2007 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chen, G., Ma, X., Yang, D., Tang, S., Shuai, M. (2009). A Bipartite Graph Framework for Summarizing High-Dimensional Binary, Categorical and Numeric Data. In: Winslett, M. (eds) Scientific and Statistical Database Management. SSDBM 2009. Lecture Notes in Computer Science, vol 5566. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02279-1_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02279-1_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02278-4

  • Online ISBN: 978-3-642-02279-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics