Advertisement

Parameterless Data Compression and Noise Filtering Using Association Rule Mining

  • Yew-Kwong Woon
  • Xiang Li
  • Wee-Keong Ng
  • Wen-Feng Lu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2737)

Abstract

The explosion of raw data in our information age necessitates the use of unsupervised knowledge discovery techniques to understand mountains of data. Cluster analysis is suitable for this task because of its ability to discover natural groupings of objects without human intervention. However, noise in the data greatly affects clustering results. Existing clustering techniques use density-based, grid-based or resolution-based methods to handle noise but they require the fine-tuning of complex parameters. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. There are several noise/outlier detection techniques but they too need suitable parameters. In this paper, we present a novel parameterless method of filtering noise using ideas borrowed from association rule mining. We term our technique, FLUID (Filtering Using Itemset Discovery). FLUID automatically discovers representative points in the dataset without any input parameter by mapping the dataset into a form suitable for frequent itemset discovery. After frequent itemsets are discovered, they are mapped back to their original form and become representative points of the original dataset. As such, FLUID accomplishes both data and noise reduction simultaneously, making it an ideal preprocessing step for cluster analysis. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Dean, N. (ed.): OCLC Researchers Measure the World Wide Web. Number 248. Online Computer Library Center (OCLC) Newsletter (2000)Google Scholar
  2. 2.
    Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1, 12–23 (2000)CrossRefGoogle Scholar
  3. 3.
    Gardner, M., Bieker, J.: Data mining solves tough semiconductor manufacturing problems. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, Boston, Massachusetts, United States, pp. 376–383 (2000)Google Scholar
  4. 4.
    Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., Wiltshire, J.: Discovery of aggregate usage profiles for web personalization. In: Proc. Workshop on Web Mining for E-Commerce - Challenges and Opportunities, Boston, MA, USA (2000)Google Scholar
  5. 5.
    Sun, A., Lim, E.P., Ng, W.K.: Personalized classification for keyword-based category profiles. In: Proc. 6th European Conf. on Research and Advanced Technology for Digital Libraries, Rome, Italy, pp. 61–74 (2002)Google Scholar
  6. 6.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  7. 7.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 226–231 (1996)Google Scholar
  8. 8.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Conf., Seattle, WA, pp. 94–105 (1998)Google Scholar
  9. 9.
    Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB Journal 8, 289–304 (2000)CrossRefGoogle Scholar
  10. 10.
    Foss, A., Zaiane, O.R.: A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proc. Int. Conf. on Data Mining, Maebashi City, Japan, pp. 179–186 (2002)Google Scholar
  11. 11.
    Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proc. 24th Int. Conf. on Very Large Data Bases, pp. 392–403 (1998)Google Scholar
  12. 12.
    Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. ACM SIGMOD Conf., Dallas, Texas, pp. 427–438 (2000)Google Scholar
  13. 13.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. on Very Large Databases, Santiago, Chile, pp. 487–499 (1994)Google Scholar
  14. 14.
    Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. ACM SIGMOD Conf., Dallas, Texas, pp. 1–12 (2000)Google Scholar
  15. 15.
    Das, A., Ng, W.K., Woon, Y.K.: Rapid association rule mining. In: Proc. 10th Int. Conf. on Information and Knowledge Management, Atlanta, Georgia, pp. 474–481 (2001)Google Scholar
  16. 16.
    Woon, Y.K., Ng, W.K., Das, A.: Fast online dynamic association rule mining. In: Proc. 2nd Int. Conf. on Web Information Systems Engineering, Kyoto, Japan, pp. 278–287 (2001)Google Scholar
  17. 17.
    Woon, Y.K., Ng, W.K., Lim, E.P.: Preprocessing optimization structures for association rule mining. Technical Report CAIS-TR-02-48, School of Computer Engineering, Nanyang Technological University, Singapore (2002)Google Scholar
  18. 18.
    Karypis, G., Han, E.H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32, 68–75 (1999)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Yew-Kwong Woon
    • 1
  • Xiang Li
    • 2
  • Wee-Keong Ng
    • 1
  • Wen-Feng Lu
    • 2
    • 3
  1. 1.Nanyang Technological UniversitySingaporeSingapore
  2. 2.Singapore Institute of Manufacturing TechnologySingaporeSingapore
  3. 3.Singapore-MIT Alliance 

Personalised recommendations