High Performance Data Mining and Knowledge Discovery
- 39 Downloads
Many, perhaps most, organizations use computers when they interact with their customers. As a result, and almost by accident, many organizations have accumulated huge amounts of data about such interactions. Over the past five to ten years, they have increasingly tried to use this data for commercial advantage. This process began by accumulating transaction data into data warehouses, where it could be made available for decision support and retrospective analysis. The effeectiveness of such analysis largely depends on the ability of individuals to induce queries that will reveal key facts about the organization and its customers.
Increasingly, both the volume of data and its complexity have taken the problem beyond the ability of any individual to analyze. Data mining is the automated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are ‘interesting’ in the sense of impacting an organization’s practice. Research and development work in the area of knowledge discovery and data mining concerns the study and definition of techniques, methods, and tools for the extraction of novel, useful, and implicit patterns from data. It builds on machine learning, database technology, and statistics, but is distinguished by problems of scale: the data involved is so large that most applications tend to use conceptually straightforward, but carefully optimized, algorithms.
There is a natural confluence between parallel computation and data mining. For researchers in parallel computation, data mining is an application area that is growing in importance, and that introduces interesting new problems (irregularity, data representation and storage, multiple parallelization strategies, symbolic computation) that have not been so critical in scientific and numerical computing. For organizations who want to use data mining in their day to day work, parallel computation offers increased performance, which in turn may translate into commercial advantage. When data mining tools are implemented on high-performance parallel computers, they can analyze massive databases in a reasonable time. Faster processing also means that users can experiment with more models to understand complex data. High performance makes it practical for users to analyze greater quantities of data. Larger databases, in turn, yield improved predictions.
Data mining, even sequentially, is not yet mature, and many of the existing applications are relatively unsophisticated. Nevertheless, it seemed useful to explore the fledgling projects that are looking at the connections between parallel computing and data mining. This track has assembled a small number of papers describe such research experiences.
The first paper “Mining of Association Rules in Very Large Databases: A Structured Parallel Approach” by Becuzzi, Coppola, and Vanneschi, presents a case study implementing the Apriori parallel association rule algorithm using the skeleton-based language SkIE.