Comparing Datasets Using Frequent Itemsets: Dependency on the Mining Parameters

Ntoutsi, Irene; Theodoridis, Yannis

doi:10.1007/978-3-540-87881-0_20

Comparing Datasets Using Frequent Itemsets: Dependency on the Mining Parameters

Irene Ntoutsi¹ &
Yannis Theodoridis¹

Conference paper

1775 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5138))

Abstract

Comparison between sets of frequent itemsets has been traditionally utilized for raw dataset comparison assuming that frequent itemsets inherit the information lying in the original raw datasets. In this work, we revisit this assumption and examine whether dissimilarity between sets of frequent itemsets could serve as a measure of dissimilarity between raw datasets. In particular, we investigate how the dissimilarity between two sets of frequent itemsets is affected by the minSupport threshold used for their generation and the adopted compactness level of the itemsets lattice, namely frequent itemsets, closed frequent itemsets or maximal frequent itemsets. Our analysis shows that utilizing frequent itemsets comparison for dataset comparison is not as straightforward as related work has argued, a result which is verified through an experimental study and opens issues for further research in the KDD field.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Burdick, D., Calimlim, M., Gehrke, J.: Mafia: A maximal frequent itemset algorithm for transactional databases. In: International Conference on Data Engineering (ICDE), pp. 443–452. IEEE Computer Society, Los Alamitos (2001)
Google Scholar
FIMI. Frequent itemsets mining data set repository (valid as of May 2008), http://fimi.cs.helsinki.fi/data/
Ganti, V., Gehrke, J., Ramakrishnan, R.: A framework for measuring changes in data characteristics. In: ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), pp. 126–137. ACM Press, New York (1999)
Google Scholar
Li, T., Ogihara, M., Zhu, S.: Association-based similarity testing and its applications. Intelligent Data Analysis 7, 209–232 (2003)
Google Scholar
Parthasarathy, S., Ogihara, M.: Clustering distributed homogeneous datasets. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 566–574. Springer, Heidelberg (2000)
Chapter Google Scholar
Xin, D., Han, J., Yan, X., Cheng, H.: Mining compressed frequent-pattern sets. In: International Conference on Very Large Data Bases (VLDB), pp. 709–720. VLDB Endowment (2005)
Google Scholar
Zaki, M., Hsiao, C.-J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering (TKDE) 17(4), 462–478 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, University of Piraeus, Greece
Irene Ntoutsi & Yannis Theodoridis

Authors

Irene Ntoutsi
View author publications
You can also search for this author in PubMed Google Scholar
Yannis Theodoridis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

John Darzentas George A. Vouros Spyros Vosinakis Argyris Arnellos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ntoutsi, I., Theodoridis, Y. (2008). Comparing Datasets Using Frequent Itemsets: Dependency on the Mining Parameters. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds) Artificial Intelligence: Theories, Models and Applications. SETN 2008. Lecture Notes in Computer Science(), vol 5138. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87881-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-87881-0_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87880-3
Online ISBN: 978-3-540-87881-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics