Abstract
In the frequent string mining problem, one is given m databases \({\cal D}_1,...,{\cal D}_m\) of strings and searches for strings that fulfill certain frequency constraints. The constraints consist of m pairs of thresholds \((\mathit{minf}_1,\mathit{maxf}_1),\) \(...,(\mathit{minf}_m,\mathit{maxf}_m)\) and one wants to find all strings φ that satisfy \(\mathit{minf}_i \le \mathit{freq}(\phi, {\cal D}_i) \le \mathit{maxf}_i\) for all i with 1 ≤ i ≤ m, where \(\mathit{freq}(\phi,\mathcal{D}_i) = |\{ \psi \in \mathcal{D}_i : \phi \mbox{ is a substring of } \psi \}|\).
Fischer et al. [2] presented an algorithm that solves the frequent string mining problem in linear time under the assumption that the number of databases is treated as a constant. The space consumption of this algorithm, however, is proportional to the total size of all databases. We improve this algorithm in such a way that its space consumption is proportional to the size of the largest database, and it takes linear time regardless of the number of databases. Also, our algorithm is more flexible in the sense that one of several databases can be replaced without having to recalculate everything, that is, intermediate data can be stored on file and be reused.
This is an extended abstract of an article published in the Data Mining and Knowledge Discovery journal [1].
Chapter PDF
Similar content being viewed by others
References
Kügel, A., Ohlebusch, E.: A space efficient solution to the frequent string mining problem. Data Mining and Knowledge Discovery 17(1), 24–38 (August 2008)
Fischer, J., Heun, V., Kramer, S.: Optimal string mining under frequency constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 139–150. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kügel, A., Ohlebusch, E. (2008). A Space Efficient Solution to the Frequent String Mining Problem for Many Databases . In: Daelemans, W., Goethals, B., Morik, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. Lecture Notes in Computer Science(), vol 5211. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87479-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-87479-9_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87478-2
Online ISBN: 978-3-540-87479-9
eBook Packages: Computer ScienceComputer Science (R0)