Abstract
The large amounts of data collected by enterprises are accumulating data, and today it is already feasible to have Terabyte- or even Petabyte-scale datasets that must be submitted for data mining processes. However, given a Terabyte-scale dataset of moderate-to-high dimensionality, how could one cluster its points? Numerous successful, serial clustering algorithms for data in five or more dimensions exist in literature, including the algorithm Halite that we described in the previous chapter. However, the existing algorithms are impractical for datasets spanning Terabytes and Petabytes, and examples of applications with such huge amounts of data in five or more dimensions abound (e.g., Twitter crawl: >12 TB, Yahoo! operational data: \(5\) Petabytes [6]). This limitation was previously summarized in Table 3.1. For datasets that do not even fit on a single disk, parallelism is a first class option, and thus we must re-think, re-design and re-implement existing serial algorithms in order to allow for parallel processing. With that in mind, this chapter presents one work that explores parallelism using MapReduce for clustering huge datasets. Specifically, we describe in detail one second algorithm, named BoW [5], that focuses on data mining in large sets of complex data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Provided by Yahoo! Research (www.yahoo.com).
- 2.
www.hadoop.com
- 3.
www.yahoo.com
- 4.
http://twitter.com/
References
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). doi: 10.1145/276305.276314
Agrawal, R., Gehrke, J.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005). 10.1007/s10618-005-1396-1
Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: Li, F., Moro, M.M., Ghandeharizadeh, S., Haritsa, J.R.,Weikum, G., Carey, M.J., Casati, F., Chang, E.Y., Manolescu, I., Mehrotra, S., Dayal, U., Tsotras, V.J. (eds.) ICDE, pp. 625–636. IEEE (2010).
Cordeiro, R.L.F., Traina,A.J.M., Faloutsos, C.:Halite: Fast and scalable multi-resolution local correlation clustering. IEEE Tran. Knowl. Data Eng. 99(PrePrints), 16 (2011). doi: 10.1109/TKDE.2011.176
Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Apté, C., Ghosh, J., Smyth, P. (eds.) KDD, pp. 690–698. ACM (2011)
Fayyad, U.: A data miner’s story–getting to know the grand challenges. In: Invited Innovation Talk, KDD (2007). Slide 61. Available at: http://videolectures.net/kdd07_fayyad_dms/
Moise, G., Sander, J.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, pp. 533–541 (2008).
Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: ICDM, pp. 414–425. IEEE Computer Society (2006)
Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst. 14(3), 273–298 (2008). 10.1007/s10115-007-0090-6
Yiu, M.L., Mamoulis, N.: Iterative projected clustering by subspace mining. TKDE 17(2), 176–189 (2005). doi:10.1109/TKDE.2005.29
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 The Author(s)
About this chapter
Cite this chapter
Cordeiro, R.L., Faloutsos, C., Traina Júnior, C. (2013). BoW. In: Data Mining in Large Sets of Complex Data. SpringerBriefs in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-4890-6_5
Download citation
DOI: https://doi.org/10.1007/978-1-4471-4890-6_5
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4889-0
Online ISBN: 978-1-4471-4890-6
eBook Packages: Computer ScienceComputer Science (R0)