BoW

Cordeiro, Robson L. F.; Faloutsos, Christos; Traina Júnior, Caetano

doi:10.1007/978-1-4471-4890-6_5

Robson L. F. Cordeiro⁴,
Christos Faloutsos⁵ &
Caetano Traina Júnior⁴

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

1943 Accesses

Abstract

The large amounts of data collected by enterprises are accumulating data, and today it is already feasible to have Terabyte- or even Petabyte-scale datasets that must be submitted for data mining processes. However, given a Terabyte-scale dataset of moderate-to-high dimensionality, how could one cluster its points? Numerous successful, serial clustering algorithms for data in five or more dimensions exist in literature, including the algorithm Halite that we described in the previous chapter. However, the existing algorithms are impractical for datasets spanning Terabytes and Petabytes, and examples of applications with such huge amounts of data in five or more dimensions abound (e.g., Twitter crawl: >12 TB, Yahoo! operational data: \(5\) Petabytes [6]). This limitation was previously summarized in Table 3.1. For datasets that do not even fit on a single disk, parallelism is a first class option, and thus we must re-think, re-design and re-implement existing serial algorithms in order to allow for parallel processing. With that in mind, this chapter presents one work that explores parallelism using MapReduce for clustering huge datasets. Specifically, we describe in detail one second algorithm, named BoW [5], that focuses on data mining in large sets of complex data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Provided by Yahoo! Research (www.yahoo.com).
2.
www.hadoop.com
3.
www.yahoo.com
4.
http://twitter.com/

References

Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). doi: 10.1145/276305.276314
Google Scholar
Agrawal, R., Gehrke, J.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005). 10.1007/s10618-005-1396-1
Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: Li, F., Moro, M.M., Ghandeharizadeh, S., Haritsa, J.R.,Weikum, G., Carey, M.J., Casati, F., Chang, E.Y., Manolescu, I., Mehrotra, S., Dayal, U., Tsotras, V.J. (eds.) ICDE, pp. 625–636. IEEE (2010).
Google Scholar
Cordeiro, R.L.F., Traina,A.J.M., Faloutsos, C.:Halite: Fast and scalable multi-resolution local correlation clustering. IEEE Tran. Knowl. Data Eng. 99(PrePrints), 16 (2011). doi: 10.1109/TKDE.2011.176
Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Apté, C., Ghosh, J., Smyth, P. (eds.) KDD, pp. 690–698. ACM (2011)
Google Scholar
Fayyad, U.: A data miner’s story–getting to know the grand challenges. In: Invited Innovation Talk, KDD (2007). Slide 61. Available at: http://videolectures.net/kdd07_fayyad_dms/
Moise, G., Sander, J.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, pp. 533–541 (2008).
Google Scholar
Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: ICDM, pp. 414–425. IEEE Computer Society (2006)
Google Scholar
Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst. 14(3), 273–298 (2008). 10.1007/s10115-007-0090-6
Google Scholar
Yiu, M.L., Mamoulis, N.: Iterative projected clustering by subspace mining. TKDE 17(2), 176–189 (2005). doi:10.1109/TKDE.2005.29
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, ICMC, University of Sao Paulo, Avenue do Trabalhador Saocarlense 400, Sao Carlos, Sao Paulo, 13566-590, Brazil
Robson L. F. Cordeiro & Caetano Traina Júnior
Department of Computer Science, Carnegie Mellon University, Forbes Avenue 5000, Pittsburgh, Pennsylvania, 15213, USA
Christos Faloutsos

Authors

Robson L. F. Cordeiro
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar
Caetano Traina Júnior
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robson L. F. Cordeiro .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cordeiro, R.L., Faloutsos, C., Traina Júnior, C. (2013). BoW. In: Data Mining in Large Sets of Complex Data. SpringerBriefs in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-4890-6_5

Download citation

DOI: https://doi.org/10.1007/978-1-4471-4890-6_5
Published: 11 January 2013
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4889-0
Online ISBN: 978-1-4471-4890-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BoW