Skip to main content

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

  • 1943 Accesses

Abstract

The large amounts of data collected by enterprises are accumulating data, and today it is already feasible to have Terabyte- or even Petabyte-scale datasets that must be submitted for data mining processes. However, given a Terabyte-scale dataset of moderate-to-high dimensionality, how could one cluster its points? Numerous successful, serial clustering algorithms for data in five or more dimensions exist in literature, including the algorithm Halite that we described in the previous chapter. However, the existing algorithms are impractical for datasets spanning Terabytes and Petabytes, and examples of applications with such huge amounts of data in five or more dimensions abound (e.g., Twitter crawl: >12 TB, Yahoo! operational data: \(5\) Petabytes [6]). This limitation was previously summarized in Table 3.1. For datasets that do not even fit on a single disk, parallelism is a first class option, and thus we must re-think, re-design and re-implement existing serial algorithms in order to allow for parallel processing. With that in mind, this chapter presents one work that explores parallelism using MapReduce for clustering huge datasets. Specifically, we describe in detail one second algorithm, named BoW [5], that focuses on data mining in large sets of complex data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Provided by Yahoo! Research (www.yahoo.com).

  2. 2.

    www.hadoop.com

  3. 3.

    www.yahoo.com

  4. 4.

    http://twitter.com/

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). doi: 10.1145/276305.276314

    Google Scholar 

  2. Agrawal, R., Gehrke, J.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005). 10.1007/s10618-005-1396-1

  3. Cordeiro, R.L.F., Traina, A.J.M., Faloutsos, C., Traina Jr., C.: Finding clusters in subspaces of very large, multi-dimensional datasets. In: Li, F., Moro, M.M., Ghandeharizadeh, S., Haritsa, J.R.,Weikum, G., Carey, M.J., Casati, F., Chang, E.Y., Manolescu, I., Mehrotra, S., Dayal, U., Tsotras, V.J. (eds.) ICDE, pp. 625–636. IEEE (2010).

    Google Scholar 

  4. Cordeiro, R.L.F., Traina,A.J.M., Faloutsos, C.:Halite: Fast and scalable multi-resolution local correlation clustering. IEEE Tran. Knowl. Data Eng. 99(PrePrints), 16 (2011). doi: 10.1109/TKDE.2011.176

  5. Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Apté, C., Ghosh, J., Smyth, P. (eds.) KDD, pp. 690–698. ACM (2011)

    Google Scholar 

  6. Fayyad, U.: A data miner’s story–getting to know the grand challenges. In: Invited Innovation Talk, KDD (2007). Slide 61. Available at: http://videolectures.net/kdd07_fayyad_dms/

  7. Moise, G., Sander, J.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, pp. 533–541 (2008).

    Google Scholar 

  8. Moise, G., Sander, J., Ester, M.: P3C: A robust projected clustering algorithm. In: ICDM, pp. 414–425. IEEE Computer Society (2006)

    Google Scholar 

  9. Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst. 14(3), 273–298 (2008). 10.1007/s10115-007-0090-6

    Google Scholar 

  10. Yiu, M.L., Mamoulis, N.: Iterative projected clustering by subspace mining. TKDE 17(2), 176–189 (2005). doi:10.1109/TKDE.2005.29

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robson L. F. Cordeiro .

Rights and permissions

Reprints and permissions

Copyright information

© 2013 The Author(s)

About this chapter

Cite this chapter

Cordeiro, R.L., Faloutsos, C., Traina Júnior, C. (2013). BoW. In: Data Mining in Large Sets of Complex Data. SpringerBriefs in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-4890-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-4890-6_5

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-4889-0

  • Online ISBN: 978-1-4471-4890-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics