Skip to main content

Classifying Streaming Data II: Time-Dependent Data

  • Chapter
  • First Online:
Principles of Data Mining

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

  • 359k Accesses

Abstract

This chapter builds on the description in Chapter 21 of the H-Tree algorithm for classifying streaming data, i.e. data which arrives (generally in large quantities) from some automatic process over a period of days, months, years or potentially forever. Chapter 21 was concerned with stationary data generated from a fixed causal model; Chapter 22 is concerned with data that is time-dependent, where the underlying model can change from time to time, perhaps seasonally. This phenomenon is known as concept drift.

The algorithm given here, CDH-Tree, is a variant of the popular CVFDT algorithm which generates a type of decision tree called a Hoeffding Tree. The algorithm is described and explained in detail with accompanying pseudocode for the benefit of readers who may be interested in developing their own implementations. A detailed example using synthetic data is given to illustrate the way in which the classification tree evolves as more and more records are processed in the presence of concept drift.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We will assume that each record comprises a set of attribute values together with a classification.

  2. 2.

    Figure 22.4 is identical to Figure 22.1. It is repeated for ease of reference.

  3. 3.

    This is a restriction imposed by the CDH-Tree algorithm. It would be possible to allow nodes in an alternate tree to have their own alternate nodes but at the risk of creating and needing to maintain an increasingly unwieldy structure, most of which will never form part of the main tree. It is only the current main tree that is used for prediction.

  4. 4.

    Although nodes 14, 15, 16, 29 and 30 were previously parts of an alternate tree they are now in the main tree and so potentially can have alternate tree structures attached to them.

  5. 5.

    Nodes 4 and 8 in Figure 22.8 and the subtrees hanging from them are not part of the revised structure and are no longer accessible. It may be possible for a practical implementation to reuse the memory they occupy but we will not pursue this here.

  6. 6.

    We will adopt the convention that the branches at each internal node correspond to attribute values 1 and 2 (or 1, 2 and 3 in the case of age) in that order, working from left to right. So, for example, node 6 corresponds to a rule with left-hand side IF \(\textit{tears}=2\) AND \(\textit{astig}=1\) AND \(\textit{age}=2\). (The corresponding classifications on the right-hand side are not shown.)

  7. 7.

    Up to the point where the sliding window is full, and provided \(D\) is greater than \(W\), CDH-Tree is effectively the same algorithm as H-Tree.

  8. 8.

    We have left attribute \(\textit{age}\) unchanged to avoid irrelevant complications. It has three attribute values whereas the other attributes all have only two.

  9. 9.

    Strictly, the nodes were numbered differently from Figure 22.15, but in the same order.

References

  1. Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). New York: ACM.

    Chapter  Google Scholar 

  2. Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 97–106). New York: ACM.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag London Ltd.

About this chapter

Cite this chapter

Bramer, M. (2016). Classifying Streaming Data II: Time-Dependent Data. In: Principles of Data Mining. Undergraduate Topics in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-7307-6_22

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-7307-6_22

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-7306-9

  • Online ISBN: 978-1-4471-7307-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics