Privacy-Preserving Data Stream Classification

  • Yabo Xu
  • Ke Wang
  • Ada Wai-Chee Fu
  • Rong She
  • Jian Pei
Part of the Advances in Database Systems book series (ADBS, volume 34)

In a wide range of applications, multiple data streams need to be examined together in order to discover trends or patterns existing across several data streams. One common practice is to redirect all data streams into a central place for joint analysis. This “centralized” practice is challenged by the fact that data streams often are private in that they come from different owners. In this paper, we focus on the problem of building a classifier in this context and assume that classification evolves as the current window of streams slides forward. This problem faces two major challenges. First, the many-to-many join relationship of streams will blow up the already fast arrival rate of data streams. Second, the privacy requirement implies that data exchange among owners should be minimal. These considerations rule out all classification methods that require producing the join in the current window.We show that Naive Bayesian Classification (NBC) presents a unique opportunity to address this problem. Our main contribution is to adopt NBC to solve the classification problem for private data streams.


Privacy data streams classification Naive Bayesian classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    C. Aggarwal, J. Han, J. Wang, and P. Yu. (2006). A Framework for On-Demand Classification of Evolving Data Streams. IEEE TKDE, Vol. 18, No. 5, Page:577–589.Google Scholar
  2. 2.
    R. Agrawal, A. Evfimievski and R. Srikant. (2003). Information sharing across private databases. In Proc. SIGMOD.Google Scholar
  3. 3.
    R. Agrawal, and R. Srikant. (2000). Privacy-preserving data mining. In Proc. SIGMOD.Google Scholar
  4. 4.
    C. Agarwal and P. Yu. (2004). A condensation Approach to Privacy Preserving Data Mining. In Proc. EDBT.Google Scholar
  5. 5.
    Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy. (1999). Tracking Join and Self-Join Sizes in Limited Storage. In ACM PODS.Google Scholar
  6. 6.
    B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Model and issues in data stream systems. (2002). In ACM PODS, Madison, Wisconsin.Google Scholar
  7. 7.
    J. Beringer and E. Hullermeier. (2005). Online clustering of parallel data streams. In press for Data & Knowledge Engineering.Google Scholar
  8. 8.
    J. Bethencourt, D. Song, and B. Waters. (2006). Constructions and Practical Applications for Private Stream Searching. In IEEE Symposium on Security and Privacy.Google Scholar
  9. 9.
    Y. D. Cai, D. Clutter, G. Pape, J. Han, M. Welge and L. Auvil. (2004). MAIDS: Mining alarming incidents from data streams. In Proc. SIGMOD, demonstration paper.Google Scholar
  10. 10.
    D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. (2002). Monitoring streams - a new class of data management applications. In Proc. VLDB.Google Scholar
  11. 11.
    S. Chaudhuri, R. Motwani, and V. R. Narasayya. (1999). On random sampling over joins. In Proc. SIGMOD.Google Scholar
  12. 12.
    K. Chen and L. Liu. (2005). Privacy preserving data classification with rotation perturbation. In ICDM.Google Scholar
  13. 13.
    G. Chen, X. Wu, X. Zhu. (2005). Sequential pattern mining in multiple streams, In Proc. ICDM.Google Scholar
  14. 14.
    A. Das, J. Gehrke and M.Riedewald. (2003). Approximate join processing over data streams. In Proc. SIGMOD, Madison, Wisconsin.Google Scholar
  15. 15.
    A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. (2002). Processing complex aggregate queries over data streams. In Proc. SIGMOD, Madison, Wisconsin.Google Scholar
  16. 16.
    P. Domingos and G. Hulten. (2000). Mining high-speed data streams. In Proc. SIGKDD.Google Scholar
  17. 17.
    Pedro Domingos and Michael Pazzani. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130.CrossRefGoogle Scholar
  18. 18.
    W. Du and Z. Zhan. (2002). Building decision tree classifier on private data. ICDM Workshop on Privacy, Security and Data Mining.Google Scholar
  19. 19.
    R. O. Duda and P. E. Hart. (1973). Pattern classification and scene analysis. New York: John Wiley & Sons.zbMATHGoogle Scholar
  20. 20.
    J. Gama, R. Racha, P.Medas. (2003). Accurate decision trees for mining high-speed data streams. In Proc. SIGKDD.Google Scholar
  21. 21.
    S. Ganguly, M. Garofalakis, A. Kumar and R. Rastogj. (2005). Join-distinct aggregate estimation over update streams. In Proc. ACM PODS, Baltimore, Maryland.Google Scholar
  22. 22.
    L. Golab and M. Tamer Ozsu. (2003) Processing sliding window multi-joins in continuous queries over data streams. In Proc. VLDB.Google Scholar
  23. 23.
    O. Goldreich. (2001). Secure multi-party computation. Working Draft, Version 1.3.Google Scholar
  24. 24.
    S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. (2000). Clustering data streams. In FOCS.Google Scholar
  25. 25.
    D. J. Hand and K. Yu. (2001). Idiot’s Bayes - not so stupid after all? International Statistical Review. 69(3), 385-399.zbMATHCrossRefGoogle Scholar
  26. 26.
    M. Levene and G. Loizou. (2003). Why is the snowflake schema a good data warehouse design? Information Systems 28(3).Google Scholar
  27. 27.
    F. Li, J. Sun, S. Papadimitriou, G. Mihala and I. Stanoi. (2007). Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking. In Proc. ICDE.Google Scholar
  28. 28.
    Y. Lindell and B. Pinkas. (2000). Privacy preserving data mining. In Proc. CRYPTO.Google Scholar
  29. 29.
    A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. (2006). l-Diversity: Privacy beyond k-anonymity. ICDE.Google Scholar
  30. 30.
    R. Ostrovsky and W. Skeith. (2005). Private Searching on Streaming Data. In CRYPTO.Google Scholar
  31. 31.
    Irina Rish. (2001). An empirical study of the naive Bayes classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence.Google Scholar
  32. 32.
    U. Srivastava, J. Widom. (2004). Memory-limited execution of windowed stream joins. In Proc. VLDB.Google Scholar
  33. 33.
    L. Sweeney. (2002). k-Anonymity: A Model for Protecting Privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5).Google Scholar
  34. 34.
    J. Vaidya and C. W. Clifton. (2002). Privacy preserving association rule mining in vertically partitioned data. In SIGKDD.Google Scholar
  35. 35.
    H. Wang, W. Fan, P. Yu and J. Han. (2003). Mining concept-drifting data streams using ensemble classifiers. In Proc. SIGKDD.Google Scholar
  36. 36.
    K. Wang, Y. Xu, R. She, P. Yu. (2006). Classification Spanning Private Databases. AAAI.Google Scholar
  37. 37.
    Y. Zhu and D. Shasha. (2002). Statstream: Statistical monitoring of thousands of data streams in real time. In Proc. VLDB.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Yabo Xu
    • 1
  • Ke Wang
    • 1
  • Ada Wai-Chee Fu
    • 2
  • Rong She
    • 1
  • Jian Pei
    • 1
  1. 1.School of Computing ScienceSimon Fraser UniversityBurnabyCanada
  2. 2.Department of Computer ScienceChinese University of Hong KongChina

Personalised recommendations