Backtrack-Based and Window-Oriented Optimistic Failure Recovery in Distributed Stream Processing

  • Qiming ChenEmail author
  • Meichun Hsu
  • Malu Castellanos
Conference paper
Part of the Lecture Notes in Business Information Processing book series (LNBIP, volume 206)


Support transaction property and fault-tolerance is the key to applying stream processing to industry-scale applications; however the corresponding latency overhead must be minimized for accommodating real-time analytics. This issue has been studied in various contexts. In this work we develop the backtrack failure recovery mechanism to allow a task to roll forward without waiting for acknowledgement from its downstream target tasks in the failure-free case, but to request its upstream source tasks to resend the missing tuples only during failure recovery which is the rare case thus has limited impact on the overall performance. For further reduced latency we extend our solution in another dimension by applying the notion of optimistic checkpointing to stream processing, and propose the Continued stream processing with Window - based Checkpoint and Recovery (CWCR) approach, allowing a task to emit results tuple by tuple continuously but checkpoint in batch, and acknowledge, only once per window (e.g. time window). We also tackle the hard problems found in implementing a transactional layer on-top of an existing stream processing platform. We have implemented the proposed mechanisms on Fontainebleau, the distributed stream analytics infrastructure we built on top of the open-sourced Storm platform. Our experiment results reveal the novelty of the proposed technologies and the feasibility to support fault-tolerance with minimal latency overhead for real-time stream processing.


Stream processing Failure recovery Dataflow transaction Pessimistic checkpointing Optimistic checkpointing 


  1. 1.
    Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J 15(2), 121–142 (2006)Google Scholar
  2. 2.
    Abadi, D.J., et al.: The design of the Borealis stream processing engine. In: CIDR (2005)Google Scholar
  3. 3.
    Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-Tolerance in the Borealis distributed stream processing system. In: SIGMOD 2005 (2005)Google Scholar
  4. 4.
    Botan, I., Fischer, P.M., Kossmann, D., Tatbu, N.: Transactional stream processing. In: EDBT 2012 (2012)Google Scholar
  5. 5.
    Johnson, D.B., Zwaenepoel, W.: Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11, 462–491 (1990)Google Scholar
  6. 6.
    Chen, Q., Hsu, M., Zeller, H.: Experience in continuous analytics as a service. In: EDBT 2011 (2011)Google Scholar
  7. 7.
    Chen, Q., Hsu, M.: Experience in extending query engine for continuous analytics. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) DAWAK 2010. LNCS, vol. 6263, pp. 190–202. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  8. 8.
    Chen, Q., Hsu, M.: Query engine net for streaming analytics. In: Proceedings of 19th International Conference on Cooperative Information Systems (CoopIS) (2011)Google Scholar
  9. 9.
    DeWitt, D.J., Paulson, E., Robinson, E., Naughton, J., Royalty, J., Shankar, S., Krioukov, A.: Clustera: an integrated computation and data management system. In: VLDB 2008 (2008)Google Scholar
  10. 10.
    Franklin, M.J., et al.: Continuous analytics: rethinking query processing in a network-effect world. In: CIDR 2009 (2009)Google Scholar
  11. 11.
    Gedik, B., Andrade, H., Wu, K.-L., Yu, P.S., Doo, M.C.: SPADE: the system S declarative stream processing engine. In: ACM SIGMOD 2008 (2008)Google Scholar
  12. 12.
    Hwang, J.-H., Balazinska, M., et al.: High-availability algorithms for distributed stream processing. In: Proceedings of ICDE 2005, Washington, DC, USA (2005)Google Scholar
  13. 13.
    Johnson, D.B., Zwaenepoel, W.: Recovery in distributed systems using optimistic message logging and checkpointing. J. Algorithms 11, 462–491 (1990)CrossRefzbMATHMathSciNetGoogle Scholar
  14. 14.
    Li, J., Karp, A.: Access control for the services oriented architecture. In: ACM Workshop on Secure Web Services (2007)Google Scholar
  15. 15.
    Shah, M.A., Hellerstein, J.M., Brewer, E.: Highly available, fault-tolerant, parallel datafows. In: Proceedings of SIGMOD, New York, USA (2004)Google Scholar
  16. 16.
    Prasad Sistla, A., Welch, J.L.: Efficient distributed recovery using message logging. In: Proceedings of the Eighth Annual ACM Symposium on Principles of Distributed Computing (1989)Google Scholar
  17. 17.
    Stiegler, M., Li, J., Kambatla, K., Karp, A.: Clusterken: A Reliable Object-Based Messaging Framework to Support Data Center Processing, HPL-2011–44 (2011)Google Scholar
  18. 18.
  19. 19.
    Wang, Y.M., Fuchs, W.K.: Optimistic message logging for independent checkpointing in message-passing systems. In: IEEE Symposium on Reliable Distribution System, pp. 147–154 (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.HP LabsPalo AltoUSA

Personalised recommendations