Advertisement

Parsing gigabytes of JSON per second

  • Geoff Langdale
  • Daniel LemireEmail author
Regular Paper
  • 17 Downloads

Abstract

JavaScript Object Notation or JSON is a ubiquitous data exchange format on the web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of single instruction and multiple data instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.

Keywords

JSON parsing SIMD Software performance DFA 

Notes

Acknowledgements

The vectorized UTF-8 validation was motivated by a blog post by O. Goffart. K. Willets helped design the current vectorized UTF-8 validation. In particular, he provided the algorithm and code to check that sequences of two, three and four non-ASCII bytes match the leading byte. The authors are grateful to W. Muła for sharing related number parsing code online. The software library has benefited from the contributions of T. Navennec, K. Wolf, T. Kennedy, F. Wessels, G. Fotopoulos, H. N. Gies, E. Gedda, G. Floros, D. Xie, N. Xiao, E. Bogatov, J. Wang, L. F. Peres, W. Bolsterlee, A. Karandikar, R. Urban, T. Dyson, I. Dotsenko, A. Milovidov, C. Liu, S. Gleason, J. Keiser, Z. Bjornson, V. Baranov, I. A. Daza Dillon and others.

The work is supported in part by the Natural Sciences and Engineering Research Council of Canada under grant RGPIN-2017-03910.

Supplementary material

References

  1. 1.
    Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB in action: adaptive query processing on raw data. Proc. VLDB Endow. 5(12), 1942–1945 (2012)CrossRefGoogle Scholar
  2. 2.
    Boncz, P.A., Graefe, G., He, B., Sattler, K.U.: Database architectures for modern hardware. Technical report 18251, Dagstuhl Seminar (2019)Google Scholar
  3. 3.
    Bonetta, D., Brantner, M.: FAD.Js: fast JSON data access using JIT-based speculative optimizations. Proc. VLDB Endow. 10(12), 1778–1789 (2017)CrossRefGoogle Scholar
  4. 4.
    Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. https://tools.ietf.org/html/rfc8259, internet Engineering Task Force, Request for Comments: 8259 (2017)
  5. 5.
    Cameron, R.D., Herdy, K.S., Lin, D.: High performance XML parsing using parallel bit stream technology. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, ACM, New York, NY, USA, CASCON ’08, pp. 17:222–17:235 (2008)Google Scholar
  6. 6.
    Chandramouli, B., Prasaad, G., Kossmann, D., Levandoski, J., Hunter, J., Barnett, M.: FASTER: a concurrent key-value store with in-place updates. In: Proceedings of the 2018 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’18, pp. 275–290 (2018)Google Scholar
  7. 7.
    Cohen, J., Roth, M.S.: Analyses of deterministic parsing algorithms. Commun. ACM 21(6), 448–458 (1978)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Cole, C.R.: 100-Gb/s and beyond transceiver technologies. Opt. Fiber Technol. 17(5), 472–479 (2011)CrossRefGoogle Scholar
  9. 9.
    Downs, T.: avx-turbo: test the non-AVX, AVX2 and AVX-512 speeds across various active core counts. https://github.com/travisdowns/avx-turbo (2019)
  10. 10.
    Farfán, F., Hristidis, V., Rangaswami, R.: Beyond lazy XML parsing. In: Proceedings of the 18th International Conference on Database and Expert Systems Applications, DEXA’07, pp. 75–86. Springer, Berlin (2007)Google Scholar
  11. 11.
    Fog, A.: Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Technical report, Copenhagen University College of Engineering, Copenhagen, Denmark. http://www.agner.org/optimize/instruction_tables.pdf (2018)
  12. 12.
    Ge, C., Li, Y., Eilebrecht, E., Chandramouli, B., Kossmann, D.: Speculative distributed CSV data parsing for big data analytics. In: ACM SIGMOD International Conference on Management of Data, ACM (2019)Google Scholar
  13. 13.
    Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23(1), 5–48 (1991)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Green, T.J., Gupta, A., Miklau, G., Onizuka, M., Suciu, D.: Processing XML streams with deterministic automata and stream indexes. ACM Trans. Database Syst. 29(4), 752–788 (2004)CrossRefGoogle Scholar
  15. 15.
    Kostoulas, M.G., Matsa, M., Mendelsohn, N., Perkins, E., Heifets, A., Mercaldi, M.: XML screamer: an integrated approach to high performance XML parsing, validation and deserialization. In: Proceedings of the 15th International Conference on World Wide Web, ACM, New York, NY, USA, WWW ’06, pp. 93–102 (2006)Google Scholar
  16. 16.
    Lemire, D., Kaser, O.: Faster 64-bit universal hashing using carry-less multiplications. J. Cryptogr. Eng. 6(3), 171–185 (2016)CrossRefGoogle Scholar
  17. 17.
    Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. Proc. VLDB Endow. 10(10), 1118–1129 (2017).  https://doi.org/10.14778/3115404.3115416 CrossRefGoogle Scholar
  18. 18.
    Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD ’14, pp. 1247–1258 (2014)Google Scholar
  19. 19.
    Marian, A., Siméon, J.: Projecting XML documents. In: Proceedings of the 29th International Conference on Very Large Data Bases—vol. 29, VLDB Endowment, VLDB ’03, pp. 213–224 (2003)CrossRefGoogle Scholar
  20. 20.
    Mühlbauer, T., Rödiger, W., Seilbeck, R., Reiser, A., Kemper, A., Neumann, T.: Instant loading for main memory databases. Proc. VLDB Endow. 6(14), 1702–1713 (2013)CrossRefGoogle Scholar
  21. 21.
    Muła, W., Lemire, D.: Faster Base64 encoding and decoding using AVX2 instructions. ACM Trans. Web 12(3), 20:1–20:26 (2018)CrossRefGoogle Scholar
  22. 22.
    Mytkowicz, T., Musuvathi, M., Schulte, W.: Data-parallel finite-state machines. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, New York, NY, USA, ASPLOS ’14, pp. 529–542 (2014)Google Scholar
  23. 23.
    Naishlos, D.: Autovectorization in GCC. In: Proceedings of the 2004 GCC Developers Summit, pp. 105–118 (2004)Google Scholar
  24. 24.
    Noga, M.L., Schott, S., Löwe, W.: Lazy XML processing. In: Proceedings of the 2002 ACM Symposium on Document Engineering, ACM, New York, NY, USA, DocEng’02, pp. 88–94 (2002)Google Scholar
  25. 25.
    Palkar, S., Abuzaid, F., Bailis, P., Zaharia, M.: Filter before you parse: faster analytics on raw data with Sparser. Proc. VLDB Endow. 11(11), 1576–1589 (2018)CrossRefGoogle Scholar
  26. 26.
    Pavlopoulou, C., Carman, Jr E.P., Westmann, T., Carey, M.J., Tsotras, V.J.: A parallel and scalable processor for JSON data. In: EDBT’18 (2018)Google Scholar
  27. 27.
    Tahara, D., Diamond, T., Abadi, D.J.: Sinew: a SQL system for multi-structured data. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD’14, pp. 815–826 (2014)Google Scholar
  28. 28.
    Takase, T., Miyashita, H., Suzumura, T., Tatsubori, M.: An adaptive, fast, and safe XML parser based on byte sequences memorization. In: Proceedings of the 14th International Conference on World Wide Web, ACM, New York, NY, USA, WWW ’05, pp. 692–701 (2005)Google Scholar
  29. 29.
    Xie, D., Chandramouli, B., Li, Y., Kossmann, D.: FishStore: faster ingestion with subset hashing. In: Proceedings of the 2019 International Conference on Management of Data, ACM, New York, NY, USA, SIGMOD’19, pp. 1711–1728 (2019)Google Scholar
  30. 30.
    Xu, Q., Siyamwala, H., Ghosh, M., Suri, T., Awasthi, M., Guz, Z., Shayesteh, A., Balakrishnan, V.: Performance analysis of NVMe SSDs and their implication on real world databases. In: Proceedings of the 8th ACM International Systems and Storage Conference, ACM, New York, NY, USA, SYSTOR ’15, pp. 6:1–6:11Google Scholar
  31. 31.
    Zhang, Y., Pan, Y., Chiu, K.: Speculative p-DFAs for parallel XML parsing. In: 2009 International Conference on High Performance Computing (HiPC), IEEE, pp. 388–397 (2009)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.branchfree.orgSydneyAustralia
  2. 2.Université du Québec (TELUQ)MontrealCanada

Personalised recommendations