Skip to main content

Tree Contraction for Compressed Suffix Arrays on Modern Processors

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9050))

Included in the following conference series:

  • 1741 Accesses

Abstract

We propose a novel processor-aware compaction technique for pattern matching that is widely-used in databases, information retrieval, and text mining. As the amount of data increases, it is getting important to efficiently store data on memory. A compressed suffix array (CSA) is a compact data structure for in-memory pattern matching. However, CSA suffers from tremendous processor penalties, such as a flood of instructions and cache/TLB misses due to the lack of processor-aware design. To mitigate these penalties, we propose a novel compaction technique for CSA, called suffix trie contraction (STC). The frequently accessed suffixes of CSA are transformed to a trie (e.g., a suffix trie), and then inter-connected nodes in the trie are repeatedly ’\(contracted\)’ to a single node, which enables lightweight sequential scans in a processor-friendly way. In detail, STC consists of two contraction techniques: fixed-length path contraction (FPC) and sub-tree contraction (SC). FPC is applied to the parts with a few branches in the trie, and SC is applied to the parts with many branches. Our experiment results indicate that FPC outperforms naive CSA by two orders of magnitude for short pattern queries and by three times for long pattern queries. As the number of branches inside the trie increases, SC gradually becomes superior to CSA and FPC for short pattern queries. Finally, the latency and throughput of STC are 7 times and 72 times better than those of CSA for the TREC test data set at the expense of additional 7.1 % space overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. In: Proceedings of ALENEX (2012)

    Google Scholar 

  2. Kim, C., et al.: Designing Fast Architecture-sensitive Tree Search on Modern Multi-core/Many-core Processors. ACM Transaction on Database Systems 36(4), 22:1–22:34 (2011)

    Article  Google Scholar 

  3. Kreft, S., Navarro, G.: LZ77-like Compression with fast random access. In: Proceedings of DCC, pp. 239–248 (2010)

    Google Scholar 

  4. Kreft, S., Navarro, G.: Self-indexing based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  5. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of SODA, pp. 319–327 (1990)

    Google Scholar 

  6. Manzini, G.: An Analysis of the Burrows Wheeler Transform. J. ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  7. Yamamuro, T., et al. Vast-tree: a vector-advanced and compressed structure for massive data tree traversal. In: Proceedings of EDBT, pp. 396–407 (2012)

    Google Scholar 

  8. Navarro, G., Mäkinen, V.: Compressed Full-Text Indexes. ACM Computing Surveys (CSUR) 39(1) (2007)

    Google Scholar 

  9. Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching: efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of FOCS (2000)

    Google Scholar 

  11. Kim, C., et al.: Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Technical report, Intel Lab. (2012)

    Google Scholar 

  12. Hankins, R.A., Patel, J. M.: Effect of node size on the performance of cache-conscious B+trees. In: Proceedings of SIGMETRICS, pp. 283–294 (2003)

    Google Scholar 

  13. Chen, S., Gibbons, P.B., Mowry, T.C.: Improving index performance through prefetching. In: Proceedings of SIGMOD, pp. 235–246 (2001)

    Google Scholar 

  14. Zhou, J., Ross, K.A.: Buffering accesses to memory-resident index structures. In: Proceedings of VLDB, pp. 405–416 (2003)

    Google Scholar 

  15. Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of ALENEX, pp. 60–70 (2006)

    Google Scholar 

  16. Schlegel, B., Gemulla, R., Lehner, W.: K-ary search on modern processors. In: Proceedings of DaMoN, pp. 52–60 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takeshi Yamamuro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Yamamuro, T., Onizuka, M., Honjo, T. (2015). Tree Contraction for Compressed Suffix Arrays on Modern Processors. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18123-3_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18122-6

  • Online ISBN: 978-3-319-18123-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics