TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Abstract

Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    Source code https://github.com/cipriantruica/TextBenDS

References

  1. Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3.

  2. Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797.

  3. Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296.

  4. Bellot, P., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., SanJuan, E., Schenkel, R., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Sanderson, M., Scholer, F., & Wang, Q. (2013). Report on inex 2013. SIGIR Forum, 47(2), 21–32. https://doi.org/10.1145/2568388.2568393.

    Article  Google Scholar 

  5. Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1.

  6. Bouakkaz, M., Loudcher, S., & Ouinten, Y. (2016). OLAP textual aggregation approach using the google similarity distance. International Journal of Business Intelligence and Data Mining, 11(1), 31. https://doi.org/10.1504/ijbidm.2016.076425.

    Article  Google Scholar 

  7. Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., & Teisseire, M. (2011). Towards an on-line analysis of tweets processing. In International Conference on Database and Expert Systems Applications (pp. 154–161). https://doi.org/10.1007/978-3-642-23091-2_15.

    Google Scholar 

  8. Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1.

  9. Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726.

  10. Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492.

    Article  Google Scholar 

  11. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9.

    Article  Google Scholar 

  12. Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8

  13. Gattiker, A. E., Gebara, F. H., Hofstee, H. P., Hayes, J. D., & Hylick, A. (2013). Big data text-oriented benchmark creation for Hadoop. IBM Journal of Research and Development, 57(3/4), 10:1–10:6. https://doi.org/10.1147/JRD.2013.2240732.

    Article  Google Scholar 

  14. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., & Jacobsen, H. A. (2013). Bigbench: Towards an industry standard benchmark for big data analytics. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1197–1208). https://doi.org/10.1145/2463676.2463712.

    Google Scholar 

  15. Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., & Zicari, R. V. (2017). Bigbench v2: The new and improved bigbench. In 2017 IEEE 33rd International Conference on Data Engineering (pp. 1225–1236). https://doi.org/10.1109/ICDE.2017.167.

    Google Scholar 

  16. Gray, J. (1993). The benchmark handbook for database and transaction systems (2nd ed.). Burlington: Morgan Kaufmann Publishers.

  17. Guille, A., & Favre, C. (2015). Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Social Network Analysis and Mining, 5(1), 18. https://doi.org/10.1007/s13278-015-0258-0.

    Article  Google Scholar 

  18. Hofmann, T. (2017). Probabilistic latent semantic indexing. SIGIR Forum, 51(2), 211–218. https://doi.org/10.1145/3130348.3130370.

    Article  Google Scholar 

  19. Huang, S., Huang, J., Dai, J., Xie, T., & Huang, B. (2010). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In International Conference on Data Engineering (pp. 41–51). https://doi.org/10.1109/ICDEW.2010.5452747.

    Google Scholar 

  20. Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S. A., Yang, Q., Luo, C., & Li, J. (2014). Characterizing and subsetting big data workloads. In 2014 IEEE International Symposium on Workload Characterization (pp. 191–201). https://doi.org/10.1109/IISWC.2014.6983058.

    Google Scholar 

  21. Kılıç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., & Borandag, E. (2017). Ttc-3600: A new benchmark dataset for turkish text categorization. Journal of Information Science, 43(2), 174–185. https://doi.org/10.1177/0165551515620551.

    Article  Google Scholar 

  22. Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253.

  23. Lavrenko, V., & Croft, W. B. (2017). Relevance-based language models. SIGIR Forum, 51(2), 260–267. https://doi.org/10.1145/3130348.3130376.

    Article  Google Scholar 

  24. Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397 URL http://www.jmlr.org/papers/v5/lewis04a.html.

    Google Scholar 

  25. Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283.

  26. Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30.

  27. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

  28. Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11.

  29. O’Shea, J., Bandar, Z., Crockett, K. A., & McLean, D. (2010). Benchmarking short text semantic similarity. International Journal of Intelligent Information and Database Systems, 4(2), 103–120. https://doi.org/10.1504/IJIIDS.2010.032437.

    Article  Google Scholar 

  30. Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822.

  31. Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581.

  32. Pirzadeh, P., Carey, M. J., & Westmann, T. (2015). Bigfun: A performance study of big data management system functionality. In IEEE International Conference on Big Data (pp. 507–514). https://doi.org/10.1109/BigData.2015.7363793.

    Google Scholar 

  33. Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https://doi.org/10.1145/3121050.3121062.

  34. Ravat, F., Teste, O., Tournier, R., & Zurfluh, G. (2008). Top−keyword: an aggregation function for textual document olap. In International Conference on Data Warehousing and Knowledge Discovery (pp. 55–64). https://doi.org/10.1007/978-3-540-85836-2-6.

    Google Scholar 

  35. Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., & Curino, C. (2015). Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (pp. 1357–1369). New York: ACM. https://doi.org/10.1145/2723372.2742790.

    Google Scholar 

  36. Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2.

  37. Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597.3137600.

    Article  Google Scholar 

  38. Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286.

  39. Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010). The hadoop distributed file system. In Symposium on Mass Storage Systems and Technologies (pp. 1–10). https://doi.org/10.1109/MSST.2010.5496972.

    Google Scholar 

  40. Spärck Jones, K., Walker, S., & Robertson, S. E. (2000a). A probabilistic model of information retrieval: development and comparative experiments: Part 1. Information Processing & Management, 36(6), 779–808. https://doi.org/10.1016/S0306-4573(00)00015-7.

    Article  Google Scholar 

  41. Spärck Jones, K., Walker, S., & Robertson, S. E. (2000b). A probabilistic model of information retrieval: development and comparative experiments: Part 2. Information Processing & Management, 36(6), 809–840. https://doi.org/10.1016/S0306-4573(00)00016-9.

    Article  Google Scholar 

  42. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., & Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. VLDB Endowment, 2(2), 1626–1629. https://doi.org/10.14778/1687553.1687609.

    Article  Google Scholar 

  43. Transaction Processing Performance Council (TPC) (2016). TPC express benchmark hs standard specification version 1.4.2.http://www.tpc.org Accessed March 2019.

  44. Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019.

  45. Truică, C. O., & Darmont, J. (2017). T2K2: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3.

  46. Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33.

  47. Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055.

  48. Truică, C. O., Darmont, J., Boicea, A., & Rădulescu, F. (2018). Benchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2. Future Generation Computer Systems, 85, 60–75. https://doi.org/10.1016/j.future.2018.02.037.

    Article  Google Scholar 

  49. Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., & Baldeschwieler, E. (2013). Apache hadoop yarn: Yet another resource negotiator. In Annual Symposium on Cloud Computing (pp. 5:1–5:16). https://doi.org/10.1145/2523616.2523633.

    Google Scholar 

  50. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., & Qiu, B. (2014). BigDataBench: A big data benchmark suite from internet services. In IEEE International Symposium on High Performance Computer Architecture (pp. 488–499). https://doi.org/10.1109/HPCA.2014.6835958.

    Google Scholar 

  51. Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., & Feng, G. (2016). Textgen: a realistic text data content generation method for modern storage system benchmarks. Frontiers of Information Technology & Electronic Engineering, 17(10), 982–993. https://doi.org/10.1631/FITEE.1500332.

    Article  Google Scholar 

  52. Wang, X., Ah-Pine, J., & Darmont, J. (2017). Shcoclust, a scalable similarity-based hierarchical co-clustering method and its application to textual collections. In 2017 IEEE International Conference on Fuzzy Systems (pp. 1–6). https://doi.org/10.1109/FUZZ-IEEE.2017.8015720.

    Google Scholar 

  53. Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094.

  54. Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65. https://doi.org/10.1145/2934664.

    Article  Google Scholar 

  55. Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96

  56. Zhang, D., Zhai, C., & Han, J. (2012). MiTexCube: MicroTextCluster cube for online analysis of text cells and its applications. Statistical Analysis and Data Mining, 6(3), 243–259. https://doi.org/10.1002/sam.11159.

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded by grant No. PN-III-P1-1.2-PCCDI-2017-0734.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ciprian-Octavian Truică.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Truică, CO., Apostol, ES., Darmont, J. et al. TextBenDS: a Generic Textual Data Benchmark for Distributed Systems. Inf Syst Front 23, 81–100 (2021). https://doi.org/10.1007/s10796-020-09999-y

Download citation

Keywords

  • Benchmark
  • Distributed frameworks
  • Distributed DBMSs
  • Top-k keywords
  • Top-k documents
  • Weighting schemes