Advertisement

T\(^2\)K\(^2\): The Twitter Top-K Keywords Benchmark

  • Ciprian-Octavian TruicăEmail author
  • Jérôme Darmont
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 767)

Abstract

Information retrieval from textual data focuses on the construction of vocabularies that contain weighted term tuples. Such vocabularies can then be exploited by various text analysis algorithms to extract new knowledge, e.g., top-k keywords, top-k documents, etc. Top-k keywords are casually used for various purposes, are often computed on-the-fly, and thus must be efficiently computed. To compare competing weighting schemes and database implementations, benchmarking is customary. To the best of our knowledge, no benchmark currently addresses these problems. Hence, in this paper, we present a top-k keywords benchmark, T\(^2\)K\(^2\), which features a real tweet dataset and queries with various complexities and selectivities. T\(^2\)K\(^2\) helps evaluate weighting schemes and database implementations in terms of computing performance. To illustrate T\(^2\)K\(^2\)’s relevance and genericity, we show how to implement the TF-IDF and Okapi BM25 weighting schemes, on one hand, and relational and document-oriented database instantiations, on the other hand.

Keywords

Top-k keywords Benchmark Term weighting Database systems 

References

  1. 1.
    Bringay, S., Béchet, N., Bouillot, F., Poncelet, P., Roche, M., Teisseire, M.: Towards an on-line analysis of tweets processing. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011. LNCS, vol. 6861, pp. 154–161. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23091-2_15 CrossRefGoogle Scholar
  2. 2.
    Cooper, J.D., Robinson, M.D., Slansky, J.A., Kiger, N.D.: Literacy: Helping Students Construct Meaning. Cengage Learning, Boston (2014)Google Scholar
  3. 3.
    Darmont, J.: Data processing benchmarks. In: Khosrow, M. (ed.) Encyclopedia of Information Science and Technology, 3rd edn, pp. 146–152. IGI Global, Hershey (2014)Google Scholar
  4. 4.
    Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J.: PRIMEBALL: a parallel processing framework benchmark for big data applications in the cloud. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 109–124. Springer, Cham (2014). doi: 10.1007/978-3-319-04936-6_8 CrossRefGoogle Scholar
  5. 5.
    Gattiker, A.E., Gebara, F.H., Hofstee, H.P., Hayes, J.D., Hylick, A.: Big data text-oriented benchmark creation for Hadoop. IBM J. Res. Dev. 57(3/4), 10: 1–10: 6 (2013)CrossRefGoogle Scholar
  6. 6.
    Gray, J.: The Benchmark Handbook for Database and Transaction Systems, 2nd edn. Morgan Kaufmann, Burlington (1993)zbMATHGoogle Scholar
  7. 7.
    Guille, A., Favre, C.: Event detection, tracking, and visualization in twitter: a mention-anomaly-based approach. Soc. Netw. Anal. Min. 5(1), 18 (2015)CrossRefGoogle Scholar
  8. 8.
    Kılınç, D., Özçift, A., Bozyigit, F., Yildirim, P., Yücalar, F., Borandag, E.: TTC-3600: A new benchmark dataset for turkish text categorization. J. Inf. Sci. 43(2), 174–185 (2017)CrossRefGoogle Scholar
  9. 9.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  10. 10.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefzbMATHGoogle Scholar
  11. 11.
    O’Shea, J., Bandar, Z., Crockett, K.A., McLean, D.: Benchmarking short text semantic similarity. Int. J. Intell. Inf. Database Syst. 4(2), 103–120 (2010)Google Scholar
  12. 12.
    Paltoglou, G., Thelwall, M.: A study of information retrieval weighting schemes for sentiment analysis. In: 48th Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395 (2010)Google Scholar
  13. 13.
    Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M., Gallinari, P.: LSHTC: a benchmark for large-scale text classification. CoRR abs/1503.08581 (2015)Google Scholar
  14. 14.
    Ravat, F., Teste, O., Tournier, R., Zurfluh, G.: Top_Keyword: an aggregation function for textual document OLAP. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 55–64. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-85836-2_6 CrossRefGoogle Scholar
  15. 15.
    Reagan, A.J., Tivnan, B.F., Williams, J.R., Danforth, C.M., Dodds, P.S.: Benchmarking sentiment analysis methods for large-scale texts: a case for using continuum-scored words and word shift graphs. CoRR abs/1512.00531 (2015)Google Scholar
  16. 16.
    Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 1. Inf. Process. Manage. 36(6), 779–808 (2000)CrossRefGoogle Scholar
  17. 17.
    Spärck Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments: part 2. Inf. Process. Manage. 36(6), 809–840 (2000)CrossRefGoogle Scholar
  18. 18.
    Truică, C.O., Darmont, J., Velcin, J.: A scalable document-based architecture for text analysis. In: International Conference on Advanced Data Mining and Applications (ADMA), pp. 481–494 (2016)Google Scholar
  19. 19.
    Wang, L., Dong, X., Zhang, X., Wang, Y., Ju, T., Feng, G.: TextGen: a realistic text data content generation method for modern storage system benchmarks. Front. Inf. Technol. Electron. Eng. 17(10), 982–993 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Computer Science and Engineering Department, Faculty of Automatic Control and ComputersUniversity Politehnica of BucharestBucharestRomania
  2. 2.Université de Lyon, Lyon 2, ERIC EA 3083LyonFrance

Personalised recommendations