Skip to main content

Evolution of the PAN Lab on Digital Text Forensics

  • Chapter
  • First Online:
Information Retrieval Evaluation in a Changing World

Abstract

PAN is a networking initiative for digital text forensics, where researchers and practitioners study technologies for text analysis with regard to originality, authorship, and trustworthiness. The practical importance of such technologies is obvious for law enforcement, cyber-security , and marketing, yet the general public needs to be aware of their capabilities as well to make informed decisions about them. This is particularly true since almost all of these technologies are still in their infancy, and active research is required to push them forward. Hence PAN focuses on the evaluation of selected tasks from the digital text forensics in order to develop large-scale, standardized benchmarks, and to assess the state of the art. In this chapter we present the evolution of three shared tasks: plagiarism detection, author identification, and author profiling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Amigó E, Carrillo-de-Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, de Rijke M, Spina D (2014) Overview of RepLab 2014: author profiling and reputation dimensions for online reputation management. In: Proceedings of the fifth international conference of the CLEF initiative

    Google Scholar 

  • Argamon S, Juola P (2011) Overview of the international authorship identification competition at PAN-2011. In: CLEF 2011 labs and workshop, notebook papers, 19–22 Sept 2011, Amsterdam, The Netherlands

    Google Scholar 

  • Argamon S, Koppel M, Fine J, Shimoni AR (2003) Gender, genre, and writing style in formal written texts. TEXT 23:321–346

    Article  Google Scholar 

  • Asghari H, Mohtaj S, Fatemi O, Faili H, Rosso P, Potthast M (2016) Algorithms and corpora for persian plagiarism detection: overview of pan at fire 2016. In: Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, Dec 7–10, CEUR workshop proceedings, vol 1737, pp 135–144. CEUR-WS.org

    Google Scholar 

  • Bagnall D (2015) Author identification using multi-headed recurrent neural networks. In: Cappellato L, Ferro N, Gareth J, San Juan E (eds) Working notes papers of the CLEF 2015 evaluation labs

    Google Scholar 

  • Bagnall D (2016) Authorship clustering using multi-headed recurrent neural networks. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop – working notes papers. CEUR-WS.org

    Google Scholar 

  • Barrón-Cedeno A, Rosso P, Devi SL, Clough P, Stevenson M (2013) Pan@fire: overview of the cross-language !ndian text re-use detection competition. In: Notebook papers of FIRE 2011, FIRE-2011, Mumbai, India, Dec 2–4

    Google Scholar 

  • Bensalem I, Boukhalfa I, Rosso P, Abouenour L, Darwish K, Chikhi S (2015) Overview of the AraPlagDet PAN@ FIRE2015 shared task on arabic plagiarism detection. In: Notebook papers of FIRE 2015, FIRE-2015, Gandhinagar, India, Dec 4–6, CEUR workshop proceedings, vol 1587, pp 111–122. CEUR-WS.org

    Google Scholar 

  • Burrows S, Potthast M, Stein B (2013) Paraphrase acquisition via crowdsourcing and machine learning. Trans Intell Syst Technol (ACM TIST) 4(3):43:1–43:21. http://dx.doi.org/10.1145/2483669.2483676

    Article  Google Scholar 

  • ClueWeb09 (2009) The ClueWeb09 Dataset, 2009. http://lemurproject.org/clueweb09/

  • Costa PT, McCrae RR (2008) The revised neo personality inventory (NEO-PI-R). The SAGE handbook of personality theory and assessment, vol 2. SAGE Publications, Los Angeles, pp 179–198

    Google Scholar 

  • Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) PAN@FIRE: overview of SOCO track on the detection of source code re-use. In: Notebook papers of FIRE 2014, FIRE-2014, Bangalore, India, Dec 5–7

    Google Scholar 

  • Flores E, Barrón-Cedeño A, Moreno L, Rosso P (2015) PAN@FIRE: overview of CL-SOCO track on the detection of cross-language source code re-use 1587:1–5

    Google Scholar 

  • Fréry J, Largeron C, Juganaru-Mathieu M (2014) UJM at CLEF in author identification. In: CLEF 2014 labs and workshops, notebook papers, CLEF and CEUR-WS.org

    Google Scholar 

  • Gollub T, Stein B, Burrows S (2012a) Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 1125–1126

    Google Scholar 

  • Gollub T, Stein B, Burrows S, Hoppe D (2012b) Tira: Configuring, executing, and disseminating information retrieval experiments. In: Database and expert systems applications (DEXA), 2012 23rd international workshop on, IEEE, pp 151–155

    Google Scholar 

  • Gollub T, Potthast M, Beyer A, Busse M, Rangel F, Rosso P, Stamatatos E, Stein B (2013) Recent trends in digital text forensics and its evaluation: plagiarism detection, author identification, and author profiling. In: 4th international conference of CLEF on information access evaluation meets multilinguality, multimodality, and visualization, CLEF 2013, LNCS, vol 8138. Springer, New York, pp 53–58

    Google Scholar 

  • Gupta P, Clough P, Rosso P, Stevenson M (2012) Pan@fire: Overview of the cross-language !ndian news story search (CL!NSS) track. In: Notebook papers of FIRE 2012, FIRE-2012, Kolkata, India, Dec 17–19

    Google Scholar 

  • Gupta P, Clough P, Rosso P, Stevenson M, Banchs RE (2013) Pan@fire: overview of the cross-language !ndian news story search (CL!NSS) track. In: Notebook papers of FIRE 2013, FIRE-2013, Delhi, India, Dec 4–6

    Google Scholar 

  • Hagen M, Potthast M, Stein B (2015) Source retrieval for plagiarism detection from large web corpora: recent approaches. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes

  • Hagen M, Potthast M, Völske M, Gomoll J, Stein B (2016) How writers search: analyzing the search and writing logs of non-fictional essays. In: Kelly D, Capra R, Belkin N, Teevan J, Vakkari P (eds) Proceedings of the 1st ACM SIGIR conference on human information interaction and retrieval (CHIIR 16). ACM, New York, pp 193–202. http://dx.doi.org/10.1145/2854946.2854969

    Google Scholar 

  • Hagen M, Potthast M, Adineh P, Fatehifar E, Stein B (2017) Source retrieval for web-scale text reuse detection. In: Proceedings of the 26th ACM international conference on information and knowledge management (CIKM 17), ACM, New York

    Google Scholar 

  • Holmes J, Meyerhoff M (2003) The handbook of language and gender. Blackwell Handbooks in Linguistics. Wiley, Malden

    Book  Google Scholar 

  • Inches G, Crestani F (2012) Overview of the international sexual predator identification competition at PAN-2012. In: Forner P, Karlgren J, Womser-Hacker C (eds) CLEF 2012 evaluation labs and workshop – working notes papers, 17–20 Sept, Rome, Italy

    Google Scholar 

  • Juola P, Stamatatos E (2013) Overview of the author identification task at PAN 2013. In: Working notes for CLEF 2013 conference

    Google Scholar 

  • Khonji M, Iraqi Y (2014) A slightly-modified GI-based author-verifier with lots of features (ASGALF). In: CLEF 2014 labs and workshops, notebook papers, CLEF and CEUR-WS.org

    Google Scholar 

  • Koppel M, Winter Y (2014) Determining if two documents are written by the same author. J Am Soc Inf Sci Technol 65(1):178–187

    Article  Google Scholar 

  • Koppel M, Argamon S, Shimoni AR (2003) Automatically categorizing written texts by author gender. Lit Ling Comput 17(4): 401–412

    Article  Google Scholar 

  • Koppel M, Schler J, Bonchek-Dokow E (2007) Measuring differentiability: unmasking pseudonymous authors. J Mach Learn Res 8:1261–1276

    MATH  Google Scholar 

  • López-Monroy AP, Montes-y Gómez M, Escalante HJ, Villaseñor-Pineda L, Stamatatos E (2015) Discriminative subprofile-specific representations for author profiling in social media. Knowl-Based Syst 89:134–147

    Article  Google Scholar 

  • Maharjan S, Shrestha P, Solorio T, Hasan R (2014) A straightforward author profiling approach in MapReduce. In: Advances in artificial intelligence. Iberamia, pp 95–107

    Google Scholar 

  • Moreau E, Jayapal A, Lynch G, Vogel C (2015) Author Verification: Basic Stacked Generalization Applied To Predictions from a Set of Heterogeneous Learners. In: Cappellato L, Ferro N, Gareth J, San Juan E (eds) Working notes papers of the CLEF 2015 evaluation labs

    Google Scholar 

  • Pennebaker JW (2013) The secret life of pronouns: what our words say about us. Bloomsbury, New York

    Google Scholar 

  • Potthast M, Stein B, Eiselt A, Barrón-Cedeño A, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: Stein B, Rosso P, Stamatatos E, Koppel M, Agirre E (eds) SEPLN 09 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), CEUR-WS.org, pp 1–9. http://ceur-ws.org/Vol-502

  • Potthast M, Barrón-Cedeño A, Eiselt A, Stein B, Rosso P (2010a) Overview of the 2nd international competition on plagiarism detection. In: Braschler M, Harman D, Pianta E (eds) Working notes papers of the CLEF 2010 evaluation labs. http://www.clef-initiative.eu/publication/working-notes

  • Potthast M, Stein B, Barrón-Cedeño A, Rosso P (2010b) An evaluation framework for plagiarism detection. In: Huang CR, Jurafsky D (eds) 23rd international conference on computational linguistics (COLING 10). Association for computational linguistics, Stroudsburg, Pennsylvania, pp 997–1005

    Google Scholar 

  • Potthast M, Eiselt A, Barrón-Cedeño A, Stein B, Rosso P (2011) Overview of the 3rd international competition on plagiarism detection. In: Petras V, Forner P, Clough P (eds) Working notes papers of the CLEF 2011 evaluation labs. http://www.clef-initiative.eu/publication/working-notes

  • Potthast M, Gollub T, Hagen M, Graßegger J, Kiesel J, Michel M, Oberländer A, Tippmann M, Barrón-Cedeño A, Gupta P, Rosso P, Stein B (2012a) Overview of the 4th international competition on plagiarism detection. In: Forner P, Karlgren J, Womser-Hacker C (eds) Working notes papers of the CLEF 2012 evaluation labs. http://www.clef-initiative.eu/publication/working-notes

  • Potthast M, Hagen M, Stein B, Graßegger J, Michel M, Tippmann M, Welsch C (2012b) ChatNoir: a search engine for the ClueWeb09 corpus. In: Hersh B, Callan J, Maarek Y, Sanderson M (eds) 35th international ACM conference on research and development in information retrieval (SIGIR 12), ACM, p 1004. http://dx.doi.org/10.1145/2348283.2348429

  • Potthast M, Gollub T, Hagen M, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013a) Overview of the 5th international competition on plagiarism detection. In: Forner P, Navigli R, Tufis D (eds) Working notes papers of the CLEF 2013 evaluation labs. http://www.clef-initiative.eu/publication/working-notes

  • Potthast M, Hagen M, Völske M, Stein B (2013b) Crowdsourcing interaction logs to understand text reuse from the web. In: Fung P, Poesio M (eds) Proceedings of the 51st annual meeting of the association for computational linguistics (ACL 13). Association for computational linguistics, pp 1212–1221. http://www.aclweb.org/anthology/P13-1119

  • Potthast M, Gollub T, Rangel F, Rosso P, Stamatatos E, Stein B (2014a) Improving the reproducibility of pan’s shared tasks: Plagiarism detection, author identification, and author profiling. In: 5th international conference of CLEF on information access evaluation meets multilinguality, multimodality, and interaction, CLEF 2014. LNCS, vol 8685. Springer, New York, pp 268–299

    Google Scholar 

  • Potthast M, Hagen M, Beyer A, Busse M, Tippmann M, Rosso P, Stein B (2014b) Overview of the 6th international competition on plagiarism detection. In: Cappellato L, Ferro N, Halvey M, Kraaij W (eds) Working notes papers of the CLEF 2014 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes

  • Potthast M, Göring S, Rosso P, Stein B (2015) Towards data submissions for shared tasks: first experiences for the task of text alignment. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings. http://www.clef-initiative.eu/publication/working-notes

  • Potthast M, Rangel F, Tschuggnall M, Stamatatos E, Rosso P, Stein B (2017) Overview of PAN’17: author identification, author profiling, and author obfuscation. In: 8th international conference of CLEF on experimental IR meets multilinguality, multimodality, and visualization, CLEF 2017, LNCS, vol 10456. Springer, New York, pp 275–290

    Google Scholar 

  • Rammstedt B, John O (2007) Measuring personality in one minute or less: A 10 item short version of the big five inventory in English and German. J Res Pers 203–212

    Article  Google Scholar 

  • Rangel F, Rosso P (2015) On the multilingual and genre robustness of emographs for author profiling in social media. In: 6th international conference of CLEF on experimental IR meets multilinguality, multimodality, and interaction, LNCS, vol 9283. Springer, New York, pp 274–280

    Chapter  Google Scholar 

  • Rangel F, Rosso P (2016) On the impact of emotions on author profiling. Inf Process Manage 52(1):73–92

    Article  Google Scholar 

  • Rangel F, Rosso P, Moshe Koppel M, Stamatatos E, Inches G (2013) Overview of the author profiling task at pan 2013. In: Forner P, Navigli R, Tufis D (eds) CLEF 2013 labs and workshops, notebook papers, vol 1179. CEUR-WS.org

    Google Scholar 

  • Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at PAN 2014. In: Cappellato L, Ferro N, Halvey M, Kraaij W (eds) CLEF 2014 labs and workshops, notebook papers, vol 1180. CEUR-WS.org

    Google Scholar 

  • Rangel F, Rosso P, Potthast M, Stein B, Daelemans W (2015) Overview of the 3rd author profiling task at pan 2015. In: Cappellato L, Ferro N, Jones G, San Juan E (eds) CLEF 2015 labs and workshops, notebook papers. CEUR workshop proceedings, vol 1391. CEUR-WS.org

    Google Scholar 

  • Rangel F, González F, Restrepo F, Montes M, Rosso P (2016a) Pan at fire: Overview of the PR-SOCO track on personality recognition in source code. Notebook papers of FIRE 2016, FIRE-2016, Kolkata, India, Dec 7–10, CEUR workshop proceedings, vol 1737, pp 1–5. CEUR-WS.org

    Google Scholar 

  • Rangel F, Rosso P, Verhoeven B, Daelemans W, Potthast M, Stein B (2016b) Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working notes papers of the CLEF 2016 Evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings

    Google Scholar 

  • Rangel F, Rosso P, Potthast M, Stein B (2017) Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in twitter. Working notes papers of the CLEF

    Google Scholar 

  • Rosso P, Rangel F, Potthast M, Stamatatos E, Tschuggnall M, Stein B (2016) Overview of the PAN’2016 - new challenges for authorship analysis: Cross-genre profiling, clustering, diarization, and obfuscation. In: 7th international conference of CLEF on Experimental IR meets multilinguality, multimodality, and interaction, CLEF 2016, LNCS, vol 9822. Springer, New York, pp 332–350

    Google Scholar 

  • Sadat F, Kazemi F, Farzindar A (2014) Automatic identification of arabic language varieties and dialects in social media. In: Proceedings of SocialNLP, p 22

    Google Scholar 

  • Schler J, Koppel M, Argamon S, Pennebaker JW (2006) Effects of age and gender on blogging. In: AAAI spring symposium: computational approaches to analyzing weblogs, AAAI, pp 199–205

    Google Scholar 

  • Seidman S (2013) Authorship verification using the impostors method. In: Forner P, Navigli R, Tufis D (eds) CLEF 2013 Evaluation labs and workshop – Working notes papers

    Google Scholar 

  • Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60:538–556

    Article  Google Scholar 

  • Stamatatos E (2011) Plagiarism detection using stopword n-grams. J Am Soc Inf Sci Technol 62(12):2512–2527. http://dx.doi.org/10.1002/asi.21630

    Article  Google Scholar 

  • Stamatatos E, Daelemans W, Verhoeven B, Stein B, Potthast M, Juola P, Sánchez-Pérez MA, Barrón-Cedeño A (2014) Overview of the author identification task at PAN 2014. In: Working notes for CLEF 2014 conference, pp 877–897

    Google Scholar 

  • Stamatatos E, Daelemans W, Verhoeven B, Juola P, López-López A, Potthast M, Stein B (2015a) Overview of the author identification task at PAN 2015. In: Working notes of CLEF 2015 - conference and labs of the evaluation forum

    Google Scholar 

  • Stamatatos E, Potthast M, Rangel F, Rosso P, Stein B (2015b) Overview of the pan/clef 2015 evaluation lab. In: 6th international conference of CLEF on experimental IR meets multilinguality, multimodality, and interaction, CLEF 2015. LNCS, vol 9283. Springer, New York, pp 518–538

    Google Scholar 

  • Stamatatos E, Tschuggnall M, Verhoeven B, Daelemans W, Specht G, Stein B, Potthast M (2016) Clustering by authorship within and across documents. In: Working notes papers of the CLEF 2016 Evaluation labs, CLEF and CEUR-WS.org, CEUR workshop proceedings, vol 1609. http://ceur-ws.org/Vol-1609/

  • Stein B, Meyer zu Eißen S, Potthast M (2007) Strategies for retrieving plagiarized documents. In: Clarke C, Fuhr N, Kando N, Kraaij W, de Vries A (eds) 30th International ACM conference on research and development in information retrieval (SIGIR 07). ACM, New York, pp 825–826. http://dx.doi.org/10.1145/1277741.1277928

  • Stein B, Lipka N, Prettenhofer P (2011) Intrinsic plagiarism analysis. Lang Resour Eval (LRE) 45(1):63–82. http://dx.doi.org/10.1007/s10579-010-9115-y

    Article  Google Scholar 

  • Tschuggnall M, Stamatatos E, Verhoeven B, Daelemans W, Specht G, Stein B, Potthast M (2017) Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Working notes papers of the CLEF 2017 evaluation labs, CLEF and CEUR-WS.org. CEUR workshop proceedings

    Google Scholar 

  • Weren E, Kauer A, Mizusaki L, Moreira V, de Oliveira P, Wives L (2014) Examining multiple features for author profiling. J Inf Data Manage 5:266–279

    Google Scholar 

Download references

Acknowledgements

The work of Paolo Rosso was partially funded by the Spanish MICINN under the research project MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31). The work on the author profiling data in Arabic was made possible by NPRP grant #9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Rosso .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rosso, P., Potthast, M., Stein, B., Stamatatos, E., Rangel, F., Daelemans, W. (2019). Evolution of the PAN Lab on Digital Text Forensics. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-22948-1_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-22947-4

  • Online ISBN: 978-3-030-22948-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics