Skip to main content

Towards a Statistical-Enriched Corpus Containing Portuguese Collocations in Use: Reviewing Possible Extraction Tools

  • Conference paper
  • First Online:
  • 606 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9727))

Abstract

Collocations are a main problem for any natural language processing task, from machine translation to summarization. With the goal of building a corpus with collocations, enriched with statistical information about them, we survey, in this paper, four tools for extracting collocations. These tools allow us to collect sentences with collocations, and also to gather statistics on this particular type of co-ocurrences, like Mutual Information and Log likelihood values.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://corpora.informatik.uni-leipzig.de.

  2. 2.

    https://gramtrans.com/deepdict/.

  3. 3.

    http://www.clul.ul.pt/pt/recursos/.

  4. 4.

    https://www.sketchengine.co.uk.

  5. 5.

    http://www.linguateca.pt/COMPARA/listas_freq.php.

  6. 6.

    http://beta.visl.sdu.dk/constraint_grammar.html.

  7. 7.

    http://linguateca.dei.uc.pt/Floresta/InicialFloresta.html.

  8. 8.

    http://www.linguateca.pt/floresta/principal.html.

  9. 9.

    This measure is based only on a frequency of words w1 and w2 and bigram w1 w2, it is not affected by the size of the corpus.

  10. 10.

    If the clustering option is selected, the collocates within a word sketch are clustered according to any such clusters from the distributional thesaurus that they appear in. The words from the thesaurus are clustered according to their distributional similarity scores.

  11. 11.

    Salience is a statistical measure of how salient a word or lemma is in a given context, given the frequency of the word and the context. This is measured with logDice.

  12. 12.

    [7] suggests that MI is generally used for a lexicographical purpose, while MI3 is probably more useful for second language learning.

References

  1. Anagnostou, N.K., Weir, G.R.S.: Review of software applications for derivingcollocations. In: ICT in the Analysis, Teaching and Learning of Languages, Preprints of the ICTATLL Workshop 2006, Glasgow, pp. 91–100 (2006)

    Google Scholar 

  2. van den Bosch, A., Daelemans, W.: Memory-based morphological analysis. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, ACL 1999, pp. 285–292. Association for Computational Linguistics, Stroudsburg (1999). http://dx.org/10.3115/1034678.1034726

  3. Branco, A., Silva, J.: Evaluating solutions for the rapid development of state-of-the-art pos taggers for Portuguese. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004). European Language Resources Association (ELRA), Lisbon. http://www.lrec-conf.org/proceedings/lrec2004/pdf/572.pdf, aCL Anthology Identifier: L04–1354

  4. Correia, J.M.P.: Syntax Deep Explorer. Ph.D. thesis. Instituto Superior Técnico (2015)

    Google Scholar 

  5. Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: Mbt: a memory-based part of speech tagger-generator. In: Proceedings of Fourth Workshop on Very Large Corpora, pp. 14–27. ACL SIGDAT (1996)

    Google Scholar 

  6. Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: The sketch engine. In: Proceedings of EURALEX (2004)

    Google Scholar 

  7. McEnery, T., Xiao, R., Tono, Y.: Corpus-Based Language Studies: An Advanced Resource Book. Taylor & Francis (2006)

    Google Scholar 

  8. Mendes, A., Antunes, S., do Nascimento, M.F.B., Miguel, J., Casteleiro, L.P., Sá, T.: Combina-pt: a large corpus-extracted and hand-checked lexical database of Portuguese multiword expressions. In: Proceedings of LREC, pp. 1900–1905 (2006)

    Google Scholar 

  9. Santos, D., Rocha, P.: Evaluating CETEMPúblico, a free resource for Portuguese. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 450–457. Association for Computational Linguistics (2001)

    Google Scholar 

  10. Tutin, A., Grossmann, F.: Collocations régulières et irrégulières: esquisse de typologie du phénomène collocatif. Revue française de linguistique appliquée 7(1), 7–25 (2002)

    Google Scholar 

Download references

Acknowledgments

The work was partially supported by national funds through FCT - Fundação para a Ciência e a Tecnologia, reference UID/CEC/50021/2013. Ângela Costa is supported by PhD fellowship from FCT (SFRH/BD/85737/2012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ângela Costa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Costa, Â., Coheur, L. (2016). Towards a Statistical-Enriched Corpus Containing Portuguese Collocations in Use: Reviewing Possible Extraction Tools. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41552-9_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41551-2

  • Online ISBN: 978-3-319-41552-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics