Skip to main content

Abstract

The tools that were developed through the ACCURAT project and are presented in this book are packed into the ACCURAT toolkit (Pinnis et al. 2012a)—a collection of tools that are capable of collecting comparable corpora, analysing and extracting parallel data. The ACCURAT toolkit produces

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.accurat-project.eu/

  2. 2.

    Whilst they may not be directly applicable, it is straightforward to adopt and apply our methods for building comparable corpora from the web to digital archives or other off-line textual data collections that are very large.

  3. 3.

    The EU’s multilingual thesaurus, http://eurovoc.europa.eu/

  4. 4.

    http://htmlparser.sourceforge.net/

  5. 5.

    http://www.w3.org/DOM/

  6. 6.

    Open NLP—http://incubator.apache.org/opennlp/

  7. 7.

    http://www.racai.ro/en/tools/text/

  8. 8.

    The first version of Sisyphus was created by the Belgian METAL team in 1987, in pre-Windows times, to speed up system development. This kind of tool is still needed.

  9. 9.

    Full requirements are defined in the documentation of each tool ACCURAT D2.6 (2012).

  10. 10.

    http://www.accurat-project.eu/index.php?p=toolkit

References

  • ACCURAT D2.6. (2012). Toolkit for multi-level alignment and information extraction from comparable corpora. http://www.accurat-project.eu

  • Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.

    Google Scholar 

  • Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC 2004 (pp. 1313–1316).

    Google Scholar 

  • Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.

    Google Scholar 

  • Evert, S. (2005). The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.

    Google Scholar 

  • Ion, R., Ceauşu, A., & Irimia, E. (2011). An expectation maximization algorithm for textual unit alignment. Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC 2011) held at the 49th Annual Meeting of the Association for Computational Linguistics (pp. 128—135), Portland, OR, June 24th, 2011. (C) 2011 Association for Computational Linguistics. ISBN: 978-1-937284-01-5.

  • Ion, R. (2012). PEXACC: A parallel data mining algorithm from comparable corpora. Proceedings of LREC 2012, May 21–27, Istanbul, Turkey.

    Google Scholar 

  • Pecina, P. (2009). Lexical association measures: Collocation extraction. Studies in computational and theoretical linguistics. Prague, Czech Republic: Institute of Formal and Applied Linguistics.

    Google Scholar 

  • Petrović, S., Šnajder, J., & Bašić, B. D. (2010). Extending lexical association measures for collocation extraction. Computer Speech and Language, 24(2), 383–394.

    Article  Google Scholar 

  • Pinnis, M. (2012). Latvian and Lithuanian named entity recognition with TildeNER. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.

    Google Scholar 

  • Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., et al. (2012a). Toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, July 8–14, 2012.

    Google Scholar 

  • Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., Gornostay, T. (2012b). Term extraction, tagging, and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), June 20–21, Madrid, Spain.

    Google Scholar 

  • Skadiņa, I., Aker, A., Giouli, V., Tufiş, D., Gaizauskas, R., Mieriņa, M., et al. (2010). Collection of comparable corpora for under-resourced languages. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications (Vol. 219, pp. 161–168). IOS Press.

    Google Scholar 

  • Ştefănescu, D. (2012). Mining for term translations in comparable corpora. Proceedings of the 5th Workshop on Building and Using Comparable Corpora (BUCC 2012) to be held at the 8th edition of Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey, May 23–25, 2012.

    Google Scholar 

  • Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy.

    Google Scholar 

  • Su, F., & Babych, B. (2012a). Development and application of a cross-language document comparability metric. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.

    Google Scholar 

  • Su, F., & Babych, B. (2012b). Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-) parallel translation equivalents. Proceedings of EACL’12 Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), Avignon, France.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Inguna Skadiņa .

Editor information

Editors and Affiliations

Additional information

Chapter editor: Inguna Skadiņa

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aker, A. et al. (2019). Appendices. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics