Skip to main content

Identifying Parallel Web Documents by Filenames

  • Conference paper
Advanced Web Technologies and Applications (APWeb 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3007))

Included in the following conference series:

  • 525 Accesses

Abstract

Parallel Web documents are the crucial lexical basis for constructing robust multilingual Web-based linguistic knowledge resources. To identify parallel Web documents efficiently and effectively, this paper develops a new automatic approach based on filenames using the commonly used parallel document naming practice on the Web. The approach involves three procedures for identifying common file descriptor, language flag, and language flag-pair respectively among all file names examined. To examine how these three procedures can be used to get the best result, five methods are developed by incorporating these procedures in different ways. An experimental study on a Hong Kong government Web site is conducted to evaluate the performance of these five methods in terms of recall and precision. The experimental result shows that the method combining the procedures of the file descriptor alignment and the language flag-pair alignment outperforms other methods, with a 95.3% of precision rate and a 91.0% of recall rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Littman, M.L., Dumais, S., Landauer, T.K.: Automatic cross language information retrieval using latent semantic indexing. In: Grefenstette, G. (ed.) Cross- Language Information Retrieval, vol. ch. 5, Kluwer Academic Publishers, Boston (1998)

    Google Scholar 

  2. Carbonell, J.G., Yang, Y., Frederking, R.E., Brown, R.D., Geng, Y., Lee, D.: Translingual information retrieval: a comparative evaluation. In: Pollack, M.E. (ed.) IJCAI 1997 Proceedings of the 15th International Joint Conference on Artificial Intelligence, pp. 708–714 (1997)

    Google Scholar 

  3. Chau, R., Yeh, C.-H.: Construction of a fuzzy multilingual thesaurus and its application to cross-lingual text retrieval. In: Zhong, N., Yao, Y., Ohsuga, S., Liu, J. (eds.) WI 2001. LNCS (LNAI), vol. 2198, pp. 340–345. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  4. Resnik, P., Smith, N.A.: The Web as a Parallel Corpus. Technical Report UMIAC-TR-2002-61, MD: University of Maryland (2002)

    Google Scholar 

  5. Chen, J., Nie, J.Y.: Parallel Web Text Mining for Cross-Language IR. In: Proceedings of RIAO 2000: Content-Based Multimedia Information Access, Paris (2000)

    Google Scholar 

  6. Ma, X., Liberman, M.: Bits: A method for bilingual text search over the web. In Machine Translation Summit VII, (September 1999), hppt://www.ldc.upenn.edu/Papers/MTSVII1999/BITS.ps

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chen, J., Yeh, CH., Chau, R. (2004). Identifying Parallel Web Documents by Filenames. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24655-8_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21371-0

  • Online ISBN: 978-3-540-24655-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics