Skip to main content

Building Cantonese Dictionaries Using Crowdsourcing Strategies: The words.hk Project

  • Chapter
  • First Online:
Digital Humanities and New Ways of Teaching

Part of the book series: Digital Culture and Humanities ((DICUHU,volume 1))

Abstract

The words.hk project is the first attempt to build a Cantonese-to-Cantonese dictionary using a lean start-up (see Ries, The lean startup: How today’s entrepreneurs use continuous innovation to create radically successful businesses. New York: Crown Business, 2011) model combined with crowdsourcing strategies. The goal is to produce a comprehensive dictionary written for Cantonese and in Cantonese. Existing resources are often (1) not available electronically, (2) out of date, or (3) too Anglo- or Sino-centric. Building large data sets from these existing resources requires a lot of editing and ‘data-janitorial’ work, which can be done far better with a large group of less-experienced people than just a handful of experts, and crowdsourcing strategies are particularly appropriate in these cases. We started with a small team of editors and software developers in 2014. In less than 3 years’ time, we grew into an organisation with over 400 volunteers, gathered over 42,000 entries, of which more than 36,000 entries have been edited with Written Cantonese descriptions, examples, and translations as of June 2017. Given the nature of the project and the member composition – a language with no authority to fall back on and most members with no formal linguistics or lexicographical training – we adhere to two simple principles, in order to keep the dictionary growing without introducing major issues in the core data: ‘usage over etymology’ and ‘decision problem avoidance’. I will discuss how these principles have shaped the architecture of the project, the editing workflow, and other technological difficulties that we face.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A Cantonese-to-English dictionary is being compiled at the same time.

  2. 2.

    These nicely rephrased questions were perhaps brought up because some think (1) there is no use; (2) online projects should never be trusted; and (3) how dare you write a Cantonese dictionary?

  3. 3.

    I refer readers to Li (2011) and Snow (2004), respectively, for the status of Written Cantonese in its early stage and subsequent development until the early 2000s. The recent development of Written Cantonese as a fully functional written variety is discussed at length in a submitted manuscript by the author titled ‘The Transformation of “Cantonese in Written Materials” into “Written Cantonese”’.

  4. 4.

    Cantonese Wikipedia (n.d.). In Wikipedia. Retrieved June 25, 2017, from https://en.wikipedia.org/wiki/Cantonese_Wikipedia

  5. 5.

    Premodern Cantonese word glossaries do exist. Chapter 11 of Guangdong Xinyu 廣東新語 (Qu 1678) and the preface of Yue’ou 粵謳 (Zhao 1821). The scale is however not comparable with a full dictionary.

  6. 6.

    Native speakers usually look at character-based collections, such as A Chinese Syllabary Pronounced According to the Dialect of Canton (Wong 1941) or online references like Chinese Character Database: With Word-formations Phonologically Disambiguated According to the Cantonese Dialect (accessible from http://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/), to look up the Cantonese pronunciation of a character. Most speakers do not consult a dictionary to check the meaning of unknown Cantonese words.

  7. 7.

    Prominent ones include A Chinese Dictionary in the Cantonese Dialect (Eitel 1877), The Student’s Cantonese-English Dictionary (Meyer and Wempe 1935), Cantonese Speaker’s Dictionary (Cowles 1965), Cantonese Dictionary: Cantonese-English, English-Cantonese (Huang 1970), A Practical Cantonese-English dictionary (Lau 1977), and Touhou Kantongo Jiten (Chishima 2005).

  8. 8.

    Hong Kong Cantonese is the de facto standard, but we also recruited editors from other Cantonese-speaking regions.

  9. 9.

    See http://beta.words.hk/base/hoifong/ for the full licence text.

  10. 10.

    Copyrights in Hong Kong are generally retained for 50 years after author’s death or 50 years after being published.

  11. 11.

    Our “Non-Commercial Open Data Licence” (see Footnote #8) can be readily adapted to any project with valuable content. In addition, the Creative Commons family of licenses (http://creativecommons.org) are also a popular choice for open licencing of data.

  12. 12.

    Which contains words in Mandarin, but most of them are shared among Mandarin and Cantonese, and it also contains local words that are used by Cantonese speakers only.

  13. 13.

    A software development practice ‘in which requirements and solutions evolve through collaboration between self-organising, cross-functional teams’ (Wikipedia).

  14. 14.

    The practice of separating development task into clear-cut phases; the process flows downwards through these phases.

  15. 15.

    The character was originally proposed by Kong (1933) to be the character for ‘a bit’, but it was listed without any justification. The character is pronounced as zit3 according to correspondence rules between Middle Chinese and Cantonese pronunciation and had never been listed in any Cantonese materials except for that single occurrence, before the character was rediscovered and popularised by columnist Pang Chi-Ming.

  16. 16.

    That is, when the speaker is forced to guess the pronunciation of a word or required to write out certain spoken words with which they are unfamiliar.

  17. 17.

    Despite the fact that Cantonese orthography or pronunciation is not officially regulated or standardised by any institutional bodies, elements that are shared between Cantonese and Standard Written Chinese are standardised and well-documented.

  18. 18.

    Cantonese Wikipedia used to have a set of standard characters, and they are no longer in use. Since there is a large community of Written Cantonese speakers, as well as publications using Written Cantonese, a socially emerged standardisation is taking place, and a rigid top-down standardisation seems unnecessary.

References

  • Bauer, R. (1988). Written Cantonese of Hong Kong. Cahiers de Linguistique-Asie Orientale, 17(2), 245–293.

    Article  Google Scholar 

  • Caau2. n.d. Retrieved June 25, 2017, from Words.hk http://beta.words.hk/zidin/%E7%82%92.

  • Cantonese Wikipedia. n.d. Retrieved June 25, 2017, from Wikipedia https://en.wikipedia.org/wiki/Cantonese_Wikipedia.

  • Chin, A. C.-O. (2018). Initiatives of digital humanities in Cantonese studies: A corpus of mid-20th century Hong Kong Cantonese. In K.-K. Tam (Ed.), Digital humanities and new ways of teaching. Singapore: Springer.

    Google Scholar 

  • Chishima, E. (2005). Tōhō Kantongo Jiten [Tōhō Cantonese dictionary]. Tōkyō: Tōhō Shoten.

    Google Scholar 

  • Cowles, R. (1965). Cantonese speaker’s dictionary. Hong Kong: Hong Kong University Press.

    Google Scholar 

  • Eitel, E. (1877). A Chinese dictionary in the Cantonese dialect. London: Trübner and Co. 57 & 59, Ludgate Hill and Hong Kong: Lane, Crawford & Co.

    Google Scholar 

  • Ferguson, C. (1959). Diglossia. Word, 15, 325–340.

    Article  Google Scholar 

  • Howe, J. (2006). The rise of crowdsourcing. Wired Magazine. Retrieved from http://sistemas-humano-computacionais.wikidot.com/local–files/capitulo:redessociais/Howe_The_Rise_of_Crowdsourcing.pdf.

  • Huang, P. (1970). Cantonese dictionary: Cantonese-English, English-Cantonese. New Haven: Yale University Press.

    Google Scholar 

  • Hutton, C., & Bolton, K. (2005). A dictionary of Cantonese slang: The language of Hong Kong movies, street gangs and city life. Honolulu: University of Hawaii Press.

    Google Scholar 

  • Kong, Z. N. (1933). Guangdong Suyu Kao [Study on common sayings in Cantonese]. Guangzhou: Nanfang Fulunshe.

    Google Scholar 

  • Lau, S. (1977). A practical Cantonese-English dictionary. Hong Kong: Hong Kong Government Printer.

    Google Scholar 

  • Leung, M., & Law, S. (2002). HKCAC: The Hong Kong Cantonese adult language corpus. International Journal of Corpus Linguistics, 6(2), 305–326.

    Article  Google Scholar 

  • Li, Y. M. F. (2011). Qingmo Minchu de Yueyu Shuxie [Cantonese writing in late Qing and early Republic of China]. Hong Kong: Joint Publishing (HK).

    Google Scholar 

  • Luke, K., & Wong, M. (2015). The Hong Kong Cantonese corpus: Design and uses. In B. K. Tsou & O. Y. Kwong (Eds.), JCL monograph series no. 25: Linguistic corpus and corpus linguistics in the Chinese context (pp. 312–333). Hong Kong: The Chinese University Press.

    Google Scholar 

  • Meyer, B., & Wempe, T. (1935). The student’s Cantonese-English dictionary. Unknown: St. Louis Industrial School Printing Press.

    Google Scholar 

  • Qu, D. J. (1678). Guangdong Xinyu [New words about Guangdong]. (n.p.)

    Google Scholar 

  • Rieder, B., & Röhle, T. (2012). Digital methods: Five challenges. In D. Berry (Ed.), Understanding digital humanities (pp. 67–84). London: Palgrave Macmillan.

    Google Scholar 

  • Ries, E. (2011). The lean startup: How today’s entrepreneurs use continuous innovation to create radically successful businesses. New York: Crown Business.

    Google Scholar 

  • Snow, D. B. (2004). Cantonese as written language: The growth of a written Chinese vernacular. Hong Kong: Hong Kong University Press.

    Google Scholar 

  • Snow, D. B. (2008). Cantonese as written standard? Journal of Asian Pacific Communication, 18(2), 190–208.

    Article  Google Scholar 

  • Tang, S. W. (2015). Yueyu Yufa Jiangyi [Lectures on Cantonese grammar]. Hong Kong: Commercial Press.

    Google Scholar 

  • Wong, S. L. (1941). Yueyin Yunhui [A Chinese syllabary pronounced according to the dialect of Canton]. Hong Kong.

    Google Scholar 

  • Zhao, Z. Y. (1821). Yue’ou [Cantonese folklore].

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Lau, Cm. (2019). Building Cantonese Dictionaries Using Crowdsourcing Strategies: The words.hk Project. In: Tso, A.Wb. (eds) Digital Humanities and New Ways of Teaching. Digital Culture and Humanities, vol 1. Springer, Singapore. https://doi.org/10.1007/978-981-13-1277-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1277-9_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1276-2

  • Online ISBN: 978-981-13-1277-9

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics