Project: Resources for Closely-related Languages

Long name

Human Language Technology Resources for Closely-Related Languages

Abbreviation

RCRL

DURATION

2008-2010

FUNDED BY:

National Research Foundation

PROJECT URLS

PROJECT LEADERS

  • GERHARD B VAN HUYSSTEEN
    • CTexT (Centre for Text Technology), North-West University, South Africa
  • FEBE DE WET
    • Stellenbosch University, South Africa

PROJECT COLLABORATORS

  • Suléne Pilon (North-West University, South Africa)
  • Martin Puttkammer (North-West University, South Africa)
  • Martin Schlemmer (North-West University, South Africa)
  • Handré Groenewald (North-West University, South Africa)
  • Linsen Loots (Stellenbosch University, South Africa)
  • Thomas Niesler (Stellenbosch University, South Africa)
  • Marelie Davel (Council for Scientific and Industrial Research, South Africa)
  • Georg Schlünz (Council for Scientific and Industrial Research, South Africa)
  • Etienne Barnard (Council for Scientific and Industrial Research, South Africa)
  • Wilbert Heeringa (Meertens Institute, The Netherlands)
    Liesbeth Augustinus (University of Leuven, Belgium)
  • Walter Daelemans (University of Antwerp, Belgium)

OVERVIEW

Two kinds of resources are considered fundamental for the development of Human Language Technology (HLT) applications (such as dictation software, automatic machine translation systems, or intelligent search engines), viz.:

Core Technologies (also sometimes called “lingware”; i.e. reusable, efficient natural language processing modules, integrated in end-user applications); and
Data (i.e. language data (corpora and lexica), formal descriptions of language structures (grammars), and language models) (Daelemans & Strik, 2002).

As these resources are quintessential in the development of most HLT applications, it is vital to develop sophisticated, reusable resources for all the South African languages, before venturing into the development of advanced HLT applications for these languages. Although both deep and shallow methods are used for most of the processes involved in developing these resources, these methods are, on the one hand, far from perfect (Gaustad & Bouma, 2001), and, on the other hand, have mostly been developed for the commercially-more important languages of the world (e.g. English, Dutch, Spanish, German, Japanese, etc.) (Cole, 1995: 111). To develop such resources for the indigenous South African languages (all of which could be considered so-called resource-scarce languages), these methods should therefore either be adapted for these languages, or alternatively, new methods must be sought to deal with the idiosyncrasies of these languages.

One method to fast-track the development of resources for resource-scarce languages is to re-cycle (port/transfer/re-engineer) existing technologies from one language L1 to another, closely-related language L2. The basic hypothesis, which is also the hypothesis of this project, is that “[if] the languages L1 and L2 are similar enough, then it should be easier [and quicker] to recycle software applicable to L1 than to rewrite it from scratch for L2”, thereby taking care of “most of the drudgery before any human has to become involved” (Rayner et al., 1997: 65). Scannell (2006) argues that resource-scarce languages could benefit from such an approach, especially where L1 is a global, well-resourced language.

To illustrate this hypothesis in real terms, let us assume that Dutch (L1) and Afrikaans (L2) are similar enough for purposes of technology transfer. The hypothesis then is that it would be easier and quicker to use and adapt, for example, an existing Dutch syntactic parser to parse Afrikaans sentences, than to develop an Afrikaans syntactic parser from scratch. One could therefore use the Dutch parser to parse an Afrikaans corpus, and afterwards only correct systematic errors manually or semi-automatically.

Until 2007, only a few projects that have exploited this approach were conducted for, amongst others, Irish – Scottish Gaelic (Scannell, 2006), Spanish – Catalan, Spanish – Galician (Corbí-Bellot et al., 2005), English – French, Swedish – Danish (Rayner et al., 1997), etc. Almost all of these projects were conducted within the context of either automatic text-based machine translation, or speech-to-speech translation.

Although the idea to re-cycle technologies between closely-related languages is not a novel one, numerous questions and opportunities for research remain. For instance, what “similar enough” in the above hypothesis entails, is completely unclear from the literature. Also, almost all the above-mentioned projects were conducted using rule-based methods (e.g. finite-state grammars); the question remains whether technologies that have been developed using machine learning would also be appropriate for this approach. Moreover, with the exception of some linguistic and lexicographic studies (e.g. Jansen & Olivier, 1986; Prinsloo, 2006), very little similar research has been done involving South African languages.

Hence, the central problem addressed in this project concerns an investigation of the possibilities of technology re-cycling between closely-related languages. The focus in this project is on Afrikaans, with Dutch as the closely-related, well-sourced language.

In the period 2008-2010, we directed our attention on the experimental development of various re-usable resources (core technologies and data) for Afrikaans, including:

  • Annotated wideband speech data for large-vocabulary continuous speech recognition;
  • Pronunciation dictionary;
  • Afrikaans-Dutch/Dutch-Afrikaans convertor, including a bilingual translation dictionary;
  • High-accuracy chunker (i.e. shallow parser);
  • Improved part-of-speech tagger and lemmatiser for Afrikaans.

REFERENCES

  • Cole, R.A. (editor in chief). 1995. Survey of the State of the Art in Human Language Technology. Available at: https://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html. Accessed on: May 15, 2002.
  • Corbí-Bellot, A.M. et al. 2005. An open-source shallow-transfer machine translation engine for the Romance languages of Spain. In: Proceedings of the 10th Annual EAMT Conference. Budapest, Hungary, 30-31 May 2005.
  • Daelemans, W. & Strik, H. 2002. Actieplan voor het Nederlands in de taal- en spraaktechnologie: Prioriteiten voor basisvoorzieningen. [Action Plan for Dutch Language and Speech Technology: Priorities for Basic Resources]. Report for the Nederlandse Taalunie. Available at: https://cnts.uia.ac.be/~walter/TST/. Accessed on: April 30, 2004.
  • Gaustad, T. & Bouma, G. 2001. Accurate Stemming of Dutch for Text Classification. Computational Linguistics in the Netherlands 2001. Amsterdam: Rodopi. pp. 104-117.
  • Jansen, E. & Olivier G. 1986. Praktiese Nederlands. Pretoria: Academica.
  • Prinsloo, D.J. 2006. Compiling a Bidirectional Dictionary Bridging English and the Sotho Languages: A Viability Study. Lexikos. 16: 193-204.
  • Rayner, M. et al. 1997. Recycling Lingware in a Multilingual MT System. In: Burstein, J. & Leacock, C. From Research to Commercial Applications: Making NLP Work in Practice. Somerset, New Jersey: Association for Computational Linguistics. pp. 65-70.
  • Scannell, K. 2006. Machine translation for closely related language pairs. In: Proceedings of the LREC2006 Workshop on Strategies for developing machine translation for minority languages. European Language Resources Association: Paris.
  • Vandeghinste, V., Schuurman, I., Carl, M., Markantonatou, S. and Badia, T. 2006. METIS-II: Machine Translation for Low Resource Languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy. May 24-26. European Language Resources Association: Paris.

OUTPUTS

PUBLICATIONS

  1. Daelemans, W, Groenewald, HJ & Van Huyssteen, GB. 2009. Prototype-based Active Learning for Lemmatization. In: Angelova, G, Bontcheva, K, Mitkov, R, Nikolov, N & Nikolov, N. (eds.). Proceedings of Recent Advances in Natural Language Processing 2009. ISSN: 1313-8502. 14-16 September 2009. Borovets, Bulgaria. pp 65-70.
  2. Davel, MH & De Wet, F. 2010. Verifying pronunciation dictionaries using conflict analysis. In: Proceedings of Interspeech. ISBN: 1990-9772. 26-30 September. Makuhari, Chiba, Japan. pp 1898-1901.
  3. De Wet, F, De Waal, A & Van Huyssteen, GB. 2011. Developing a broadband automatic speech recognition system for Afrikaans. In: Proceedings of the 12h Annual Conference of the International Speech Communication Association (Interspeech 2011). ISSN: 1990-9772. 27-31 August. Florence, Italy. pp. 3185-3188.
  4. Heeringa, W & De Wet, F. 2008. The origin of the Afrikaans pronunciation: a comparison to West Germanic languages and Dutch dialects. In: Proceedings of the 19th Annual Symposium of the Pattern Recognition Association of South Africa. ISBN 978-0-7992-2350-7. 27-28 November. Cape Town, South Africa. pp 159-164.
  5. Heeringa, W, De Wet, F & Van Huyssteen, GB. submitted. Afrikaans and Dutch as Closely-related Languages: A Comparison to West Germanic Languages and Dutch Dialects.
  6. Loots, L, De Wet, F & Niesler, T. 2010. Extending an Afrikaans pronunciation dictionary using Dutch resources and P2P/GP2P. In: Proceedings of the 21st Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2470-2. 22-23 November. Stellenbosch, South Africa. [no page numbers].
  7. Pilon, S, Van Huyssteen, GB & Augustinus, L. 2010. Converting Afrikaans to Dutch for technology recycling. In: Proceedings of the 21st Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2470-2. 22-23 November. Stellenbosch, South Africa. pp 219-224.
  8. Schlünz, GI, Barnard, E & Van Huyssteen, GB. 2010. Part-of-Speech Effects on Text-to-Speech Synthesis. In: Proceedings of the 21st Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2470-2. 22-23 November. Stellenbosch, South Africa. pp 257-262.
  9. Van Huyssteen, GB & Davel, M. 2010. Learning Rules and Categorization Networks for Language Standardization. In:Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL) Workshop on Extracting and Using Constructions in Computational Linguistics. 6 June. Los Angeles, USA. pp 39-46.
  10. Van Huyssteen, GB & Pilon, S. 2009. Rule-based Conversion of Closely-related Languages: A Dutch-to-Afrikaans Convertor. In: Nicolls, F. (ed.). Proceedings of the 20th Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2356-9. 30 November – 01 December. Stellenbosch, South Africa. pp 23-28.

RESOURCES

  1. Van Niekerk, D, Van Huyssteen, GB & Puttkammer, MJ. 2015. Closely-related Languages Convertor V2.0.0. Potchefstroom: Centre for Text Technology (Ctext), North-West University.
    • Language independent convertor for converting text from one language to another, closely related language
    • Written in Python
    • Includes Afrikaans-to-Dutch and Dutch-to-Afrikaans wordlists (including false friends) and rules for orthographic conversion.
    • Includes METIS II test data, also with Afrikaans translations.
    • Includes code for web demo, available here
  2. 2009. Dutch-2-Afrikaans Convertor/Afrikaans-2-Dutch Convertor. (GB van Huyssteen, S Pilon, L Augustinus, MJ Puttkammer, and M Schlemmer). Potchefstroom: NWU.
    • A freely available, open-source rule-based system for converting Dutch text to Afrikaans, and vice versa.
    • Could be used as a pre-processing step in machine translation from Dutch to Afrikaans.
  3. 2010. Convertor v1.2.0. (GB Van Huyssteen, S Pilon, MJ Puttkammer, and M Schlemmer). Potchefstroom: NWU.
    • Language independent convertor for converting text from one language to another closely related langues.
    • Written in Perl.
  4. 2010. Afrikaans and Dutch Lists and Rules v1.0.2. (GB van Huyssteen and S Pilon). Potchefstroom: NWU.
    • Afrikaans-to-Dutch and Dutch-to-Afrikaans wordlists (including false friends) and rules for orthographic conversion.
    • Includes METIS II test data, also with Afrikaans translations.
  5. 2010. Resources for Closely-related Languages: Afrikaans Pronunciation Dictionary (RCRL APD) v1.4.1. (F de Wet, M Davel, L Loots, T Niesler). Potchefstroom: NWU.
    • Contains more than 24,000 Afrikaans words.
    • Developed in collaboration with Stellenbosch University and CSIR Meraka Institute.
  6. 2010. Afrikaans Radio News Corpus v1.0.0. (F de Wet). Potchefstroom: NWU.
    • 330 bulletins; circa 27 hours of audio data.
    • SABC radio news bulletins from 2001-2004, as well as from 2010 onwards.
    • Manually transcribed.
    • Available for research purposes.

TALKS AND POSTERS

  1. Heeringa, W & De Wet, F. 2009. The origin of Afrikaans pronunciation: a comparison to west Germanic languages and Dutch dialects. Presentation given at the 30th TABU Dag. University of Groningen, GRONINGEN, The Netherlands. 11-12 June.
  2. Heeringa, W, De Wet, F & Van Huyssteen, GB. 2011. Afrikaans and Dutch as Closely‑related Languages: a Comparison to West Germanic Languages and Dutch Dialects. Methods in Dialectology 14. University of Western Ontario, LONDON, Ontario, Canada. 2-6 August.
  3. Van Huyssteen, GB & Pilon, S. 2010. A Dutch-to-Afrikaans Convertor. 20th Meeting of Computational Linguistics in the Netherlands (CLIN) 2010. Utrecht University, UTRECHT, The Netherlands. 5 February.
  4. Pilon, S & Van Huyssteen, GB. 2011. Technology recycling for closely related languages: Dutch and Afrikaans. 21th Meeting of Computational Linguistics in the Netherlands (CLIN) 2011. University College Ghent, GHENT, Belgium. 11 February.
  5. Van Huyssteen, GB & Pilon, S. 2010. Some thoughts on a Dutch-to-Afrikaans convertor. Guest lecture, University of Antwerp, ANTWERP, Belgium. 02/02/2010.

DISSERTATIONS (UNPUBLISHED)

  1. Schlünz, GI. 2010. The effects of part-of-speech tagging on text-to-speech synthesis for resource-scarce languages. Unpublished MSc Eng dissertation. Potchefstroom: North West University.

RELATED PROJECTS AND EVENTS

  • Project: Mutual Intelligibility of Closely Related Languages
  • Workshop on comparing approaches to measuring linguistic differences