Human Language Technology Resources for Closely-Related Languages
National Research Foundation
Gerhard B van Huyssteen
- CTexT (Centre for Text Technology), North-West University, South Africa
Febe de Wet
- Stellenbosch University, South Africa
- Suléne Pilon (North-West University, South Africa)
- Martin Puttkammer (North-West University, South Africa)
- Martin Schlemmer (North-West University, South Africa)
- Handré Groenewald (North-West University, South Africa)
- Linsen Loots (Stellenbosch University, South Africa)
- Thomas Niesler (Stellenbosch University, South Africa)
- Marelie Davel (Council for Scientific and Industrial Research, South Africa)
- Georg Schlünz (Council for Scientific and Industrial Research, South Africa)
- Etienne Barnard (Council for Scientific and Industrial Research, South Africa)
- Wilbert Heeringa (Meertens Institute, The Netherlands)
- Liesbeth Augustinus (University of Leuven, Belgium)
- Walter Daelemans (University of Antwerp, Belgium)
Two kinds of resources are considered fundamental for the development of Human Language Technology (HLT) applications (such as dictation software, automatic machine translation systems, or intelligent search engines), viz.:
- Core Technologies (also sometimes called “lingware”; i.e. reusable, efficient natural language processing modules, integrated in end-user applications); and
- Data (i.e. language data (corpora and lexica), formal descriptions of language structures (grammars), and language models) (Daelemans & Strik, 2002).
As these resources are quintessential in the development of most HLT applications, it is vital to develop sophisticated, reusable resources for all the South African languages, before venturing into the development of advanced HLT applications for these languages. Although both deep and shallow methods are used for most of the processes involved in developing these resources, these methods are, on the one hand, far from perfect (Gaustad & Bouma, 2001), and, on the other hand, have mostly been developed for the commercially-more important languages of the world (e.g. English, Dutch, Spanish, German, Japanese, etc.) (Cole, 1995: 111). To develop such resources for the indigenous South African languages (all of which could be considered so-called resource-scarce languages), these methods should therefore either be adapted for these languages, or alternatively, new methods must be sought to deal with the idiosyncrasies of these languages.
One method to fast-track the development of resources for resource-scarce languages is to re-cycle (port/transfer/re-engineer) existing technologies from one language L1 to another, closely-related language L2. The basic hypothesis, which is also the hypothesis of this project, is that "[if] the languages L1 and L2 are similar enough, then it should be easier [and quicker] to recycle software applicable to L1 than to rewrite it from scratch for L2", thereby taking care of "most of the drudgery before any human has to become involved" (Rayner et al., 1997: 65). Scannell (2006) argues that resource-scarce languages could benefit from such an approach, especially where L1 is a global, well-resourced language.
To illustrate this hypothesis in real terms, let us assume that Dutch (L1) and Afrikaans (L2) are similar enough for purposes of technology transfer. The hypothesis then is that it would be easier and quicker to use and adapt, for example, an existing Dutch syntactic parser to parse Afrikaans sentences, than to develop an Afrikaans syntactic parser from scratch. One could therefore use the Dutch parser to parse an Afrikaans corpus, and afterwards only correct systematic errors manually or semi-automatically.
Until 2007, only a few projects that have exploited this approach were conducted for, amongst others, Irish – Scottish Gaelic (Scannell, 2006), Spanish – Catalan, Spanish – Galician (Corbí-Bellot et al., 2005), English – French, Swedish – Danish (Rayner et al., 1997), etc. Almost all of these projects were conducted within the context of either automatic text-based machine translation, or speech-to-speech translation.
Although the idea to re-cycle technologies between closely-related languages is not a novel one, numerous questions and opportunities for research remain. For instance, what "similar enough" in the above hypothesis entails, is completely unclear from the literature. Also, almost all the above-mentioned projects were conducted using rule-based methods (e.g. finite-state grammars); the question remains whether technologies that have been developed using machine learning would also be appropriate for this approach. Moreover, with the exception of some linguistic and lexicographic studies (e.g. Jansen & Olivier, 1986; Prinsloo, 2006), very little similar research has been done involving South African languages.
Hence, the central problem addressed in this project concerns an investigation of the possibilities of technology re-cycling between closely-related languages. The focus in this project is on Afrikaans, with Dutch as the closely-related, well-sourced language.
In the period 2008-2010, we directed our attention on the experimental development of various re-usable resources (core technologies and data) for Afrikaans, including:
- Annotated wideband speech data for large-vocabulary continuous speech recognition;
- Pronunciation dictionary;
- Afrikaans-Dutch/Dutch-Afrikaans convertor, including a bilingual translation dictionary;
- High-accuracy chunker (i.e. shallow parser);
- Improved part-of-speech tagger and lemmatiser for Afrikaans.
- Cole, R.A. (editor in chief). 1995. Survey of the State of the Art in Human Language Technology. Available at: http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html. Accessed on: May 15, 2002.
- Corbí-Bellot, A.M. et al. 2005. An open-source shallow-transfer machine translation engine for the Romance languages of Spain. In: Proceedings of the 10th Annual EAMT Conference. Budapest, Hungary, 30-31 May 2005.
- Daelemans, W. & Strik, H. 2002. Actieplan voor het Nederlands in de taal- en spraaktechnologie: Prioriteiten voor basisvoorzieningen. [Action Plan for Dutch Language and Speech Technology: Priorities for Basic Resources]. Report for the Nederlandse Taalunie. Available at: http://cnts.uia.ac.be/~walter/TST/. Accessed on: April 30, 2004.
- Gaustad, T. & Bouma, G. 2001. Accurate Stemming of Dutch for Text Classification. Computational Linguistics in the Netherlands 2001. Amsterdam: Rodopi. pp. 104-117.
- Jansen, E. & Olivier G. 1986. Praktiese Nederlands. Pretoria: Academica.
- Prinsloo, D.J. 2006. Compiling a Bidirectional Dictionary Bridging English and the Sotho Languages: A Viability Study. Lexikos. 16: 193-204.
- Rayner, M. et al. 1997. Recycling Lingware in a Multilingual MT System. In: Burstein, J. & Leacock, C. From Research to Commercial Applications: Making NLP Work in Practice. Somerset, New Jersey: Association for Computational Linguistics. pp. 65-70.
- Scannell, K. 2006. Machine translation for closely related language pairs. In: Proceedings of the LREC2006 Workshop on Strategies for developing machine translation for minority languages. European Language Resources Association: Paris.
- Vandeghinste, V., Schuurman, I., Carl, M., Markantonatou, S. and Badia, T. 2006. METIS-II: Machine Translation for Low Resource Languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006). Genoa, Italy. May 24-26. European Language Resources Association: Paris.
- Language independent convertor for converting text from one language to another, closely related language
- Written in Python
- Includes Afrikaans-to-Dutch and Dutch-to-Afrikaans wordlists (including false friends) and rules for orthographic conversion.
- Includes METIS II test data, also with Afrikaans translations.
- Includes code for web demo, available here
2009. Dutch-2-Afrikaans Convertor/Afrikaans-2-Dutch Convertor. (GB van Huyssteen, S Pilon, L Augustinus, MJ Puttkammer, and M Schlemmer). Potchefstroom: NWU.
A freely available, open-source rule-based system for converting Dutch text to Afrikaans, and vice versa.
Could be used as a pre-processing step in machine translation from Dutch to Afrikaans.
Language independent convertor for converting text from one language to another closely related langues.
Written in Perl.
Afrikaans-to-Dutch and Dutch-to-Afrikaans wordlists (including false friends) and rules for orthographic conversion.
Includes METIS II test data, also with Afrikaans translations.
Contains more than 24,000 Afrikaans words.
Developed in collaboration with Stellenbosch University and CSIR Meraka Institute.
330 bulletins; circa 27 hours of audio data.
SABC radio news bulletins from 2001-2004, as well as from 2010 onwards.
Available for research purposes contact Martin Puttkammer.
Talks and posters
Related projects and events
Project: Mutual Intelligibility of Closely Related Languages
Workshop on comparing approaches to measuring linguistic differences