Van Huyssteen & Van Zaanen 2004

Van Huyssteen, Gerhard B., and Menno M. Van Zaanen. 2004. “Learning compound boundaries for Afrikaans spelling checking.” Proceedings of First Workshop on  International Proofing Tools and Language Technologies . Patras: University of Patras.:101-108.

English: Afrikaans, compound, compound splitting, human language technology, morphology, spelling checker

Afrikaans: Afrikaans, kompositum, kompositumverdeling, mensetaaltegnologie, morfologie, samestelling, samestellingverdeling, speltoetser

English: Current spelling checkers for Afrikaans still do not provide full access to desired linguistic performance; especially with respect to high lexical recall and error precision. One of the main problems is that Afrikaans is an agglutinative language with a high lexical generative power using concatenative compound formation. This means that the lexicon in an Afrikaans spelling checker can never account for all possible compounds; and other means should therefore be sought to recognise valid compounds. In this article; we investigate two approaches to finding compound boundaries. First; we describe a longest string-matching algorithm; which searches for known words at the beginning and end of the compound. Next; a machine-learning approach using decision trees is implemented. Results of both approaches are presented; indicating that the longest string-matching algorithm outperforms the machine-learning approach. However; the machine-learning approach has many advantages over the longest string-matching algorithm. The article concludes with a discussion of the advantages and disadvantages of the systems; remaining problems and possible solutions.


Afrikaans: 

In: English

On: Afrikaans