Van Huyssteen, Gerhard B., and Menno M. Van Zaanen. 2004. “Learning compound boundaries for Afrikaans spelling checking.” Proceedings of First Workshop on International Proofing Tools and Language Technologies . Patras: University of Patras.:101-108.
Van Huyssteen & Van Zaanen 2004
Abstract
Current spelling checkers for Afrikaans still do not provide full access to desired linguistic performance; especially with respect to high lexical recall and error precision. One of the main problems is that Afrikaans is an agglutinative language with a high lexical generative power using concatenative compound formation. This means that the lexicon in an Afrikaans spelling checker can never account for all possible compounds; and other means should therefore be sought to recognise valid compounds. In this article; we investigate two approaches to finding compound boundaries. First; we describe a longest string-matching algorithm; which searches for known words at the beginning and end of the compound. Next; a machine-learning approach using decision trees is implemented. Results of both approaches are presented; indicating that the longest string-matching algorithm outperforms the machine-learning approach. However; the machine-learning approach has many advantages over the longest string-matching algorithm. The article concludes with a discussion of the advantages and disadvantages of the systems; remaining problems and possible solutions.
Written in:
English
Dealing with:
Afrikaans
Keywords
Afrikaans, compound, compound splitting, human language technology, morphology, spelling checker
Afrikaans keywords
Afrikaans, kompositum, kompositumverdeling, mensetaaltegnologie, morfologie, samestelling, samestellingverdeling, speltoetser