Van Huyssteen & Van Zaanen 2004

2004, Afrikaans, compound, compound splitting, human language technology, morphology, spelling checker

Van Huyssteen, Gerhard B., and Menno M. Van Zaanen. 2004. “Learning compound boundaries for Afrikaans spelling checking.” Proceedings of First Workshop on International Proofing Tools and Language Technologies . Patras: University of Patras.:101-108.

Download PDF

DOI

Abstract

Current spelling checkers for Afrikaans still do not provide full access to desired linguistic performance; especially with respect to high lexical recall and error precision. One of the main problems is that Afrikaans is an agglutinative language with a high lexical generative power using concatenative compound formation. This means that the lexicon in an Afrikaans spelling checker can never account for all possible compounds; and other means should therefore be sought to recognise valid compounds. In this article; we investigate two approaches to finding compound boundaries. First; we describe a longest string-matching algorithm; which searches for known words at the beginning and end of the compound. Next; a machine-learning approach using decision trees is implemented. Results of both approaches are presented; indicating that the longest string-matching algorithm outperforms the machine-learning approach. However; the machine-learning approach has many advantages over the longest string-matching algorithm. The article concludes with a discussion of the advantages and disadvantages of the systems; remaining problems and possible solutions.

Written in:

English

Dealing with:

Afrikaans

Keywords

Afrikaans, compound, compound splitting, human language technology, morphology, spelling checker

Afrikaans keywords

Afrikaans, kompositum, kompositumverdeling, mensetaaltegnologie, morfologie, samestelling, samestellingverdeling, speltoetser