Van Zaanen et al 2014

Automatic Compound Processing, Publications

2014, Afrikaans, compound boundary annotation, Dutch, language resource development

van Zaanen, Menno M., Gerhard B. Van Huyssteen, Suzanna Aussems, Chris Emmery, and Roald Eiselen. 2014. “The development of Dutch and Afrikaans language resources for compound boundary analysis.” Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavik, Iceland.

Download PDF

DOI

Abstract

In most languages; new words can be created through the process of compounding; which combines two or more words into a new lexical unit. Whereas in languages such as English the components that make up a compound are separated by a space; in languages such as Finnish; German; Afrikaans and Dutch these components are concatenated into one word. Compounding is very productive and leads to practical problems in developing machine translators and spelling checkers; as newly formed compounds cannot be found in existing lexicons. The Automatic Compound Processing (AuCoPro) project deals with the analysis of compounds in two closely-related languages; Afrikaans and Dutch. In this paper; we present the development and evaluation of two datasets; one for each language; that contain compound words with annotated compound boundaries. Such datasets can be used to train classifiers to identify the compound components in novel compounds. We describe the process of annotation and provide an overview of the annotation guidelines as well as global properties of the datasets. The inter-rater agreements between the annotators are considered highly reliable. Furthermore; we show the usability of these datasets by building an initial automatic compound boundary detection system; which assigns compound boundaries with approximately 90% accuracy.

Written in:

English

Dealing with:

Afrikaans and Dutch

Keywords

compound boundary annotation, language resource development, Dutch, Afrikaans

Afrikaans keywords

Afrikaans, Nederlands, samestellingsgrensannotasie, taalhulpbronontwikkeling