Share on email
Share on print
Share on whatsapp
Share on skype
Share on google
Share on facebook
Share on twitter

Daelemans, Groenewald & Van Huyssteen 2009

Daelemans, Walter, Hendrik J. Groenewald, and Gerhard B. Van Huyssteen. 2009. “Prototype-based active learning for lemmatization.” Proceedings of Recent Advances in Natural Language Processing 2009.

  • Files
  • Keywords
  • Abstract
  • Languages

English: active learning, Afrikaans, lemmatization, prototype theory

Afrikaans: Afrikaans, aktiewe leer, lemmatisering, prototipeteorie


English: Annotation of training data for machine learning is often a laborious and costly process. In Active Learning (AL); criteria are investigated that allow ordering the unannotated data in such a way that those instances potentially contributing most to the speed of learning can be annotated first. Within this context we explore a new approach that focuses on prototypicality as a criterion for the selection of instances to act as training data in order to optimize prediction accuracy. In parallel with the prototype-based active classification (PBAC) approach of Cebron & Berthold (2009); we investigate whether the basic PBAC assumption rings true for linguistic data. The NLP task we address is lemmatization; the reduction of inflected word forms to their base-form. We operationalize prototypicality as features (i.e. word frequency and word length) of the already available training data items; and combine this with a measure of uncertainty (entropy). We show that the selection of less prototypical instances first; provides performance that is better than when data is randomly selected or when state of the art AL methods are used. We argue that this improvement is possible due to the fact that language processing tasks have highly disjunctive instance spaces; as there are often few regularities and many irregularities.

In: English

On: Afrikaans