Snyman, Van Huyssteen & Daelemans 2011

2011, genre classification, inverse documents frequency, resource-scarce languages, term frequency, text classification algorithm

Snyman, Dirk P., Gerhard B. Van Huyssteen, and Walter Daelemans. 2011. “Automatic genre classification for resource scarce languages.” Proceedings of the 2011 Conference of the Pattern Recognition Association of South Africa, Vanderbijlpark, South Africa.

Download PDF

DOI

Abstract

In this article we present research on the development of automatic genre classification systems for resource scarce languages. The main approaches to text classification from literature are presented and weighed against each other during an experimental phase; to identify the most appropriate text classification approach to be used as a genre classification system. A fixed feature set is extracted for seven classes from the available training data for each of the six languages under scrutiny and paired with each classification algorithm in order to test the algorithms’ performance. The algorithm showing the best results is support vector machines; in conjunction with term frequency and inverse document frequency features.

Written in:

English

Dealing with:

Afrikaans; Sepedi; Sesotho; Setswana; isiXhosa and isiZulu

Keywords

genre classification, resource scarce languages, text classification algorithm, term frequency, inverse documents frequency

Afrikaans keywords

genreklassifikasie, hulpbronskaars tale, inverse dokumentfrekwensie, teksklassifikasiealgoritme, termfrekwensie

This is my work, my life

Facebook
YouTube
WhatsApp
Mail