Project: Genre Classification for South African Languages
DURATION
2011-2012
FUNDED BY:
- Dutch Language Union (Belgium, The Netherlands)
- Department of Arts and Culture (South Africa)
PROJECT URLS
- https://gcsal.sf.net/
- https://gerhard.pro/genre-classification/
PROJECT LEADERS
- GERHARD B VAN HUYSSTEEN – PROJECT LEADER AND LINGUISTICS
- CTexT (Centre for Text Technology), North-West University, South Africa
- WALTER DAELEMANS – COMPUTATIONAL LINGUISTICS
- CLiPs (Computational Linguistics and Psycholinguistics), University of Antwerp, Belgium
PROJECT COLLABORATORS
- DIRK SNYMAN
- CTexT (Centre for Text Technology), North-West University, Potchefstroom, South Africa
OVERVIEW
During 2011/2, the Department of Arts and Culture of the South African Government funded a small-scale project on genre classification for document management.
During the project, the following tasks were undertaken:
- We investigated appropriate ontologies and optimal supervised and unsupervised machine learning methods for the development of genre classifiers, specifically for resource-scarce languages (information captured in a master’s dissertation, and in a scholarly publication);
- We developed genre classifiers (and its associated resources) for the ten official indigenous languages of South Africa (available here);
- We implemented these classifiers as a web-based demo, where users can either upload a document or provide a URL for classification (depending on the chosen genre classification ontology); and
- We organised a training event on “New Applications of Automatic Text Categorization”, presented by Prof Walter Daelemans on 25 January 2012 at the CSIR, Pretoria).
The project was executed and managed by Trifonius, in collaboration with partners, including:
- Prof Walter Daelemans (University of Antwerp; Belgium)
- Centre for Text Technology (CTexT) (North-West University; Potchefstroom)
- Human Language Technology Competence Area (Council for Scientific and Industrial Research; Pretoria)
AIMS
The primary aim of this project was to develop resources (including annotation protocols, and training and testing data) for the development of:
- automatic genre classifiers for ten South African languages.
Other secondary aims included:
- to report on the research and development process in the form of:
- one Master’s degree dissertation;
- at least two scholarly papers, to be published in relevant journals or peer-reviewed conference proceedings;
- various annotation protocols, made available publicly;
- to contribute towards human capital development and growth of the pool of experts in descriptive linguistics and computational linguistics in South Africa, Belgium and The Netherlands by offering bursaries, grants or contract work to undergraduate and post-graduate students;
- to extend the collaboration network between Trifonius, North-West University (NWU) and University of Antwerp (UA), by introducing young scholars and students to each other;
- to identify new research issues as they unfold in the research and development process; and
- to contribute to the HLT-enabling of the languages of South Africa.
OUTPUTS
PEER-REVIEWED PUBLICATIONS
- Snyman, DP, Van Huyssteen, GB & Daelemans, W. 2014. Outomatiese Genreklassifikasie vir Afrikaans [Automatic genre classification for Afrikaans]. DOI: 10.4102/satnt.v33i1.759. Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie. 33(1): 12 pp.
- Snyman, D, Van Huyssteen, GB & Daelemans, W. 2012. Cross-Lingual Genre Classification for Closely Related Languages. In: Proceedings of the Twenty-Third Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-54601-0. 29-30 November. Pretoria, South Africa. pp. 133-137.
- Snyman, DP, Van Huyssteen, GB & Daelemans, W. 2011. Automatic genre classification for resource scarce languages. In: Proceedings of the 2011 Conference of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-51914-4. 22-25 November. Vanderbijlpark, South Africa. pp. 132-137.
RESOURCES
- Genre Classification Corpora for South African Languages 1.0. (Project leader, with Walter Daelemans as co-project leader, and Dirk Snyman as main collaborator and scientific programmer). Potchefstroom: NWU.
- Corpora that can be used to train genre classifiers for South African languages.
- Afrikaans Genre Classification Corpus (ISLRN: 666-908-651-526-7)
- isiNdebele Genre Classification Corpus (ISLRN: 248-916-003-745-6)
- isiXhosa Genre Classification Corpus (ISLRN: 418-998-894-930-1)
- isiZulu Genre Classification Corpus (ISLRN: 457-135-629-106-1)
- Sesotho Genre Classification Corpus (ISLRN: 469-495-440-934-0)
- Sesotho sa Leboa Genre Classification Corpus (ISLRN: 676-872-880-082-8)
- Setswana Genre Classification Corpus (ISLRN: 921-735-738-409-8)
- Siswati Genre Classification Corpus (ISLRN: 718-674-341-027-9)
- Tshivenda Genre Classification Corpus (ISLRN: 098-827-706-093-4)
- Xitsonga Genre Classification Corpus (ISLRN: 210-849-527-713-3)
- Cite as: Snyman, D, Van Huyssteen, GB & Daelemans, W. 2012. Genre classification corpora for South African languages 1.0. Potchefstroom: North-West University. Available at gcsal.sf.net.
- Corpora that can be used to train genre classifiers for South African languages.
TUTORIAL
Tutorial: New Applications of Automatic Text Categorization
Presenter: Prof Walter Daelemans (University of Antwerp, Belgium)
Date: 25 January 2012
Time: 09:00-16:00
Place: Knowledge Commons, CSIR, Pretoria
Cost: Free
Automatic text categorization is a mature language technology that is able to sort documents into different categories on the basis of examples. Its applications range from e-mail routing and spam filtering to topic detection and text genre assignment. A text categorization system incorporates an approach to document representation (mostly a set of relevant terms or n-grams of words found in the document), and a machine learning method. In the first part of the tutorial, this basic architecture has been described at an introductory level, and an overview of state of the art document representation and machine learning methods have been presented.
In the second part of the tutorial, we focused on more technical detail about a new application area of this technology: automatic profiling of text. In this application, we are interested in which metadata we can infer from a document. More specifically we are interested in how far we can get with text categorisation techniques in tasks like the following:
(i) Text profiling: predicting age, gender, personality, and region of the author of the text.
(ii) Intrinsic plagiarism detection: finding passages in text not written by the author.
(iii) Deception detection: finding out whether reviews and reports are truthful, detecting pedophile grooming in social networks etc.
In order to achieve this, we need document representations that are different from other applications, instead of (patterns of) content words we need other linguistic categories, and special purpose machine learning algorithms for some of the tasks, such as Koppel et al.’s unmasking algorithm.
This workshop was hosted and organised by Trifonius, and was made possible through funding by the National Centre for Human Language Technologies of the Department of Arts and Culture, and a financial contribution by the Human Language Technology Competence Area of the CSIR. The workshop was attended by eleven scholars and students.
DISSERTATIONS (UNPUBLISHED)
Snyman, DP. 2012. Outomatiese genreklassifikasie vir hulpbronskaars tale [Automatic genre classification for resource-scarce languages]. MA Thesis. Potchefstroom: North-West University.
DEMO
Final version of a web-based demonstrator.