Software & Resources

2015

GridLine and CTexT. 2015. Afrikaanse Klinkende Taal. Web demo (prototype). Amsterdam: GridLine. 

  • Online, browser-based analysis of Afrikaans texts with suggestions to improve readability of uploaded text.
  • Deliverable of a project funded by the Dutch Language Union and the South African Department of Arts and Culture.
  • Username for demo: kta-demo-001; Password: vmWgTE

Puttkammer, MJ & Van Huyssteen, GB. 2015. Afrikaanse werkwoorde met verledetydvorme 1.0 [Afrikaans verbs with past tense forms 1.0]. Potchefstroom: Centre for Text Technology (CTexT), North-West University. 

  • Version 1.0 comprises 5,523 Afrikaans verbs with associated past tense forms (tab separated).
  • Verbs were sourced from various sources, mainly from CTexT’s Afrikaans lemmatisation data, and supplemented with data from the Afrkaans/Dutch-Dutch/Afrikaans Dictionary (ANNA).
  • Past tense forms were generated and manually verified.

Van Huyssteen, GB, Coetzee, M, Eiselen, ER, Fourie, W, Hocking, J, Lavangee, I, Puttkammer, MJ and Van der Walt, C. 2015. VivA interfaces 1.0. Johannesburg: Virtual Institute for Afrikaans (VivA). 

  • Includes:
    • Offline mobile app for Android and iOS
    • Web site, with online access to, amongst others, dictionary portal, language advice portal, corpus portal, grammar portal, terminology platform, forum, and ticketing system
    • APIs for access to external resources
  • Role: Project leader; lead functional designer

Van Niekerk, DR, Van Huyssteen, GB & Puttkammer, MJ. 2015. Closely related languages convertor v2.0.0. Potchefstroom: Centre for Text Technology (CTexT), North-West University 

  • Language independent convertor for converting text from one language to another, closely related language
  • Written in Python
  • Includes Afrikaans-to-Dutch and Dutch-to-Afrikaans wordlists (including false friends) and rules for orthographic conversion.
  • Includes METIS II test data, also with Afrikaans translations.
  • Includes code for web demo, available here
  • Role: Project leader

2014

CTexT. 2014. Afrikaans NCHLT Annotated Text Corpora 1.0. ISLRN: 139-586-400-050-9. Potchefstroom: Resource Management Agency.  

  • Monolingual text corpus, annotated with lemma, part of speech and morphological analyses, and manually verified
  • Developed during the NCHLT Text project, funded by Department of Arts and Culture
  • Based on documents from the South African government domain, crawled from gov.za websites and collected from various language units
  • Includes:
    • 58,000 word annotated corpus
    • 6,100 word annotated test corpus
    • Annotation protocols for lemmatisation, part of speech tagging, and morphological analysis
  • Quote:
    • Eiselen, E.R. & Puttkammer, M.J. 2014. Developing text resources for ten South African languages. (In Proceedings of the 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland. p. 3698-3703)
  • Available here.
  • Role: Project leader of Afrikaans annotation team (including AP Butler and S Pilon); author of annotation protocols; annotation and verification of data

Van Huyssteen, GB, Daelemans, W, Van Zaanen, MM, Verhoeven, B. 2014. Resources for Compound Processing. North-West University: Potchefstroom, South Africa; University of Antwerp: Antwerp, Belgium; University of Tilburg: Tilburg, The Netherlands. 

  • Annotation Guidelines for Compound Analysis
  • Annotation Guidelines for Compound Segmentation.
  • Annotation Guidelines for the Semantic Analysis of Noun-Noun Compounds in English, Dutch and Afrikaans. 
    • Including: Decision Tree and Paraphrasing Table
  • Annotation Guidelines for the Semantic Analysis of Other Nominal Compounds in Dutch and Afrikaans. 
    • Specifically: Adjective-Noun, Verb-Noun, Quantifier-Noun and Preposition-Noun
  • Compound Semantics Dataset (compounds with semantic annotation) (ISLRN: 018-728-255-533-1)
    • Afrikaans
      • Afr-NN-FirstRound (1449 compounds) 
      • Afr-NN-SecondRound (2328 compounds)
      •  Afr-XN (4553 compounds)
    •  Dutch
      • Ned-NN-FirstRound (1766 compounds)
      • Ned-NN-SecondRound (2000 compounds)
      • Ned-XN (600 compounds)
  • Compound Splitting Dataset (compounds annotated with constituent boundaries and linking elements) (ISLRN: 672-510-089-719-2)
    • Afrikaans (25,266 compounds)
    • Dutch (26,000 compounds)

2012

Butler, A & Van Huyssteen, GB. 2012. 5000 Afrikaans woorde, gekategoriseer volgens spellingprobleme. Weergawe 1.0. Ongepubliseer. Potchefstroom: Noordwes Universiteit. 

  • A list of 5,000 Afrikaans words, categorised according to potential spelling problems.
  • The spelling of words are based on the ninth edition of the Afrikaanse Woordelys en Spelreëls.
  • This wordlist is the outcome of a project that investigated the potential of a spelling game for Afrikaans. It was completed in 2005, and was sponsored by the Chancellor's Trust Fund of the North West-University, Potchefstroom, South Africa. 
  • Available under a Creative Commons Attribution 2.5 South Africa License.

Construction Morphology Toolkit (cx-morph-toolkit) 0.2. (Project leader, with Marelie H Davel as main collaborator and scientific programmer). Potchefstroom: NWU. 

  • Analyse a set of word pairs (base-form + modified form) to automatically extract and draw categorisation networks (a la Cognitive Grammar).
  • Written in Perl.
  • Cite as: Davel, Marelie H & Van Huyssteen, Gerhard B. 2012. Construction Morphology Toolkit (cx-morph-toolkit) 0.2. Potchefstroom: North-West University. Available at cx-morph-kit.sourceforge.net.

Genre classification corpora for South African languages 1.0. (Project leader, with Walter Daelemans as co-project leader, and Dirk Snyman as main collaborator and scientific programmer). Potchefstroom: NWU. 

  • Corpora that can be used to train genre classifiers for South African languages.
    • Afrikaans Genre Classification Corpus  (ISLRN: 666-908-651-526-7)
    • isiNdebele Genre Classification Corpus  (ISLRN: 248-916-003-745-6)
    • isiXhosa Genre Classification Corpus  (ISLRN: 418-998-894-930-1)
    • isiZulu Genre Classification Corpus  (ISLRN: 457-135-629-106-1)
    • Sesotho Genre Classification Corpus  (ISLRN: 469-495-440-934-0)
    • Sesotho sa Leboa Genre Classification Corpus  (ISLRN: 676-872-880-082-8)
    • Setswana Genre Classification Corpus  (ISLRN: 921-735-738-409-8)
    • Siswati Genre Classification Corpus  (ISLRN: 718-674-341-027-9)
    • Tshivenda Genre Classification Corpus  (ISLRN: 098-827-706-093-4)
    • Xitsonga Genre Classification Corpus  (ISLRN: 210-849-527-713-3)
  • Cite as: Snyman, D, Van Huyssteen, GB & Daelemans, W. 2012. Genre classification corpora for South African languages 1.0. Potchefstroom: North-West University. Available at gcsal.sf.net.

2011

Taalkommissie van die Suid-Afrikaanse Akademie vir Wetenskap en Kuns. 2011. Taalkommissiekorpus 1.1. Noordwes-Universiteit: CTexT. 

  • A stratified corpus of formal, written Standard Afrikaans, comprising circa 57 million words.
  • Available for research purposes only. Get online access (also to various other corpora) at the Virtual Institute for Afrikaans.
  • To request a downloadable version of the corpus for specific research purposes (e.g. computational linguistics), write an email to sulene (dot) pilon (at) up (dot) ac (dot) za

2010

Convertor v1.2.0. (Project leader, in collaboration with S Pilon, MJ Puttkammer, and M Schlemmer). Potchefstroom: NWU. 

  • Language independent convertor for converting text from one language to another closely related langues.
  • Written in Perl.

Afrikaans and Dutch Lists and Rules v1.0.2. (Project leader, in collaboration with S Pilon). Potchefstroom: NWU. 

  • Afrikaans-to-Dutch and Dutch-to-Afrikaans wordlists (including false friends) and rules for orthographic conversion.
  • Includes METIS II test data, also with Afrikaans translations.

2009

Dutch-2-Afrikaans Convertor/Afrikaans-2-Dutch Convertor. (Project leader, in collaboration with S Pilon, L Augustinus, MJ Puttkammer, and M Schlemmer). Potchefstroom: NWU. 

  • A freely available, open-source rule-based system for converting Dutch text to Afrikaans, and vice versa.
  • Could be used as a pre-processing step in machine translation from Dutch to Afrikaans.

2008

Afrikaanse SkryfGoed 2008, including Afrikaanse Speltoetser 3.1, Afrikaanse Grammatikatoetser 1.0, Tesourus 1.0 en Woordafbreker. (Project leader, in collaboration with, amongst others, ER Eiselen, S Pilon, MJ Puttkammer, HJ Groenewald, U Janke and M Schlemmer). Potchefstroom: PUCHE.

  • Official spelling checker for Microsoft Office 2010.

LaraTK. (Project member, in collaboration with, amongst others, MJ Puttkammer and M Schlemmer). Potchefstroom: NWU. Available on request. 

  • Lexicon Annotation and Regulation Assistant for “Taalkommissie” is software developed for the Technical Committee on Standardisation (“Taalkommissie”) of the Pan South African Language Board (PanSALB) and the “Suid-Afrikaanse Akademie vir Wetenskap en Kuns”.
  • Tool for the annotation and editing of word-lists.

2007

  • Van Huyssteen, GB, Pilon, S, Puttkammer, MJ, Groenewald, HJ, Wissing, DP & Kotzé, GJ. 2007. ALEXANDER: Annotated lexical database for Afrikaans.

    • Annotated on phonological, morphological, syntactic and semantic level.

TurboAnnotate. (Project leader, in collaboration with, amongst others, MJ Puttkammer S Pilon). Potchefstroom: NWU. 

  • A freely available, open-source system for bootstrapping linguistic data for machine-learning purposes, or for manually creating gold standards or other annotated lists.

2006

Brits, KC, Pretorius, R & Van Huyssteen, GB. 2006. A Lemmatiser for Setswana.

2005

Afrikaanse Speltoetser 3.0, Tesourus 1.0 en Woordafbreker. (Project leader, in collaboration with, amongst others, ER Eiselen, S Pilon, MJ Puttkammer, MM van Zaanen and JC Muller). Potchefstroom: PUCHE.

  • Official spelling checker for Microsoft Office 2003.

isiXhosa Spelling Checker 1.0 & Hyphenator. (Project leader, in collaboration with, amongst others, K Podile, J Jones, MJ Puttkammer, DJ Prinsloo, L Pretorius and JC Muller). Potchefstroom: NWU. 

  • Official spelling checker for Microsoft Office 2003.

isiZulu Spelling Checker 1.0 & Hyphenator. (Project leader, in collaboration with, amongst others, SE Bosch, ER Eiselen, DJ Prinsloo, L Pretorius and JC Muller). Potchefstroom: NWU. 

  • Official spelling checker for Microsoft Office 2003.

Sesotho sa Leboa Spelling Checker 1.0 & Hyphenator. (Project leader, in collaboration with, amongst others, DJ Prinsloo, ER Eiselen, MJ Puttkammer and JC Muller). Potchefstroom: NWU. 

  • Official spelling checker for Microsoft Office 2003.

Setswana Spelling Checker 1.0 & Hyphenator. (Project leader, in collaboration with, amongst others, PM Sebate, ER Eiselen, MJ Puttkammer, DJ Prinsloo and JC Muller). Potchefstroom: NWU. 

  • Official spelling checker for Microsoft Office 2003.

CKARMA (“C5 KompositumAnaliseerder vir Robuuste Morfologiese Analise”). [C5 Compound Analyser for Robust Morphological Analysis]. (Project leader, in collaboration with, amongst others, MJ Puttkammer and MM van Zaanen). Potchefstroom: CTexT, North-West University

CALOMO (“C5 Afrikaanse Lettergreepverdeler vir Outomatiese Morfologiese Ontleding”). [C5 Afrikaans Hyphenator for Automatic Morphological Analysis]. (Project leader, in collaboration with, amongst others, MJ Puttkammer and MM van Zaanen). Potchefstroom: CTexT, North-West University

2004

RAGEL (“Reëlgebaseerde Afrikaanse Grondwoord- En Lemma-identifiseerder”). [Rule-Based Afrikaans Stemmer and Lemmatiser]. (Project leader, in collaboration with, amongst others, S Pilon). Potchefstroom: CTexT, North-West University

2003

Afrikaanse Speltoetser 2.0 en Woordafbreker. (Project leader, in collaboration with, amongst others, ER Eiselen, S Pilon, MJ Puttkammer, MM van Zaanen and JC Muller). Potchefstroom: PUCHE. 

  • Official spelling checker for Microsoft Office 2003.
  • Awarded a silver award by “SA Computer Magazine” (2004).