- Dutch Language Union (Belgium, The Netherlands)
- Department of Arts and Culture (South Africa)
- National Research Foundation (South Africa) (Grant number: 81794)
- European Network on Word Structure (NetWordS) (European Science Foundation) (Grant number: 5570)
Gerhard B van Huyssteen – Project Coordinator & Linguistics
- CTexT (Centre for Text Technology), North-West University, South Africa
Walter Daelemans - Compound Semantics
- CLiPs (Computational Linguistics and Psycholinguistics), University of Antwerp, Belgium
Menno van Zaanen - Compound Splitting
- TiCC (Tilburg Centre for Cognition and Communication), University of Tilburg, The Netherlands
- CLiPS (Computational Linguistics and Psycolinguistics), University of Antwerp, Belgium
North-West University (South Africa)
Roald Eiselen, Benito Trollip, Joani Liversage, Zandre Botha, Martin Puttkammer, Martin Schlemmer, Carli de Wet, Nadia Schultz, Nanette Van Den Berg, Sansi Eiselen
Tilburg University (The Netherlands)
Rick Smetsers, Nanne van Noord, Vincent Lichtenberg, Bas Goris, Sylvie Bruys, Suzanne Aussems
University of Antwerp (Belgium)
Natasja Loyens, Maxim Baetens
In many human language technology applications (e.g. machine translators, spelling checkers), it often happens that concatenatively written compounds (e.g. “skrywerspen”/”schrijverspen” ‘writer’s pen’) are processed incorrectly (e.g. not found in a lexicon). From a technological perspective, deficiencies related to automatic compound segmentation are particularly problematic, since concatenative compounding is a highly productive process in many languages, including Dutch and Afrikaans. Although a compound splitter has already been developed for Afrikaans (Van Huyssteen and Van Zaanen, 2004), the reported accuracy of circa 90% could be improved, and the annotation protocol and data need to be revised.
More importantly, no stand-alone compound splitter for Dutch is available; research that has been done in this field is more than ten years old (e.g. Pohlmann and Kraaij, 1996), uses expensive resources (e.g. Ordelman et al., 2003), does complete morphological analysis (e.g. De Pauw et al., 2004), and/or has not been released for re-use in the open-source domain. In subproject 1, we will therefore attempt to develop robust compound splitters for both Afrikaans and Dutch through a combination of technology recycling (Pilon et al., 2010) and data pooling (i.e. joining (converted) training material for the two languages in one training set), as well as experimentation with sequence classification (Van Zaanen & Gaustad, 2010; Van Zaanen et al., 2011).
In addition to segmentation, another subpart of this proposed project will also focus on the semantic analysis of compounds – i.e. to determine that “boekrak” construes ‘case for books’, while “houtrak” means ‘case made of wood’. For more advanced HLT applications like information extraction, question answering and machine translation systems, proper semantic analysis of compounds is required. Internationally, research on automatic compound analysis has focused almost exclusively on English; no work in this regard has been done for either Afrikaans or Dutch, and this proposed project will therefore do pioneering work in this regard.
Although linguistic research on the topic has been done for both these languages, a uniform, cross-lingual framework does not exist yet, neither does an understanding of how compounding in these two languages differs systematically (see examples above). An attempt will therefore be made to consolidate existing research on both these languages (and other languages), and to postulate a cross-lingual annotation scheme compatible with the work of Ó Séaghdha (2008).
Since no semantic analyser exists for either languages, in subproject 2 we will then develop first-generation analysers for Afrikaans and Dutch simultaneously, using bootstrapping and data pooling (i.e. first develop a small training set of Afrikaans data, then train an Afrikaans analyser, then analyse Dutch data with the Afrikaans analyser, and subsequently join the data to train a next Afrikaans and/or Dutch analyser; this process continues in small increments until desired performance has been reached). We will start with techniques that work well for English (based on distributional semantics and machine learning); see Hendrickx et al. (2010) for an overview of the current state of the art. We will try to improve these techniques and adapt them to the specific requirements of Afrikaans and Dutch.
- Daelemans, W., Buchholz, S. and Jorn Veenstra. 1999. Memory-Based Shallow Parsing. Proceedings of CoNLL-99, Bergen, Norway. June 12, 1999.
- Davel. M. and Barnard, E. 2004. A default-and-refinement approach to pronunciation prediction". In: Proceedings of PRASA. South Africa, November 2004, pp. 119–123.
- De Knop, S. and Dirven, R. 2008. Motion and location events in German, French and English: A typological, contrastive and pedagogical approach. In: De Knop, S. and De Rycker, T. (eds.) Cognitive Approaches to Pedagogical Grammar: A Volume in Honour of René Dirven. Berlin: Mouton de Gruyter.
- De Pauw, G., Laureys, T., Daelemans, W. and Van Hamme, H. 2004. A Comparison of Two Different Approaches to Morphological Analysis of Dutch. In: Proceedings of the Workshop of the ACL Special Interest Group on Computational Phonology (SIGPHON). Barcelona, Spain. pp. 62-69.
- Gast, V. forthcoming. Contrastive analysis: Theories and methods. In: Kortmann, B. and Kabatek, J. (eds.). Dictionaries of Linguistics and Communication Science: Linguistic theory and methodology. Berlin: Mouton de Gruyter.
- González, M. D. L. Á. G., Mackenzie, J. L. and Álvarez, E. M. G. 2008. Current Trends in Contrastive Linguistics: Functional and cognitive perspectives, Amsterdam, John Benjamins.
- Hendrickx, I, Kim, SM, Kozareva, Z, Nakov, P, Ó Séaghdha, D, Padó, S, Pennacchiotti, M, Romano, L & Szpakowicz, S. 2010. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals. In: Proceedings of the SemEval-2 Workshop. Uppsala, Sweden.
- Hüning, M. 2009. Semantic niches and analogy in word formation: Evidence from contrastive linguistics. Languages in Contrast. 9(2): 183-201.
- Hüning, M. 2010. Diachronie in de synchronie. Over contrastieve taalkunde en taal(veranderings)theorie. In: Fenoulhet, J. and Renkema, J. (eds.) Internationale neerlandistiek: een vak in beweging. Gent: Academia Press.
- Mitchell, T.M. 1997. Machine learning. Boston: MacGraw-Hill.
- Ó Séaghdha, D. 2008. Learning compound noun semantics. Technical report 735. Cambridge: University of Cambridge.
- OECD. 2002. Proposed standard practice for surveys on research and experimental development (Frascati Manual). Eurostat.
- Ordelman, R., Van Hessen, A. and De Jong, F. 2003. Compound decomposition in Dutch large vocabulary speech recognition. In: Proceedings of Eurospeech 2003. Geneva, Switzerland. 225–228.
- Pilon, S, Van Huyssteen, GB and Augustinus, L. 2010. Converting Afrikaans to Dutch for technology recycling. In: Proceedings of the 21st Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2470-2. 22-23 November. Stellenbosch, South Africa. pp 219-224.
- Pohlmann, R and Kraaij, W. 1996. Improving the precision of a text retrieval system with compound analysis. In: Proceedings of the 7th Computational Linguistics in the Netherlands (CLIN 1996). pp. 115-129.
- Quinlan, J.R. 1987. Generating production rules from decision trees. In: McDermott, J. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (IJCAI-87): 304–307.
- Van Huyssteen, GB and Van Zaanen, MM. 2004. Learning Compound Boundaries for Afrikaans Spelling Checking. In: Proceedings of First Workshop on International Proofing Tools and Language Technologies. Patras, Greece. pp. 101-108.
- Van Huyssteen, GB. 2005. ’n Kognitiewe gebruiksgebaseerde beskrywingsmodel vir die Afrikaanse grammatika. [A Cognitive Usage-Based Description Model for Afrikaans Grammar]. Southern African Linguistics and Applied Language Studies. 23(2): pp. 125-137.
- Van Zaanen, M & Gaustad T. 2010. Grammatical Inference as Class Discrimination. In: Sempere, J & García, P. (eds.). Grammatical Inference: Theoretical Results and Applications. 6339, 245–257.
- Van Zaanen, M, Gaustad T & Feijen J. 2011. Influence of Size on Pattern-based Sequence Classification. In: Van der Putten, P, Veenman, C, Vanschoren, J, Israel, M & Blockeel, H. (eds.). Proceedings of the 20th Belgian-Dutch Conference on Machine Learning. The Hague, The Netherlands. pp 53–60.
- Veenstra, J., Van den Bosch, A., Buchholz, S., Daelemans, W. and Zavrel, J. 2000. Memory-Based Word Sense Disambiguation. Computers and the Humanities. 34(1-2): 171-177.
The primary aim of this project was to develop resources (including annotation protocols, and training and testing data) for the development of:
- robust compound splitters (subproject 1); and
- first-generation compound analysers (subproject 2);
for Afrikaans and Dutch, through a combination of cross-language transfer (i.e. technology recycling), data pooling, and various machine learning approaches.
Other secondary aims included:
- to report on the research and development process in the form of:
- one Master’s degree dissertation;
- two fourth-year student’s projects (mini-dissertation);
- at least two scholarly papers, to be published in relevant journals or peer-reviewed conference proceedings;
- various annotation protocols, made available publicly; and
- to contribute towards human capital development and growth of the pool of experts in descriptive linguistics and computational linguistics in South Africa, Belgium and The Netherlands by offering bursaries, grants or contract work to undergraduate and post-graduate students.
- to extend the collaboration network between North-West University (NWU), Tilburg University (TU) and University of Antwerp (UA), by introducing young scholars and students to each other (i.e. extending the existing collaboration beyond Van Huyssteen–Van Zaanen–Daelemans);
- to identify new research issues as they unfold in the research and development process; and
- to contribute to the HLT-enabling of the languages of South Africa.
Annotation Guidelines for Compound Analysis
Verhoeven, B., van Huyssteen, G., van Zaanen, M., & Daelemans, W. 2014. Annotation Guidelines for Compound Analysis. In: CLiPS Technical Report Series (CTRS), 5. ISSN: 2033-3544.
- Annotation Guidelines for Compound Segmentation.
- Annotation Guidelines for the Semantic Analysis of Noun-Noun Compounds in English, Dutch and Afrikaans. Including: Decision Tree and Paraphrasing Table
- Annotation Guidelines for the Semantic Analysis of Other Nominal Compounds in Dutch and Afrikaans. Specifically: Adjective-Noun, Verb-Noun, Quantifier-Noun and Preposition-Noun
Compound Semantics Dataset (compounds with semantic annotation)
- Afr-NN-FirstRound (1449 compounds)
- Afr-NN-SecondRound (2328 compounds)
- Afr-XN (4553 compounds)
- Ned-NN-FirstRound (1766 compounds)
- Ned-NN-SecondRound (2000 compounds)
- Ned-XN (600 compounds)
Compound Splitting Dataset (compounds annotated with constituent boundaries and linking elements)
- Afrikaans (25,266 compounds)
- Dutch (26,000 compounds)
- Verhoeven, B. 2012. A Computational Semantic Analysis of Noun Compounds in Dutch. MA Thesis, University of Antwerp, Belgium.
- Liversage, J. 2013. Verifiëring van semantiese verhoudings in Afrikaanse naamwoord-naamwoordsamestellings [Verification of semantic relations in Afrikaans noun-noun compounds]. Potchefstroom: North-West University.
- Trollip, B. 2013. Herbeskouing van die interfiks in Afrikaanse komposita [Reconsideration of the interfix in Afrikaans compounds]. Potchefstroom: North-West University.
- Van den Berg, N. 2013. Samestellings met en afleidings van meerledige eiename in Afrikaans en Nederlands [Compounds with and derivations of multiword proper names in Afrikaans and Dutch]. Potchefstroom: North-West University.
- Trollip, B. 2012. Die klassifikasiemoontlikhede van nie-prototipiese samestellings. [The classification possibilities of non-prototypical compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
- De Wet, C. 2012. Semantiese ontleding van Afrikaanse NN-samestellings. [Semantic analysis of Afrikaans NN-compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
- Schultz, N. 2012. Die ontwikkeling van 'n verteenwoordigende verwysende datastel van Afrikaanse samestellings. [The development of a representative referential dataset of Afrikaans compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
- Liversage, J. 2012. Voorgestelde protokol vir die verwerking van X+N samestellings. [Proposed protocol for the processing of X+N compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
Scalise, S. CompoNet. University of Bologna, Italy.
CompoNet is a descriptive compound database for 27 languages, including Dutch and Afrikaans.
Ó Séaghdha, D. Compound Noun Bibliography. University of Cambridge, United Kingdom.
Bibliography of computational and linguistic literature relating to compound nouns.