Project: Automatic Compound Processing
Abbreviation
AuCoPro
DURATION
2012-2014
FUNDED BY:
- Dutch Language Union (Belgium, The Netherlands)
- Department of Arts and Culture (South Africa)
- National Research Foundation (South Africa) (Grant number: 81794)
- European Network on Word Structure (NetWordS) (European Science Foundation) (Grant number: 5570)
PROJECT URLS
PROJECT LEADERS
- GERHARD B VAN HUYSSTEEN – PROJECT COORDINATOR & LINGUISTICS
CTexT (Centre for Text Technology), North-West University, South Africa - WALTER DAELEMANS – COMPOUND SEMANTICS
CLiPs (Computational Linguistics and Psycholinguistics), University of Antwerp, Belgium - MENNO VAN ZAANEN – COMPOUND SPLITTING
TiCC (Tilburg Centre for Cognition and Communication), University of Tilburg, The Netherlands
PROJECT COLLABORATORS
- BEN VERHOEVEN
CLiPS (Computational Linguistics and Psycolinguistics), University of Antwerp, Belgium - NORTH-WEST UNIVERSITY (SOUTH AFRICA)
Roald Eiselen, Benito Trollip, Joani Liversage, Zandre Botha, Martin Puttkammer, Martin Schlemmer, Carli de Wet, Nadia Schultz, Nanette Van Den Berg, Sansi Eiselen - TILBURG UNIVERSITY (THE NETHERLANDS)
Rick Smetsers, Nanne van Noord, Vincent Lichtenberg, Bas Goris, Sylvie Bruys, Suzanne Aussems - UNIVERSITY OF ANTWERP (BELGIUM)
Natasja Loyens, Maxim Baetens
OVERVIEW
In many human language technology applications (e.g. machine translators, spelling checkers), it often happens that concatenatively written compounds (e.g. “skrywerspen”/”schrijverspen” ‘writer’s pen’) are processed incorrectly (e.g. not found in a lexicon). From a technological perspective, deficiencies related to automatic compound segmentation are particularly problematic, since concatenative compounding is a highly productive process in many languages, including Dutch and Afrikaans. Although a compound splitter has already been developed for Afrikaans (Van Huyssteen and Van Zaanen, 2004), the reported accuracy of circa 90% could be improved, and the annotation protocol and data need to be revised.
More importantly, no stand-alone compound splitter for Dutch is available; research that has been done in this field is more than ten years old (e.g. Pohlmann and Kraaij, 1996), uses expensive resources (e.g. Ordelman et al., 2003), does complete morphological analysis (e.g. De Pauw et al., 2004), and/or has not been released for re-use in the open-source domain. In subproject 1, we will therefore attempt to develop robust compound splitters for both Afrikaans and Dutch through a combination of technology recycling (Pilon et al., 2010) and data pooling (i.e. joining (converted) training material for the two languages in one training set), as well as experimentation with sequence classification (Van Zaanen & Gaustad, 2010; Van Zaanen et al., 2011).
In addition to segmentation, another subpart of this proposed project will also focus on the semantic analysis of compounds – i.e. to determine that “boekrak” construes ‘case for books’, while “houtrak” means ‘case made of wood’. For more advanced HLT applications like information extraction, question answering and machine translation systems, proper semantic analysis of compounds is required. Internationally, research on automatic compound analysis has focused almost exclusively on English; no work in this regard has been done for either Afrikaans or Dutch, and this proposed project will therefore do pioneering work in this regard.
Although linguistic research on the topic has been done for both these languages, a uniform, cross-lingual framework does not exist yet, neither does an understanding of how compounding in these two languages differs systematically (see examples above). An attempt will therefore be made to consolidate existing research on both these languages (and other languages), and to postulate a cross-lingual annotation scheme compatible with the work of Ó Séaghdha (2008).
Since no semantic analyser exists for either languages, in subproject 2 we will then develop first-generation analysers for Afrikaans and Dutch simultaneously, using bootstrapping and data pooling (i.e. first develop a small training set of Afrikaans data, then train an Afrikaans analyser, then analyse Dutch data with the Afrikaans analyser, and subsequently join the data to train a next Afrikaans and/or Dutch analyser; this process continues in small increments until desired performance has been reached). We will start with techniques that work well for English (based on distributional semantics and machine learning); see Hendrickx et al. (2010) for an overview of the current state of the art. We will try to improve these techniques and adapt them to the specific requirements of Afrikaans and Dutch.
REFERENCES
- Daelemans, W., Buchholz, S. and Jorn Veenstra. 1999. Memory-Based Shallow Parsing. Proceedings of CoNLL-99, Bergen, Norway. June 12, 1999.
- Davel. M. and Barnard, E. 2004. A default-and-refinement approach to pronunciation prediction”. In: Proceedings of PRASA. South Africa, November 2004, pp. 119–123.
- De Knop, S. and Dirven, R. 2008. Motion and location events in German, French and English: A typological, contrastive and pedagogical approach. In:
- De Knop, S. and De Rycker, T. (eds.) Cognitive Approaches to Pedagogical Grammar: A Volume in Honour of René Dirven. Berlin: Mouton de Gruyter.
- De Pauw, G., Laureys, T., Daelemans, W. and Van Hamme, H. 2004. A Comparison of Two Different Approaches to Morphological Analysis of Dutch. In: Proceedings of the Workshop of the ACL Special Interest Group on Computational Phonology (SIGPHON). Barcelona, Spain. pp. 62-69.
- Gast, V. forthcoming. Contrastive analysis: Theories and methods. In: Kortmann, B. and Kabatek, J. (eds.). Dictionaries of Linguistics and Communication Science: Linguistic theory and methodology. Berlin: Mouton de Gruyter.
- González, M. D. L. Á. G., Mackenzie, J. L. and Álvarez, E. M. G. 2008. Current Trends in Contrastive Linguistics: Functional and cognitive perspectives, Amsterdam, John Benjamins.
- Hendrickx, I, Kim, SM, Kozareva, Z, Nakov, P, Ó Séaghdha, D, Padó, S, Pennacchiotti, M, Romano, L & Szpakowicz, S. 2010. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals. In: Proceedings of the SemEval-2 Workshop. Uppsala, Sweden.
- Hüning, M. 2009. Semantic niches and analogy in word formation: Evidence from contrastive linguistics. Languages in Contrast. 9(2): 183-201.
- Hüning, M. 2010. Diachronie in de synchronie. Over contrastieve taalkunde en taal(veranderings)theorie. In: Fenoulhet, J. and Renkema, J. (eds.) Internationale neerlandistiek: een vak in beweging. Gent: Academia Press.
- Mitchell, T.M. 1997. Machine learning. Boston: MacGraw-Hill.
- Ó Séaghdha, D. 2008. Learning compound noun semantics. Technical report 735. Cambridge: University of Cambridge.
- OECD. 2002. Proposed standard practice for surveys on research and experimental development (Frascati Manual). Eurostat.
- Ordelman, R., Van Hessen, A. and De Jong, F. 2003. Compound decomposition in Dutch large vocabulary speech recognition. In: Proceedings of Eurospeech 2003. Geneva, Switzerland. 225–228.
- Pilon, S, Van Huyssteen, GB and Augustinus, L. 2010. Converting Afrikaans to Dutch for technology recycling. In: Proceedings of the 21st Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2470-2. 22-23 November. Stellenbosch, South Africa. pp 219-224.
- Pohlmann, R and Kraaij, W. 1996. Improving the precision of a text retrieval system with compound analysis. In: Proceedings of the 7th Computational Linguistics in the Netherlands (CLIN 1996). pp. 115-129.
- Quinlan, J.R. 1987. Generating production rules from decision trees. In: McDermott, J. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (IJCAI-87): 304–307.
- Van Huyssteen, GB and Van Zaanen, MM. 2004. Learning Compound Boundaries for Afrikaans Spelling Checking. In: Proceedings of First Workshop on International Proofing Tools and Language Technologies. Patras, Greece. pp. 101-108.
- Van Huyssteen, GB. 2005. ’n Kognitiewe gebruiksgebaseerde beskrywingsmodel vir die Afrikaanse grammatika. [A Cognitive Usage-Based Description Model for Afrikaans Grammar]. Southern African Linguistics and Applied Language Studies. 23(2): pp. 125-137.
- Van Zaanen, M & Gaustad T. 2010. Grammatical Inference as Class Discrimination. In: Sempere, J & García, P. (eds.). Grammatical Inference: Theoretical Results and Applications. 6339, 245–257.
- Van Zaanen, M, Gaustad T & Feijen J. 2011. Influence of Size on Pattern-based Sequence Classification. In: Van der Putten, P, Veenman, C, Vanschoren, J, Israel, M & Blockeel, H. (eds.). Proceedings of the 20th Belgian-Dutch Conference on Machine Learning. The Hague, The Netherlands. pp 53–60.
- Veenstra, J., Van den Bosch, A., Buchholz, S., Daelemans, W. and Zavrel, J. 2000. Memory-Based Word Sense Disambiguation. Computers and the Humanities. 34(1-2): 171-177.
AIMS
The primary aim of this project was to develop resources (including annotation protocols, and training and testing data) for the development of:
- robust compound splitters (subproject 1); and
- first-generation compound analysers (subproject 2);
for Afrikaans and Dutch, through a combination of cross-language transfer (i.e. technology recycling), data pooling, and various machine learning approaches.
Other secondary aims included:
- to report on the research and development process in the form of:
- one Master’s degree dissertation;
- two fourth-year student’s projects (mini-dissertation);
- at least two scholarly papers, to be published in relevant journals or peer-reviewed conference proceedings;
- various annotation protocols, made available publicly; and
- to contribute towards human capital development and growth of the pool of experts in descriptive linguistics and computational linguistics in South Africa, Belgium and The Netherlands by offering bursaries, grants or contract work to undergraduate and post-graduate students.
- to extend the collaboration network between North-West University (NWU), Tilburg University (TU) and University of Antwerp (UA), by introducing young scholars and students to each other (i.e. extending the existing collaboration beyond Van Huyssteen–Van Zaanen–Daelemans);
- to identify new research issues as they unfold in the research and development process; and
- to contribute to the HLT-enabling of the languages of South Africa.
OUTPUTS
PEER-REVIEWED PUBLICATIONS
- Aussems, S, Goris, B, Lichtenberg, V, Van Noord, N, Smetsers, R, & Van Zaanen, M. 2013. Unsupervised identification of compounds. In: Proceedings of the 22nd Annual Belgian-Dutch Conference on Machine Learning (Benelearn). 3 June. Nijmegen, The Netherlands.
- Botha, Z., Eiselen, R., & Van Huyssteen, G. 2013. Automatic Compound Semantic Analysis using Wordnets. In: Proceedings of the Twenty-Fourth Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-86970-771-5. 3 December. Pretoria, South Africa. pp. 1-6.
- Van Zaanen, M, Van Huyssteen, GB, Aussems, S, Emmery, C, & Eiselen, R. 2014. The Development of Dutch and Afrikaans Language Resources for Compound Boundary Analysis. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). May. Reykjavik, Iceland.
- Van Huyssteen, GB. 2014. Morfologie. In: Carstens, WAM & Bosman, N. (reds.). Kontemporêre Afrikaanse Taalkunde. ISBN 978-0-62703-019-2. Pretoria: Van Schaik Uitgewers. pp. 171-208.
Preprint - Van Huyssteen, GB & Verhoeven, B. 2014. A Taxonomy for Afrikaans and Dutch compounds. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014): The First Workshop on Computational Approaches to Compound Analysis (ComAComA). ISBN: 978-1-873769-43-0. 21-22 August. Dublin, Ireland. pp. 31-40.
- Verhoeven, B., & Daelemans, W. 2013. Semantic Classification of Dutch Noun-Noun Compounds: A Distributional Semantics Approach. In: CLIN Journal, 3: 2-18. ISSN: 2211-4009.
- Verhoeven, B., Daelemans, W., & van Huyssteen, G.B. 2012. Classification of Noun-Noun Compound Semantics in Dutch and Afrikaans. In: Proceedings of the Twenty-Third Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-54601-0. 29-30 November. Pretoria, South Africa. pp. 121-125.
- Verhoeven, B, & Van Huyssteen, GB. 2013. More Than Only Noun-Noun Compounds: Towards an annotation scheme for the semantic modelling of other noun compound types. In: Proceedings of the Ninth Joint ACL – ISO Workshop on Interoperable Semantic Annotation. 19-20 March. Potsdam, Germany.
- Verhoeven, B, Van Zaanen, MM, Daelemans, W & Van Huyssteen, GB. 2014. Automatic compound processing: Compound splitting and semantic analysis for Afrikaans and Dutch. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014): The First Workshop on Computational Approaches to Compound Analysis (ComAComA). ISBN: 978-1-873769-43-0. 21-22 August. Dublin, Ireland. pp. 20-30.
RESOURCES
ANNOTATION GUIDELINES FOR COMPOUND ANALYSIS
- Verhoeven, B., van Huyssteen, G., van Zaanen, M., & Daelemans, W. 2014. Annotation Guidelines for Compound Analysis. In: CLiPS Technical Report Series (CTRS), 5. ISSN: 2033-3544.
- Annotation Guidelines for Compound Segmentation.
Annotation Guidelines for the Semantic Analysis of Noun-Noun Compounds in English, Dutch and Afrikaans. Including: Decision Tree and Paraphrasing Table - Annotation Guidelines for the Semantic Analysis of Other Nominal Compounds in Dutch and Afrikaans. Specifically: Adjective-Noun, Verb-Noun, Quantifier-Noun and Preposition-Noun
COMPOUND SEMANTICS DATASET (COMPOUNDS WITH SEMANTIC ANNOTATION)
Afrikaans
- Afr-NN-FirstRound (1449 compounds)
- Afr-NN-SecondRound (2328 compounds)
- Afr-XN (4553 compounds)
Dutch
- Ned-NN-FirstRound (1766 compounds)
- Ned-NN-SecondRound (2000 compounds)
- Ned-XN (600 compounds)
COMPOUND SPLITTING DATASET (COMPOUNDS ANNOTATED WITH CONSTITUENT BOUNDARIES AND LINKING ELEMENTS)
- Afrikaans (25,266 compounds)
- Dutch (26,000 compounds)
TALKS
- Aussems, S, Bruys, S, Goris, B, Lichtenberg, V, Van Noord, N, Smetsers, R, & Van Zaanen, M. 2013. Automatically Identifying Compounds. Presentation presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands. 18 January 2013.
- Liversage, J, & Van Huyssteen, GB. 2013. Verifiëring van semantiese verhoudings in Afrikaanse naamwoord-naamwoordsamenstellings. [Verification of semantic relations in Afrikaans noun-noun compounds.] Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013.
- Trollip, B, & Van Huyssteen, G.B. 2013. Herbeskouing van die interfiks in Afrikaans. [Reconsideration of the interfix in Afrikaans.] Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013.
- Van den Berg, N, & Van Huyssteen, GB. 2013. Samestellings met en afleidings van meerledige eiename. [Compounds of and derivations with multi-part proper names.] Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013.
- Van Huyssteen, GB, Verhoeven, B, & Daelemans, W. 2013. Bringing together interdisciplinary perspectives on compound semantics: Examples from Afrikaans and Dutch in the CompoNet database. Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013.
- Van Huyssteen, GB & Verhoeven, B. 2014. A Taxonomy for Afrikaans and Dutch compounds. Presented at the 25th International Conference on Computational Linguistics (COLING 2014): The First Workshop on Computational Approaches to Compound Analysis (ComAComA). 21-22 August. Dublin, Ireland.
- Van Zaanen, M. 2012. Automatic Compound Processing (AuCoPro) – Identification for Segmentation. Presentation presented at ATILA 2012, Groesbeek, The Netherlands. 23 November 2012.
- Van Zaanen, M, Van Huyssteen, GB, Aussems, S, Emmery, C & Eiselen, R. 2014. The development of Dutch and Afrikaans language resources for compound boundary analysis. Presented at the 9th International Conference on Language Resources and Evaluation (LREC 2014). 26-31 May. Reykjavik, Iceland.
- Verhoeven, B, & Daelemans, W. 2012. Automatic Compound Processing (AuCoPro) – Semantic Analysis. Presentation presented at ATILA 2012, Groesbeek, The Netherlands. 23 November 2012.
- Verhoeven, B, Daelemans, W, & Van Huyssteen, GB. 2013. Semantic Classification of Dutch and Afrikaans Noun-Noun Compounds. Presentation presented at the 5th Workshop on African Language Technology (AfLaT 2013), Ghent, Belgium. 6 December 2013.
- Verhoeven, B, Daelemans, W, & Van Huyssteen, GB. 2013. Semantic Classification of Dutch and Afrikaans Noun-Noun Compounds. Presentation presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands. 18 January 2013.
- Verhoeven, B, Van Huyssteen, GB, & Daelemans, W. 2013. Samenstellingen in het Afrikaans en Nederlands: Automatische semantische analyse en taalkundige implicaties. [Compounding in Afrikaans and Dutch: Automatic semantic analysis and linguistic implications.] Presentation presented at Graduate Conference of the Departement of Linguistics, University of Antwerp, Belgium. 2 October 2013.
- Verhoeven, B, Van Huyssteen, GB, & Daelemans, W. 2012. AuCoPro: Project Presentation and Recent Developments. Presented at Centre for Text Technology (CTexT), North-West University. Potchefstroom, South Africa. 7 September 2012.
- Verhoeven, B, Van Zaanen, MM, Daelemans, W & Van Huyssteen, GB. 2014. Automatic compound processing: Compound splitting and semantic analysis for Afrikaans and Dutch. Presented at the 25th International Conference on Computational Linguistics (COLING 2014): The First Workshop on Computational Approaches to Compound Analysis (ComAComA). 21-22 August. Dublin, Ireland.
DISSERTATIONS (UNPUBLISHED)
MASTERS
- Verhoeven, B. 2012. A Computational Semantic Analysis of Noun Compounds in Dutch. MA Thesis, University of Antwerp, Belgium.
HONOURS
- Liversage, J. 2013. Verifiëring van semantiese verhoudings in Afrikaanse naamwoord-naamwoordsamestellings [Verification of semantic relations in Afrikaans noun-noun compounds]. Potchefstroom: North-West University.
- Trollip, B. 2013. Herbeskouing van die interfiks in Afrikaanse komposita [Reconsideration of the interfix in Afrikaans compounds]. Potchefstroom: North-West University.
- Van den Berg, N. 2013. Samestellings met en afleidings van meerledige eiename in Afrikaans en Nederlands [Compounds with and derivations of multiword proper names in Afrikaans and Dutch]. Potchefstroom: North-West University.
BACHELORS
- Trollip, B. 2012. Die klassifikasiemoontlikhede van nie-prototipiese samestellings. [The classification possibilities of non-prototypical compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
- De Wet, C. 2012. Semantiese ontleding van Afrikaanse NN-samestellings. [Semantic analysis of Afrikaans NN-compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
- Schultz, N. 2012. Die ontwikkeling van ‘n verteenwoordigende verwysende datastel van Afrikaanse samestellings. [The development of a representative referential dataset of Afrikaans compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
- Liversage, J. 2012. Voorgestelde protokol vir die verwerking van X+N samestellings. [Proposed protocol for the processing of X+N compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
RELATED PROJECTS/LINKS
- Scalise, S. CompoNet. University of Bologna, Italy.
CompoNet is a descriptive compound database for 27 languages, including Dutch and Afrikaans. - Ó Séaghdha, D. Compound Noun Bibliography. University of Cambridge, United Kingdom.
Bibliography of computational and linguistic literature relating to compound nouns.