Stats calculators: Frequency information in VivA’s Afrikaans corpus collection

Van Huyssteen, Gerhard B. 2021. “Stats calculators: Frequency information in VivA’s Afrikaans corpus collection.” https://gerhard.pro/software/stats-calculators-frequency-info-viva/.

Introduction

Here I provide a number of word frequency calculators for some of the Afrikaans corpora in the Virtual Institute for Afrikaans’ (VivA) corpus portal. These calculators already have the frequency of the most frequent word and the number of word types included, based on the frequency counts in the corpora that are available in the VivA Korpusportaal. These frequencies/numbers are updated regularly.

All of these calculators require as input the frequency of the word (or multiword item) F(n) in one or more of the corpora. For the tf-idf (term frequency–inverse document frequency) the number of documents in which the word occurs F(d) is also required. All these numbers can be obtained easily from VivA’s corpus portal.

Based on the input, the following results are calculated automatically:

Relative frequency class (N) based on Perkuhn et al. (2012), plus its interpreted frequency category based on Van Huyssteen’s (2017b) proposal.
Zipfian scale (Z) based on Van Heuven et al. (2014)
Frequency per million words (fpmw)
Frequency per thousand words (fptw)
Frequency relative to most frequent word (f(n)”) (also called strengthened frequency)
Term frequency–inverse document frequency (tf-idf)

break

In another post, you can also find generic versions of the first two calculators (N and Z), which you can use with corpora not included in the list below.

Instructions

In VivA’s corpus portal, obtain the frequency (F(n)) of your search string (e.g. word form, lemma, etc.) in each of the corpora that you are interested in. If you want to calculate the tf-idf, also get the number of documents (F(d)) in which the word occurs in each corpus.
For each corpus in the table below, enter the F(n) and/or F(d) in the white cells.
- You don’t need to enter these frequencies for all the corpora – only for those you are interested in.
- However, if you enter the frequencies of your search string in all the corpora, the total for that corpus collection will be calculated (in the row Total:). The two totals in the coloured rows are for either:
  - orange: corpora in VivA’s comprehesive corpus collection (omvattende versameling), excluding transcriptions of speech corpora; or
  - yellow: corpora in VivA’s exclusive corpus collection (eksklusiewe versameling), excluding corpora of historical texts.
- The two corpora at the bottom (VSK and THT) are excluded from the calculations for the totals, since these copora contains texts that are fundamentally different from the other corpora.
Copy table or relevant sections to your article.

Abbreviations

Links

A multitude of online calculators for corpus linguistics are available, such as Lancaster Stats Tools online, and Paul Rayson’s Log-likelihood and effect size calculator (to name but a few).

References

Perkuhn, R., Keibel, H. & Kupietz, M. 2012. Korpuslinguistik. Paderborn: Wilhelm Fink Verlag.
Van Heuven, W. J. B., P. Mandera, E. Keuleers, and M. Brysbaert. 2014. “Subtlex-UK: A new and improved word frequency database for British English.” Quarterly Journal of Experimental Psychology 67: 1176-1190.

In addition to the descriptions by the original authors, you can find descriptions in Afrikaans in the following publications:

Regarding N: Van Huyssteen, Gerhard B. 2017b. “Die aard, doel en omvang van die Afrikaanse woordelys en spelreëls. Deel 1 [The nature, goal and scope of the Afrikaanse woordelys en spelreëls. Part 1].” Tydskrif vir Geesteswetenskappe 57 (2-1): 323-345. https://doi.org/doi.10.17159/2224-7912/2017/v57n2-1a7.
Regarding Z: Van Huyssteen, Gerhard B. 2018. “‘n Korpusondersoek na ‘huidiglik’ [A corpus exploration of ‘huidiglik’].” Literator 39 (2): a1527. https://doi.org/https://doi.org/10.4102/lit.v39i2.1527.