Share on email
Share on whatsapp
Share on facebook
Share on twitter

Stats calculators: Frequency information in VivA’s Afrikaans corpus collection

Van Huyssteen, Gerhard B. 2021. “Stats calculators: Frequency information in VivA’s Afrikaans corpus collection.” https://gerhard.pro/software/stats-calculators-frequency-info-viva/.

Introduction

Here I provide a number of word frequency calculators for some of the Afrikaans corpora in the Virtual Institute for Afrikaans’ (VivA) corpus portal. These calculators already have the frequency of the most frequent word and the number of word types included, based on the frequency counts in the corpora that are available in the VivA Korpusportaal. These frequencies/numbers are updated regularly.

All of these calculators require as input the frequency of the word (or multiword item) F(n) in one or more of the corpora. For the tf-idf (term frequency–inverse document frequency) the number of documents in which the word occurs F(d) is also required. All these numbers can be obtained easily from VivA’s corpus portal.

Based on the input, the following results are calculated automatically:

  1. Relative frequency class (N) based on Perkuhn et al. (2012), plus its interpreted frequency category based on Van Huyssteen’s (2017b) proposal.
  2. Zipfian scale (Z) based on Van Heuven et al. (2014)
  3. Frequency per million words (fpmw)
  4. Frequency per thousand words (fptw)
  5. Frequency relative to most frequent word (f(n)”) (also called strengthened frequency)
  6. Term frequency–inverse document frequency (tf-idf)

break

In another post, you can also find generic versions of the first two calculators (N and Z), which you can use with corpora not included in the list below.

Instructions

  1. In VivA’s corpus portal, obtain the frequency (F(n)) of your search string (e.g. word form, lemma, etc.) in each of the corpora that you are interested in. If you want to calculate the tf-idf, also get the number of documents (F(d)) in which the word occurs in each corpus.
  2. For each corpus in the table below, enter the F(n) and/or F(d) in the white cells.
    • You don’t need to enter these frequencies for all the corpora – only for those you are interested in.
    • However, if you enter the frequencies of your search string in all the corpora, the total for that corpus collection will be calculated (in the row Total:). The two totals in the coloured rows are for either:
      • orange: corpora in VivA’s comprehesive corpus collection (omvattende versameling), excluding transcriptions of speech corpora; or
      • yellow: corpora in VivA’s exclusive corpus collection (eksklusiewe versameling), excluding corpora of historical texts.
    • The two corpora at the bottom (VSK and THT) are excluded from the calculations for the totals, since these copora contains texts that are fundamentally different from the other corpora.
  3. Copy table or relevant sections to your article.
Abbreviations

Links

A multitude of online calculators for corpus linguistics are available, such as Lancaster Stats Tools online, and Paul Rayson’s Log-likelihood and effect size calculator (to name but a few).

References

  • Perkuhn, R., Keibel, H. & Kupietz, M. 2012. Korpuslinguistik. Paderborn: Wilhelm Fink Verlag.
  • Van Heuven, W. J. B., P. Mandera, E. Keuleers, and M. Brysbaert. 2014. “Subtlex-UK: A new and improved word frequency database for British English.” Quarterly Journal of Experimental Psychology 67: 1176-1190.

In addition to the descriptions by the original authors, you can find descriptions in Afrikaans in the following publications:

English: Afrikaans; calculator; corpus linguistics; online calculator; statistics; Virtual Institute for Afrikaans; VivA; word frequency; word frequency class; Zipf; Zipf scale

Afrikaans: aanlyn berekenaar; Afrikaans; berekenaar; korpuslinguistiek; statistiek; Virtuele Instituut vir Afrikaans; VivA; woordfrekwensie; woordfrekwensieklas; Zipf; Zipfskaal

English: A number of word frequency calculators are provided for Afrikaans corpora in the Virtual Institute for Afrikaans’ (VivA) corpus portal.


Afrikaans: ‘n Aantal woordfrekwensieberekenaars word vir Afrikaanse korpusse in die Virtuele Instituut vir Afrikaans (VivA se korpusportaal voorsien.

In: English

On: Afrikaans