Preprocessing the Norwegian Web as Corpus (NoWaC) in R

The present script can be used to pre-process data from a frequency list of the Norwegian as Web Corpus (NoWaC; Guevara, 2010).

Before using the script, the frequency list should be downloaded from this URL. The list is described as ‘frequency list sorted primary alphabetic and secondary by frequency within each character’, and this is the direct URL. The download requires signing in to an institutional network. Last, the downloaded file should be unzipped.

The script is shown below.


Guevara, E. R. (2010). NoWaC: A large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (pp. 1-7).

Rodina, Y., & Westergaard, M. (2015). Grammatical gender in Norwegian: Language acquisition and language change. Journal of Germanic Linguistics, 27(2), 145–187.

Rodina, Y., & Westergaard, M. (2021). Grammatical gender and declension class in language change: A study of the loss of feminine gender in Norwegian. Journal of Germanic Linguistics, 33(3), 235–263.

comments powered by Disqus