Preprocessing the Norwegian Web as Corpus (NoWaC) in R

The present script can be used to pre-process data from a frequency list of the Norwegian as Web Corpus (NoWaC).

Before using the script, the frequency list should be downloaded from The list is described as ‘frequency list sorted primary alphabetic and secondary by frequency within each character’, and the direct URL is: The download requires signing in to an institutional network. Last, the downloaded file should be unzipped.

Reference of the corpus

Guevara, E. R. (2010). NoWaC: A large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (pp. 1-7).

comments powered by Disqus