corpus

Preprocessing the Norwegian Web as Corpus (NoWaC) in R

The present script can be used to pre-process data from a frequency list of the Norwegian as Web Corpus (NoWaC; Guevara, 2010). Before using the script, the frequency list should be downloaded from this URL. The list is described as ‘frequency list sorted primary alphabetic and secondary by frequency within each character’, and this is the direct URL. The download requires signing in to an institutional network. Last, the downloaded file should be unzipped.