Preprocessing the Norwegian Web as Corpus (NoWaC) in R

2023 R

The present script can be used to preprocess data from a frequency list of the Norwegian as Web Corpus (NoWaC; Guevara, 2010).

Before using the script, the frequency list should be downloaded from this URL. The list is described as ‘frequency list sorted primary alphabetic and secondary by frequency within each character’, and this is the direct URL. The download requires signing in to an institutional network. Last, the downloaded file should be unzipped.

The script is shown below.

References

Guevara, E. R. (2010). NoWaC: A large web-based corpus for Norwegian. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop (pp. 1-7). https://aclanthology.org/W10-1501

Rodina, Y., & Westergaard, M. (2015). Grammatical gender in Norwegian: Language acquisition and language change. Journal of Germanic Linguistics, 27(2), 145–187. https://doi.org/10.1017/S1470542714000245

Rodina, Y., & Westergaard, M. (2021). Grammatical gender and declension class in language change: A study of the loss of feminine gender in Norwegian. Journal of Germanic Linguistics, 33(3), 235–263. https://doi.org/10.1017/S1470542719000217

R corpora corpus Norwegian NoWaC linguistics language research methods s