You can find a range of language resources here, produced and shared by researchers affiliated to the Text Mining and Retrieval research group at Leiden University.
BERT Models
We pre-trained 2 Dutch BERT models on the SoNaR corpus, a 500-million-word reference corpus of contemporary written Dutch. We created a cased and an uncased model. The uncased model is useful for tasks where the input is all lowercased (such as text classification) and the cased model is more applicable in tasks like NER, where the casing of words can contain useful information for classification. (GNU GPLv3)
By Alex Brandsen
110kDBRD
110k Dutch Book Reviews Dataset
This dataset contains book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the Large Movie Review Dataset and intended as a benchmark for sentiment classification in Dutch. The scripts that were used to scrape the reviews from Hebban can be found in the 110kDBRD GitHub repository, as well as the Dutch language model for FastAI, trained on the Dutch Wikipedia.
By Benjamin van der Burgh
dutch-archaeo-NER-dataset
A manually tagged Dutch NER dataset in the archaeology domain, specifically excavation reports. Contains ~31k annotations over 6 entity types.
By Alex Brandsen