Language Resources By TMR

You can find a range of language resources here, produced and shared by researchers affiliated to the Text Mining and Retrieval research group at Leiden University.

BERT Models

We pre-trained 2 Dutch BERT models on the SoNaR corpus, a 500-million-word reference corpus of contemporary written Dutch. We created a cased and an uncased model. The uncased model is useful for tasks where the input is all lowercased (such as text classification) and the cased model is more applicable in tasks like NER, where the casing of words can contain useful information for classification. (GNU GPLv3)


Download Cased Model
Download Uncased Model

By Alex Brandsen

BERT Models

110kDBRD

110k Dutch Book Reviews Dataset

This dataset contains book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the Large Movie Review Dataset and intended as a benchmark for sentiment classification in Dutch. The scripts that were used to scrape the reviews from Hebban can be found in the 110kDBRD GitHub repository, as well as the Dutch language model for FastAI, trained on the Dutch Wikipedia.


View on GitHub

By Benjamin van der Burgh

dutch-archaeo-NER-dataset

A manually tagged Dutch NER dataset in the archaeology domain, specifically excavation reports. Contains ~31k annotations over 6 entity types.


Download Data from Zenodo

By Alex Brandsen

dutch-archaeo-NER-dataset