DutchParl. A Corpus of Parliamentary Documents in Dutch


For a detailed description of this corpus, please read:

M. Marx and A. Schuth. DutchParl. A Corpus of Parliamentary Documents in Dutch. 2010

Please cite this paper if you use the corpus in your own work.
@InProceedings{marx:dutc10,
  author = 	 {M. Marx and A. Schuth},
  title = 	 {{DutchParl.  A Corpus of Parliamentary Documents in Dutch}},
  booktitle = 	 {Proceedings LREC 2010},
  year = 	 2010
}

The aim of DutchParl is to create a corpus containing all digitally available parliamentary documents written in the Dutch language. The main reason to create the corpus is to provide one portal from which these documents are accessible both in their original official version (in PDF format), and in a uniform XML format with extensive metadata. The corpus was designed to be useful as a data set in all possible scientific disciplines. E.g., it can be used for (comparative) corpus-linguistic and political science research and as a test-set for information-theoretic experiments. The corpus was developed following the guidelines set out in Wynne 2005.


Release DutchParl 1

DutchParl is released as a collection of XML files in the format described in the paper, and as a set of text files containing all text from the XML files concatenated together. For each text file also a frequancy list of the occurring tokens is provided.

These text files are organized by country and by document type (OCR and DIG, for scanned and digital documents, respectively). For each type, the complete text is provided, and the text of the verbatim proceedings. For the Belgian federal documents we additionaly added files with words in Dutch and French, respectively.

The following lists the sizes of the files

1.7M 2009-10-22 13:50 B/be-PLENCOMM-fr.dict
 230M 2009-10-22 11:11 B/be-PLENCOMM-fr.txt
 3.5M 2009-10-22 13:57 B/be-PLENCOMM-nl.dict
 264M 2009-10-22 11:11 B/be-PLENCOMM-nl.txt
 4.1M 2009-10-06 20:11 B/be-PLENCOMM-tekst.dict
 503M 2009-10-06 19:33 B/be-PLENCOMM-tekst.txt
 1.6M 2009-10-22 14:02 B/be-QA-fr.dict
 148M 2009-10-22 11:23 B/be-QA-fr.txt
 2.7M 2009-10-22 14:06 B/be-QA-nl.dict
 134M 2009-10-22 11:11 B/be-QA-nl.txt
 2.9M 2009-10-06 20:16 B/be-QA-tekst.dict
 299M 2009-10-06 19:34 B/be-QA-tekst.txt
 3.7K 2009-10-22 14:38 B/README
 5.7M 2009-10-16 18:50 Flanders-DIG/flanders.dict
 4.0M 2009-10-08 12:34 Flanders-DIG/flanders-proc.dict
 311M 2009-10-08 12:27 Flanders-DIG/flanders-proc.txt
 454M 2009-10-07 17:41 Flanders-DIG/flanders.txt
 3.8M 2009-10-16 18:45 Flanders-OCR/flanders.dict
 2.7M 2009-10-08 12:32 Flanders-OCR/flanders-proc.dict
 140M 2009-10-08 12:24 Flanders-OCR/flanders-proc.txt
 195M 2009-10-07 17:30 Flanders-OCR/flanders.txt
  16M 2009-10-09 12:56 NL-DIG/BLG.dict
 1.2G 2009-10-07 11:34 NL-DIG/BLG.txt
 6.0M 2009-10-09 13:11 NL-DIG/HAN.dict
 782M 2009-10-07 12:04 NL-DIG/HAN.txt
  14M 2009-10-09 12:24 NL-DIG/KST.dict
 2.1G 2009-10-07 14:33 NL-DIG/KST.txt
 4.0M 2009-10-09 13:15 NL-DIG/KVR.dict
 149M 2009-10-07 15:15 NL-DIG/KVR.txt
 2.3M 2009-10-09 13:16 NL-DIG/NDS.dict
  62M 2009-10-07 15:30 NL-DIG/NDS.txt
 311K 2009-10-09 13:16 NL-DIG/OVG.dict
  11M 2009-10-07 15:31 NL-DIG/OVG.txt
 3.6K 2009-10-09 12:32 NL-DIG/README
 2.1K 2009-10-16 14:20 NL-OCR/README
  38M 2009-10-16 12:59 NL-OCR/SGD-proc-text.dict
 2.5G 2009-10-16 12:15 NL-OCR/SGD-proc-text.txt
  65M 2009-10-16 14:12 NL-OCR/SGDtext.dict
 6.6G 2009-10-16 11:56 NL-OCR/SGDtext.txt

Download

The XML corpora are avaliable on request. Sent an email to maartenmarx APESTAART uva.nl for instructions. The RelaX-NG schemas are here.
The above listed text files are available as gzipped tar files per country:

NL-DIG 1.3G
NL-OCR 2.7G
Flanders-DIG 225M
Flanders-OCR 103M
B483M

Sizes of the subcorpora


Terms of Use

We are not aware of copyright restrictions on the material. If you use the corpus, we like to know about it. Sent an email to maartenmarx APESTAART uva.nl.

Additions to the corpus

If you are aware of a dataset that could (or you think: should) be included, write us. This could be proceedings of lower governmental bodies like provinces or towns. Together with your suggestion we would like to see

Acknowledgments

Maarten Marx acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FET-Open grant agreement FOX, number FP7-ICT-233599.