M. Marx and A. Schuth. DutchParl. A Corpus of Parliamentary Documents in Dutch. 2010
Please cite this paper if you use the corpus in your own work.
@InProceedings{marx:dutc10,
author = {M. Marx and A. Schuth},
title = {{DutchParl. A Corpus of Parliamentary Documents in Dutch}},
booktitle = {Proceedings LREC 2010},
year = 2010
}
The aim of DutchParl is to create a corpus containing all digitally available parliamentary documents written in the Dutch language. The main reason to create the corpus is to provide one portal from which these documents are accessible both in their original official version (in PDF format), and in a uniform XML format with extensive metadata. The corpus was designed to be useful as a data set in all possible scientific disciplines. E.g., it can be used for (comparative) corpus-linguistic and political science research and as a test-set for information-theoretic experiments. The corpus was developed following the guidelines set out in Wynne 2005.
DutchParl is released as a collection of XML files in the format described in the paper, and as a set of text files containing all text from the XML files concatenated together. For each text file also a frequancy list of the occurring tokens is provided.
These text files are organized by country and by document type (OCR and DIG, for scanned and digital documents, respectively). For each type, the complete text is provided, and the text of the verbatim proceedings. For the Belgian federal documents we additionaly added files with words in Dutch and French, respectively.
The following lists the sizes of the files
1.7M 2009-10-22 13:50 B/be-PLENCOMM-fr.dict 230M 2009-10-22 11:11 B/be-PLENCOMM-fr.txt 3.5M 2009-10-22 13:57 B/be-PLENCOMM-nl.dict 264M 2009-10-22 11:11 B/be-PLENCOMM-nl.txt 4.1M 2009-10-06 20:11 B/be-PLENCOMM-tekst.dict 503M 2009-10-06 19:33 B/be-PLENCOMM-tekst.txt 1.6M 2009-10-22 14:02 B/be-QA-fr.dict 148M 2009-10-22 11:23 B/be-QA-fr.txt 2.7M 2009-10-22 14:06 B/be-QA-nl.dict 134M 2009-10-22 11:11 B/be-QA-nl.txt 2.9M 2009-10-06 20:16 B/be-QA-tekst.dict 299M 2009-10-06 19:34 B/be-QA-tekst.txt 3.7K 2009-10-22 14:38 B/README 5.7M 2009-10-16 18:50 Flanders-DIG/flanders.dict 4.0M 2009-10-08 12:34 Flanders-DIG/flanders-proc.dict 311M 2009-10-08 12:27 Flanders-DIG/flanders-proc.txt 454M 2009-10-07 17:41 Flanders-DIG/flanders.txt 3.8M 2009-10-16 18:45 Flanders-OCR/flanders.dict 2.7M 2009-10-08 12:32 Flanders-OCR/flanders-proc.dict 140M 2009-10-08 12:24 Flanders-OCR/flanders-proc.txt 195M 2009-10-07 17:30 Flanders-OCR/flanders.txt 16M 2009-10-09 12:56 NL-DIG/BLG.dict 1.2G 2009-10-07 11:34 NL-DIG/BLG.txt 6.0M 2009-10-09 13:11 NL-DIG/HAN.dict 782M 2009-10-07 12:04 NL-DIG/HAN.txt 14M 2009-10-09 12:24 NL-DIG/KST.dict 2.1G 2009-10-07 14:33 NL-DIG/KST.txt 4.0M 2009-10-09 13:15 NL-DIG/KVR.dict 149M 2009-10-07 15:15 NL-DIG/KVR.txt 2.3M 2009-10-09 13:16 NL-DIG/NDS.dict 62M 2009-10-07 15:30 NL-DIG/NDS.txt 311K 2009-10-09 13:16 NL-DIG/OVG.dict 11M 2009-10-07 15:31 NL-DIG/OVG.txt 3.6K 2009-10-09 12:32 NL-DIG/README 2.1K 2009-10-16 14:20 NL-OCR/README 38M 2009-10-16 12:59 NL-OCR/SGD-proc-text.dict 2.5G 2009-10-16 12:15 NL-OCR/SGD-proc-text.txt 65M 2009-10-16 14:12 NL-OCR/SGDtext.dict 6.6G 2009-10-16 11:56 NL-OCR/SGDtext.txt
The XML corpora are avaliable on request. Sent an email to
maartenmarx APESTAART uva.nl for instructions. The RelaX-NG schemas are here.
The above listed text files are available as gzipped tar files per
country:
| NL-DIG | 1.3G |
| NL-OCR | 2.7G |
| Flanders-DIG | 225M |
| Flanders-OCR | 103M |
| B | 483M |