<![CDATA[NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers]]>

2026-07-15T18:35:14Z https://api.figshare.com/v2/oai

oai:figshare.com:article/28381610 2025-02-11T11:19:37Z category_28849 category_29134 category_29137 category_27328 category_29263 category_27322 portal_549 item_type_3 month_year_02_2025

10.5522/04/28381610.v1 https://figshare.com/articles/dataset/NCSE_v2_0_A_Dataset_of_OCR-Processed_19th_Century_English_Newspapers/28381610 https://ndownloader.figshare.com/files/52260821 https://ndownloader.figshare.com/files/52260746 https://ndownloader.figshare.com/files/52256369 https://ndownloader.figshare.com/files/52256378 https://ndownloader.figshare.com/files/52260608 https://ndownloader.figshare.com/files/52260611 https://ndownloader.figshare.com/files/52260923 https://ndownloader.figshare.com/files/52260941 Bourne, Jonno Jonno Bourne 0000-0003-2616-3716 <![CDATA[NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers]]> Natural language processing Library studies Open access Digital history Media studies British history newspapers OCR NLP 2025-02-11 2025-02-11 Dataset 2025 University College London This repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models".

Dataset Overview

The NCSE v2.0 is a digitized collection of six 19th-century English periodicals containing:

82,690 pages
1.4 million entries
321 million words
1.9 billion characters

The dataset includes:

1.1 million text entries
198,000 titles
17,000 figure descriptions
16,000 tables

Repository Contents

NCSE v2.0 Dataset
- NCSE_v2.zip: a folder containing a parquet file for each of the periodicals as well as a readme file.
Bounding Box Dataset
A zip file called bounding_box.zip. Contains
- post_process: A folder of the processed periodical bounding box data
- post_process_fill: A folder of the processed periodical bounding box data WITH column filling.
- bbox_readme.txt: a readme file and data description for the bounding boxes
Test Sets
- cropped_images.zip: 378 images cropped from the NCSE test set pages, all 2-bit png files
- ground_truth: 358 text files corresponding to the text from the cropped_images folder
Classification Training Data
The below files are used for training the classification models. They contain 12000 observations 2000 from each periodical. The labels were classified using mistral-large-2411. This data is used to train the ModernBERT classifier described in the paper. The topics are taken from the International Press Telecommunications Council (IPTC) subject codes.
- silver_IPTC_class.parquet: IPTC topic classification silver set
- silver_text_type.parquet: Text-type classification silver set
Classified Data
The zip file "classification_data.zip" with all rows classified using the ModernBERT classifer described in the paper.
- IPTC_type_classified.zip: contains one parquet file per periodical
- text_type_classified.zip: contains one parquet file per periodical
- classification_readme.md: Description of the data
Classification Mappings
Data for mapping the classification codes to human readable names.
- class_mappings.zip: contains a json for each classification type
- - IPTC_class_mapping.json
  - text_type_class_mapping.json

Original Images

The original page images can be found at the King's College London Repositories:

Or via the project central archive

Citation

If you use this dataset, please cite:

No citation data currently available

Related Code

All original code related to this project including the creation of the datasets and thier analysis can be found at:
https://github.com/JonnoB/ereading_the_unreadable

Contact

For questions about the dataset, please create an issue in this repository.

Usage Rights

In keeping with the original NCSE dataset, all data is made available under a Creative Commons Attribution 4.0 International License (CC BY).

]]>