<![CDATA[Transcribed newspaper articles from the NCSE collection]]>

2026-05-07T16:08:38Z https://api.figshare.com/v2/oai

oai:figshare.com:article/25805008 2025-01-02T09:33:48Z category_29134 category_29137 category_29263 category_27328 category_27322 category_28849 portal_549 item_type_3 month_year_01_2025

10.5522/04/25805008.v1 https://figshare.com/articles/dataset/Transcribed_newspaper_articles_from_the_NCSE_collection/25805008 https://ndownloader.figshare.com/files/46281040 Bourne, Jonno Jonno Bourne 0000-0003-2616-3716 <![CDATA[Transcribed newspaper articles from the NCSE collection]]> Library studies Open access Media studies Digital history British history Natural language processing newspapers archives NLP OCR 2025-01-02 2025-01-02 Dataset 2025 University College London CLOCR-C: Transcribed newspaper articles from the NCSE collection

This dataset contains 91 pairs of newspaper articles from the Nineteenth Century Serials Edition (NCSE). The articles are the original OCR from the NCSE and the transcribed equivalent. The data was used in "CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models" to demonstrate that pre-trained language models are able to perform post-OCR correction improve the accuracy of corrupted OCR text. The paper is can be found on arxiv at https://arxiv.org/abs/2408.17428

Data Details

The data set comes from 6 different publications, and is made up of 91 articles, containing a total of 40712 words distributed across the 19th Century.

The dataset is zip file made up of two sub-files containing 91. Each file shares its name with a corresponding file in the other folder.

transcription_files: contains .txt files of the transcribed articles
transcription_raw_ocr: contains .txt files of the original OCR

]]>