<![CDATA[Scrambled text: training Language Models to correct OCR errors using synthetic data]]>

2026-05-16T14:36:34Z https://api.figshare.com/v2/oai

oai:figshare.com:article/27108334 2024-09-27T11:58:55Z category_28849 category_29134 category_29137 category_27328 portal_549 item_type_3 month_year_09_2024

10.5522/04/27108334.v1 https://figshare.com/articles/dataset/Scrambled_text_training_Language_Models_to_correct_OCR_errors_using_synthetic_data/27108334 https://ndownloader.figshare.com/files/49417669 https://ndownloader.figshare.com/files/49417675 https://ndownloader.figshare.com/files/49417678 https://ndownloader.figshare.com/files/49433905 Bourne, Jonno Jonno Bourne 0000-0003-2616-3716 <![CDATA[Scrambled text: training Language Models to correct OCR errors using synthetic data]]> Natural language processing Library studies Open access Digital history newspapers. OCR NLP synthetic data 2024-09-27 2024-09-27 Dataset 2024 University College London This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data".

In addition it contains the 10,000 synthetic 19th century articles generated using GPT4o. These articles are available both as a csv with the prompt parameters as columns as well as the articles as individual text files.

The files in the repository are as follows

ncse_hf_dataset: A huggingface dictionary dataset containing 91 articles from the Nineteenth Century Serials Edition (NCSE) with original OCR and the transcribed groundtruth. This dataset is used as the testset in the paper
synth_gt.zip: A zip file containing 5 parquet files of training data from the 10,000 synthetic articles. The each parquet file is made up of observations of a fixed length of tokens, for a total of 2 Million tokens. The observation lengths are 200, 100, 50, 25, 10.
synthetic_articles.zip: A zip file containing the csv of all the synthetic articles and the prompts used to generate them.
synthetic_articles_text.zip: A zip file containing the text files of all the synthetic articles. The file names are the prompt parameters and the id reference from the synthetic article csv.

The data in this repo is used by the code repositories associated with the project

https://github.com/JonnoB/scrambledtext_analysis
https://github.com/JonnoB/training_lms_with_synthetic_data

]]>