Structuring 800,000 Advertisements from the Avisblatt for Digital Humanities Research
During 2017 the “Printed Markets” project of Professor Susanna Burghartz, working with Basel University Library and Data Futures GmbH, digitised all the issues of the Avisblatt newspaper—one of the first advertising publications, first appearing in 1729—and created an Invenio repository. annostor then enabled the creation and management of Web Annotation Data Model (WADM) annotation collections, making the Avisblatt accessible as 800,000 computer-readable personal advertisements for use by historians of the 18th and 19th centuries.
The Avisblatt is remarkable not just for its pioneering print techniques and large distribution volumes, but also because it appeared with articles and advertisements in French, German, Italian, and Latin. Its typefaces, evolving from fraktur and schwabacher over the 116 years of its publication, had presented serious challenges for conventional optical character recognition (OCR) approaches and products.
From Annotations to Research Data
The Avisblatt is remarkable, not just for its pioneering print techniques and large distribution volumes, but it appeared with articles and advertisements in French, German, Italian and Latin. Its type faces, evolving from fraktur and schwabacher over the 116 years of its publication, had presented serious challenges for conventional optical character recognition (OCR) approaches and products.
Eric Decker’s Research & Infrastructure Support department at the University of Basel (RISE) worked with Data Futures and the Transkribus project at Innsbruck University to develop neural network-based automation for analysis of the Avisblatt. Professor Burghartz’s team created training data for OCR using annostor’s Mirador integration, and then the neural net generated highly reliable computer text for all the Avisblatt issues.
Training Neural Networks on 800,000 Advertisements
Individual advertisements were defined in annostor to train a second neural net, enabling advertisement markup to be generated automatically at scale. annostor converted the Transkribus output into WADM annotations, enabling them to be displayed interactively in any IIIF viewer, preserved effectively, and exported via APIs for scholars using dataset formats such as CSV.
The annotation collections are preserved for the long term in the InvenioRDM corpus repository (https://avisblatt.dg-basel.hasdai.org/), which provides the IIIF service for the complete digital Avisblatt as well as unrestricted download of the annotation collections.
The “Printed Markets” project of Professor Burghartz and Dr Alexander Engel then processed the Avisblatt annotations further using R techniques, leading to additional research outcomes (https://avisblatt.ch/) including CSV datasets, deposited as a Zenodo record (10.5281/zenodo.8278751) and a GitHub project (https://github.com/Avisblatt/avisdata/tree/v1.0.0).