Smithsonian Institution builds AI trial to capture biodiversity data from historic research

During 2024 Smithsonian Institution researchers working with the Biodiversity Heritage Library (BHL) extracted structured taxonomic data from large numbers of historic field notebooks.

This material is a crucial resource for understanding historical populations and habitats, which have now been degraded by polution and climate change.

Creating a digital corpus from the primarily hand-written material and pasted-in photographs, the project used AI to extract texts and provided workflows for human correction of text recognition errors and for experts to confirm specimen identifications.

supports AI annotatation

In collaboration with

Extraction of biodiversity data using a pipeline of AI products

rapidly-advancing AI workflows extract structured data from handwritten field notebooks, creating access for researchers to historic species populations and habitat conditions

annostor provides human–in-the-loop validation workflows

annostor's hybrid workflows combine AI speed at scale with expert review, to establish scientific facts

Integrating Al data derivatives with long-term biodiversity infrastructure

integratingandannotation the InvenioRDM repository enables seamless connection with GBIF and GNA, and guarantees long-term access and protection of investment

Creating new FAIR biodiversity data resources

unlocking critical historic records using AI analysis validated via annostor workflows, creates freely-accessible FAIR resources for researchers worldwide, and for education

Integrating AI outputs with source materials for long-term preservation

Working with annostor the Smithsonian team developed a repository providing IIIF image services for the field notebook page imagery and supporting Web Annotation Data Model (WADM) annotation. They imported the corrected AI-generated texts of the handwritten notes to create annotations anchored to the digitised pages, and then created links to the Global Biodiversity Information Framework (GBIF) using the Global Names Architecture (GNA).

They imported the corrected AI-generated texts of the handwritten notes to create annotations anchored to the digitised pages, and then created links to the Global Biodiversity Information Framework (GBIF) using the Global Names Architecture (GNA).

Connecting historic field observations with global data resources using Persistent IDentifiers

Extending this investigation during 2025, the Smithsonian analysed tens of thousands of heritage biodiversity images—originally deposited in BHL from the Flickr Foundation—and created AI profiles of the organisms and plants discovered. These profiles, together with the high-resolution digital imagery were added to the new repository to automatically create new collections of standards-based annotation records.

Hybrid workflows were then developed to support human-in-the-loop identification of the specimens via GBIF and GNA—producing a new corpus of historic field observation records having free access to the global scientific community.

Biodiversity RDM repository

annostor overview

citizen

scholar

institution

Biodiversity & Life Sciences

Humanities & Cultural Heritage

Personal data & Citizen Science

Overview

Gaining access to the historic record

Avisblatt annotation in print culture studies

South Indian Paintings

IIIF annotation in digital scholarship

Sensitive and copyrighted materials

annostor citizen

annostor scholar