Smithsonian Institution builds AI trial

to capture biodiversity data from historic research

During 2024 Smithsonian Institution researchers working with the Biodiversity Heritage Library (BHL) extracted structured taxonomic data from large numbers of historic field notebooks.

This material is a crucial resource for understanding historical populations and habitats, which have now been drastically altered.

Digitised to create a digital corpus from the primarily hand-written material and pasted-in photographs, the project used AI to extract texts and provided workflows for human correction of text recognition and for experts to confirm specimen identifications.

In collaboration with
Trusted logo Trusted logo
image
biodiversity icon

Extraction of biodiversity data using a pipeline of AI products

rapidly-advancing AI extract structured data from handwritten field notebooks, creating access for researchers to historic species populations and habitat conditions

biodiversity icon

Human–in-the-loop validation workflows

hybrid workflows combine AI speed at scale with expert review, to establish scientific facts

biodiversity icon

Integrating AI data derivatives with long-term biodiversity infrastructure

built using IIIF and WADM the InvenioRDM repository platform enables seamless connection with GBIF and GNA-to-GBIF—and guarantees preservation of investment

biodiversity icon

Creating new FAIR biodiversity data resources

unlocking critical historic records using AI analyses, validated using annostor workflows, creates freely-accessible resources for worldwide research and education

Integrating AI outputs with source materials for long-term preservation

Working with annostor the Smithsonian team developed a repository providing IIIF image services for the field notebook page imagery and supporting Web Annotation Data Model (WADM) annotation. They imported the corrected AI-generated texts of the handwritten notes to create annotations anchored to the digitised pages, and then created links to the Global Biodiversity Information Framework (GBIF) using the Global Names Architecture (GNA).

They imported the corrected AI-generated texts of the handwritten notes to create annotations anchored to the digitised pages, and then created links to the Global Biodiversity Information Framework (GBIF) using the Global Names Architecture (GNA).

image

Connecting historic field observations with global data resources using Persistent IDentifiers

Extending this investigation during 2025, the Smithsonian analysed tens of thousands of heritage biodiversity images—originally deposited in BHL from the Flickr Foundation—and created AI profiles of the organisms and plants discovered. These profiles, together with the high-resolution digital imagery were added to the new repository to automatically create new collections of standards-based annotation records.

Hybrid workflows were then developed to support human-in-the-loop identification of the specimens via GBIF and GNA—producing a new corpus of historic field observation records having free access to the global scientific community.

image