Sidero-mining: Systematic extraction of siderophore biosynthetic information using large language models reveals habitat- and pathogen-specific bacterial iron interaction networks

2026-03-01

立即下载

Authors: Jiqi Shao, Yanzhao Wu, Ningbo Xi, Ruolin He, Ruichen Xu, Peng Guo, Shaohua Gu, Zhiyuan Li.

Journal: bioRxiv

DOI: 10.1101/2025.03.09.642270v1

Link: https://www.biorxiv.org/content/10.1101/2025.03.09.642270v1.full.pdf

Published: Mar, 13, 2025

Document Type: Research Article

Abstract:

Siderophores are essential secondary metabolites widely distributed across microorganisms, displaying remarkable diversity. Despite extensive research, public databases contain limited information on siderophore biosynthetic gene clusters (BGCs), particularly lacking cross-species distribution and biosynthetic substrate annotations. Systematically collecting and organizing siderophore BGC synthesis data on a large scale would significantly enhance the use of domain knowledge and support data-driven research.

Large language models (LLMs) now offer a practical and scalable approach for mining and curating biological data, especially for converting literature insights into structured datasets In this work, we developed the Sidero-Mining pipeline, using LLMs to efficiently extract siderophore BGC synthesis information. By employing LLMs to screen over 10,000 publications, we identified 1,843 high-quality articles for data mining based on Sidero-Mining framework, manual validation, and data integration.

This effort culminated in the creation of the most comprehensive siderophore BGC dataset to date, containing 728 BGCs and 325 NRPS A domain substrate entries cross various species. Our results highlight LLMs’ potential to accelerate secondary metabolite dataset construction, and our methodological framework can be adapted for systematically exploring other secondary metabolites.

附件【2025.03.09.642270v1.full.pdf】

研究成果 / Research Achievements

Sidero-mining: Systematic extraction of siderophore biosynthetic information using large language models reveals habitat- and pathogen-specific bacterial iron interaction networks

最新文章