Authors: Jiqi Shao, Yanzhao Wu, Ningbo Xi, Ruolin He, Ruichen Xu, Peng Guo, Shaohua Gu, Zhiyuan Li.
Journal: bioRxiv
DOI: 10.1101/2025.03.09.642270v1
Link: https://www.biorxiv.org/content/10.1101/2025.03.09.642270v1.full.pdf
Published: Mar, 13, 2025
Document Type: Research Article
Abstract:
Siderophores are essential secondary metabolites widely distributed across microorganisms, displaying remarkable diversity. Despite extensive research, public databases contain limited information on siderophore biosynthetic gene clusters (BGCs), particularly lacking cross-species distribution and biosynthetic substrate annotations. Systematically collecting and organizing siderophore BGC synthesis data on a large scale would significantly enhance the use of domain knowledge and support data-driven research.
Large language models (LLMs) now offer a practical and scalable approach for mining and curating biological data, especially for converting literature insights into structured datasets In this work, we developed the Sidero-Mining pipeline, using LLMs to efficiently extract siderophore BGC synthesis information. By employing LLMs to screen over 10,000 publications, we identified 1,843 high-quality articles for data mining based on Sidero-Mining framework, manual validation, and data integration.
This effort culminated in the creation of the most comprehensive siderophore BGC dataset to date, containing 728 BGCs and 325 NRPS A domain substrate entries cross various species. Our results highlight LLMs’ potential to accelerate secondary metabolite dataset construction, and our methodological framework can be adapted for systematically exploring other secondary metabolites.