Ground truth dataset of life sciences software, including 44 analyzed tools by experts. In total, 16 experts took part in the ground truth: 2 for Genetic Variant, 2 in Metagenomics, 3 in Phylogeny, 2 in Single Cell, 3 in Systems Biology and 4 experts in Bio-imaging, which includes the fields of Microscopy and Neuroimaging . A LLM (DeepSeek V3.1) was used to annotate the ground-truth data, and these annotations were reviewed and validated by experts. Those selected by the experts were incorporated into the expert–LLM consensus. The expert-LLM annotations from the tools are compared to those in bio.tools registry using the EDAM ontology. This ground truth are supported by the ShareFAIR.
- Clone the EDAM_ground_truth repository
git clone https://github.com/ulysseLeclanche/EDAM_ground_truth.gitcd EDAM_ground_truth- All the packages used are listed in
environment.yml; a Conda environment can be created
conda env create -f environment.ymlconda activate EDAM_GTThis repository contains resources related to the EDAM ground truth dataset of life sciences software.
Description of the prompt design and the example provided to experts for annotation.
Example prompts used with the LLM (DeepSeek V3.1) to replicate expert annotation procedures.
List of software tools included in the dataset, with links to their corresponding entries in bio.tools.
Raw_free_text_annotation/(text and JSON formats) JSON files are generated using:Confusion_matrix_json.ipynb.
Distribution_free_text_annotation/Free-text annotations by type : topic, operation, input/output data type, and input/output format generated withDistribution_free_text_annotation.ipynb.
EDAM_terms_URI_ground_truth_validated.tsv
Contains the URIs of validated EDAM terms, their labels, and the associated free-text annotation proposals from experts. Each annotation is linked to one or more tools, within a domain and a category (topics, operations, format, or data).
All metrics are calculated by comparing DeepSeek's proposals and the expert consensus against the LLM-Expert consensus.
Metrics are computed using : Confusion_matrix_json.ipynb and figures with Contribution_LLM_expert_Precision_recall_F1_figures.ipynb.
Recall, Precision, and F1 score were calculated for all tools across each domain, as well as by annotation type.
- All confusion matrix are inside :
Confusion_matrix_free_text_annotation/
Confusion matrices contain : TP, FN, FP, recall, precision, F1 score, annotations retained or rejected and mixed annotations. plots/Figures for recall, precision, F1 score by domains and annotation types.
Identification of free-text annotation found by experts/LLM but absent from EDAM ontology.
The notebook Contingency_table_annotation_consensus_mapped.ipynb calculates the contingency table for free-text annotations, regardless of whether they have been validated as EDAM annotations.
Comparison between: Ground-truth validated annotations (expert + LLM consensus) and existing annotations in bio.tools using the notebook Contingency_table_Biotools_vs_ground_truth.ipynb.
Annotations inherited from the EDAM ontology based on direct annotations are calculated using edam_neighbors.py
All notebooks and contingency tables are in the folder: Contingency_table/
Authors:
Ulysse Le Clanche¹, Melvin Selim Atay², Elise Bannier²,³, Anaïs Baudot⁴,⁵, Lea Bellenger⁶, Alexandrina Bodrug⁶, Samuel Chaffron⁷, Eric Charpentier⁶, Erwan Corre⁸, Clémence Frioux⁹, Aurélie Lardenois¹⁰, Frédéric Lemoine¹¹, Camille Maumet², Cyril Noël¹², Perrine Paul-Gilloteaux¹³, Paul Simion¹⁴, Morgane Térézol⁴, Olivier Dameron¹, and Alban Gaignard⁶
Affiliations:
- Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F-35000 Rennes, France
- Univ Rennes, CNRS, Inria, Inserm, IRISA UMR 6074, EMPENN — ERL U 1228, F-35000 Rennes, France
- CHU Rennes, Radiology Department, Rennes, France
- Aix Marseille Université, INSERM, MMG, Marseille, France
- CNRS, Marseille, France
- Nantes Université, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
- ABiMS-IFB, Station Biologique de Roscoff, CNRS/Sorbonne Université, Roscoff, France
- Inria, Univ. Bordeaux, INRAE, 33400, Talence, France
- Institut National de Santé et de Recherche Médicale, U1085-Irset, Université de Rennes 1, F-35042 Rennes, France
- Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, F-75015 Paris, France
- SeBiMER Service de Bioinformatique de l'Ifremer, Ifremer, IRSI, Plouzané, France
- Nantes Université, CHU Nantes, CNRS, Inserm, BioCore, US16, SFR Bonamy, Nantes, France
- EcoBio - Ecosystems, Biodiversity, Evolution, Université de Rennes 1, 35042 Rennes, France
This project is licensed under the MIT License.