EDAM_ground_truth

Ground truth dataset of life sciences software, including 44 analyzed tools by experts. In total, 16 experts took part in the ground truth: 2 for Genetic Variant, 2 in Metagenomics, 3 in Phylogeny, 2 in Single Cell, 3 in Systems Biology and 4 experts in Bio-imaging, which includes the fields of Microscopy and Neuroimaging . A LLM (DeepSeek V3.1) was used to annotate the ground-truth data, and these annotations were reviewed and validated by experts. Those selected by the experts were incorporated into the expert–LLM consensus. The expert-LLM annotations from the tools are compared to those in bio.tools registry using the EDAM ontology. This ground truth are supported by the ShareFAIR.

Installation

Clone the EDAM_ground_truth repository

git clone https://github.com/ulysseLeclanche/EDAM_ground_truth.git

cd EDAM_ground_truth

All the packages used are listed in environment.yml; a Conda environment can be created

conda env create -f environment.yml

conda activate EDAM_GT

Content and Usage

This repository contains resources related to the EDAM ground truth dataset of life sciences software.

1. Ground-Truth Dataset

1.1. Prompting Strategy

Supplementary_material.pdf

Description of the prompt design and the example provided to experts for annotation.

Prompt_examples_two_annotated_tools.txt

Example prompts used with the LLM (DeepSeek V3.1) to replicate expert annotation procedures.

1.2. List of software tools

Tools_lists.csv

List of software tools included in the dataset, with links to their corresponding entries in bio.tools.

1.3. Raw free-text annotations from experts and DeepSeek

Raw_free_text_annotation/ (text and JSON formats) JSON files are generated using: Confusion_matrix_json.ipynb.

1.4 Distribution of free-text annotations by type

Distribution_free_text_annotation/ Free-text annotations by type : topic, operation, input/output data type, and input/output format generated with Distribution_free_text_annotation.ipynb.

1.5. Ground-truth EDAM terms validated by expert

EDAM_terms_URI_ground_truth_validated.tsv
Contains the URIs of validated EDAM terms, their labels, and the associated free-text annotation proposals from experts. Each annotation is linked to one or more tools, within a domain and a category (topics, operations, format, or data).

2. LLM vs Expert Annotation Analysis

2.1. Metrics : Recall, Precision and F1 score

All metrics are calculated by comparing DeepSeek's proposals and the expert consensus against the LLM-Expert consensus.

Metrics are computed using : Confusion_matrix_json.ipynb and figures with Contribution_LLM_expert_Precision_recall_F1_figures.ipynb.
Recall, Precision, and F1 score were calculated for all tools across each domain, as well as by annotation type.

2.2. Resources and plots

All confusion matrix are inside :Confusion_matrix_free_text_annotation/
Confusion matrices contain : TP, FN, FP, recall, precision, F1 score, annotations retained or rejected and mixed annotations.
plots/ Figures for recall, precision, F1 score by domains and annotation types.

3. Added Value of LLM and Expert Annotations

3.1 Missing annotations in EDAM

Identification of free-text annotation found by experts/LLM but absent from EDAM ontology.
The notebook Contingency_table_annotation_consensus_mapped.ipynb calculates the contingency table for free-text annotations, regardless of whether they have been validated as EDAM annotations.

3.2 Contribution to bio.tools

Comparison between: Ground-truth validated annotations (expert + LLM consensus) and existing annotations in bio.tools using the notebook Contingency_table_Biotools_vs_ground_truth.ipynb. Annotations inherited from the EDAM ontology based on direct annotations are calculated using edam_neighbors.py

All notebooks and contingency tables are in the folder: Contingency_table/

Authors and Affiliations:

Authors:
Ulysse Le Clanche¹, Melvin Selim Atay², Elise Bannier²,³, Anaïs Baudot⁴,⁵, Lea Bellenger⁶, Alexandrina Bodrug⁶, Samuel Chaffron⁷, Eric Charpentier⁶, Erwan Corre⁸, Clémence Frioux⁹, Aurélie Lardenois¹⁰, Frédéric Lemoine¹¹, Camille Maumet², Cyril Noël¹², Perrine Paul-Gilloteaux¹³, Paul Simion¹⁴, Morgane Térézol⁴, Olivier Dameron¹, and Alban Gaignard⁶

Affiliations:

Univ Rennes, Inria, CNRS, IRISA - UMR 6074, F-35000 Rennes, France
Univ Rennes, CNRS, Inria, Inserm, IRISA UMR 6074, EMPENN — ERL U 1228, F-35000 Rennes, France
CHU Rennes, Radiology Department, Rennes, France
Aix Marseille Université, INSERM, MMG, Marseille, France
CNRS, Marseille, France
Nantes Université, CNRS, INSERM, l'institut du thorax, F-44000 Nantes, France
Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
ABiMS-IFB, Station Biologique de Roscoff, CNRS/Sorbonne Université, Roscoff, France
Inria, Univ. Bordeaux, INRAE, 33400, Talence, France
Institut National de Santé et de Recherche Médicale, U1085-Irset, Université de Rennes 1, F-35042 Rennes, France
Institut Pasteur, Université Paris Cité, Bioinformatics of Biostatistics Hub, F-75015 Paris, France
SeBiMER Service de Bioinformatique de l'Ifremer, Ifremer, IRSI, Plouzané, France
Nantes Université, CHU Nantes, CNRS, Inserm, BioCore, US16, SFR Bonamy, Nantes, France
EcoBio - Ecosystems, Biodiversity, Evolution, Université de Rennes 1, 35042 Rennes, France

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDAM_ground_truth

Table of Contents

Installation

Content and Usage

1. Ground-Truth Dataset

1.1. Prompting Strategy

1.2. List of software tools

1.3. Raw free-text annotations from experts and DeepSeek

1.4 Distribution of free-text annotations by type

1.5. Ground-truth EDAM terms validated by expert

2. LLM vs Expert Annotation Analysis

2.1. Metrics : Recall, Precision and F1 score

2.2. Resources and plots

3. Added Value of LLM and Expert Annotations

3.1 Missing annotations in EDAM

3.2 Contribution to bio.tools

Authors and Affiliations:

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Confusion_matrix_free_text_annotation		Confusion_matrix_free_text_annotation
Contingency_table		Contingency_table
Distribution_free_text_annotation		Distribution_free_text_annotation
Raw_free_text_annotation		Raw_free_text_annotation
plots		plots
Confusion_matrix_json.ipynb		Confusion_matrix_json.ipynb
Contribution_LLM_expert_Precision_recall_F1_figures.ipynb		Contribution_LLM_expert_Precision_recall_F1_figures.ipynb
Distribution_free_text_annotation.ipynb		Distribution_free_text_annotation.ipynb
EDAM_terms_URI_ground_truth_validated.tsv		EDAM_terms_URI_ground_truth_validated.tsv
LICENSE		LICENSE
Prompt_exemples_two_annotated_tools.txt		Prompt_exemples_two_annotated_tools.txt
README.md		README.md
Supplementary_material.pdf		Supplementary_material.pdf
Tools_lists.csv		Tools_lists.csv
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

EDAM_ground_truth

Table of Contents

Installation

Content and Usage

1. Ground-Truth Dataset

1.1. Prompting Strategy

1.2. List of software tools

1.3. Raw free-text annotations from experts and DeepSeek

1.4 Distribution of free-text annotations by type

1.5. Ground-truth EDAM terms validated by expert

2. LLM vs Expert Annotation Analysis

2.1. Metrics : Recall, Precision and F1 score

2.2. Resources and plots

3. Added Value of LLM and Expert Annotations

3.1 Missing annotations in EDAM

3.2 Contribution to bio.tools

Authors and Affiliations:

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages