Skip to content

Add parentage functions#56

Closed
josuechinchilla wants to merge 50 commits intodevelopmentfrom
add_parentage_functions
Closed

Add parentage functions#56
josuechinchilla wants to merge 50 commits intodevelopmentfrom
add_parentage_functions

Conversation

@josuechinchilla
Copy link
Copy Markdown
Collaborator

Added 2 parentage functions along with the test files and updated package giles to include updates

Cristianetaniguti and others added 30 commits October 3, 2025 15:11
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands BIGr with new pedigree/parentage utilities and broader updates to MADC→VCF tooling, pedigree QC, and related documentation/tests, culminating in a version bump to 0.7.0.

Changes:

  • Added new parentage/pedigree functions: find_parentage() and validate_pedigree() with corresponding test coverage.
  • Added/updated MADC processing utilities (check_madc_sanity, madc2vcf_multi, get_countsMADC/get_counts) plus extensive integration-style tests.
  • Updated check_ped(), filterVCF(), package metadata, NAMESPACE, NEWS, docs, and CI workflow dependencies.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
R/validate_pedigree.R New trio validation + correction file writing.
tests/testthat/test-validate_pedigree.R New unit tests for validate_pedigree().
R/find_parentage.R New parentage assignment (single-parent + best-pair) logic.
tests/testthat/test-find_parentage.R New unit tests for find_parentage().
R/madc2vcf_multi.R New madc2vcf_multi() polyRAD-based multi-allelic conversion.
tests/testthat/test-madc2vcf_multi.R New integration tests for madc2vcf_multi().
R/check_madc_sanity.R New MADC sanity checker + internal check_botloci().
tests/testthat/test-check_madc_sanity.R New tests for check_madc_sanity().
R/get_countsMADC.R Extended count-extraction API (madc_object, collapsing matches, verbose).
R/utils.R Added vmsg() + url_exists() and updated globalVariables().
R/filterVCF.R Added quality.rates metrics and adjusted console output behavior.
R/check_ped.R Reworked pedigree QC and added optional “corrected” outputs.
tests/testthat/test-check_ped.R Updated expected output list length.
tests/testthat/test-madc2vcf_targets.R Removed cross-function REF/ALT comparison; added large remote-file test block.
tests/testthat/test-madc2vcf_all.R Reduced verbosity; added large remote-file test block.
tests/testthat/test-imputation_concordance.R Minor whitespace cleanup.
R/imputation_concordance.R Added plotting + configurable printing; updated docs.
NEWS.md Added release notes through 0.7.0 and prior versions’ details.
DESCRIPTION Version bump to 0.7.0; imports/suggests updated (data.table, ggplot2, polyRAD).
NAMESPACE Exported new functions; added imports (data.table, ggplot2, etc.).
man/validate_pedigree.Rd Generated docs for validate_pedigree().
man/find_parentage.Rd Generated docs for find_parentage().
man/madc2vcf_multi.Rd Generated docs for madc2vcf_multi().
man/check_madc_sanity.Rd Generated docs for check_madc_sanity().
man/get_countsMADC.Rd Updated docs for new get_countsMADC() signature.
man/get_counts.Rd Added docs for internal get_counts().
man/imputation_concordance.Rd Updated docs for new args/behavior.
man/madc2vcf_targets.Rd Updated docs for new args/requirements.
man/madc2vcf_all.Rd Updated docs to reflect new required args and options.
man/filterVCF.Rd Updated docs for quality.rates.
man/check_ped.Rd Updated docs for revised check_ped() behavior.
man/vmsg.Rd Added docs for vmsg().
.github/workflows/R-CMD-check.yaml CI installs polyRAD + VariantAnnotation for tests.
.gitignore Added .DS_Store.
BIGr.Rproj Added ProjectId.
CRAN-SUBMISSION Removed file.
cran-comments.md Removed file.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/validate_pedigree.R Outdated
Comment on lines +168 to +172
(sire_vec == 0 & dam_vec == 0 & progeny_vec > 0) |
(sire_vec == 2 & dam_vec == 2 & progeny_vec < 2) |
((sire_vec == 0 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 0)) & (progeny_vec == 2) |
((sire_vec == 2 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 2)) & (progeny_vec == 0) |
((sire_vec == 0 & dam_vec == 2) | (sire_vec == 2 & dam_vec == 0)) & (progeny_vec != 1),
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Mendelian error conditions mix | and & without grouping, so R’s operator precedence makes these tests evaluate differently than intended (e.g., A | B & C parses as A | (B & C)). This will miscount mismatches for several parent genotype combinations. Wrap each multi-part condition in parentheses so each rule is evaluated as ( (parent_combo) & (progeny_condition) ) before OR’ing the rules together.

Suggested change
(sire_vec == 0 & dam_vec == 0 & progeny_vec > 0) |
(sire_vec == 2 & dam_vec == 2 & progeny_vec < 2) |
((sire_vec == 0 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 0)) & (progeny_vec == 2) |
((sire_vec == 2 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 2)) & (progeny_vec == 0) |
((sire_vec == 0 & dam_vec == 2) | (sire_vec == 2 & dam_vec == 0)) & (progeny_vec != 1),
((sire_vec == 0 & dam_vec == 0 & progeny_vec > 0)) |
((sire_vec == 2 & dam_vec == 2 & progeny_vec < 2)) |
((((sire_vec == 0 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 0)) & (progeny_vec == 2))) |
((((sire_vec == 2 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 2)) & (progeny_vec == 0))) |
((((sire_vec == 0 & dam_vec == 2) | (sire_vec == 2 & dam_vec == 0)) & (progeny_vec != 1))),

Copilot uses AI. Check for mistakes.
Comment thread R/find_parentage.R Outdated
Comment on lines +197 to +201
(sire_genos_mat == 0 & dam_genos_mat == 0 & progeny_vec > 0) |
(sire_genos_mat == 2 & dam_genos_mat == 2 & progeny_vec < 2) |
((sire_genos_mat == 0 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 0)) & (progeny_vec == 2) |
((sire_genos_mat == 2 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 2)) & (progeny_vec == 0) |
((sire_genos_mat == 0 & dam_genos_mat == 2) | (sire_genos_mat == 2 & dam_genos_mat == 0)) & (progeny_vec != 1),
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same precedence issue as in validate_pedigree(): the Mendelian error rules combine | and & without parentheses, so the logic does not match the intended ((parent_combo) & (progeny_condition)) form and will miscount mismatches. Add explicit parentheses around each rule before combining with |.

Suggested change
(sire_genos_mat == 0 & dam_genos_mat == 0 & progeny_vec > 0) |
(sire_genos_mat == 2 & dam_genos_mat == 2 & progeny_vec < 2) |
((sire_genos_mat == 0 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 0)) & (progeny_vec == 2) |
((sire_genos_mat == 2 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 2)) & (progeny_vec == 0) |
((sire_genos_mat == 0 & dam_genos_mat == 2) | (sire_genos_mat == 2 & dam_genos_mat == 0)) & (progeny_vec != 1),
((sire_genos_mat == 0 & dam_genos_mat == 0 & progeny_vec > 0)) |
((sire_genos_mat == 2 & dam_genos_mat == 2 & progeny_vec < 2)) |
(((sire_genos_mat == 0 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 0)) & (progeny_vec == 2)) |
(((sire_genos_mat == 2 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 2)) & (progeny_vec == 0)) |
(((sire_genos_mat == 0 & dam_genos_mat == 2) | (sire_genos_mat == 2 & dam_genos_mat == 0)) & (progeny_vec != 1)),

Copilot uses AI. Check for mistakes.
Comment thread R/filterVCF.R

if (!inherits(vcf.file, "vcfR")) {
vcf <- read.vcfR(vcf.file, verbose = FALSE)
vcf <- read.vcfR(vcf.file)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read.vcfR() is now called without verbose = FALSE, which can produce a lot of console output from filterVCF() even though the function has no verbosity control. Consider restoring verbose = FALSE (or adding a verbose argument to filterVCF()) to avoid noisy output in pipelines.

Suggested change
vcf <- read.vcfR(vcf.file)
vcf <- read.vcfR(vcf.file, verbose = FALSE)

Copilot uses AI. Check for mistakes.
Comment thread R/filterVCF.R
Comment on lines +103 to +107

mean_depth_marker <- rowMeans(dp_orig, na.rm = TRUE)
genotype_present <- !is.na(gt_orig)
genotyping_rate_marker <- rowMeans(genotype_present)

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genotype_present <- !is.na(gt_orig) treats missing genotypes like "./." as present (they are strings, not NA), which inflates genotyping-rate calculations (and downstream sample/marker stats). Treat "./." (and possibly ".|.") as missing when computing genotype_present and heterozygosity.

Copilot uses AI. Check for mistakes.
Comment thread R/filterVCF.R
}

#Message that includes the output vcf stats
print(vcf)
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(vcf) is executed unconditionally at the end, which can be very expensive/noisy for large VCFs and cannot be disabled. Consider removing this print or guarding it behind a new verbose flag (or reuse quality.rates).

Suggested change
print(vcf)
if (isTRUE(quality.rates)) {
print(vcf)
}

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +4
test_that("check madc",{

github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/test_madcs/"
names <- c("Columns", "FixAlleleIDs", "IUPACcodes", "LowerCase", "Indels", "ChromPos", "allNAcol", "allNArow", "RefAltSeqs", "OtherAlleles")
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests download files from GitHub but don’t call skip_if_offline(), so they will fail in offline/CRAN/air-gapped CI environments. Add skip_if_offline("raw.githubusercontent.com") (or equivalent) near the top of the test file.

Copilot uses AI. Check for mistakes.
Comment thread tests/testthat/test-find_parentage.R Outdated
Comment on lines +79 to +83
# CONCLUSION: We must allow heterozygous markers but avoid the
# operator-precedence bug. Looking at the source code condition:
#
# ((sire==0 & dam==1) | (sire==1 & dam==0)) & (prog==2)
#
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test design/comments explicitly rely on a known operator-precedence bug in the Mendelian mismatch logic. Tests should validate intended behavior, not depend on buggy behavior to break ties; otherwise fixing the code will require rewriting tests. After correcting precedence in find_parentage(), update toy genotypes/expectations accordingly.

Copilot uses AI. Check for mistakes.
Comment thread R/madc2vcf_multi.R
Comment on lines +6 to +8
#' rows/columns, and sets `n.header.rows` automatically based on whether the
#' MADC file follows the raw DArT format (6 header rows) or the fixed allele ID
#' format (no header rows).
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs claim n.header.rows is set automatically for raw vs fixed MADC formats, but the implementation stops when FixAlleleIDs is FALSE and always calls readDArTag(..., n.header.rows = 0L). Please update the documentation to reflect actual behavior (or implement the advertised auto-detection).

Suggested change
#' rows/columns, and sets `n.header.rows` automatically based on whether the
#' MADC file follows the raw DArT format (6 header rows) or the fixed allele ID
#' format (no header rows).
#' rows/columns, writes the corrected data to a temporary file, and then reads
#' that file with `polyRAD::readDArTag`. It does not automatically detect raw
#' DArT files with 6 header rows; inputs are handled as fixed/headerless MADC
#' data after preprocessing.

Copilot uses AI. Check for mistakes.
Comment thread R/check_madc_sanity.R
Comment on lines +3 to +7
#' @description
#' Performs nine quick validations on an allele report:
#' 1) **Columns** - required columns are present (`CloneID`, `AlleleID`, `AlleleSequence`);
#' 2) **FixAlleleIDs** - first column's first up-to-6 rows are not all blank or `"*"`
#' *and* both `_0001` and `_0002` appear in `AlleleID`;
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs say “nine quick validations”, but the implementation includes an additional OtherAlleles check (and checks/messages are initialized with OtherAlleles). Update the description/list so the documented number and items match what the function returns.

Copilot uses AI. Check for mistakes.
Comment thread NEWS.md
Comment on lines +1 to +5
# BIGr 0.7.0

## New function `madc2vcf_multi`

- New function `madc2vcf_multi` to convert a DArTag MADC file to a VCF using the polyRAD pipeline for multiallelic genotyping
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description focuses on adding parentage functions, but this PR also introduces/changes substantial MADC→VCF functionality (and bumps the package to 0.7.0). Please update the PR description (or split the PR) so reviewers can accurately track scope and release impact.

Copilot uses AI. Check for mistakes.
Modified Sire and Dam calls to male_parent and female_parent to be species agnostic.

Modified the test files to accomodate the changes in functions

Modified importFrom statement for thinSNP to call specific functions and resolve warnings when installing BIGr

Updated package documentation
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 71.80212% with 399 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.39%. Comparing base (4f30e52) to head (9824e29).

Files with missing lines Patch % Lines
R/madc2vcf_all.R 65.65% 113 Missing ⚠️
R/madc2vcf_targets.R 74.49% 63 Missing ⚠️
R/validate_pedigree.R 76.00% 54 Missing ⚠️
R/filterVCF.R 21.66% 47 Missing ⚠️
R/check_ped.R 56.09% 36 Missing ⚠️
R/check_madc_sanity.R 81.15% 26 Missing ⚠️
R/madc2vcf_multi.R 72.22% 25 Missing ⚠️
R/imputation_concordance.R 50.00% 18 Missing ⚠️
R/find_parentage.R 93.33% 9 Missing ⚠️
R/get_countsMADC.R 84.61% 8 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff               @@
##           development      #56      +/-   ##
===============================================
- Coverage        83.05%   78.39%   -4.67%     
===============================================
  Files               19       23       +4     
  Lines             1369     2536    +1167     
===============================================
+ Hits              1137     1988     +851     
- Misses             232      548     +316     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants