Conversation
…t/BIGr into ped_indels_update
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Madc2vcf updates
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ht/BIGr into ped_indels_update
There was a problem hiding this comment.
Pull request overview
This PR expands BIGr with new pedigree/parentage utilities and broader updates to MADC→VCF tooling, pedigree QC, and related documentation/tests, culminating in a version bump to 0.7.0.
Changes:
- Added new parentage/pedigree functions:
find_parentage()andvalidate_pedigree()with corresponding test coverage. - Added/updated MADC processing utilities (
check_madc_sanity,madc2vcf_multi,get_countsMADC/get_counts) plus extensive integration-style tests. - Updated
check_ped(),filterVCF(), package metadata, NAMESPACE, NEWS, docs, and CI workflow dependencies.
Reviewed changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| R/validate_pedigree.R | New trio validation + correction file writing. |
| tests/testthat/test-validate_pedigree.R | New unit tests for validate_pedigree(). |
| R/find_parentage.R | New parentage assignment (single-parent + best-pair) logic. |
| tests/testthat/test-find_parentage.R | New unit tests for find_parentage(). |
| R/madc2vcf_multi.R | New madc2vcf_multi() polyRAD-based multi-allelic conversion. |
| tests/testthat/test-madc2vcf_multi.R | New integration tests for madc2vcf_multi(). |
| R/check_madc_sanity.R | New MADC sanity checker + internal check_botloci(). |
| tests/testthat/test-check_madc_sanity.R | New tests for check_madc_sanity(). |
| R/get_countsMADC.R | Extended count-extraction API (madc_object, collapsing matches, verbose). |
| R/utils.R | Added vmsg() + url_exists() and updated globalVariables(). |
| R/filterVCF.R | Added quality.rates metrics and adjusted console output behavior. |
| R/check_ped.R | Reworked pedigree QC and added optional “corrected” outputs. |
| tests/testthat/test-check_ped.R | Updated expected output list length. |
| tests/testthat/test-madc2vcf_targets.R | Removed cross-function REF/ALT comparison; added large remote-file test block. |
| tests/testthat/test-madc2vcf_all.R | Reduced verbosity; added large remote-file test block. |
| tests/testthat/test-imputation_concordance.R | Minor whitespace cleanup. |
| R/imputation_concordance.R | Added plotting + configurable printing; updated docs. |
| NEWS.md | Added release notes through 0.7.0 and prior versions’ details. |
| DESCRIPTION | Version bump to 0.7.0; imports/suggests updated (data.table, ggplot2, polyRAD). |
| NAMESPACE | Exported new functions; added imports (data.table, ggplot2, etc.). |
| man/validate_pedigree.Rd | Generated docs for validate_pedigree(). |
| man/find_parentage.Rd | Generated docs for find_parentage(). |
| man/madc2vcf_multi.Rd | Generated docs for madc2vcf_multi(). |
| man/check_madc_sanity.Rd | Generated docs for check_madc_sanity(). |
| man/get_countsMADC.Rd | Updated docs for new get_countsMADC() signature. |
| man/get_counts.Rd | Added docs for internal get_counts(). |
| man/imputation_concordance.Rd | Updated docs for new args/behavior. |
| man/madc2vcf_targets.Rd | Updated docs for new args/requirements. |
| man/madc2vcf_all.Rd | Updated docs to reflect new required args and options. |
| man/filterVCF.Rd | Updated docs for quality.rates. |
| man/check_ped.Rd | Updated docs for revised check_ped() behavior. |
| man/vmsg.Rd | Added docs for vmsg(). |
| .github/workflows/R-CMD-check.yaml | CI installs polyRAD + VariantAnnotation for tests. |
| .gitignore | Added .DS_Store. |
| BIGr.Rproj | Added ProjectId. |
| CRAN-SUBMISSION | Removed file. |
| cran-comments.md | Removed file. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (sire_vec == 0 & dam_vec == 0 & progeny_vec > 0) | | ||
| (sire_vec == 2 & dam_vec == 2 & progeny_vec < 2) | | ||
| ((sire_vec == 0 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 0)) & (progeny_vec == 2) | | ||
| ((sire_vec == 2 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 2)) & (progeny_vec == 0) | | ||
| ((sire_vec == 0 & dam_vec == 2) | (sire_vec == 2 & dam_vec == 0)) & (progeny_vec != 1), |
There was a problem hiding this comment.
The Mendelian error conditions mix | and & without grouping, so R’s operator precedence makes these tests evaluate differently than intended (e.g., A | B & C parses as A | (B & C)). This will miscount mismatches for several parent genotype combinations. Wrap each multi-part condition in parentheses so each rule is evaluated as ( (parent_combo) & (progeny_condition) ) before OR’ing the rules together.
| (sire_vec == 0 & dam_vec == 0 & progeny_vec > 0) | | |
| (sire_vec == 2 & dam_vec == 2 & progeny_vec < 2) | | |
| ((sire_vec == 0 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 0)) & (progeny_vec == 2) | | |
| ((sire_vec == 2 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 2)) & (progeny_vec == 0) | | |
| ((sire_vec == 0 & dam_vec == 2) | (sire_vec == 2 & dam_vec == 0)) & (progeny_vec != 1), | |
| ((sire_vec == 0 & dam_vec == 0 & progeny_vec > 0)) | | |
| ((sire_vec == 2 & dam_vec == 2 & progeny_vec < 2)) | | |
| ((((sire_vec == 0 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 0)) & (progeny_vec == 2))) | | |
| ((((sire_vec == 2 & dam_vec == 1) | (sire_vec == 1 & dam_vec == 2)) & (progeny_vec == 0))) | | |
| ((((sire_vec == 0 & dam_vec == 2) | (sire_vec == 2 & dam_vec == 0)) & (progeny_vec != 1))), |
| (sire_genos_mat == 0 & dam_genos_mat == 0 & progeny_vec > 0) | | ||
| (sire_genos_mat == 2 & dam_genos_mat == 2 & progeny_vec < 2) | | ||
| ((sire_genos_mat == 0 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 0)) & (progeny_vec == 2) | | ||
| ((sire_genos_mat == 2 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 2)) & (progeny_vec == 0) | | ||
| ((sire_genos_mat == 0 & dam_genos_mat == 2) | (sire_genos_mat == 2 & dam_genos_mat == 0)) & (progeny_vec != 1), |
There was a problem hiding this comment.
Same precedence issue as in validate_pedigree(): the Mendelian error rules combine | and & without parentheses, so the logic does not match the intended ((parent_combo) & (progeny_condition)) form and will miscount mismatches. Add explicit parentheses around each rule before combining with |.
| (sire_genos_mat == 0 & dam_genos_mat == 0 & progeny_vec > 0) | | |
| (sire_genos_mat == 2 & dam_genos_mat == 2 & progeny_vec < 2) | | |
| ((sire_genos_mat == 0 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 0)) & (progeny_vec == 2) | | |
| ((sire_genos_mat == 2 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 2)) & (progeny_vec == 0) | | |
| ((sire_genos_mat == 0 & dam_genos_mat == 2) | (sire_genos_mat == 2 & dam_genos_mat == 0)) & (progeny_vec != 1), | |
| ((sire_genos_mat == 0 & dam_genos_mat == 0 & progeny_vec > 0)) | | |
| ((sire_genos_mat == 2 & dam_genos_mat == 2 & progeny_vec < 2)) | | |
| (((sire_genos_mat == 0 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 0)) & (progeny_vec == 2)) | | |
| (((sire_genos_mat == 2 & dam_genos_mat == 1) | (sire_genos_mat == 1 & dam_genos_mat == 2)) & (progeny_vec == 0)) | | |
| (((sire_genos_mat == 0 & dam_genos_mat == 2) | (sire_genos_mat == 2 & dam_genos_mat == 0)) & (progeny_vec != 1)), |
|
|
||
| if (!inherits(vcf.file, "vcfR")) { | ||
| vcf <- read.vcfR(vcf.file, verbose = FALSE) | ||
| vcf <- read.vcfR(vcf.file) |
There was a problem hiding this comment.
read.vcfR() is now called without verbose = FALSE, which can produce a lot of console output from filterVCF() even though the function has no verbosity control. Consider restoring verbose = FALSE (or adding a verbose argument to filterVCF()) to avoid noisy output in pipelines.
| vcf <- read.vcfR(vcf.file) | |
| vcf <- read.vcfR(vcf.file, verbose = FALSE) |
|
|
||
| mean_depth_marker <- rowMeans(dp_orig, na.rm = TRUE) | ||
| genotype_present <- !is.na(gt_orig) | ||
| genotyping_rate_marker <- rowMeans(genotype_present) | ||
|
|
There was a problem hiding this comment.
genotype_present <- !is.na(gt_orig) treats missing genotypes like "./." as present (they are strings, not NA), which inflates genotyping-rate calculations (and downstream sample/marker stats). Treat "./." (and possibly ".|.") as missing when computing genotype_present and heterozygosity.
| } | ||
|
|
||
| #Message that includes the output vcf stats | ||
| print(vcf) |
There was a problem hiding this comment.
print(vcf) is executed unconditionally at the end, which can be very expensive/noisy for large VCFs and cannot be disabled. Consider removing this print or guarding it behind a new verbose flag (or reuse quality.rates).
| print(vcf) | |
| if (isTRUE(quality.rates)) { | |
| print(vcf) | |
| } |
| test_that("check madc",{ | ||
|
|
||
| github_path <- "https://raw.githubusercontent.com/Breeding-Insight/BIGapp-PanelHub/refs/heads/long_seq/test_madcs/" | ||
| names <- c("Columns", "FixAlleleIDs", "IUPACcodes", "LowerCase", "Indels", "ChromPos", "allNAcol", "allNArow", "RefAltSeqs", "OtherAlleles") |
There was a problem hiding this comment.
These tests download files from GitHub but don’t call skip_if_offline(), so they will fail in offline/CRAN/air-gapped CI environments. Add skip_if_offline("raw.githubusercontent.com") (or equivalent) near the top of the test file.
| # CONCLUSION: We must allow heterozygous markers but avoid the | ||
| # operator-precedence bug. Looking at the source code condition: | ||
| # | ||
| # ((sire==0 & dam==1) | (sire==1 & dam==0)) & (prog==2) | ||
| # |
There was a problem hiding this comment.
The test design/comments explicitly rely on a known operator-precedence bug in the Mendelian mismatch logic. Tests should validate intended behavior, not depend on buggy behavior to break ties; otherwise fixing the code will require rewriting tests. After correcting precedence in find_parentage(), update toy genotypes/expectations accordingly.
| #' rows/columns, and sets `n.header.rows` automatically based on whether the | ||
| #' MADC file follows the raw DArT format (6 header rows) or the fixed allele ID | ||
| #' format (no header rows). |
There was a problem hiding this comment.
The docs claim n.header.rows is set automatically for raw vs fixed MADC formats, but the implementation stops when FixAlleleIDs is FALSE and always calls readDArTag(..., n.header.rows = 0L). Please update the documentation to reflect actual behavior (or implement the advertised auto-detection).
| #' rows/columns, and sets `n.header.rows` automatically based on whether the | |
| #' MADC file follows the raw DArT format (6 header rows) or the fixed allele ID | |
| #' format (no header rows). | |
| #' rows/columns, writes the corrected data to a temporary file, and then reads | |
| #' that file with `polyRAD::readDArTag`. It does not automatically detect raw | |
| #' DArT files with 6 header rows; inputs are handled as fixed/headerless MADC | |
| #' data after preprocessing. |
| #' @description | ||
| #' Performs nine quick validations on an allele report: | ||
| #' 1) **Columns** - required columns are present (`CloneID`, `AlleleID`, `AlleleSequence`); | ||
| #' 2) **FixAlleleIDs** - first column's first up-to-6 rows are not all blank or `"*"` | ||
| #' *and* both `_0001` and `_0002` appear in `AlleleID`; |
There was a problem hiding this comment.
Docs say “nine quick validations”, but the implementation includes an additional OtherAlleles check (and checks/messages are initialized with OtherAlleles). Update the description/list so the documented number and items match what the function returns.
| # BIGr 0.7.0 | ||
|
|
||
| ## New function `madc2vcf_multi` | ||
|
|
||
| - New function `madc2vcf_multi` to convert a DArTag MADC file to a VCF using the polyRAD pipeline for multiallelic genotyping |
There was a problem hiding this comment.
The PR description focuses on adding parentage functions, but this PR also introduces/changes substantial MADC→VCF functionality (and bumps the package to 0.7.0). Please update the PR description (or split the PR) so reviewers can accurately track scope and release impact.
Modified Sire and Dam calls to male_parent and female_parent to be species agnostic. Modified the test files to accomodate the changes in functions Modified importFrom statement for thinSNP to call specific functions and resolve warnings when installing BIGr Updated package documentation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## development #56 +/- ##
===============================================
- Coverage 83.05% 78.39% -4.67%
===============================================
Files 19 23 +4
Lines 1369 2536 +1167
===============================================
+ Hits 1137 1988 +851
- Misses 232 548 +316 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Added 2 parentage functions along with the test files and updated package giles to include updates