Tree Nine

The Tree Nine system is the basis of a tuberculosis cluster tracker built with funding from the California Department of Health, but it can also be run as a standalone workflow to simply place samples on an existing phylogenetic tree using UShER. For CDPH's TB cluster tracker, this system runs on Terra, but like any WDL workflow it can run on basically anything that supports Docker (Singularity is untested). We have verified compataibility with both Cromwell and miniwdl.

If clustering is enabled, samples' SNP distance from each other is calculated from branch length, allowing them to be placed into clusters. Clustering can be done on a subset of samples or across the entire tree. Every run of Tree Nine with clustering outputs files with cluster information; if you put those files back into Tree Nine again, you can track changes to your clusters over time, maintaining persistent cluster IDs the entire time. Every cluster can optionally generate a Microreact template JSON and communicate with the Microreact API to automatically create projects containing the cluster's subtree, distance matrix, links to parent/subclusters, and a sample-level metadata table.

Tree Nine takes in samples as MAPLE-formatted diff files, which you can generate with myco or convert from VCF.

This repo also contains the following subworkflows:

Annotate
Convert to Nextstrain (for viewing in Auspice, non-clade sample annotations, etc)
Extract
Mask tree
Mask subtree
Summarize

features

Highly scalable, even on lower-end computes
- Preliminary development tests run directly on a seven-year-old Macbook
- Places >11,000 new samples on a base tree of >130,000 other samples in less than two hours (on GCP, not the laptop)
Automatic clustering (including recursive subclusters) based on genetic distance
- Clustering can be limited to only samples specified by the user, all newly added samples, or all samples on the entire tree
- Create per-cluster subtrees (pb/nwk)
- Cluster IDs can be made persistent to track changes over multiple runs, but starting from scratch is also supported
Includes a sample input tree created from SRA data if no input tree is specified
Trees automatically converted to UsHER (.pb), Taxonium (.jsonl.gz), Newick (.nwk), and Nextstrain (.json) formats
Reroot the tree to a specified node
Mimic BioNumerics's rules by "backmasking" related samples against each other to hide ambigious positions
- Designed for highly clonal samples which have a plausible direct epidemiological relationship
- Backmasking can only be performed on samples which have sample-level diff files
Summarize input, reroot, and output trees with matutils
Filter out positions by coverage at that position and/or entire samples by overall coverage
Specify your own reference genome if you don't want to work with H37Rv
Annotate clades via matUtils with a specified annotation TSV

compatibility

Hardware values in the WDL's runtime sections are not minimum requirements, they are just default values we use for cloud runs. If your machine uses ARM hardware (including Apple Silicon), Docker must use compatiability layer and may have diminished performance, but it does work. Singularity is reported to work with myco with some adjustments, but Singularity not been tested with Tree Nine. If running this workflow with miniwdl, include the --copy-input-files runtime attribute.

benchmarking

Formal benchmarks have not been established, but if your computer can run Docker, it can probably handle small-scale runs of the entirety of Tree Nine (placing ~10 samples on a 70 sample base tree, including clustering) in about a minute.

Placing approximately 11,000 real-world samples on a ~130,000 sample base tree takes approximately one hour, plus one hour of optional matOpimize. Clustering depends heavily on the number of samples you are considering for clustering and how many clusters are actually found. Due to a matUtils limitation, we currently must open and close the tree file multiple times per cluster (which only takes ~2 seconds on >130,000 sample trees, but that adds up quickly if you have >3500 clusters).

Name		Name	Last commit message	Last commit date
Latest commit History 501 Commits
.github/workflows		.github/workflows
attic		attic
data		data
input_jsons		input_jsons
.dockerignore		.dockerignore
.dockstore.yml		.dockstore.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_for_devs.md		README_for_devs.md
annotate.md		annotate.md
annotate.wdl		annotate.wdl
clustering.md		clustering.md
convert_to_nextstrain.md		convert_to_nextstrain.md
convert_to_nextstrain.wdl		convert_to_nextstrain.wdl
convert_to_nwk.wdl		convert_to_nwk.wdl
extract.md		extract.md
extract_subtree.wdl		extract_subtree.wdl
find_clusters.py		find_clusters.py
mask.md		mask.md
mask_subtree.wdl		mask_subtree.wdl
mask_tree.wdl		mask_tree.wdl
matutils_and_friends.wdl		matutils_and_friends.wdl
optimize.wdl		optimize.wdl
process_clusters.py		process_clusters.py
sanity_check_diffs.wdl		sanity_check_diffs.wdl
summarize.md		summarize.md
summarize.wdl		summarize.wdl
summarize_changes.py		summarize_changes.py
summarize_changes_alt.py		summarize_changes_alt.py
tree_nine.wdl		tree_nine.wdl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tree Nine

features

compatibility

benchmarking

About

Uh oh!

Releases 42

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tree Nine

features

compatibility

benchmarking

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 42

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages