Tutorial of ggComp

ggComp (Genomic-based Germplasm Compare) is a novel strategy fitting complex genomes to evaluate germplasm resources by identifying shared genomic regions and excluding pervasive copy number variations for pairwise accessions.

Installation

ggComp is a light weight software that only need to make sure Python 3.8.5, samtools 1.4 and bcftools 1.9 are well installed and set in environment path.

Clone the ggComp repository

git clone https://github.com/zack-young/ggComp.git

Usage


Program: ggComp (A pairwised comparison method to identify similar genetic regions and shared CNV regions between accessions.)
Version: 1.0

Usage:   ggComp [-v|--version] [-h|--help] <command> <argument>

Commands:

    CNV_detector            detect CNV regions

    SNP_extractor           extract SNP information from VCF

    DSR_counter             compute the DSR

    SGR_PHR_definer         identify SGR and PHR

    HMM_smoother            smoothing SGR and PHR result using HMM

CNV_detector

detect CNV regions

Usage: ggComp CNV_detector <--chr_lis <STRING>>  <--single_CNV <FILE> | --pair_CNV <FILE>>

    --chr_lis <STRING>      chromosomes that listed in behind file 
                            e.g. 'chr1A chr1B chr1D'

Options:

    --single_CNV            detect CNV region of samples ::: input file contains chromosme
                            list, path of BED file, path of BAM file and output file
                            e.g. chromosome BED BAM output_dev
                            produce three files: *.mask_CNV
                                                 *.mask_CNV_deletion
                                                 *.mask_CNV_duplication
 
    --pair_CNV              identify pairwise sample specific CNV regions and shared CNV
                            regions::: input file contains sample names, path of CNV file
                            and path of output file directory
                            e.g. sample1_name sample1_directory sample2_name sample2_directory pair_dircetory

chromosome length in bed file shall not exceed the limit of bedtools (400Mb)
see test/single_CNV.config and test/pair_CNV.config for more details
(colunms separated by tab)

SNP_extractor

extract SNP information from VCF for DSR calculation

Usage: ggComp SNP_extractor <--config <FILE>>

Options:"

    --config <FILE>         file contains path of vcf files, sample ID list in vcf for
                            extracting(separated by ','), path of output file
                            e.g. VCF_file    vcf_ID   output_file

see test/SNP_extractor.config for more details
(colunms separated by tab)

DSR_counter

compute the DSR (Different SNP ratio) by bin

Usage: ggComp DSR_counter <--config <file>>  [Options]

    --config <FILE>         file contains:path of files contain SNP inforamtion;
                            path of output files; end position of chromosomes
                            (use 'defalut' to call built-in end position of each chromosomes)
                            e.g. SNP_file output_file END\default

Options:

    --DP_low <INT>          exclude SNP with DP <= <int>. default 3

    --DP_high <INT>         exclude SNP with DP >= <int>. default 99

    --GQ_sample <INT>       exclude SNP with GQ <= <int>. default 8

    --bin_size <INT>        bln size. default 1000000

see test/DSR_counter.config for more details (colunms separated by tab)

SGR_PHR_definer

detect SGR (Similar Genetic Regions) and PHR (Polymorphism Hotspot Regions) between sample pairs

Usage: ggComp SGR_PHR_definer <--noCNV <file> | --plus_CNV <file>> [Options]

    --no_CNV <FILE>         ignore CNV
                            file contains path of files contain DSR information
                            and path of output files.
                            e.g. DSR_file  output

    --plus_CNV <FILE>       take CNV into consideration"  
                            file contains path of files contain DSR information
                            path of CNV file and path of output files.
                            e.g. DSR_file  CNV_file output

Options

    --LEVEL <INT>           threshold divide SGR and PHR. default 10

see test/SGR_PHR_definer.config for more details (colunms separated by tab)

HMM_smoother

smoothing SGR (Similar Genetic Regions) and PHR (Polymorphism Hotspot Regions) result using HMM

Usage: ggComp HMM_smoother [Options]

Options:

    --input <FOLDER>        folder path of SGR and PHR phasing results

    --output <FOLDER>       output path

    --folder_lis <FILE>     containing a list of folder names of SGR and PHR phasing
                            results that are goining to be smoothed
                            e.g. C2_C10
                                C2_C10

    --processes <INT>       (optional) maximum worker processes"
                            default: 2"

    --train                 (optional) train the model using input data (otherwise using"
                            the model published in the article)"

    --niter <INT>           (optional) maximum number of iterations to perform in training"
                            default: 60"

Visualization

Plot the distribution of SGR PHR and CNV across whole genome

"Usage: ggComp Visualization <--config <FILE>>"

Options:

   --config <FILE>       file contains path of vcf files, sample ID list in vcf for"
                         extracting(separated by ','), path of output file"
                         e.g. path    suffix   SAMPLE1    SAMPLE1_name    SAMPLE2 SAMPLE2_name    pdf_path"

see test/plot.config for more details (colunms separated by tab)"

Quickstart with an example

CNV_detector

This is the first step of ggComp that detect CNV regions through BAM files.
Each BAM file should contain only one chromosome and the chromosome name in first column of config file should consistent with that in BAM file.
BED file records bins that going to be processed.
Chromosomes that list in –chr_lis may not consistent with config file as chromosomes may be separated into parts in BAM file.
e.g. --chr_lis 'chr1A chr1B' but in config file is chr1A.1 chr1A.2 chr1B.2 chr1B.2.

single_CNV.config

chr1A   test/CNV_detector/chr1A_11-12Mb.bed test/CNV_detector/C2/chr1A_11-12Mb_C10.bam  test/CNV_detector/C10
chr1A   test/CNV_detector/chr1A_11-12Mb.bed test/CNV_detector/C2/chr1A_11-12Mb_C2.bam   test/CNV_detector/C2
sh src/ggComp.sh CNV_detector --chr_lis chr1A --single_CNV test/single_CNV.config

pair_CNV.config

C2  test/CNV_detector/C2    C10 test/CNV_detector/C10   test/CNV_detector/C2_C10
sh src/ggComp.sh CNV_detector --chr_lis chr1A --pair_CNV test/pair_CNV.config

SNP_extractor

extract SNP information from VCF, including ‘CHROM POS REF ALT GT DP GQ’ only biallele by MAF<0.01.

SNP_extractor.config

test/SNP_extractor/chr1A_11-12Mb.bcf    2-90377,10-83979    test/SNP_extractor/chr1A_11-12Mb.SNP_gt
sh src/ggComp.sh SNP_extractor --config test/SNP_extractor.config

DSR_counter

compute the DSR from SNP by bins (default bin length: 1000000bp)

DSR_counter.config

test/SNP_extractor/chr1A_11-12Mb.SNP_gt test/DSR_counter/chr1A_11-12Mb.DSR  12000000
sh src/ggComp.sh DSR_counter --config test/DSR_counter.config

SGR_PHR_definer

identify SGR and PHR from DSR file

SGR_PHR_plus_CNV.config

test/DSR_counter/chr1A_11-12Mb.DSR  test/CNV_detector/C2_C10/chr1A.C2toC10all_CNV   test/SGR_PHR_definer/chr1A_11-12Mb_combineCNV.level
sh src/ggComp.sh SGR_PHR_definer --plus_CNV test/SGR_PHR_plus_CNV.config

SGR_PHR_noCNV.config

test/DSR_counter/chr1A_11-12Mb.DSR  test/SGR_PHR_definer/chr1A_11-12Mb_noCNV.level
sh src/ggComp.sh SGR_PHR_definer --no_CNV test/SGR_PHR_noCNV.config

HMM_smoother

smoothing SGR and PHR result using HMM #### Smooth only

sh WheatComp.sh HMM_smoother \
    -i /data3/user3/wangwx/projs/HMM_for_yzz_comp/210328-modify/test/data \
    --folder_lis /data3/user3/wangwx/projs/HMM_for_yzz_comp/210328-modify/test/folders.txt \
    -o /data3/user3/wangwx/projs/HMM_for_yzz_comp/210328-modify/test/out\
    --processes 21

Train & Smooth

sh WheatComp.sh HMM_smoother \
    -i /data3/user3/wangwx/projs/HMM_for_yzz_comp/210328-modify/test/data \
    --folder_lis /data3/user3/wangwx/projs/HMM_for_yzz_comp/210328-modify/test/folders.txt \
    -o /data3/user3/wangwx/projs/HMM_for_yzz_comp/210328-modify/test/out \
    --processes 21 \
    --train \
    --niter 30

Visualization

SGR_PHR_noCNV.config

test/Visualization   .homo_undefined_snp_level.HMMv1 Zang1817    Zang1817    S14 ZXM1341 test/Visualization
sh src/ggComp.sh Visualization --config test/Visualization.config