CGtag: Complete Genomics Analysis Toolkit and Annotation in Galaxy

Here we describe several of the tools contained in the CGtag toolkit by way of example. Example data is provided in a shared data library, or can be imported to your history directly from this page.

Example Data

Complete Genomics provides free public access to a variety of whole human genome data sets generated from Complete Genomics’ sequencing service (link). Here we will be using the breast cancer cell line HCC1187.  

Data is also available from the CGtag data library which can be accessed by going to "Shared Data" at the top of your screen.

Alternatively, here is an example history containing all the shared files, which can be copied to your own history:

Small Variants (SNPs and small insertion, deletions and substitutions) are reported in the CG varfile format. SV breakpoint (junctions) are reported in the CG junctions file.


Complete Genomics provides a suite of command-line tools, cgatools, for downstream analysis of their data.

Let us perform a Tumour-Normal Comparison of the small variants and the SV.

For SVs we use the cgatool "jumctionDiff". This tool output those junctions present in the first input file, but not in the second.

For small variants we will use the ListVariants and TestVariants tools to perform the comparison. Listvariants takes as input a varfile, and will output a list of all the variants found in the sample. This can then be compared to an arbitrary number of varfiles using the TestVariants tool. The output will consist of all variants in the variantlist, with an additional column per varfile indicating whether that variant was present in the other sample or not. In our case we are interested in the variants present in the tumour, and wish to compare these to the variants found in the normal sample. We will therefore use the tumour varfile as input to ListVariants, and then compare the resulting list of variant to the varfile of the normal sample with the TestVariants tool:

Notice that if we had selected the wrong build (as we did in dataset 3), the tool would have let us know and the job would have failed.

We could also have performed some preprocessing on the varfile before we ran the listvariants tool. This can be done on the varfile using the cgatools VarFilter.


There are several annotation tools available. We start by running ANNOVAR on the result of our T/N comparison we did in the previous section. Because our input is neither a VCF file nor a varfile, we select the option "other" and specify which columns contain the necessary information for ANNOVAR. We next select the annotation we would like to perform and hit excecute. Note that for a file of this size (some 3,5 million variants), the execution may take quite some time, so any prefiltering that can be done will save a lot of time at this step.

We can now run condel to obtain the consensus deliteriousness score from the SIFT and polyphen2 annotations we got from ANNOVAR. MutationAssessor may also be run to get more information about SNPs in exonic regions. Because this connects to an online service, the queries are limited to 3 per second, so filtering your list of variants here is highly recommended.


Integrated Circos plot can visualize SVs and small variants from Complete Genomics files, and B-allelefrequency and coverage data from Illumina SNP array data processed by Nexus. Output can be either a whole-genome plot, a zip file containing all individual chromosome plots, a single image containing all single-chromosome plots, and/or a cutom defined region. Below is example of the output for the VCaP sample, all necessary data for these plots can be found in the CGtag shared data library. On the left is the whole-genome plot, in the middle the montage of the chromosome plots, and on the right the indivudual plot for chromosome 5. These images clearly shows the chromothripsis present on chromosome 5.

First step we made a list of all the variants found in the varfile by running the cgatool ListVariants on it. Next we run the integrated circos plot tool. The history is shown below:

If you do not have all the required files, any subset can also be given as input, and only those tracks corresponding to the data that was input will be generated. We selected to make a whole-genome plot, separate per-chromosome plots which will be delivered in a .tar.gz file, a montage of these per-chromosome plots, and we define our own custom region to plot.

Example Full Workflow

The following is an example of a full workflow for T/N pairs consisting of the following steps:
1) Normal correction of junctions (breakpoints) using JunctionDiff
2) Normal correction of small variants using ListVariants/TestVariants followed by the Filter tool
3) Annotation of somatic small variant using ANNOVAR
4) Filtering of annotated somatic variants for presence in 1000Genomes project and Exome Variant Server (ESP)
5) Circos visualisation of the somatic junctions, filtered somatic small variants, and B-allele frequency and coverage obtained from SNParray data (whole-genome plot)

example workflow

To use this workflow yourself, navigate to Shared Data on the top menu bar, and then to the Published workflows, and select the CGtag workflow.

An example run of this workflow, using the data in the data library, is shown below:

(Of course one would normally use data all from the same sample)