Mitochondria play critical cellular roles such as in the respiratory electron transport chain, synthesis of adenosine triphosphate, and involvement in apoptosis. Mitochondrial DNA copy number (mtDNA CN) is a surrogate measure of mitochondrial function and serves as a clinically relevant biomarker. For example, lower mtDNA CN has been associated with an increased risk of mortality (Ashar et al. 2015), diabetes, and of chronic kidney disease (Tin et al. 2016). Therefore, estimating mtDNA copy number from SNP microarray data in large and richly phenotyped consortia may lead to the identification of new disease associations and genetic variation influencing mitochondrial function.
Genvisis is a robust open-source software package that takes advantage of unique compressed data structures to efficiently process, assess quality of, analyze, and visualize the intensity data from GWAS arrays. MitoPipeline relies on the Genvisis framework to obtain mtDNA CN estimates that are corrected for batch effects, total amount of nuclear DNA, and sample quality. Source code for the entire MitoPipeline and Genvisis is available on GitHub.
Please contact us if you have ideas on how to improve the analysis, or if you would like to contribute!
Downloading and installing Genvisis (which includes MitoPipeline) is pretty easy:
If you have Java installed, you can run java -version
from a terminal to check what version is installed. You should see something like this:
java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
Download the latest version of Genvisis here and save to your favorite location
Now, any command-line analyses can be run with java -jar /path/to/genvisis.jar [command]
. If you would like to access Genvisis interactively (e.g., use all the visualization options), then simply double click on genvisis.jar to launch the graphical user interface (GUI).
This website is the best place to check for documentation and usage, as it will be updated from time to time. If you see any portions that are unclear or inaccurate, send an email and let us know. If you really want a hard copy of the MitoPipeline manual (which is likely out of date and has some formatting issues), you can get that here.
The good news is that you should be able to drop your raw datafiles into a directory (along with marker lists for Illumina arrays) and just run the provided example scripts to get reasonable mtDNA CN estimates. The flip side is that there are a lot of parameters that can be adjusted to optimize these estiamtes which we go into in depth below. Examples are available for the Affy SNP 6.0 array and most Illumina arrays (Sorry, but Illumina 370Duo and OmniExpress arrays do not contain any mitochondrial content!). Each setup contains .bat files that are populated commands to execute in the same directory (file paths are relative to the example directory). On Windows machines, .bat files can simply be double-clicked on to run. Feel free to edit the arguments as desired (such as fully qualified paths to file names in other directories or drives, changing number of threads, etc.) as you would a normal text file. These commands can easily be converted to bash scripts for Linux/OSX, or can be edited to submit to a compute cluster. If you are more comfortable running the command from a terminal, “cd” into the example directory, copy the java -Xmx ...
line, and paste into a terminal to run.
Download the Affymetrix package here and unzip, which should then have a directory structure similar to below. The entire package is almost 4 GB (5.4 GB uncompressed), as it contains a copy of Affymetrix Power Tools (1.8 GB) and the dbSNP database (3.2 GB).
affyGw6/
├── AffyPowerTools
├── cels
├── docs
├── example
├── projects
├── resources
├── processAffy.bat
├── generateAffyMitoCN.bat
├── Launch.bat
├── Launch.command
├── launch.properties
├── Launch.sh
└── genvisis.jar
The affyGw6/ directory contains two fully populated commands that will assist in generating mtDNA CN estimates in the same directory (after placing all desired .CEL files in the “cels/” directory, or altering the argument in the batch file). If you would like to test out the command you can download 36 example .CEL files from 1000 genomes (Consortium 2015) here.
Table 1: Populated commands for affy processing
Command | Description |
---|---|
processAffy.bat | runs affy step 1 |
generateAffyMitoCN.bat | runs affy step 2 |
NOTE: you will need to adjust the aptExeDir=
argument within the processAffy.bat file to match your operating system, the default is set to run on windows.
Operatating System | aptExeDir= path |
---|---|
windows | AffyPowerTools/bin/windows/ |
mac (OS X) | AffyPowerTools/bin/osx/ |
linux | AffyPowerTools/bin/linux/ |
See the output section for a description of the results.
Download the Illumina package here, which should have a directory structure similar to below. The entire package is almost 5 GB, as it contains some raw example data from 1000G (1.4 GB) and the dbSNP database (3.4 GB). This example analysis of the 1000G Omni2.5 data can be run using “exampleMitoCN.bat”. See the Illumina section for a more detailed description of the exampleMitoCN.bat command, and the output section for a description of the results.
Illumina/
├── docs
├── example
├── Omni2_5_1000g_Genvisis
├── projects
├── resources
├── Launch.bat
├── Launch.command
├── launch.properties
├── Launch.sh
├── exampleMitoCN.bat
└── genvisis.jar
We generate mtDNA CN estimates via the following basic pipeline highlighted in Figure 1 and expanded below.
Figure 1: Diagram of mtDNA CN estimation pipeline
Affymetrix processing requires two steps to obtain mtDNA CN estimates.
First, raw .CEL files are genotyped and normalized with Affymetrix Power Tools (APT). We provide a helper command to automate this process.
java -jar genvisis.jar affy.AffyPipeline \
cels=cels/ \
sketch=resources/hapmap.quant-norm.normalization-target.txt \
aptExeDir=AffyPowerTools/bin/linux/ \
aptLibDir=AffyPowerTools/lib/ \
outDir=genvisisAffyProject1/ \
markerPositions=resources/markerPositionsHG18.txt \
build=HG18 \
-full \
threads=24
Below are the arguments of this helper command. If you downloaded the full Affymetrix package, the relative paths in the example above should work.
cels=
argument points to a directory containing all .CEL files to be analyzed.sketch=
is a reference quantile distribution provided as part of the PennCNV-affy package (Wang et al. 2007).The aptExeDir=
is a directory that contains the APT executables appropriate for your system. At a minimum, the directory must contain apt-probeset-summarize and apt-probeset-genotype.
The aptLibDir=
is a directory that contains some necessary APT library files. There are two different library sets, depending on whether the -full
argument is supplied, and are detailed in Table 2 below.
Table 2: Required Affymetrix Library files
File Type | Default | Full |
---|---|---|
CDF | GenomeWideSNP_6.cdf | GenomeWideSNP_6.Full.cdf |
Special SNPs | GenomeWideSNP_6.specialSNPs | GenomeWideSNP_6.Full.specialSNPs |
Birdseed models | GenomeWideSNP_6.birdseed-v2.models | GenomeWideSNP_6.birdseed-v2.models |
ChrX probes | GenomeWideSNP_6.chrXprobes | GenomeWideSNP_6.chrXprobes |
ChrY probes | GenomeWideSNP_6.chrYprobes | GenomeWideSNP_6.chrYprobes |
The markerPositions=
argument points to a file with tab-separated columns and an entry for each marker to be analyzed as shown in Table 3 below, and only markers present in this file will be analyzed.
Table 3: Affymetrix Marker Position file
MarkerName | Chr | Position |
---|---|---|
SNP_A-1780419 | 1 | 84647761 |
SNP_A-1780418 | 5 | 156323558 |
… | .. | … |
Our Affy SNP 6.0 marker positions files are included in the Affymetrix package, but can also be downloaded separately here: HG18 marker postions or HG18 marker postions or HG18 marker postions or HG19 marker positions. Alternatively, Affymetrix hosts the latest array annotations available through the NetAffx Analysis Center.
The outDir=
argument is where the results will be placed
The build=
argument refers to the genomic build reflected by the markerPositions=
argument. Options are HG18 or HG19.
-full
argument tells the pipeline to use the full library files as described above. The full library files contain 465 mitochondrial (MT) SNPs, while the default contains only 119.The threads=
refers to the number of threads used in parallelized portions of the analysis. For optimal performance, this should not be set higher than the number of processors on your computer or compute node.
For more options: run java -jar genvisis.jar affy.AffyPipeline -h
.
With the arguments above populated for affy.AffyPipeline, the command executes two different APT calls and parses the output into the Genvisis binary data format.
Genotyping
The first APT command genotypes the .CEL files and executes a command like below (file names will change if the -full
argument is selected). For more info on this command, refer to the apt-probeset-genotype manual
apt-probeset-genotype
-c AffyPowerTools/lib/GenomeWideSNP_6.cdf
--table-output true
-a birdseed-v2
--set-gender-method cn-probe-chrXY-ratio
--read-models-birdseed AffyPowerTools/lib/GenomeWideSNP_6.birdseed-v2.models
--special-snps AffyPowerTools/lib/GenomeWideSNP_6.specialSNPs
--chrX-probes AffyPowerTools/lib/GenomeWideSNP_6.chrXprobes
--chrY-probes AffyPowerTools/lib/GenomeWideSNP_6.chrYprobes
--probeset-ids genvisisAffyProject1/Genvisis_affy_pipeline.probesetIdsSNPS.txt
--set-analysis-name Genvisis_affy_pipeline
-out-dir genvisisAffyProject1/Genvisis_affy_pipeline_Genotypes/
--cel-files genvisisAffyProject1/Genvisis_affy_pipeline.celList.txt
The results of apt-probeset-genotype will be placed in genvisisAffyProject1/Genvisis_affy_pipeline_Genotypes/ (or relative to the outDir
you supplied) and should have the files listed in Table 4 below.
Table 4: Files produced by APT genotyping
File Name | Description |
---|---|
Genvisis_affy_pipeline.calls.txt | Genotype matrix |
Genvisis_affy_pipeline.confidences.txt | Confidences of genotype calls |
Genvisis_affy_pipeline.report.txt | Sample information (w/ computed gender) |
Normalizing
The second APT command normalizes the .CEL files (including intensity only CN_ probesets) to the target distribution (defined by the sketch=
argument) and executes a command like below (once again with file names adjusted if the -full
argument is selected). For more info on this command, refer to the apt-probeset-summarize manual.
apt-probeset-summarize
-cdf-file AffyPowerTools/lib/GenomeWideSNP_6.cdf
--analysis quant-norm.sketch=50000,pm-only,med-polish,expr.genotype=true
--target-sketch resources/hapmap.quant-norm.normalization-target.txt
--out-dir genvisisAffyProject1/Genvisis_affy_pipeline_Normalization/
--cel-files genvisisAffyProject1/Genvisis_affy_pipeline.celList.txt
--probeset-ids genvisisAffyProject1/Genvisis_affy_pipeline.probesetIdsAll.txt
--set-analysis-name Genvisis_affy_pipeline
The results of apt-probeset-summarize will be placed in genvisisAffyProject1/Genvisis_affy_pipeline_Normalization/ (or relative to the outDir
you supplied) and should have the files listed in Table 5 below.
Table 5: Files produced by APT normalization.
File Name | Description |
---|---|
Genvisis_affy_pipeline.summary.txt | probeset intensities |
Genvisis_affy_pipeline.report.txt | APT sample QC |
Genvisis takes the results from APT and converts them to a compressed binary data format. A project “.properties” file (such as Genvisis_affy_pipeline.properties) will be generated in the outDir
directory supplied earlier. The properties file stores custom parameters associated with your project and will be used in the next step of the analysis.
Some of the Genvisis specific directories present within outDir
are listed in Table 6 below.
Table 6: Genvisis directories
Directory | Description |
---|---|
transposed/ | marker-dominant binary data |
samples/ | sample-dominant binary data |
data/ | contains auxilary sample and marker data |
The basic command to generate mtDNA CN estimates from Affy SNP 6.0 arrays looks like the following:
java -jar genvisis.jar cnv.manage.MitoPipeline \
proj=genvisisAffyProject1/Genvisis_affy_pipeline.properties \
dirSrc=NA \
dirProj=genvisisAffyProject1/ \
PCmarkers=resources/autosomal_PC_markers.oneHitWonders_20.txt \
mitochondrialMarkers=resources/gw6_MT_USE.oneHitWonders_20.txt \
markerPositions=resources/markerPositionsHG18.txt \
threads=24
proj=
argument specifies the path to the properties file associated with the current analysis.dirSrc=
argument is not required, and only applies to the case when your Affymetrix data has previously been formatted into text files. In this version of the pipeline, simply set it to “NA”.dirProj=
points to the directory containing the project’s data. In this case, it is the same directory as the outDir
from Step 1.PCmarkers=
refers to a text file ( with no header ) that contains probeset IDs (typically high-quality, LD-pruned, and autosomal) to use when performing an intensity PCA on Log R Ratios (LRRs) to detect batch effects. A recommended list of PCmarkers to use can be found in Table 7.mitochondrialMarkers=
refers to a text file ( with no header ) that contains probeset IDs to use when computing the raw mtDNA CN estimates. A curated list of mt markers can be found in Table 7.The markerPositions=
and threads=
are the same as in Step 1.
output=
argument to specify an analysis name. This name will serve as a prefix for all output files from an analysis (default is “GENVISIS_PCA”).numComponents=
argument to specify the number of principal components to intitially compute, and should be around 5% of your sample size.Optional: Provide pcOptFile=
argument to specify a list of samples to use when evalating mtDNA CN estimates using WBC betas (such a list of samples from a single race/ethnicity), described more here.
For a description of the output of this command, see the Output section. To see more options, run java -jar genvisis.jar cnv.manage.MitoPipeline -h
Table 7: Affymetrix marker lists
Array | PCmarkers= |
mitochondrialMarkers= |
---|---|---|
Affy SNP 6.0 | Affy PC markers | Affy Mito markers |
Illumina processing requires only a single command, and is very similar to the Step 2 of the Affymetrix processing. For example:
java -jar genvisis.jar cnv.manage.MitoPipeline \
dirSrc=00src/ \
dirProj=Omni2_5_1000g_Genvisis/ \
PCmarkers=Omni2_5autosomal_PC_markers.oneHitWonders_40.txt \
mitochondrialMarkers=Omni2_5Mitochondrial.oneHitWonders_40.txt \
markerPositions=markerPositionsHG19.txt \
dirExt=.txt.gz \
threads=24 \
build=HG19
dirSrc=
is the full path for a source data directory (e.g., where Illumina FinalReport files are located). All FinalReport files in this directory will be imported to Genvisis. These files can be gzipped to save space, but remember to set dirExt=.csv.gz
or dirExt=.txt.gz
, etc. as appropriate. See the note about FinalReport files for more information.dirProj=
is where the results will be placed (and where a project “.properties file” will be generated).PCmarkers=
refers to a text file ( with no header ) that contains probeset IDs (typically high-quality, LD-pruned, and autosomal) to use when performing an intensity PCA on Log R Ratios (LRRs) to detect batch effects. See Table 9.mitochondrialMarkers=
refers to a text file ( with no header ) that contains probeset IDs to use when computing the raw mtDNA CN estimates. If a list in Table 9 does not represent your array, the easiest path is simply to include all MT markers; however, the best way would be to visualize each one (using Genvisis/ScatterPlot) and only include a marker if it does not show evidence of hybridization to another part of the genome (e.g., contains a null cluster or there are more than two main clusters).The markerPositions=
argument points to a file with tab-separated columns and an entry for each marker to be analyzed as shown in Table 8 below. All markers present in final report file must be represented, or else you will receive an error message and progress will halt.
Table 8: Illumina marker positions
MarkerName | Chr | Position |
---|---|---|
rs4433435 | 1 | 171223951 |
rs12456529 | 18 | 48691665 |
… | .. | … |
Most marker positions for Illumina arrays can be found at the Illumina Downloads site, where you can look for a “Manifest” (CSV) file.
threads=
refers to the number of threads used in parallelized portions of the analysis. For optimal performance, this should not be set higher than the number of processors on your computer or compute node.build=
argument refers to the genomic build reflected by the markerPositions=
argument. Options are HG18 or HG19.output=
argument to specify an analysis name. This name will serve as a prefix for all output files from an analysis (default is “GENVISIS_PCA”).numComponents=
argument to specify the number of principal components to intitially compute, and should be around 5% of your sample size.Optional: Provide pcOptFile=
argument to specify a list of samples to use when evalating mtDNA CN estimates using WBC betas (such a list of samples from a single race/ethnicity), described more here.
For a description of the output of this command, see the Output section. To see more options, run java -jar genvisis.jar cnv.manage.MitoPipeline -h
Here are some of the marker lists we have compiled. You will likely be able to use a list if your array is similar to any of these.
Table 9: Illumina marker lists
Array | PCmarkers= |
mitochondrialMarkers= |
---|---|---|
Exome Chip Arrays | Exome PC markers | Exome Mito markers |
Omni Arrays (1, 2.5, 5) | Omni PC markers | Omni Mito markers |
370Duo or OmniExpress | Does not include any MT markers!! | |
550, 610 and 660 | 550, 610 and 660 PC markers | 550, 610 and 660 Mito markers |
When exporting FinalReport files from GenomeStudio, the following six fields are required: SNP Name, Sample ID, X, Y, Allele1 - AB, Allele2 - AB. If your current files do not contain one of these, we recommend re-exporting and also including five additional fields: GC Score, Allele1 - Forward, Allele2 - Forward, Log R Ratio, and B Allele Freq. These files can be gzipped to save space.
This section details some of the output produced from both the single Illumina command and the cnv.manage.MitoPipeline command via Step 2 of the Affymetrix pipeline. All files will be present under projDir=
directory from Illumina processing, or the outDir=
directory form Affymetrix SNP 6.0 processing.
Table 10: Files generated by the pipeline; file names will begin with either the default PCA_GENVISIS or whatever you supplied using the output=
command
File Suffix | File Description |
---|---|
markers_to_QC.txt | Autosomal markers that are not copy number only |
markerQC.txt | Marker QC metrics |
lrr_sd.txt | Sample QC using markers that passed QC thresholds |
samples.USED_PC.txt | Samples used to compute PCA |
samples.QC_Summary.txt | Summary of samples passing QC |
GC_ADJUSTMENT/*.ser | Serialized (binary) files to compute GC corrected LRRs |
MitoMarkers.MarkersUsed.txt | Boolean matrix for MT markers used |
MitoMarkers.RawValues.txt | Intensity values for MT markers |
PCs.MarkerLoadings.txt | Marker loadings from PCA |
PCs.MarkerReport.txt | Reports if marker was used in the PCA |
PCs.SingularValues.txt | Singular values from PCA |
PCs.txt | PCs from HQ samples used to compute PCA |
PCs.extrapolated.txt | PCs extrapolated to all samples |
PCs.summary.report.txt | mtDNA CN estimates from highest order PC and principal components |
PCs.summary.report.finalReport.txt | mtDNA CN estimates from highest order PC, principal components, and sampleQC |
PCs.extrapolated_eval/typed/ | Full mtDNA CN estimates (primary results), see here |
PCs.extrapolated_betaOpt/ | Experimental evaluation of estimates, see here |
Note: the mtDNA CN estimates in PCs.summary.report.txt and PCs.summary.report.finalReport.txt are computed by regressing out all PCs (the number supplied by the numComponents=
argument). These estimates are only used if the numComponents=
is the actual number of PCs desired. Otherwise, see here.
The primary mtDNA CN estimates across each PC can be found in the .gz files under the PCs.extrapolated_eval/typed/ directory. The two files that are typically analyzed are called:
Table 11: mtDNA estimates analyzed
File |
---|
WITH_QC_BUILDERS_NATURAL_WITHOUT_INDEPS_finalSummary.estimates.txt.gz |
WITH_QC_BUILDERS_STEPWISE_RANK_R2_WITHOUT_INDEPS_finalSummary.estimates.txt.gz |
A brief description of the naming scheme is as follows:
Table 12: File naming scheme
Key | Description |
---|---|
WITH_QC_BUILDERS | Regression model and PCA with HQ samples (as opposed to all samples) |
NATURAL | PCs are in order of autosomal variance explained |
STEPWISE_RANK_R2 | PCs are ordered according to step-wise selection evaluated against the raw mtDNA CN estimate |
WITHOUT_INDEPS | Covariates (like age and sex) are not regressed out |
An example of these result files will have a header like the following:
Table 13: Example output
DNA | PC0 | PC1 | PC2 | … |
---|---|---|---|---|
HG00279 | 0.042 | 0.044 | 0.040 | … |
Where DNA is the sample ID, PC0 is the initial raw mtDNA CN estimate (simply the median LRR of homozygous markers for that sample), PC1 is the estimate after regressing out 1 PC, PC2 is the estimate after regressing out 2 PCs, and so on.
Note that the number of estimates (and thus the number of PCs used) reported in these files (usually around 100 ) are much larger than the optimal estimates (which we find to be around 10-15). Therefore, our current default recommendation is to use the estimates from PC15 (for both natural and stepwise) in your analysis unless there is compelling evidence that another estimate produces more robust results. You can find out more about selecting the most appropriate estimates here, and how the estimates are computed in the basic methods.
Since mtDNA CN is computed after sequentially regressing out PCs (i.e., estimates are obtained from PC0…PC100) and are computed two different ways (Natural vs Step-wise ordering), it is valuable to evaluate which method and which PCs produce the best estimates. Each study may present with a different degree of batch effects - requiring more or less PCs than our default recommendation of 15. If you have mtDNA CN estiamtes from another source (e.g., quantitative PCR, WES or WGS), then correlating it with that may be helpful, otherwise we recommend examining known biological associations such as age, sex, and WBC. An experimental optimization using WBC beta correlation can also be examined.
mtDNA CN is known to be associated with several variables that are commonly available to most GWAS studies. In the plots below, our default number of PCs (15) is highlighted with a vertical bar.
Figure 2: Correlation of mtDNA CN array estimates with WBC count
Figure 2: Our experimental method of optimiziation utilizing betas from WBC count meta-analyses relies on the negative assocition between raw WBC counts and mtDNA CN depicted above.
Figure 3: Correlation of mtDNA CN array estimates with qPCR
Figure 3: Currently, qPCR is the de-facto method to quantify mtDNA CN. If a subset of samples have qPCR mtDNA CN estimates, they can be used to calibrate the array-based method.
Figure 4: Correlation of mtDNA CN array estimates with sample sex
Figure 4: mtDNA CN is typically higher in females and lower in males. Assuming a sex coding of 1 for males and 2 for females, there should be a positive correlation with sex in your mtDNA CN estimates.
Figure 5: Correlation of mtDNA CN array estimates with age at blood draw
Figure 5: mtDNA CN declines after middle age and is likely to be negatively correlated with mtDNA CN estimates in older cohorts
An experimental method to evaluate the accuracy of our estimates relies on the negative correlation between mtDNA CN and white blood cell (WBC) count . We use SNPs associated with WBC count as a proxy to evaluate the number of PCs to regress out and which method to select to most accurately measure mtDNA-CN (see Figure 6).
Figure 6: Diagram of WBC beta optimization
We are currently hosting three WBC beta files (from a CHARGE/Exome chip meta-analysis) to evaluate mtDNA CN estimates. These files should be automatically downloaded by Genvisis (or included with the quickstart downloads). If need be, they can be manually downloaded below and saved to resources/MitoCN/ relative to genvisis.jar
Table 14: Available WBC meta-analysis files
File | Description |
---|---|
Whites_WBC_TOTAL_SingleSNPmatched.final.beta | WBC betas from participants with European ancestry |
Blacks_WBC_TOTAL_SingleSNPmatched.final.beta | WBC betas from African American participants |
WBC_TOTAL_SingleSNPmatched.final.beta | WBC betas, European and African ancestry meta-analyzed |
The directory ending with PCs.extrapolated_betaOpt/ will contain the results of the pipeline outlined in Figure 6.
Table 15: Notable results from WBC optimization
File | Description |
---|---|
AB_LookupBeta.dat | Describes the nucleotide interrogated by each probeset (A and B) |
rsIdLookup.txt | Mapping of each probeset to an rsID |
*beta_summary.txt | Results of the WBC/mtDNA beta correlation at p-value thresholds supplied |
*beta_summary.txt description
The beta_summary.txt files summarize the correlations of betas for mtDNA CN and WBC count at different p-value thresholds and across methods (Natural and Stepwise PC selection). The description of some of the patterns found in the header are detailed below:
Here is an example R snippet to visulize the mtDNA CN estimate optimization from an example WBC_TOTAL_SingleSNPmatched.final_beta_summary.txt file - which uses WBC count betas generated from a non-stratified analysis
library(ggplot2)
require(reshape2)
theme_set(theme_bw(10))
#Read in the results
betaEval=read.delim("./ARIC/PCA_GENVISIS_beta_opt/WBC_TOTAL_SingleSNPmatched.final_beta_summary.txt")
#subset to pvalue of interest
betaEval.subP <- subset(betaEval,betaEval$pvalCutoff==1e-04)
#focus on estimates from all samples
betaEval.subP <- subset(betaEval.subP,Method %in% c("AllPCASamps__STEPWISE_RANK_R2" ,"AllPCASamps__NATURAL"))
#plot results
allSampBetaPlot <- ggplot(betaEval.subP,aes(x=PC)) +
geom_point(aes(y=inv_correl_spearmanSigned, color=factor(Method))) +
scale_color_discrete("Method") +
xlab("number of PCs corrected for") +
ylab("correlation (spearman) of mtDNA betas with WBC betas (p < 1e-04)") +
geom_vline(xintercept =15.0) +
labs(title = "Beta optimization using 50 markers and 12827 samples")+
ylim(-1,1)+
facet_wrap(~ Method)+
#Typical range of interest
xlim(0,100)
Figure 7: Correlation of mtDNA CN and WBC count (meta-analysis of all ethnicities) betas
Figure 7 Utilizes the WBC count betas from the meta-analysis across ethnicities and shows the correlation to mtDNA CN betas (using all samples in the data set) at each estimate. The vertical line at PC15 is our default recomendation and appears to be appropriate for both methods.
Now, we can use similar code as above to look at the stratified results:
Figure 8: Correlation of mtDNA CN and WBC count (European ancestry) betas
Figure 8 shows the mtDNA beta correlation to WBC count betas generated from a meta-analysis using only particiapnts with European ancestry using 32 markers and comparing whites (n=9150) to non-whites (n=3677).
And the African American WBC count betas:
Figure 9: Correlation of mtDNA CN and WBC count (African Americans) betas
Figure 9 shows the mtDNA beta correlation to WBC count betas generated from a blacks only meta-analysis using 51 markers and comparing Whites (n=9150) to non-whites (n=3677).
We are currently writing methods to automatically generate these plots without having to use R (check back soon!)
We recommend giving the java virtual machine ~80% of your max RAM. If you have 8 GB of RAM then pass it 6; if you have 128 GB, then pass it 100. Simply start any command with java -Xmx100g -jar genvisis.jar
to change how much RAM is allocated.
When running the PCA the number of samples times the number of markers needs to be less than 2^31 (2147483648, the maximum integer value). If 50,000 markers are used for the PCA, that leaves us with a max number of samples of 42,949. If your study has more than this, you may consider reducing the number of markers used. This can also be done to lower the memory requirements, especially if you see a java.lang.OutOfMemoryError
.
The pipeline will try to download missing files when available, but often Linux clusters will refuse connections. If you see something like the following:
Error - Could not find local file resources/MitoCN/Blacks_WBC_TOTAL_SingleSNPmatched.final.beta and could not download it from http://genvisis.umn.edu/rsrc/MitoCN/Blacks_WBC_TOTAL_SingleSNPmatched.final.beta please manually download and save to resources/MitoCN/Blacks_WBC_TOTAL_SingleSNPmatched.final.beta
you can try manually downloading the link and saving it to the correct location. If that fails, email us and we will verify the file is correctly hosted on our server.
We have seen that the performance can be both array and study dependent, but generally extracts meaningful mtDNA CN signal (as evaluated by age, sex, and WBC count association). With the current methodology, Affymetrix SNP 6.0 performs the best (and may outperform qPCR methods), followed by larger Illumina Arrays (Omni 1, 2.5, 5) and then Illumina Exome Chips. Definitely let us know if you have any suggestions for improvement.
First, verify that the aptExeDir=
argument is properly set for your operating system, see affy quick start above
Also, if you are using OS X or Linux, Genvisis will try to make the APT executable, but if that fails you may need to chmod +x
the APT executables described in Step 1 of the Affy processing.
Dang, but these things happen. If you see anything like the following :
java.lang.NullPointerException
java.lang.NumberFormatException
Or any other errors, please let us know!
As APT processing produces a lot of temporary files, we recommend about one TB of hard-drive space for every 5000 samples. Once processing is complete, the temporary files can be deleted (such as those listed in Table 4 and Table 5. This will leave ~500GB per 5000 samples.
Since different arrays take up a very different amount of space, it is reasonable to estimate that disk space used for the analysis will be around 2.5 times the size of all gzipped FinalReport files.
Genvisis contains a ScatterPlot module to visualize the intensity of SNP array markers overlaid with genotypes. ScatterPlot is especially useful if you are not satisfied with the lists of mitochondrial markers provided, or are using an array that is not represented. To visualize markers, see the short tutorial here.
If interested, the revision history of this website and the publication can be found at https://github.com/PankratzLab/MitoDocs
Allaire, JJ, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, Aron Atkins, and Rob Hyndman. 2016. Rmarkdown: Dynamic Documents for R. http://rmarkdown.rstudio.com.
Ashar, Foram N., Anna Moes, Ann Z. Moore, Megan L. Grove, Paulo H. M. Chaves, Josef Coresh, Anne B. Newman, et al. 2015. “Association of Mitochondrial Dna Levels with Frailty and All-Cause Mortality.” Journal of Molecular Medicine 93 (2): 177–86. doi:10.1007/s00109-014-1233-3.
Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden. 2009. “BLAST+: Architecture and Applications.” BMC Bioinformatics 10 (1): 1–9. doi:10.1186/1471-2105-10-421.
Consortium, The 1000 Genomes Project. 2015. “A Global Reference for Human Genetic Variation.” Nature 526 (7571). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 68–74. http://dx.doi.org/10.1038/nature15393.
Diskin, Sharon J., Mingyao Li, Cuiping Hou, Shuzhang Yang, Joseph Glessner, Hakon Hakonarson, Maja Bucan, John M. Maris, and Kai Wang. 2008. “Adjustment of Genomic Waves in Signal Intensities from Whole-Genome Snp Genotyping Platforms.” Nucleic Acids Research 36 (19): e126. doi:10.1093/nar/gkn556.
Letaw, Alathea. 2015. Captioner: Numbers Figures and Creates Simple Captions. https://github.com/adletaw/captioner.
R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.
“RB.” 2016. RotBlauer LLC (Web Design). Accessed June 11. http://rotblauer.com/.
RStudio Team. 2015. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.
Sveidqvist, Knut, Mike Bostock, Chris Pettitt, Mike Daines, Andrei Kashcha, and Richard Iannone. 2016. DiagrammeR: Create Graph Diagrams and Flowcharts Using R. https://CRAN.R-project.org/package=DiagrammeR.
Tin, Adrienne, Morgan E. Grams, Foram N. Ashar, John A. Lane, Avi Z. Rosenberg, Megan L. Grove, Eric Boerwinkle, et al. 2016. “Association Between Mitochondrial Dna Copy Number in Peripheral Blood and Incident Ckd in the Atherosclerosis Risk in Communities Study.” Journal of the American Society of Nephrology. doi:10.1681/ASN.2015060661.
Wang, Kai, Mingyao Li, Dexter Hadley, Rui Liu, Joseph Glessner, Struan F.A. Grant, Hakon Hakonarson, and Maja Bucan. 2007. “PennCNV: An Integrated Hidden Markov Model Designed for High-Resolution Copy Number Variation Detection in Whole-Genome Snp Genotyping Data.” Genome Research 17 (11): 1665–74. doi:10.1101/gr.6861907.
Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.
———. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.