# Introduction

Mitochondria play critical cellular roles such as in the respiratory electron transport chain, synthesis of adenosine triphosphate, and involvement in apoptosis. Mitochondrial DNA copy number (mtDNA CN) is a surrogate measure of mitochondrial function and serves as a clinically relevant biomarker. For example, lower mtDNA CN has been associated with an increased risk of mortality (Ashar et al. 2015), diabetes, and of chronic kidney disease (Tin et al. 2016). Therefore, estimating mtDNA copy number from SNP microarray data in large and richly phenotyped consortia may lead to the identification of new disease associations and genetic variation influencing mitochondrial function.

### Built on Genvisis

Genvisis is a robust open-source software package that takes advantage of unique compressed data structures to efficiently process, assess quality of, analyze, and visualize the intensity data from GWAS arrays. MitoPipeline relies on the Genvisis framework to obtain mtDNA CN estimates that are corrected for batch effects, total amount of nuclear DNA, and sample quality. Source code for the entire MitoPipeline and Genvisis is available on GitHub.

### Installation

1. Update your version of Java if below Java 8
• If you have Java installed, you can run java -version from a terminal to check what version is installed. You should see something like this:

  java version "1.8.0_31"
Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

3. Now, any command-line analyses can be run with java -jar /path/to/genvisis.jar [command]. If you would like to access Genvisis interactively (e.g., use all the visualization options), then simply double click on genvisis.jar to launch the graphical user interface (GUI).

### Documentation Notes

This website is the best place to check for documentation and usage, as it will be updated from time to time. If you see any portions that are unclear or inaccurate, send an email and let us know. If you really want a hard copy of the MitoPipeline manual (which is likely out of date and has some formatting issues), you can get that here.

# Quickstart

The good news is that you should be able to drop your raw datafiles into a directory (along with marker lists for Illumina arrays) and just run the provided example scripts to get reasonable mtDNA CN estimates. The flip side is that there are a lot of parameters that can be adjusted to optimize these estiamtes which we go into in depth below. Examples are available for the Affy SNP 6.0 array and most Illumina arrays (Sorry, but Illumina 370Duo and OmniExpress arrays do not contain any mitochondrial content!). Each setup contains .bat files that are populated commands to execute in the same directory (file paths are relative to the example directory). On Windows machines, .bat files can simply be double-clicked on to run. Feel free to edit the arguments as desired (such as fully qualified paths to file names in other directories or drives, changing number of threads, etc.) as you would a normal text file. These commands can easily be converted to bash scripts for Linux/OSX, or can be edited to submit to a compute cluster. If you are more comfortable running the command from a terminal, “cd” into the example directory, copy the java -Xmx ... line, and paste into a terminal to run.

### Affy quickstart

Download the Affymetrix package here and unzip, which should then have a directory structure similar to below. The entire package is almost 4 GB (5.4 GB uncompressed), as it contains a copy of Affymetrix Power Tools (1.8 GB) and the dbSNP database (3.2 GB).

affyGw6/
├── AffyPowerTools
├── cels
├── docs
├── example
├── projects
├── resources
├── processAffy.bat
├── generateAffyMitoCN.bat
├── Launch.bat
├── Launch.command
├── launch.properties
├── Launch.sh
└── genvisis.jar

The affyGw6/ directory contains two fully populated commands that will assist in generating mtDNA CN estimates in the same directory (after placing all desired .CEL files in the “cels/” directory, or altering the argument in the batch file). If you would like to test out the command you can download 36 example .CEL files from 1000 genomes (Consortium 2015) here.

Table 1: Populated commands for affy processing

Command Description
processAffy.bat runs affy step 1
generateAffyMitoCN.bat runs affy step 2

NOTE: you will need to adjust the aptExeDir= argument within the processAffy.bat file to match your operating system, the default is set to run on windows.

Operatating System aptExeDir= path
windows AffyPowerTools/bin/windows/
mac (OS X) AffyPowerTools/bin/osx/
linux AffyPowerTools/bin/linux/

See the output section for a description of the results.

### Illumina quickstart

Download the Illumina package here, which should have a directory structure similar to below. The entire package is almost 5 GB, as it contains some raw example data from 1000G (1.4 GB) and the dbSNP database (3.4 GB). This example analysis of the 1000G Omni2.5 data can be run using “exampleMitoCN.bat”. See the Illumina section for a more detailed description of the exampleMitoCN.bat command, and the output section for a description of the results.

Illumina/
├── docs
├── example
├── Omni2_5_1000g_Genvisis
├── projects
├── resources
├── Launch.bat
├── Launch.command
├── launch.properties
├── Launch.sh
├── exampleMitoCN.bat
└── genvisis.jar

# Basic methods

We generate mtDNA CN estimates via the following basic pipeline highlighted in Figure 1 and expanded below.

Figure 1: Diagram of mtDNA CN estimation pipeline

1. LRR is an established intensity measure commonly used in the identification of copy number variation (CNVs) from whole-genome SNP microarrays. We co-opt the LRR to obtain population normalized intensity values for each SNP on the array, computing LRR from samples with LRR SD < 0.30 for Illumina arrays, LRR SD < 0.35 for Affymetrix arrays, and call rate > 0.96 for all arrays. Our current pipeline automates the generation of LRRs for both Illumina (also provided by GenomeStudio) and Affymetrix arrays(similar to the PennCNV-affy pipeline). LRRs, however, often exhibit a correlation with GC content that is variable across samples. Thus we correct all LRRs for GC content on a sample by sample basis (Diskin et al. 2008).
2. Mitochondrial markers are manually reviewed in Genvisis for off-axis genotype clusters. To avoid potential cross-hybridization to the nuclear genome, we BLAST (Camacho et al. 2009) each probe sequence and remove markers that do not have a perfect (100%) match to the annotated location or have any off-target matches > 80% of the probe-sequence. The median LRR of all homozygous markers for each sample serves as a raw mtDNA copy number estimate. Mitochondrial marker lists for select Illumina arrays are available here and Affymetrix SNP 6.0 lists are available here.
3. Autosomal markers are selected for PCA inclusion using typical GWAS quality filters (call rate > 98%, HWE p-value > 0.00001, PLINK mishap p-value > 0.0001, association with sex p-value > 0.00001), as well as LD-pruning (r2 < 0.30), BLAST filtering (number of alignments greater than 80% must equal exactly one), maximum autosomal spacing, and removal of known problematic regions (markers in or neear T cell receptor genes and immunoglobulin regions). Autosomal marker lists for select Illumina arrays are available here and Affymetrix SNP 6.0 lists are available here.
4. In order to avoid over-representation of poor quality samples in the variation captured by the PCA, we again filter for LRR SD < 0.30 for Illumina arrays, LRR SD < 0.35 for Affymetrix arrays, and call rate > 0.96 for all arrays. PCs for excluded samples are later imputed from the marker loadings.
5. A PCA is then performed on LRRs following marker and sample exclusions. PCs generated from LRRs typically capture batch effects, quality metrics, and latent confounding variables. We find that PCs robustly capture these effects, but recommend including at least 40K markers as input. Using all qualifying markers can produce slightly better results, but may take significantly longer to compute.
6. The number of PCs that are initially computed is 5% of the filtered sample size, and the resulting PCs can then be subset using two methods:
• Simply retaining the top PCs representing 5% of the sample size (termed NATURAL)
• Using a data driven approach to select informative PCs via forward step-wise linear regression evaluated against the raw mtDNA copy number estimate. Only PCs that pass a bonferroni adjusted p-value (.05/number of input PCs) enter into the final set of analysis PCs (termed STEPWISE_RANK_R2).
7. Lastly, we optimize which and how many PCs (from those selected in step 6) to include in the final regression model by computing the correlation of mtDNA CN betas to betas generated from a GWAS meta-analysis of white blood cell (WBC) counts. As WBC count is known to be associated with mtDNA CN, we find that this method (explained more here) selects estimates that best capture real signal.

# Affymetrix SNP 6.0

Affymetrix processing requires two steps to obtain mtDNA CN estimates.

## Step 1: Processing .CEL files

First, raw .CEL files are genotyped and normalized with Affymetrix Power Tools (APT). We provide a helper command to automate this process.

java -jar genvisis.jar affy.AffyPipeline \
cels=cels/ \
sketch=resources/hapmap.quant-norm.normalization-target.txt \
aptExeDir=AffyPowerTools/bin/linux/ \
aptLibDir=AffyPowerTools/lib/ \
outDir=genvisisAffyProject1/ \
markerPositions=resources/markerPositionsHG18.txt \
build=HG18 \
-full \
threads=24

Below are the arguments of this helper command. If you downloaded the full Affymetrix package, the relative paths in the example above should work.

• The cels= argument points to a directory containing all .CEL files to be analyzed.
• The sketch= is a reference quantile distribution provided as part of the PennCNV-affy package (Wang et al. 2007).
• The aptExeDir= is a directory that contains the APT executables appropriate for your system. At a minimum, the directory must contain apt-probeset-summarize and apt-probeset-genotype.

• The aptLibDir= is a directory that contains some necessary APT library files. There are two different library sets, depending on whether the -full argument is supplied, and are detailed in Table 2 below.

Table 2: Required Affymetrix Library files

File Type Default Full
CDF GenomeWideSNP_6.cdf GenomeWideSNP_6.Full.cdf
Special SNPs GenomeWideSNP_6.specialSNPs GenomeWideSNP_6.Full.specialSNPs
Birdseed models GenomeWideSNP_6.birdseed-v2.models GenomeWideSNP_6.birdseed-v2.models
ChrX probes GenomeWideSNP_6.chrXprobes GenomeWideSNP_6.chrXprobes
ChrY probes GenomeWideSNP_6.chrYprobes GenomeWideSNP_6.chrYprobes
• The markerPositions= argument points to a file with tab-separated columns and an entry for each marker to be analyzed as shown in Table 3 below, and only markers present in this file will be analyzed.

Table 3: Affymetrix Marker Position file

MarkerName Chr Position
SNP_A-1780419 1 84647761
SNP_A-1780418 5 156323558
..

Our Affy SNP 6.0 marker positions files are included in the Affymetrix package, but can also be downloaded separately here: HG18 marker postions or HG18 marker postions or HG18 marker postions or HG19 marker positions. Alternatively, Affymetrix hosts the latest array annotations available through the NetAffx Analysis Center.

• The outDir= argument is where the results will be placed

• The build= argument refers to the genomic build reflected by the markerPositions= argument. Options are HG18 or HG19.

• The -full argument tells the pipeline to use the full library files as described above. The full library files contain 465 mitochondrial (MT) SNPs, while the default contains only 119.
• The threads= refers to the number of threads used in parallelized portions of the analysis. For optimal performance, this should not be set higher than the number of processors on your computer or compute node.

For more options: run java -jar genvisis.jar affy.AffyPipeline -h.

### APT commands

With the arguments above populated for affy.AffyPipeline, the command executes two different APT calls and parses the output into the Genvisis binary data format.

Genotyping

The first APT command genotypes the .CEL files and executes a command like below (file names will change if the -full argument is selected). For more info on this command, refer to the apt-probeset-genotype manual

apt-probeset-genotype
-c AffyPowerTools/lib/GenomeWideSNP_6.cdf
--table-output true
-a birdseed-v2
--set-gender-method cn-probe-chrXY-ratio
--special-snps AffyPowerTools/lib/GenomeWideSNP_6.specialSNPs
--chrX-probes AffyPowerTools/lib/GenomeWideSNP_6.chrXprobes
--chrY-probes AffyPowerTools/lib/GenomeWideSNP_6.chrYprobes
--probeset-ids genvisisAffyProject1/Genvisis_affy_pipeline.probesetIdsSNPS.txt
--set-analysis-name Genvisis_affy_pipeline
-out-dir genvisisAffyProject1/Genvisis_affy_pipeline_Genotypes/
--cel-files genvisisAffyProject1/Genvisis_affy_pipeline.celList.txt


The results of apt-probeset-genotype will be placed in genvisisAffyProject1/Genvisis_affy_pipeline_Genotypes/ (or relative to the outDir you supplied) and should have the files listed in Table 4 below.

Table 4: Files produced by APT genotyping

File Name Description
Genvisis_affy_pipeline.calls.txt Genotype matrix
Genvisis_affy_pipeline.confidences.txt Confidences of genotype calls
Genvisis_affy_pipeline.report.txt Sample information (w/ computed gender)

Normalizing

The second APT command normalizes the .CEL files (including intensity only CN_ probesets) to the target distribution (defined by the sketch= argument) and executes a command like below (once again with file names adjusted if the -full argument is selected). For more info on this command, refer to the apt-probeset-summarize manual.

apt-probeset-summarize
-cdf-file AffyPowerTools/lib/GenomeWideSNP_6.cdf
--analysis quant-norm.sketch=50000,pm-only,med-polish,expr.genotype=true
--target-sketch resources/hapmap.quant-norm.normalization-target.txt
--out-dir genvisisAffyProject1/Genvisis_affy_pipeline_Normalization/
--cel-files genvisisAffyProject1/Genvisis_affy_pipeline.celList.txt
--probeset-ids genvisisAffyProject1/Genvisis_affy_pipeline.probesetIdsAll.txt
--set-analysis-name Genvisis_affy_pipeline


The results of apt-probeset-summarize will be placed in genvisisAffyProject1/Genvisis_affy_pipeline_Normalization/ (or relative to the outDir you supplied) and should have the files listed in Table 5 below.

Table 5: Files produced by APT normalization.

File Name Description
Genvisis_affy_pipeline.summary.txt probeset intensities
Genvisis_affy_pipeline.report.txt APT sample QC

### Genvisis Affy Processing

Genvisis takes the results from APT and converts them to a compressed binary data format. A project “.properties” file (such as Genvisis_affy_pipeline.properties) will be generated in the outDir directory supplied earlier. The properties file stores custom parameters associated with your project and will be used in the next step of the analysis.

Some of the Genvisis specific directories present within outDir are listed in Table 6 below.

Table 6: Genvisis directories

Directory Description
transposed/ marker-dominant binary data
samples/ sample-dominant binary data
data/ contains auxilary sample and marker data

## Step 2: Generating Affy mtDNA CN estimates

The basic command to generate mtDNA CN estimates from Affy SNP 6.0 arrays looks like the following:

java -jar genvisis.jar cnv.manage.MitoPipeline \
proj=genvisisAffyProject1/Genvisis_affy_pipeline.properties \
dirSrc=NA \
dirProj=genvisisAffyProject1/ \
PCmarkers=resources/autosomal_PC_markers.oneHitWonders_20.txt \
mitochondrialMarkers=resources/gw6_MT_USE.oneHitWonders_20.txt \
markerPositions=resources/markerPositionsHG18.txt \
threads=24
• The proj= argument specifies the path to the properties file associated with the current analysis.
• The dirSrc= argument is not required, and only applies to the case when your Affymetrix data has previously been formatted into text files. In this version of the pipeline, simply set it to “NA”.
• The dirProj= points to the directory containing the project’s data. In this case, it is the same directory as the outDir from Step 1.
• The PCmarkers= refers to a text file ( with no header ) that contains probeset IDs (typically high-quality, LD-pruned, and autosomal) to use when performing an intensity PCA on Log R Ratios (LRRs) to detect batch effects. A recommended list of PCmarkers to use can be found in Table 7.
• The mitochondrialMarkers= refers to a text file ( with no header ) that contains probeset IDs to use when computing the raw mtDNA CN estimates. A curated list of mt markers can be found in Table 7.
• The markerPositions= and threads= are the same as in Step 1.

• Optional: Provide an output= argument to specify an analysis name. This name will serve as a prefix for all output files from an analysis (default is “GENVISIS_PCA”).
• Optional: Provide a numComponents= argument to specify the number of principal components to intitially compute, and should be around 5% of your sample size.
• Optional: Provide pcOptFile= argument to specify a list of samples to use when evalating mtDNA CN estimates using WBC betas (such a list of samples from a single race/ethnicity), described more here.

For a description of the output of this command, see the Output section. To see more options, run java -jar genvisis.jar cnv.manage.MitoPipeline -h

### Affymetrix marker lists

Table 7: Affymetrix marker lists

Array PCmarkers= mitochondrialMarkers=
Affy SNP 6.0 Affy PC markers Affy Mito markers

# Illumina

### Generating Illumina mtDNA CN estimates

Illumina processing requires only a single command, and is very similar to the Step 2 of the Affymetrix processing. For example:

java -jar genvisis.jar cnv.manage.MitoPipeline \
dirSrc=00src/ \
dirProj=Omni2_5_1000g_Genvisis/ \
PCmarkers=Omni2_5autosomal_PC_markers.oneHitWonders_40.txt \
mitochondrialMarkers=Omni2_5Mitochondrial.oneHitWonders_40.txt \
markerPositions=markerPositionsHG19.txt \
dirExt=.txt.gz \
build=HG19

• The dirSrc= is the full path for a source data directory (e.g., where Illumina FinalReport files are located). All FinalReport files in this directory will be imported to Genvisis. These files can be gzipped to save space, but remember to set dirExt=.csv.gz or dirExt=.txt.gz, etc. as appropriate. See the note about FinalReport files for more information.
• The dirProj= is where the results will be placed (and where a project “.properties file” will be generated).
• The PCmarkers= refers to a text file ( with no header ) that contains probeset IDs (typically high-quality, LD-pruned, and autosomal) to use when performing an intensity PCA on Log R Ratios (LRRs) to detect batch effects. See Table 9.
• The mitochondrialMarkers= refers to a text file ( with no header ) that contains probeset IDs to use when computing the raw mtDNA CN estimates. If a list in Table 9 does not represent your array, the easiest path is simply to include all MT markers; however, the best way would be to visualize each one (using Genvisis/ScatterPlot) and only include a marker if it does not show evidence of hybridization to another part of the genome (e.g., contains a null cluster or there are more than two main clusters).
• The markerPositions= argument points to a file with tab-separated columns and an entry for each marker to be analyzed as shown in Table 8 below. All markers present in final report file must be represented, or else you will receive an error message and progress will halt.

Table 8: Illumina marker positions

MarkerName Chr Position
rs4433435 1 171223951
rs12456529 18 48691665
..

Most marker positions for Illumina arrays can be found at the Illumina Downloads site, where you can look for a “Manifest” (CSV) file.

• The threads= refers to the number of threads used in parallelized portions of the analysis. For optimal performance, this should not be set higher than the number of processors on your computer or compute node.
• The build= argument refers to the genomic build reflected by the markerPositions= argument. Options are HG18 or HG19.
• Optional: Provide an output= argument to specify an analysis name. This name will serve as a prefix for all output files from an analysis (default is “GENVISIS_PCA”).
• Optional: Provide a numComponents= argument to specify the number of principal components to intitially compute, and should be around 5% of your sample size.
• Optional: Provide pcOptFile= argument to specify a list of samples to use when evalating mtDNA CN estimates using WBC betas (such a list of samples from a single race/ethnicity), described more here.

For a description of the output of this command, see the Output section. To see more options, run java -jar genvisis.jar cnv.manage.MitoPipeline -h

### Illumina marker lists

Here are some of the marker lists we have compiled. You will likely be able to use a list if your array is similar to any of these.

Table 9: Illumina marker lists

Array PCmarkers= mitochondrialMarkers=
Exome Chip Arrays Exome PC markers Exome Mito markers
Omni Arrays (1, 2.5, 5) Omni PC markers Omni Mito markers
370Duo or OmniExpress Does not include any MT markers!!
550, 610 and 660 550, 610 and 660 PC markers 550, 610 and 660 Mito markers

### Final Report Format

When exporting FinalReport files from GenomeStudio, the following six fields are required: SNP Name, Sample ID, X, Y, Allele1 - AB, Allele2 - AB. If your current files do not contain one of these, we recommend re-exporting and also including five additional fields: GC Score, Allele1 - Forward, Allele2 - Forward, Log R Ratio, and B Allele Freq. These files can be gzipped to save space.

# Output

This section details some of the output produced from both the single Illumina command and the cnv.manage.MitoPipeline command via Step 2 of the Affymetrix pipeline. All files will be present under projDir= directory from Illumina processing, or the outDir= directory form Affymetrix SNP 6.0 processing.

Table 10: Files generated by the pipeline; file names will begin with either the default PCA_GENVISIS or whatever you supplied using the output= command

File Suffix File Description
markers_to_QC.txt Autosomal markers that are not copy number only
markerQC.txt Marker QC metrics
lrr_sd.txt Sample QC using markers that passed QC thresholds
samples.USED_PC.txt Samples used to compute PCA
samples.QC_Summary.txt Summary of samples passing QC
GC_ADJUSTMENT/*.ser Serialized (binary) files to compute GC corrected LRRs
MitoMarkers.MarkersUsed.txt Boolean matrix for MT markers used
MitoMarkers.RawValues.txt Intensity values for MT markers
PCs.MarkerReport.txt Reports if marker was used in the PCA
PCs.SingularValues.txt Singular values from PCA
PCs.txt PCs from HQ samples used to compute PCA
PCs.extrapolated.txt PCs extrapolated to all samples
PCs.summary.report.txt mtDNA CN estimates from highest order PC and principal components
PCs.summary.report.finalReport.txt mtDNA CN estimates from highest order PC, principal components, and sampleQC
PCs.extrapolated_eval/typed/ Full mtDNA CN estimates (primary results), see here
PCs.extrapolated_betaOpt/ Experimental evaluation of estimates, see here

Note: the mtDNA CN estimates in PCs.summary.report.txt and PCs.summary.report.finalReport.txt are computed by regressing out all PCs (the number supplied by the numComponents= argument). These estimates are only used if the numComponents= is the actual number of PCs desired. Otherwise, see here.

### mtDNA CN Results

The primary mtDNA CN estimates across each PC can be found in the .gz files under the PCs.extrapolated_eval/typed/ directory. The two files that are typically analyzed are called:

Table 11: mtDNA estimates analyzed

File
WITH_QC_BUILDERS_NATURAL_WITHOUT_INDEPS_finalSummary.estimates.txt.gz
WITH_QC_BUILDERS_STEPWISE_RANK_R2_WITHOUT_INDEPS_finalSummary.estimates.txt.gz

A brief description of the naming scheme is as follows:

Table 12: File naming scheme

Key Description
WITH_QC_BUILDERS Regression model and PCA with HQ samples (as opposed to all samples)
NATURAL PCs are in order of autosomal variance explained
STEPWISE_RANK_R2 PCs are ordered according to step-wise selection evaluated against the raw mtDNA CN estimate
WITHOUT_INDEPS Covariates (like age and sex) are not regressed out

An example of these result files will have a header like the following:

Table 13: Example output

DNA PC0 PC1 PC2
HG00279 0.042 0.044 0.040

Where DNA is the sample ID, PC0 is the initial raw mtDNA CN estimate (simply the median LRR of homozygous markers for that sample), PC1 is the estimate after regressing out 1 PC, PC2 is the estimate after regressing out 2 PCs, and so on.

Note that the number of estimates (and thus the number of PCs used) reported in these files (usually around 100 ) are much larger than the optimal estimates (which we find to be around 10-15). Therefore, our current default recommendation is to use the estimates from PC15 (for both natural and stepwise) in your analysis unless there is compelling evidence that another estimate produces more robust results. You can find out more about selecting the most appropriate estimates here, and how the estimates are computed in the basic methods.

# Evaluating mtDNA CN estimates

Since mtDNA CN is computed after sequentially regressing out PCs (i.e., estimates are obtained from PC0…PC100) and are computed two different ways (Natural vs Step-wise ordering), it is valuable to evaluate which method and which PCs produce the best estimates. Each study may present with a different degree of batch effects - requiring more or less PCs than our default recommendation of 15. If you have mtDNA CN estiamtes from another source (e.g., quantitative PCR, WES or WGS), then correlating it with that may be helpful, otherwise we recommend examining known biological associations such as age, sex, and WBC. An experimental optimization using WBC beta correlation can also be examined.

## Known associations

mtDNA CN is known to be associated with several variables that are commonly available to most GWAS studies. In the plots below, our default number of PCs (15) is highlighted with a vertical bar.

### WBC count

Figure 2: Correlation of mtDNA CN array estimates with WBC count

Figure 2: Our experimental method of optimiziation utilizing betas from WBC count meta-analyses relies on the negative assocition between raw WBC counts and mtDNA CN depicted above.

### qPCR

Figure 3: Correlation of mtDNA CN array estimates with qPCR

Figure 3: Currently, qPCR is the de-facto method to quantify mtDNA CN. If a subset of samples have qPCR mtDNA CN estimates, they can be used to calibrate the array-based method.

### Sex

Figure 4: Correlation of mtDNA CN array estimates with sample sex

Figure 4: mtDNA CN is typically higher in females and lower in males. Assuming a sex coding of 1 for males and 2 for females, there should be a positive correlation with sex in your mtDNA CN estimates.

### Age

Figure 5: Correlation of mtDNA CN array estimates with age at blood draw

Figure 5: mtDNA CN declines after middle age and is likely to be negatively correlated with mtDNA CN estimates in older cohorts

## WBC beta correlation (experimental)

An experimental method to evaluate the accuracy of our estimates relies on the negative correlation between mtDNA CN and white blood cell (WBC) count . We use SNPs associated with WBC count as a proxy to evaluate the number of PCs to regress out and which method to select to most accurately measure mtDNA-CN (see Figure 6).

Figure 6: Diagram of WBC beta optimization

We are currently hosting three WBC beta files (from a CHARGE/Exome chip meta-analysis) to evaluate mtDNA CN estimates. These files should be automatically downloaded by Genvisis (or included with the quickstart downloads). If need be, they can be manually downloaded below and saved to resources/MitoCN/ relative to genvisis.jar

Table 14: Available WBC meta-analysis files

File Description
Whites_WBC_TOTAL_SingleSNPmatched.final.beta WBC betas from participants with European ancestry
Blacks_WBC_TOTAL_SingleSNPmatched.final.beta WBC betas from African American participants
WBC_TOTAL_SingleSNPmatched.final.beta WBC betas, European and African ancestry meta-analyzed

The directory ending with PCs.extrapolated_betaOpt/ will contain the results of the pipeline outlined in Figure 6.

Table 15: Notable results from WBC optimization

File Description
AB_LookupBeta.dat Describes the nucleotide interrogated by each probeset (A and B)
rsIdLookup.txt Mapping of each probeset to an rsID
*beta_summary.txt Results of the WBC/mtDNA beta correlation at p-value thresholds supplied

*beta_summary.txt description

The beta_summary.txt files summarize the correlations of betas for mtDNA CN and WBC count at different p-value thresholds and across methods (Natural and Stepwise PC selection). The description of some of the patterns found in the header are detailed below:

• PC The number of PCs used to correct the raw estimate (PC0 is the raw estimate and has no PC correction)
• (inv_)(correl/pval)_(pearson/spearman)_(Signed/UnSigned) The mtDNA/WBC correlation results:
• inv denotes whether the mtDNA CN estimates were inverse normalized prior to computing the beta correlations.
• We recommend using the inverse normalized estimates to evaluate correlations.
• correl/pval denotes whether the beta correlation or p-value of the beta correlation is reported.
• pearson/spearman denotes the method for computing the correlation
• Signed/UnSigned denotes whether the absolute value of betas (UnSigned) were used to compute the correlation or not (Signed)
• We recommend using the “Signed” metric if the meta analysis betas used are also +/- signed (as the meta-analysis files we provide are)
• numberOfMarkers The number of markers used to compute the beta correlation
• BetaFile The file from which betas where extracted
• Method The method used to select PCs (Natural or Stepwise), and any sample subset
• A sample subset can be provided via the pcOptFile= command from the cnv.manage.MitoPipeline command in Step 2 of the Affymetrix pipeline, or the single Illumina command
• pvalCutoff The pvalue cutoff applied to the WBC count betas
• We recommend at least p = 1e-04, contingent on numberOfMarkers being greater than 20. Ideally, hundreds of markers, but is dependent on identical SNPs being present on both your array and present in the meta-analysis.
• numSamples The number of samples used to compute the beta correlation
• markerCallRateThreshold The call rate threshold applied to the markers used

### Beta Plots

Here is an example R snippet to visulize the mtDNA CN estimate optimization from an example WBC_TOTAL_SingleSNPmatched.final_beta_summary.txt file - which uses WBC count betas generated from a non-stratified analysis

library(ggplot2)
require(reshape2)
theme_set(theme_bw(10))
#subset to pvalue of interest
betaEval.subP <- subset(betaEval,betaEval\$pvalCutoff==1e-04)
#focus on estimates from all samples
betaEval.subP <- subset(betaEval.subP,Method %in% c("AllPCASamps__STEPWISE_RANK_R2" ,"AllPCASamps__NATURAL"))

#plot results
allSampBetaPlot <- ggplot(betaEval.subP,aes(x=PC)) +
geom_point(aes(y=inv_correl_spearmanSigned, color=factor(Method))) +
scale_color_discrete("Method") +
xlab("number of PCs corrected for") +
ylab("correlation (spearman) of mtDNA betas with WBC betas (p < 1e-04)")  +
geom_vline(xintercept =15.0) +
labs(title = "Beta optimization using 50 markers and 12827 samples")+
ylim(-1,1)+
facet_wrap(~ Method)+
#Typical range of interest
xlim(0,100) 

Figure 7: Correlation of mtDNA CN and WBC count (meta-analysis of all ethnicities) betas

Figure 7 Utilizes the WBC count betas from the meta-analysis across ethnicities and shows the correlation to mtDNA CN betas (using all samples in the data set) at each estimate. The vertical line at PC15 is our default recomendation and appears to be appropriate for both methods.

Now, we can use similar code as above to look at the stratified results:

Figure 8: Correlation of mtDNA CN and WBC count (European ancestry) betas

Figure 8 shows the mtDNA beta correlation to WBC count betas generated from a meta-analysis using only particiapnts with European ancestry using 32 markers and comparing whites (n=9150) to non-whites (n=3677).

And the African American WBC count betas:

Figure 9: Correlation of mtDNA CN and WBC count (African Americans) betas

Figure 9 shows the mtDNA beta correlation to WBC count betas generated from a blacks only meta-analysis using 51 markers and comparing Whites (n=9150) to non-whites (n=3677).

## Automatically generating plots

We are currently writing methods to automatically generate these plots without having to use R (check back soon!)

# Practical Notes

1. We recommend giving the java virtual machine ~80% of your max RAM. If you have 8 GB of RAM then pass it 6; if you have 128 GB, then pass it 100. Simply start any command with java -Xmx100g -jar genvisis.jar to change how much RAM is allocated.

2. When running the PCA the number of samples times the number of markers needs to be less than 2^31 (2147483648, the maximum integer value). If 50,000 markers are used for the PCA, that leaves us with a max number of samples of 42,949. If your study has more than this, you may consider reducing the number of markers used. This can also be done to lower the memory requirements, especially if you see a java.lang.OutOfMemoryError.

# FAQ

## I seem to be missing some files

The pipeline will try to download missing files when available, but often Linux clusters will refuse connections. If you see something like the following:

Error - Could not find local file resources/MitoCN/Blacks_WBC_TOTAL_SingleSNPmatched.final.beta and could not download it from http://genvisis.umn.edu/rsrc/MitoCN/Blacks_WBC_TOTAL_SingleSNPmatched.final.beta please manually download and save to resources/MitoCN/Blacks_WBC_TOTAL_SingleSNPmatched.final.beta


you can try manually downloading the link and saving it to the correct location. If that fails, email us and we will verify the file is correctly hosted on our server.

## Does this method always work?

We have seen that the performance can be both array and study dependent, but generally extracts meaningful mtDNA CN signal (as evaluated by age, sex, and WBC count association). With the current methodology, Affymetrix SNP 6.0 performs the best (and may outperform qPCR methods), followed by larger Illumina Arrays (Omni 1, 2.5, 5) and then Illumina Exome Chips. Definitely let us know if you have any suggestions for improvement.

## Why does the quick start example fail to run the APT executables?

First, verify that the aptExeDir= argument is properly set for your operating system, see affy quick start above

Also, if you are using OS X or Linux, Genvisis will try to make the APT executable, but if that fails you may need to chmod +x the APT executables described in Step 1 of the Affy processing.

## I saw an error!

Dang, but these things happen. If you see anything like the following :

• java.lang.NullPointerException
• java.lang.NumberFormatException

Or any other errors, please let us know!

## How much disk space is required?

### Affymetrix

As APT processing produces a lot of temporary files, we recommend about one TB of hard-drive space for every 5000 samples. Once processing is complete, the temporary files can be deleted (such as those listed in Table 4 and Table 5. This will leave ~500GB per 5000 samples.

### Illumina

Since different arrays take up a very different amount of space, it is reasonable to estimate that disk space used for the analysis will be around 2.5 times the size of all gzipped FinalReport files.

# Visualizing mitochondrial markers

Genvisis contains a ScatterPlot module to visualize the intensity of SNP array markers overlaid with genotypes. ScatterPlot is especially useful if you are not satisfied with the lists of mitochondrial markers provided, or are using an array that is not represented. To visualize markers, see the short tutorial here.

# Notes

If interested, the revision history of this website and the publication can be found at https://github.com/PankratzLab/MitoDocs

# References

Allaire, JJ, Joe Cheng, Yihui Xie, Jonathan McPherson, Winston Chang, Jeff Allen, Hadley Wickham, Aron Atkins, and Rob Hyndman. 2016. Rmarkdown: Dynamic Documents for R. http://rmarkdown.rstudio.com.

Ashar, Foram N., Anna Moes, Ann Z. Moore, Megan L. Grove, Paulo H. M. Chaves, Josef Coresh, Anne B. Newman, et al. 2015. “Association of Mitochondrial Dna Levels with Frailty and All-Cause Mortality.” Journal of Molecular Medicine 93 (2): 177–86. doi:10.1007/s00109-014-1233-3.

Camacho, Christiam, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L. Madden. 2009. “BLAST+: Architecture and Applications.” BMC Bioinformatics 10 (1): 1–9. doi:10.1186/1471-2105-10-421.

Consortium, The 1000 Genomes Project. 2015. “A Global Reference for Human Genetic Variation.” Nature 526 (7571). Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.: 68–74. http://dx.doi.org/10.1038/nature15393.

Diskin, Sharon J., Mingyao Li, Cuiping Hou, Shuzhang Yang, Joseph Glessner, Hakon Hakonarson, Maja Bucan, John M. Maris, and Kai Wang. 2008. “Adjustment of Genomic Waves in Signal Intensities from Whole-Genome Snp Genotyping Platforms.” Nucleic Acids Research 36 (19): e126. doi:10.1093/nar/gkn556.

Letaw, Alathea. 2015. Captioner: Numbers Figures and Creates Simple Captions. https://github.com/adletaw/captioner.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

“RB.” 2016. RotBlauer LLC (Web Design). Accessed June 11. http://rotblauer.com/.

RStudio Team. 2015. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, Inc. http://www.rstudio.com/.

Sveidqvist, Knut, Mike Bostock, Chris Pettitt, Mike Daines, Andrei Kashcha, and Richard Iannone. 2016. DiagrammeR: Create Graph Diagrams and Flowcharts Using R. https://CRAN.R-project.org/package=DiagrammeR.

Tin, Adrienne, Morgan E. Grams, Foram N. Ashar, John A. Lane, Avi Z. Rosenberg, Megan L. Grove, Eric Boerwinkle, et al. 2016. “Association Between Mitochondrial Dna Copy Number in Peripheral Blood and Incident Ckd in the Atherosclerosis Risk in Communities Study.” Journal of the American Society of Nephrology. doi:10.1681/ASN.2015060661.

Wang, Kai, Mingyao Li, Dexter Hadley, Rui Liu, Joseph Glessner, Struan F.A. Grant, Hakon Hakonarson, and Maja Bucan. 2007. “PennCNV: An Integrated Hidden Markov Model Designed for High-Resolution Copy Number Variation Detection in Whole-Genome Snp Genotyping Data.” Genome Research 17 (11): 1665–74. doi:10.1101/gr.6861907.

Wickham, Hadley. 2007. “Reshaping Data with the reshape Package.” Journal of Statistical Software 21 (12): 1–20. http://www.jstatsoft.org/v21/i12/.

———. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.