View on GitHub

apex

Toolkit for QTL mapping and meta-analysis.

APEX: cis-xQTL analysis guide

This page describes benchmarking experiments with APEX for various QTL analysis tasks. This is intended to serve as a guide for time and memory usage requirements, identify and delineate numerical differences between software tools, and highlight unique features of APEX.

LMM Benchmarking

We used empirical genotype data for 10,000 samples and simulated molecular phenotype data to benchmark time and memory for trans-xQTL analysis using a linear mixed model (LMM) using APEX LMM, FastGWA^[1], BOLT-LMM^[2], and GMMAT^[3-4].

Input data

Sample size	No. traits	No. SNPs	No. covariates
10,000	16,329	590,606	10

Time and memory usage

	CPU hours	Time speedup	Max memory
APEX, p < 5e-5	7.5	0.36	4.88 Gb
APEX	20.8	1.00*	4.88 Gb
FastGWA	52.1	2.50	0.14 Gb
BOLT-LMM	1,068.9	51.39	0.67 Gb
GMMAT	~5,692.5	~273.68	N/A

Software commands

Commands used in LMM benchmarking analysis are given below:

APEX command:

parallel over chromosomes, all traits at once
sparse GRM

compressed BCF format genotype data

apex trans --bed $all_traits_bed --bcf $bcf --cov $covar_txt --grm $grm --region ${chr} --out trans_chr${chr}

BOLT-LMM command:

parallel over traits, all chroms at once
172,045 LD-pruned SNPs (no sparse GRM)

uncompressed BED/BIM/FAM format genotype data

bolt --lmm --LDscoresFile=$ldsc_f --bfile=$bfile --phenoFile=trait_${trait}.ped --phenoCol=${trait} --qCovarCol=PC{1:10} --covarFile=$covar_ped --modelSnps=${snp_file} --statsFile=bolt_${trait}

fastGWA command:

parallel over traits, all chroms at once
Sparse GRM

uncompressed BED/BIM/FAM format genotype data

gcta64 --fastGWA-mlm --bfile $bfile --grm-sparse $sgrm --pheno trait_${trait}.ped --qcovar $covar_ped --threads 1 --out gcta_${trait}

GMMAT command:

parallel over traits, parallel over chroms
Sparse GRM

Compressed GDS format genotype data

eqn <- as.formula( Y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10)
null_fit <- GMMAT::glmmkin(
 eqn, data = trait_data, id = "id", 
 kins = GRM, family = gaussian(), 
 method = "REML", method.optim = "AI", verbose = TRUE
)
GMMAT::glmm.score(obj = null_fit, infile = gds_file, outfile = trait_out_file, verbose = TRUE)

Software concordance:

lmm concordance

LMM association tests from APEX and GMMAT are nearly numerically equivalent, as expected. BOLT-LMM uses the conjugate gradient method to avoid storing an explicit GRM, and a retrospective quasi-likelihood score test; these differences may explain differences with APEX and GMMAT. FastGWA uses the GRAMMAR-Gamma approximation to calculate association tests, which may explain differences with APEX and GMMAT. Results are shown for 1 gene on chromosome 1, and are reflective of overall trends.

cis-eQTL Benchmarking

We compared cis-eQTL analysis using APEX, FastQTL^[5], and QTLtools^[6] in the Geuvadis LCL eQTL data set^[7-8] (454 samples, 17815 genes, and 70 covariates including PEER factors^[12]). Single-variant association slopes and p-values are equivalent between FastQTL and APEX (not shown). QTLtools fits simple linear regressions between trait residuals and genotypes (lm(y_resid ~ geno)) rather than using multiple regression t-tests (lm(y ~ geno + covariates)), and therefore its p-values are slightly conservative.

APEX calculates gene-level cis-eQTL p-values accounting for LD using ACAT^[11], which aggregates all single-variant p-values across the cis region. FastQTL and QTLtools calculate gene-level cis-eQTL p-values accounting for LD by modeling the null distribution of the minimum single-variant p-value as a beta density, with beta parameters estimated by permutation. In the examples below, results using the two methods are highly similar, but ACAT is ~30x faster. In simulation studies, both type I error rates from both methods (ACAT and permutation-based beta approximation) are well-calibrated.

Results

cis benchmark

A: Gene-level cis-eQTL p-values in the Geuvadis LCL dataset. Associations detected only by ACAT are marked in green in each column; associations detected only by FastQTL, QTLtools, or Bonferroni (but not ACAT) are marked in magenta.

B: Comparison of time and memory usage for FastQTL and QTLtools. FastQTL was run using the adaptive p-value setting with 100 to 1000 permutations; QTLtools was run using 1000 permutations; APEX uses ACAT to account for LD rather than permutations. For each software, autosomal cis-eQTL analysis was run in parallel with 1 CPU per chromosome.

Meta-analysis

Multiple-variant meta-analysis in APEX requires sumstat files (generated by ./apex cis), which store SNP-trait association data, and vcov files (./apex store), which store covariate-adjusted linkage disequilibrium (LD) data.

We compared sumstat and vcov file storage sizes across 3 QTL studies:

Data sets

	GTEx LCLs^[9]	Geuvadis^[7-8]	HapMap^[10]
Variants	10,932,660	10,945,700	10,943,352
Sample size	147	454	518
Covariates	23	70	97
Genes	22,759	17,815	16,329

Storage size

	GTEx v8 LCLs	Geuvadis	HapMap
vcov file size (2MB window)	19.45 Gb	30.38 Gb	36.16 Gb
vcov index size	0.56 Gb	1.61 Gb	2.67 Gb
cis sumstats size	0.45 Gb	0.39 Gb	0.43 Gb

References

Jiang, Longda, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nature Publishing Group, 2019. fastGWA software.
Loh, Po-Ru, et al. “Efficient Bayesian mixed-model analysis increases association power in large cohorts.” Nature genetics 47.3 (2015): 284. BOLT-LMM software.
Chen, Han, et al. “Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models.” The American Journal of Human Genetics 98.4 (2016): 653-666. GMMAT software.
Chen, Han, Matthew P. Conomos, and Duy T. Pham. “GMMAT-package: Generalized Linear Mixed Model Association Tests.” (2020).
Ongen, Halit, et al. “Fast and efficient QTL mapper for thousands of molecular phenotypes.” Bioinformatics 32.10 (2016): 1479-1485. FastQTL software.
Delaneau, Olivier, et al. “A complete tool set for molecular QTL discovery and analysis.” Nature communications 8.1 (2017): 1-7. QTLtools software.
Lappalainen, Tuuli, et al. “Transcriptome and genome sequencing uncovers functional variation in humans.” Nature 501.7468 (2013): 506-511.
1000 Genomes Project Consortium. “An integrated map of genetic variation from 1,092 human genomes.” Nature 491.7422 (2012): 56-65.
GTEx Consortium. “The GTEx Consortium atlas of genetic regulatory effects across human tissues.” Science 369.6509 (2020): 1318-1330.
Bell, Jordana T., et al. “DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines.” Genome biology 12.1 (2011): R10.
Liu, Yaowu, et al. “Acat: A fast and powerful p value combination method for rare-variant analysis in sequencing studies.” The American Journal of Human Genetics 104.3 (2019): 410-421.
Stegle, Oliver, et al. “Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses.” Nature protocols 7.3 (2012): 500.