vignettes/bmassIntro.2.RealData.Rmd
bmassIntro.2.RealData.Rmd
This is an advanced introductory vignette for running bmass
. Here, we will use real data and explore multiple components of the bmass()
output. Specifically, we will:
bmass
For a more basic introductory vignette using a smaller, simulated dataset, please see here.
GlobalLipids2013
For this vigenette, we will be download and analyze the GlobalLipids2013
dataset. To download the necessary files from http://csg.sph.umich.edu/willer/public/lipids2013/, run the following:
mkdir my-directory/GlobalLipids2013
cd my-directory/GlobalLipids2013
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_LDL.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_HDL.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TG.txt.gz
wget http://csg.sph.umich.edu/abecasis/public/lipids2013/jointGwasMc_TC.txt.gz
To run bmass, it is recommended that at least a minimal level of QC is performed on the datasets of interest. Additionally, it is important that all the single nucleotide polymorphisms (SNPs) being analyzed are oriented to the same reference allele across all the phenotypes being included. This latter situation is emphasized so that the correct direction of effect is assigned to each phenotype for a given SNP (which is oriented based on reference allele).
Below is example bash
code on how to reformat, conduct some basic QC, and orient the reference alleles on the GlobalLipids2013
dataset. First, we will reformat our input datafiles to match the requirements for bmass
. bmass
expects input files to have the following columns –
To create files that match these specifications, we will conduct the following: extract the columns that are needed, orient SNPs to the minor allele, and extract the direction of effect from the provided effect sizes.
zcat jointGwasMc_HDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_HDL.formatted.txt.gz
zcat jointGwasMc_LDL.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_LDL.formatted.txt.gz
zcat jointGwasMc_TG.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TG.formatted.txt.gz
zcat jointGwasMc_TC.txt.gz | sed 's/:/ /g' | sed 's/chr//g' | awk '{ print $1 "\t" $2 "\t" $5 "\t" $12 "\t" $6 "\t" $7 "\t" $8 "\t" $11 "\t" $10 }' | perl -lane '$F[4] = uc($F[4]); $F[5] = uc($F[5]); if ($F[6] > 0) { $F[6] = "+"; } elsif ($F[6] < 0) { $F[6] = "-"; } else { $F[6] = 0; } print join("\t", @F);' | grep -v SNP_hg18 | gzip > jointGwasMc_TC.formatted.txt.gz
Next, we will conduct some basic QC. These steps include: remove SNPs that have duplicate entries, that do not have entries in every field, do not have MAF information, that are fixed (MAF = 0), are rare (MAF < .01), have sample sample sizes < 50000, and have effect sizes of 0 (ie direction of effect is indeterminable). We will also orient each SNP/phenotype file to its respective minor allele.
join <(zcat jointGwasMc_HDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_HDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_HDL.formatted.QCed.txt.gz
join <(zcat jointGwasMc_LDL.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_LDL.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_LDL.formatted.QCed.txt.gz
join <(zcat jointGwasMc_TG.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TG.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TG.formatted.QCed.txt.gz
join <(zcat jointGwasMc_TC.formatted.txt.gz | awk '{ print $1 "_" $2 }' | grep -v hg18 | sort | uniq -u) <(zcat jointGwasMc_TC.formatted.txt.gz | grep -v hg | grep -v ^rs | grep -v NA | perl -lane 'if ($F[3] > .5) { ($F[4], $F[5]) = ($F[5], $F[4]); if ($F[6] eq "+") { $F[6] = "-"; } elsif ($F[6] eq "-") { $F[6] = "+"; } else { $F[6] = 0; } $F[3] = 1 - $F[3]; } if ($F[6] ne "0") { if ($F[3] > .01) { if ($F[8] >= 50000) { print join("\t", @F); } } }' | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'print join("\t", @F[1..$#F]);' | gzip > jointGwasMc_TC.formatted.QCed.txt.gz
Lastly, we will orient all phenotype files to the HDL minor allele:
join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_LDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz
join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TG.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz
join <(zcat jointGwasMc_HDL.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $5 }' | sort -k 1,1) <(zcat jointGwasMc_TC.formatted.QCed.txt.gz | awk '{ print $1 "_" $2 "\t" $0 }' | sort -k 1,1) | perl -lane 'if ($F[1] eq $F[7]) { ($F[6], $F[7]) = ($F[7], $F[6]); if ($F[8] eq "-") { $F[8] = "+"; } elsif ($F[8] eq "+") { $F[8] = "-"; } else { print STDERR "Error1 -- direction of effect allele matching"; } } print join("\t", @F[2..$#F]);' | cat <(echo -e "Chr\tBP\tMarker\tMAF\tA1\tA2\tDirection\tpValue\tN") - | gzip > jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz
bmass
Now with our four GlobalLipids2013
files properly formatted and QCed, we are able to run bmass
. To do so, we will also need the list of published univariate GWAS SNPs from the GlobalLipids2013
publication (presented as two columns per SNP, chromosome and basepair position) – these have been provided in the data
subdirectory (GlobalLipids2013.GWASsnps.txt
). To run bmass
with our prepared input datafiles and our list of univariate GWAS SNPs, we run the following R code:
library("bmass");
HDL <- read.table("jointGwasMc_HDL.formatted.QCed.txt.gz", header=T);
LDL <- read.table("jointGwasMc_LDL.formatted.QCed.HDLMatch.txt.gz", header=T);
TG <- read.table("jointGwasMc_TG.formatted.QCed.HDLMatch.txt.gz", header=T);
TC <- read.table("jointGwasMc_TC.formatted.QCed.HDLMatch.txt.gz", header=T);
load("bmassDirectory/data/GlobalLipids2013.GWASsnps.rda");
Phenotypes <- c("HDL", "LDL", "TG", "TC");
bmassResults <- bmass(Phenotypes, GlobalLipids2013.GWASsnps);
Note once again that to run bmass
we supply a vector of phenotype names that also correspond to the variable names referencing each phenotype’s respective data file. Also note that you will see a warning that one or more of the GlobalLipids2013
datafiles contains p-values smaller than the threshold at which R can handle converting them into logarthmic scale – this is to be expected.
bmass
returns a list containing multiple areas of information, including separate sublists pertaining to the previous univariate GWAS SNPs used for training the multivariate model priors and the new multivariate SNPs (if any) bmass has idenfitifed as significant. Run summary(bmassResults)
and you should see the following:
> summary(bmassResults)
Length Class Mode
MarginalSNPs 3 -none- list
PreviousSNPs 4 -none- list
NewSNPs 3 -none- list
LogFile 20 -none- character
ZScoresCorMatrix 16 -none- numeric
Models 324 -none- numeric
ModelPriors 1134 -none- numeric
GWASlogBFMinThreshold 1 -none- numeric
We will touch on all the above outputs.
The NewSNPs
sublist contains information pertaining to the new significant multivariate associations found by bmass
, if any were identified. Running summary(bmassResults$NewSNPs)
you should see the following:
> summary(bmassResults$NewSNPs)
Length Class Mode
SNPs 30 data.frame list
logBFs 5427 -none- numeric
Posteriors 5427 -none- numeric
bmassResults$NewSNPs$SNPs
contains the ‘main’ bmass
output of interest – the list of new multivariate associations found by bmass
and those SNPs’ related information. The format of this output appears as follows:
> head(bmassResults$NewSNPs$SNPs, n=3)
ChrBP Chr BP Marker MAF A1 HDL_A2 HDL_Direction
1704 10_101902054 10 101902054 rs2862954 0.4631 C T +
72106 10_5839619 10 5839619 rs2275774 0.1781 G A +
118903 11_109521729 11 109521729 rs661171 0.2876 T G +
HDL_pValue HDL_N HDL_ZScore LDL_Direction LDL_pValue LDL_N
1704 1.287e-06 186893 4.841751 + 5.875e-01 172821.0
72106 7.601e-07 179144 4.945343 - 7.773e-05 165198.0
118903 1.705e-06 186946 4.785573 + 1.653e-02 172877.9
LDL_ZScore TG_Direction TG_pValue TG_N TG_ZScore TC_Direction
1704 0.5424624 + 0.013930 177587.1 2.459063 +
72106 -3.9512933 - 0.001035 169853.0 -3.280836 -
118903 2.3969983 - 0.155800 177645.0 -1.419340 +
TC_pValue TC_N TC_ZScore GWASannot mvstat mvstat_log10pVal unistat
1704 2.526e-04 187083 3.659609 0 48.70946 9.173067 23.44255
72106 1.911e-02 179333 -2.343378 0 38.26288 7.004804 24.45642
118903 1.785e-05 187131 4.290215 0 37.84098 6.917765 22.90171
unistat_log10pVal Nmin logBFWeightedAvg
1704 5.890421 172821.0 7.068306
72106 6.119129 165198.0 5.447201
118903 5.768276 172877.9 5.438490
NewSNPs$logBFs
contains the log10 Bayes Factor (BF) for each multivariate model tested (in the form of a Model x SNP matrix, with the first columns of the matrix describing each model). NewSNPs$Posteriors
is a similar matrix, just with each matrix entry representing the posterior probability for each model/SNP combination instead of the log10 BF.
> dim(bmassResults$NewSNPs$logBFs)
[1] 81 67
> bmassResults$NewSNPs$logBFs[1:5,1:10]
HDL LDL TG TC 10_101902054 10_5839619 11_109521729 11_13313759
[1,] 0 0 0 0 0.000000 0.0000000 0.0000000 0.00000000
[2,] 1 0 0 0 -233.831047 -235.0781367 -234.6670299 -234.15472067
[3,] 2 0 0 0 0.000000 0.0000000 0.0000000 0.00000000
[4,] 0 1 0 0 0.165855 0.3172489 -0.1100418 0.06393596
[5,] 1 1 0 0 -64.774919 -66.2959645 -69.2388829 -69.93309528
11_45696596 11_47251202
[1,] 0.00000000 0.00000000
[2,] -231.68478695 -219.10321932
[3,] 0.00000000 0.00000000
[4,] -0.04838241 0.04997886
[5,] -65.59917325 -45.04269241
PreviousSNPs
contains similar information as NewSNPs
does, but pertaining to the set of previous univariate GWAS significant SNPs used to train the multivariate model priors and set the empirical significance threshold for weighted average BF. Running summary(bmassResults$PreviousSNPs)
you see:
> summary(bmassResults$PreviousSNPs)
Length Class Mode
logBFs 12069 -none- numeric
SNPs 30 data.frame list
DontPassSNPs 30 data.frame list
Posteriors 12069 -none- numeric
PreviousSNPs$SNPs
, PreviousSNPs$logBFs
, and PreviousSNPs$Posteriors
all match the descriptions as above with NewSNPs
. However PreviousSNPs$DontPassSNPs
refers to any GWAS SNPs which are included in the final publicaiton list but which do meet the original study’s univariate GWAS p-value threshold based on the dataset released. This can for instance occur when a final GWAS SNP list is produced from combining a discovery dataset with a replication dataset, and the former data is the only part of the study publicly released. Therefore bmass
keeps track of these SNPs so as not to call them a completely novel multivariate SNP, but also to not use them to train the multivariate model priors since they do not technically reach univariate GWAS significance based on the dataset provided alone.
The format of DontPassSNPs
is the same seen in NewSNPs$SNPs
and PreviousSNPs$SNPs
.
MarginalSNPs
contains the same information as NewSNPs
and PreviousSNPs
, but pertaining to the set of marginally significant SNPs extracted from the intermediate MergedDataSources
dataset (a data.frame that contains all the input phenotype files combined) based on the SNPMarginalUnivariateThreshold
and SNPMarginalMultivariateThreshold
values (default of 1e-6 for each). summary(bmassResults$MarginalSNPs)
returns:
> summary(bmassResults$MarginalSNPs)
Length Class Mode
SNPs 30 data.frame list
logBFs 20493 -none- numeric
Posteriors 20493 -none- numeric
MarginalSNPs$SNPs
, MarginalSNPs$logBFs
, and MarginalSNPs$Posteriors
are all as described above.
This is the phenotype correlation matrix (V_0 hat in Detailed Methods (Global Lipids Analysis)
of Stephens 2013 PLoS ONE) derived from extracting all the ‘null’ SNPs (abs(ZScore) < 2 for every phenotype) in the dataset.
This is a matrix displaying all the model combinations possible given the set of d input phenotypes and 3 model categories of directly associated (1), indirectly associated (2), and unassociated (0). The total number of models displayed should be 3^d.
This is the set of trained priors determined for each multivariate model using the previous univariate GWAS SNPs and an EM algorithm (see eq. 36 in Stephens 2013). Note that for any model which does not contain at least one phenotype in the directly associated category, the prior was automatically set to 0 (aside from the global null model of all phenotypes set to unassociated).
Also note that ModelPriors contains a full set of model priors for each value of the hyperparameter sigma_alpha (see “Prior on sigma_alpha” in Stephens 2013); these values are stored in SigmaAlphas
, and by default are set to c(0.005,0.0075,0.01,0.015,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09,0.1,0.15)
(hence 81 models * 14 sigma_alphas = 1134 priors). To see a more interpretable form of the model priors, use the GetModelPriorMatrix()
function as shown below.
> length(bmassResults$ModelPriors)
[1] 1134
> bmassResults[c("ModelPriorMatrix", "LogFile")] <- GetModelPriorMatrix(Phenotypes, bmassResults$Models, bmassResults$ModelPriors, bmassResults$LogFile)
> head(bmassResults$ModelPriorMatrix)
HDL LDL TG TC Prior Cumm_Prior OrigOrder
1 1 2 1 1 0.32744537 0.3274454 44
2 1 2 2 1 0.13788501 0.4653304 53
3 1 1 1 2 0.11727440 0.5826048 68
4 1 1 1 1 0.07801825 0.6606230 41
5 1 2 1 2 0.06210658 0.7227296 71
6 2 1 2 2 0.05698876 0.7797184 78
ModelPriorMatrix
contains the model descriptions as provided in Models
along with the combined trained priors across each sigma_alpha value; the ModelPriomatrix
is sorted by decreasing model prior value and also shows the cumulative distribution of the priors as well as the original order position of each model.
This is the empirical weighted average BF threshold (BFavg) used to determine whether any new SNPs are ‘multivariate significant’. As described in Stephens 2013, each SNP has a single summary metric combining all the information across each multivariate model tested; this summary metric is the weighted average of each model’s BF, where weights are determined from the previous univariate GWAS SNPs (eq. 1 in Turchin and Stephens 2019 or eq. 8 in Stephens 2013). This summary metric is also calculated for the previous unviariate GWAS SNPs themselves, and the smallest BFavg among them becomes the threshold above which new SNPs are considered multivariate significant.
So every new multivariate SNP bmass
identifies in this GlobalLipids2013
analysis has a BFavg value >= 4.289906 (and is also more than a 1Mb window away from a previous univariate GWAS SNP).