vignettes/MAPITR.Intro.SimulatedData.Rmd
MAPITR.Intro.SimulatedData.Rmd
This is a simple introductory vignette to the MAPITR
package. In this introductory example, we will use a simulated dataset to run MAPITR
and explore some basic output. The dataset is provided with the package.
MAPITR
and input dataFor this introductory vignette, we will be using MAPITR_SimData_Phenotype
, MAPITR_SimData_Genotypes
, MAPITR_SimData_Pathways
, and MAPITR_SimData_PCs
. These are simulated datasets created for the purposes of this vignette. To load MAPITR
and the datasets, run the following:
library("MAPITR") data(MAPITR_SimData_Phenotype, MAPITR_SimData_Genotypes, MAPITR_SimData_Pathways, MAPITR_SimData_PCs)
First, we’ll take a look at the formats of our simulated input data:
MAPITR_SimData_Genotypes[1:10,1:10]
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 1 2 0 1 1 0 1 0 0 2
## 2 2 0 1 2 1 2 1 1 1 2
## 3 0 0 1 2 1 1 1 1 1 0
## 4 1 2 1 0 1 2 2 0 2 1
## 5 1 0 1 2 0 0 2 2 1 1
## 6 2 1 1 1 2 0 0 0 2 1
## 7 0 1 1 1 1 1 0 1 0 0
## 8 1 1 1 1 1 2 0 0 2 1
## 9 0 2 1 0 2 2 2 1 2 0
## 10 2 2 0 2 1 0 0 1 1 0
Here you should see the first ten rows and columns of our genotype matrix. The genotype matrix is n x p, so the rows are our n individuals and the columns are our p SNPs. SNP genotypes, the entries of each column, are assumed to be 0, 1, or 2 – so note, diploid information is assumed to have been collapsed into a single value for each SNP. For example, the typical diploid setup of 0/0, 0/1, and 1/1, which represents the three possible combinations a given pair of reference (0) and alternative (1) alleles can have across both chromosomal copies in an individual, is now instead represented as 0 (0/0), 1 (0/1), or 2 (1/1). One way to derive a genotype matrix with this type of encoding is to use PLINK’s --recodeAD
output option and remove all the *_HET
columns afterwards.
Also note that the genotype matrix cannot have any missing genotypes (ie 0% rate of genotype missingness). This is necessary due to how the genetic relatedness matrices (GRMs), required components for the MAPITR
variance component model, are constructed. For each entry in the genotype matrix, the only acceptable values are 0, 1, and 2.
head(MAPITR_SimData_Phenotype)
## V1
## 1 -0.4518125
## 2 -0.4628268
## 3 0.4655829
## 4 -0.3667877
## 5 0.5072770
## 6 -0.0278133
Here you should see the first ten values of our input phenotype. MAPITR
only runs on one phenotype at a time currently, so the main function only requires a single vector of phenotypic values as input.
head(MAPITR_SimData_Pathways)
## V1
## 1 Pathway1
## 2 Pathway2
## 3 Pathway3
## 4 Pathway4
## 5 Pathway5
## V2
## 1 663,259,220,149,567,514,624,661,14,142,476,394,711,68,565,138,326,689,665,490,674,360,536,515,521,147,172,459,465,240,435,317,703,643,316,503,359,492,534,177,38,717,7,146,28,496,430,370,89,135,96,200,518,441,57,383,500,315,630,171,302,523,437,418,384,25,82,284,247,444,213,362,429,653,740,527,745,389,342,162,81,153,163,563,325,634,616,633,116,434,337,594,405,367,622,548,228,51,78,36
## 2 649,232,196,550,455,727,457,290,365,184,176,555,454,420,59,377,556,245,543,298,266,629,233,283,437,367,237,351,328,182,456,49,742,171,424,558,494,165,427,686,728,253,662,641,174,10,381,199,294,459,387,687,32,297,678,30,55,27,142,39,87,694,160,252,270,552,128,11,206,732,91,545,413,617,132,219,306,511,661,522,688,636,280,4,58,105,26,483,704,682,595,708,500,46,498,744,519,47,325,339
## 3 311,750,370,280,240,551,146,107,425,582,391,643,507,134,660,349,683,305,430,296,639,414,510,463,110,396,18,353,154,288,451,676,651,401,302,725,516,552,276,377,571,71,468,424,435,259,84,498,350,329,187,91,323,713,557,696,330,320,304,174,429,658,131,732,648,337,408,100,244,400,104,432,623,630,485,169,540,80,471,23,520,261,334,410,197,192,355,398,217,325,533,178,326,542,549,181,439,625,592,493
## 4 17,242,560,104,96,95,406,191,266,411,230,320,311,545,591,589,386,408,227,45,616,462,414,596,196,538,226,623,82,15,66,442,500,232,173,435,564,144,693,667,608,29,563,11,28,100,61,669,463,369,221,516,267,629,401,354,138,180,670,576,33,514,429,622,148,651,277,107,535,729,181,436,733,128,119,198,206,460,372,709,485,335,428,171,495,203,554,459,42,642,351,660,605,127,14,53,416,137,41,510
## 5 41,553,729,696,322,38,458,684,210,450,411,457,94,185,297,494,206,104,577,539,251,274,141,8,481,124,743,734,40,45,631,705,595,397,306,440,342,525,739,575,167,360,611,348,189,50,282,357,530,133,582,464,142,722,615,151,218,288,376,154,620,727,418,541,569,265,139,284,109,533,459,605,88,136,312,359,334,82,340,486,316,66,9,391,742,404,690,91,71,456,623,536,442,273,15,329,663,373,603,5
Here you should see a matrix with two columns. The first column should list the names of each pathway being analyzed. The second column should be a list of comma-separated column indices, representing the set of SNPs belonging to each associated pathway. These indices will be used by MAPITR
to extract the proper SNPs per pathway from the input genotype matrix to conduct the marginal epistasis analysis. Note that the values for this second column are not the SNP IDs (eg rsID#s) or column headers (if your genotype matrix has them), they are the specific column locations in the input genotype matrix for each SNP.
MAPITR
Now to run MAPITR
, we use the following code:
MAPITR_Output <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways)
And to see the output MAPITR()
provides, we can run the following:
head(MAPITR_Output$Results)
## Pathways pValues Est PVE
## 1 Pathway1 0.00511314 1.09663702 1.32243728
## 2 Pathway2 0.31118083 0.40971253 0.45740085
## 3 Pathway3 0.33131510 0.36199437 0.43933281
## 4 Pathway4 0.96134690 0.01706105 0.02017861
## 5 Pathway5 0.94181499 0.02552678 0.03176702
The output shown from MAPITR_Output$Results
should be a matrix with four columns. The first column should be the list of pathway names being analyzed (same order as input file), the second column should be the MAPITR
p-values for each pathway, the third column should be the estimate of the epistatic GRM variance component (\(g_l\) in equation 3 of the MAPITR
paper) for each pathway, and the fourth column should be the proportion of variance explained (PVE) for the phenotype by each pathway. Note that PVE estimates here are for pathways that generally contain overlapping variants – this impacts our interpretation of PVE. When single variant models are used in complex trait analysis, which is often the case, PVE can theoretically sum to 1 across all variants. A priori it is unclear what PVE should theoretically sum to across multiple pathways due to this common scenario where variants are shared across pathway definitions.
In this example, there should be one pathway that has a p-value below the Bonferonni-corrected threshold of .01 (.05 / 5, the number of pathways tested):
MAPITR_Output$Results[MAPITR_Output$Results[,2] < .01,]
## Pathways pValues Est PVE
## 1 Pathway1 0.00511314 1.096637 1.322437
Also note this is a simulated dataset made to run quickly for this vignette – you will very likely need many more individuals and SNPs to identify an epistatic signal, and PVE should never be > 1 in real data.
Lastly, we will go over how to incorporate covariates into this test. Our main recommendation is that most covariates should be directly regressed out from the phenotype prior to MAPITR
analysis. In the MAPITR
paper this is how we correct for sex, age, and assessment center. The main MAPITR
function also automatically regresses out the additive effects of each pathway from the phenotype. However, we also provide an option to remove the effects of a covariate from both the phenotype and genotypes – specificially this is the orthogonality projection matrix, H, used to produce equation 3 from in the MAPITR
paper. This projection matrix is applied to both sides of the model, thus allowing an effect to removed both from the phenotype and the GRMs. By default a column of ones is included in this matrix to act as a y-intercept term. However, any additional covariates can be appended to this matrix. We currently include top principal components into this matrix to help remove more subtle population stratification effects from the GRMs.
To include the top ten principal components of our simulated dataset into the MAPITR
analysis, we can run the following:
MAPITR_SimData_PCs[1:5,1:10]
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] 0.1127320 -0.4486358 0.6656510 -3.0188042 1.54672672 -0.9981512
## [2,] 1.9343189 -0.2892368 -1.3136277 0.4662708 0.04351503 -0.1170024
## [3,] 1.9908207 -4.1118289 0.5600524 -0.4923170 1.26330957 0.4864907
## [4,] -0.1753511 -0.8886352 -1.2132968 1.0306922 0.26797451 -2.0537700
## [5,] -1.4277754 -1.2106821 0.5611386 -0.9433904 1.01580351 -3.1516246
## PC7 PC8 PC9 PC10
## [1,] 0.2063473 -0.8389474 1.3046610 1.8209930
## [2,] -0.2457400 0.5114055 2.1700065 0.9436350
## [3,] 2.0332192 0.9221459 1.7075565 0.4946276
## [4,] 1.0631471 0.9807124 -0.6004786 0.1891073
## [5,] 0.7364327 -0.3032539 -1.3603440 2.8166732
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs) MAPITR_Output2$Results[MAPITR_Output2$Results[,2] < .01,]
## Pathways pValues Est PVE
## 1 Pathway1 0.002208826 1.212191 1.479455
And after doing so, we find that including top principal components does further help refine our epistatic signal in this simulated dataset.
OpenMP versions of the MAPITR
code is available. If possible, it is recommended to use OpenMP versions since it will allow the code to run more quickly. Note that OpenMP must already be installed on your system however to run these versions of the functions.
To run OpenMP versions of MAPITR
, use the same commands as before but with the inclusion of the following option: OpenMP = TRUE
. Examples are shown below:
MAPITR_Output1 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, OpenMP=TRUE)
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs, OpenMP=TRUE)
For further details on the MAPITR
model, please see the associated manuscript.