The following script represents the transcription factor motif-based analysis performed in our recent paper “An OCT2 / OCA-B / MEF2B Ternary Complex Controls the Activity and Architecture of an Essential Locus Control Region for Normal and Malignant Germinal Center B-cells”. This script runs through the following steps for the two TFs we are focusing on, OCT2 (POU2F2) and MEF2B :
The data generated here (i.e. TF motif matches to the BCL6 locus) are used in the analysis of TF motif density in essential vs non-essential enhancers described in a separate script.
The script is extensively annotated and can thus be used as a hands-on tutorial for TF motif analysis. Please feel free to email me with any questions, comments or suggestions.
info at jchellmuth.com
library(universalmotif)
library(JASPAR2018)
library(BSgenome.Hsapiens.UCSC.hg38)
library(dplyr)
library(ggplot2)
Note on nomenclature: this is the ‘search subject’, i.e. the region to be searched for a given motif.
The region of interest here is the BCL6 locus, losely defined as:
chr3:187,700,000-188,300,000
hg38 <- BSgenome.Hsapiens.UCSC.hg38
chr3 <- getSeq(hg38,"chr3")
roi <- subseq(chr3,start=187700000,end=188300000)
Download TF motif from JASPAR using TFBSTools.
http://jaspar.genereg.net/matrix/MA0507.1/
# import motif from file:
pfm <- TFBSTools::getMatrixByID(x = JASPAR2018,ID = "MA0507.1")
##
# check key tags
pfm@tags[c("symbol","description","collection","type","source","species")]
## $symbol
## [1] "POU2F2"
##
## $description
## [1] "POU class 2 homeobox 2"
##
## $collection
## [1] "CORE"
##
## $type
## [1] "ChIP-seq"
##
## $source
## [1] "ENCODE"
##
## $species
## 9606
## "Homo sapiens"
# convert to universalmotif class for easier handling
pfm <- convert_motifs(pfm,class = "universalmotif-universalmotif")
Motifs frequently contain positions with low information content (see non-trimmed motif below). Here, I am using universalmotif’s trim_motif
to remove edges of motifs with an information content <0.5
# visualize full motif
view_motifs(pfm)
# trim
pfm.trim <- trim_motifs(pfm,min.ic = 0.5)
# visualize trimmed motif:
view_motifs(pfm.trim)
There are 3 commonly used TF motif representations:
Each of these is informative in a different way and is therefore displayed below.
This is the most basic representation of a motif and is how the motif is provided by Jaspar.
knitr::kable(pfm.trim@motif,right = F)
A | T | T | T | G | C | A | T | |
---|---|---|---|---|---|---|---|---|
A | 2016 | 0 | 256 | 0 | 158 | 0 | 2257 | 0 |
C | 19 | 0 | 2 | 0 | 0 | 2271 | 30 | 2 |
G | 1 | 0 | 0 | 0 | 1997 | 0 | 0 | 0 |
T | 251 | 2287 | 2029 | 2287 | 132 | 16 | 0 | 2285 |
ICMs are usually used to represent the motif as a probability sequence logo (as above). The height of the letters represent their probabilities at each position. The ICM has a column sum between 0 (no base preference) and 2 (only 1 base used).
icm <- convert_type(pfm.trim,type = "ICM")
knitr::kable(icm@motif)
A | T | T | T | G | C | A | T | |
---|---|---|---|---|---|---|---|---|
A | 1.2560832 | 0.000218 | 0.1659172 | 0.000218 | 0.0915148 | 0.0002115 | 1.8698794 | 0.0002170 |
C | 0.0119924 | 0.000218 | 0.0014568 | 0.000218 | 0.0001446 | 1.9217363 | 0.0250587 | 0.0019526 |
G | 0.0007787 | 0.000218 | 0.0001619 | 0.000218 | 1.1549947 | 0.0002115 | 0.0002071 | 0.0002170 |
T | 0.1565237 | 1.994559 | 1.3139025 | 1.994559 | 0.0764792 | 0.0137494 | 0.0002071 | 1.9831807 |
PWMs are generally used to evaluate how well a sequence matches a motif. Thus, use PWM for scan_sequences
below.
Following Nishida et al. Nucleic Aciids Research 2009, a pseudocount of 0.8 is added for conversion to PWM.
pwm <- convert_type(pfm.trim,type = "PWM",pseudocount = 0.8)
knitr::kable(pwm@motif)
A | T | T | T | G | C | A | T | |
---|---|---|---|---|---|---|---|---|
A | 1.817678 | -11.481673 | -1.158619 | -11.481673 | -1.854139 | -11.481673 | 1.980573 | -11.481673 |
C | -4.896711 | -11.481673 | -8.022242 | -11.481673 | -11.481673 | 1.989494 | -4.243269 | -8.022242 |
G | -8.896711 | -11.481673 | -11.481673 | -11.481673 | 1.804018 | -11.481673 | -11.481673 | -11.481673 |
T | -1.187053 | 1.999622 | 1.826950 | 1.999622 | -2.113167 | -5.141823 | -11.481673 | 1.998360 |
From the universalmotif vignette: Given a motif of length n, scan_sequences() considers every possible n-length subset in a sequence and scores it using the PWM format. If the match surpasses the minimum threshold, it is reported.
For now, I am using a low threshold. Which thershold / match to consider significant is a complex question and will be dealt with in downstream analysis.
# scan sequence for matches:
res <- scan_sequences(pwm,DNAStringSet(roi),threshold = 0.75,threshold.type = "logodds",RC = T,verbose = 0)
Correct match coordinates by adding the genomic coordinate of the start pos of the region of interest. and add chromosome name.
res$start <- res$start+187700000-1
res$stop <- res$stop+187700000-1
res$sequence <- "chr3"
P-value calculation as per universalMotif Vignette:
https://bioconductor.org/packages/release/bioc/vignettes/universalmotif/inst/doc/MotifComparisonAndPvalues.pdf
Note: smaller k leads to faster but worse approximations, and larger k leads to slower but better approximations. If k is equal to the width of the motif, then the calculation is exact.
# get background nucleotide frequency:
bkg <- Biostrings::oligonucleotideFrequency(roi,width = 1,as.prob = T)
# calculate p values:
res$pvalue <- motif_pvalue(pwm, res$score, bkg.probs = bkg,k = 8)
For minus strand hits produced by scan_sequences
, the ‘start’ position of after the ’end position. To make these genomic positions work in bed files and UCSC, I will flip start and stop positions for minus strand hits.
minus <- res$strand=="-"
res[minus, c("start", "stop")] <- res[minus, c("stop", "start")]
Similarly, add the reverse complement of hits on the minus strand.
res$match.stranded <- res$match
res$match.stranded[minus] <- reverseComplement(DNAStringSet(res$match.stranded[minus]))
Save results for later analysis
# save results:
save(pwm,file = "OCT2_MA0507.1_PWM.Rda")
save(res,file = "BCL6locus-matches_OCT2_MA0507.1.Rda")
The question of which match is to be considered significant or biologically relevant is difficult to answer and will depend on things like the type of analysis, the TF at hand or chromatin context.
The following chunk is meant to give a rough overview of how relative match scores (score.pct
in the scan_sequences results) relate to PWM scores and p-values (an vice versa).
df <- data.frame(rel.score=c(0.75,0.8,0.85,0.9,0.95,0.99,1))
df$pwm.score <- motif_score(pwm,df$rel.score)
df$pvalue <- motif_pvalue(pwm,score = df$pwm.score)
knitr::kable(df)
rel.score | pwm.score | pvalue |
---|---|---|
0.75 | -10.75675 | 0.0082550 |
0.80 | -5.52300 | 0.0044250 |
0.85 | -0.28925 | 0.0010376 |
0.90 | 4.94450 | 0.0003815 |
0.95 | 10.17825 | 0.0000763 |
0.99 | 14.36525 | 0.0000153 |
1.00 | 15.41200 | 0.0000000 |
The following chunk gives you example matches at different threshold.
res$cut <- cut(res$score.pct,breaks = c(75,80,85,90,95,99,100))
res %>% group_by(cut) %>% top_n(-1,score.pct) %>% filter(!duplicated(score.pct)) %>% arrange(score.pct) %>% ungroup() %>% select(-cut) %>% knitr::kable()
motif | sequence | start | stop | score | max.score | score.pct | match | strand | pvalue | match.stranded |
---|---|---|---|---|---|---|---|---|---|---|
POU2F2 | chr3 | 187700654 | 187700661 | -9.955 | 15.41557 | 75.21742 | TTAGGCCT | + | 0.0135559 | TTAGGCCT |
POU2F2 | chr3 | 187749361 | 187749368 | -5.058 | 15.41557 | 80.00093 | GTCTGCAT | + | 0.0071330 | GTCTGCAT |
POU2F2 | chr3 | 187717154 | 187717161 | 0.842 | 15.41557 | 85.76419 | GTTTTCAT | + | 0.0018727 | GTTTTCAT |
POU2F2 | chr3 | 187711693 | 187711700 | 5.275 | 15.41557 | 90.09445 | ATTTTCCT | + | 0.0006622 | ATTTTCCT |
POU2F2 | chr3 | 187740876 | 187740883 | 11.496 | 15.41557 | 96.17127 | ATTTTCAT | + | 0.0001145 | ATTTTCAT |
POU2F2 | chr3 | 187707731 | 187707738 | 15.411 | 15.41557 | 99.99553 | ATTTGCAT | + | 0.0000262 | ATTTGCAT |
Analysis as above but w/o annotation. ### Motif download http://jaspar.genereg.net/matrix/MA0660.1/
# import motif from file:
pfm <- TFBSTools::getMatrixByID(x = JASPAR2018,ID = "MA0660.1")
# check key tags
pfm@tags[c("remap_tf_name","collection","type","source","species")]
## $remap_tf_name
## [1] "MEF2B"
##
## $collection
## [1] "CORE"
##
## $type
## [1] "HT-SELEX"
##
## $source
## [1] "23332764"
##
## $species
## 9606
## "Homo sapiens"
# convert to universalmotif class for easier handling
pfm <- convert_motifs(pfm,class = "universalmotif-universalmotif")
# visualize full motif
view_motifs(pfm)
# trim
pfm.trim <- trim_motifs(pfm,min.ic = 0.5)
# visualize trimmed motif:
view_motifs(pfm.trim)
knitr::kable(pfm.trim@motif,right = F)
C | T | A | W | A | A | A | T | A | G | |
---|---|---|---|---|---|---|---|---|---|---|
A | 21 | 0 | 3710 | 1768 | 3032 | 3561 | 3675 | 0 | 3718 | 112 |
C | 3693 | 29 | 0 | 2 | 0 | 3 | 0 | 0 | 4 | 22 |
G | 0 | 0 | 0 | 0 | 12 | 2 | 0 | 0 | 2 | 3591 |
T | 14 | 3699 | 18 | 1958 | 684 | 162 | 53 | 3728 | 4 | 3 |
icm <- convert_type(pfm.trim,type = "ICM")
knitr::kable(icm@motif)
C | T | A | W | A | A | A | T | A | G | |
---|---|---|---|---|---|---|---|---|---|---|
A | 0.0108964 | 0.0001295 | 1.9435226 | 0.4714365 | 1.0412121 | 1.6469185 | 1.8627202 | 0.0001339 | 1.9614370 | 0.0524562 |
C | 1.8937980 | 0.0151530 | 0.0001310 | 0.0005999 | 0.0000858 | 0.0015030 | 0.0001267 | 0.0001339 | 0.0022419 | 0.0103978 |
G | 0.0001282 | 0.0001295 | 0.0001310 | 0.0000667 | 0.0042064 | 0.0010405 | 0.0001267 | 0.0001339 | 0.0011869 | 1.6782495 |
T | 0.0073070 | 1.9164031 | 0.0095598 | 0.5220928 | 0.2349573 | 0.0750334 | 0.0269886 | 1.9965197 | 0.0022419 | 0.0015188 |
pwm <- convert_type(pfm.trim,type = "PWM",pseudocount = 0.8)
knitr::kable(pwm@motif)
C | T | A | W | A | A | A | T | A | G | |
---|---|---|---|---|---|---|---|---|---|---|
A | -5.458503 | -12.186424 | 1.992786 | 0.923570 | 1.7016535 | 1.933652 | 1.979111 | -12.186424 | 1.995893 | -3.054567 |
C | 1.986160 | -4.996599 | -12.186424 | -8.726992 | -12.1864238 | -8.186424 | -12.186424 | -12.186424 | -7.794106 | -5.392008 |
G | -12.186424 | -12.186424 | -12.186424 | -12.186424 | -6.2556865 | -8.726992 | -12.186424 | -12.186424 | -8.726992 | 1.945755 |
T | -6.036677 | 1.988502 | -5.678629 | 1.070817 | -0.4462214 | -2.522866 | -4.131141 | 1.999768 | -7.794106 | -8.186424 |
# scan sequence for matches:
res <- scan_sequences(pwm,DNAStringSet(roi),threshold = 0.75,threshold.type = "logodds",RC = T,verbose = 0)
res$start <- res$start+187700000-1
res$stop <- res$stop+187700000-1
res$sequence <- "chr3"
# get background nucleotide frequency:
bkg <- Biostrings::oligonucleotideFrequency(roi,width = 1,as.prob = T)
# calculate p values:
res$pvalue <- motif_pvalue(pwm, res$score, bkg.probs = bkg,k = 8)
minus <- res$strand=="-"
res[minus, c("start", "stop")] <- res[minus, c("stop", "start")]
res$match.stranded <- res$match
res$match.stranded[minus] <- reverseComplement(DNAStringSet(res$match.stranded[minus]))
# save results:
save(pwm,file = "MEF2B_MA0660.1_PWM.Rda")
save(res,file = "BCL6locus-matches_MEF2B_MA0660.1.Rda")
df <- data.frame(rel.score=c(0.75,0.8,0.85,0.9,0.95,0.99,1))
df$pwm.score <- motif_score(pwm,df$rel.score)
df$pvalue <- motif_pvalue(pwm,score = df$pwm.score)
knitr::kable(df)
rel.score | pwm.score | pvalue |
---|---|---|
0.75 | -13.79400 | 0.0060081 |
0.80 | -7.31760 | 0.0018015 |
0.85 | -0.84120 | 0.0004683 |
0.90 | 5.63520 | 0.0000858 |
0.95 | 12.11160 | 0.0000095 |
0.99 | 17.29272 | 0.0000019 |
1.00 | 18.58800 | 0.0000000 |
res$cut <- cut(res$score.pct,breaks = c(75,80,85,90,95,99,100))
res %>% group_by(cut) %>% top_n(-1,score.pct) %>% filter(!duplicated(score.pct)) %>% arrange(score.pct) %>% ungroup() %>% select(-cut) %>% knitr::kable()
motif | sequence | start | stop | score | max.score | score.pct | match | strand | pvalue | match.stranded |
---|---|---|---|---|---|---|---|---|---|---|
MEF2B | chr3 | 187926606 | 187926615 | -13.206 | 18.59354 | 75.00016 | CTTTCTTTAG | + | 0.0108889 | CTTTCTTTAG |
MEF2B | chr3 | 188055725 | 188055734 | -6.846 | 18.59354 | 80.00020 | GCAAATATAG | + | 0.0035803 | GCAAATATAG |
MEF2B | chr3 | 188233541 | 188233550 | -0.469 | 18.59354 | 85.01361 | CTATGATTAA | + | 0.0009354 | CTATGATTAA |
MEF2B | chr3 | 188247283 | 188247292 | 5.878 | 18.59354 | 90.00343 | CTATTTTTAG | + | 0.0001938 | CTATTTTTAG |
MEF2B | chr3 | 188145045 | 188145054 | 12.333 | 18.59354 | 95.07815 | CTAATTTTAG | - | 0.0000208 | CTAAAATTAG |
MEF2B | chr3 | 188291474 | 188291483 | 18.441 | 18.59354 | 99.88008 | CTAAAAATAG | + | 0.0000021 | CTAAAAATAG |
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_3.2.1 dplyr_0.8.3
## [3] BSgenome.Hsapiens.UCSC.hg38_1.4.1 BSgenome_1.52.0
## [5] rtracklayer_1.44.4 Biostrings_2.52.0
## [7] XVector_0.24.0 GenomicRanges_1.36.1
## [9] GenomeInfoDb_1.20.0 IRanges_2.18.3
## [11] S4Vectors_0.22.1 BiocGenerics_0.30.0
## [13] JASPAR2018_1.1.1 universalmotif_1.2.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-142 bitops_1.0-6
## [3] matrixStats_0.55.0 ggtree_1.16.6
## [5] DirichletMultinomial_1.26.0 TFBSTools_1.22.0
## [7] bit64_0.9-7 httr_1.4.1
## [9] tools_3.6.1 backports_1.1.5
## [11] R6_2.4.1 seqLogo_1.50.0
## [13] DBI_1.0.0 lazyeval_0.2.2
## [15] colorspace_1.4-1 withr_2.1.2
## [17] tidyselect_0.2.5 processx_3.4.1
## [19] bit_1.1-14 compiler_3.6.1
## [21] Biobase_2.44.0 DelayedArray_0.10.0
## [23] labeling_0.3 caTools_1.17.1.3
## [25] scales_1.1.0 readr_1.3.1
## [27] stringr_1.4.0 digest_0.6.23
## [29] Rsamtools_2.0.3 R.utils_2.9.0
## [31] rmarkdown_1.18 pkgconfig_2.0.3
## [33] htmltools_0.4.0 bibtex_0.4.2
## [35] highr_0.8 rlang_0.4.2
## [37] VGAM_1.1-2 RSQLite_2.1.2
## [39] farver_2.0.1 jsonlite_1.6
## [41] BiocParallel_1.18.1 gtools_3.8.1
## [43] R.oo_1.23.0 RCurl_1.95-4.12
## [45] magrittr_1.5 GO.db_3.8.2
## [47] GenomeInfoDbData_1.2.1 Matrix_1.2-18
## [49] Rcpp_1.0.3 munsell_0.5.0
## [51] ape_5.3 R.methodsS3_1.7.1
## [53] lifecycle_0.1.0 stringi_1.4.3
## [55] yaml_2.2.0 gbRd_0.4-11
## [57] SummarizedExperiment_1.14.1 zlibbioc_1.30.0
## [59] plyr_1.8.4 grid_3.6.1
## [61] ggseqlogo_0.1 blob_1.2.0
## [63] crayon_1.3.4 CNEr_1.20.0
## [65] lattice_0.20-38 splines_3.6.1
## [67] annotate_1.62.0 KEGGREST_1.24.1
## [69] hms_0.5.2 zeallot_0.1.0
## [71] knitr_1.26 ps_1.3.0
## [73] pillar_1.4.2 reshape2_1.4.3
## [75] TFMPvalue_0.0.8 XML_3.98-1.20
## [77] glue_1.3.1 evaluate_0.14
## [79] BiocManager_1.30.10 png_0.1-7
## [81] vctrs_0.2.0 treeio_1.8.2
## [83] Rdpack_0.11-0 gtable_0.3.0
## [85] poweRlaw_0.70.2 purrr_0.3.3
## [87] tidyr_1.0.0 assertthat_0.2.1
## [89] xfun_0.11 xtable_1.8-4
## [91] tidytree_0.3.0 tibble_2.1.3
## [93] rvcheck_0.1.7 AnnotationDbi_1.46.1
## [95] GenomicAlignments_1.20.1 memoise_1.1.0