Introduction

The following script represents the transcription factor motif-based analysis performed in our recent paper “An OCT2 / OCA-B / MEF2B Ternary Complex Controls the Activity and Architecture of an Essential Locus Control Region for Normal and Malignant Germinal Center B-cells”. This script runs through the following steps for the two TFs we are focusing on, OCT2 (POU2F2) and MEF2B :

The data generated here (i.e. TF motif matches to the BCL6 locus) are used in the analysis of TF motif density in essential vs non-essential enhancers described in a separate script.
The script is extensively annotated and can thus be used as a hands-on tutorial for TF motif analysis. Please feel free to email me with any questions, comments or suggestions.
info at jchellmuth.com

Load required libraries

library(universalmotif)
library(JASPAR2018)
library(BSgenome.Hsapiens.UCSC.hg38)
library(dplyr)
library(ggplot2)

Get the genomic sequence for the region of interest

Note on nomenclature: this is the ‘search subject’, i.e. the region to be searched for a given motif.
The region of interest here is the BCL6 locus, losely defined as:
chr3:187,700,000-188,300,000

hg38 <- BSgenome.Hsapiens.UCSC.hg38
chr3 <- getSeq(hg38,"chr3") 
roi <- subseq(chr3,start=187700000,end=188300000)

OCT2 motif analysis

Motif download

Download TF motif from JASPAR using TFBSTools.
http://jaspar.genereg.net/matrix/MA0507.1/

# import motif from file:
pfm <- TFBSTools::getMatrixByID(x = JASPAR2018,ID = "MA0507.1")
## 
# check key tags
pfm@tags[c("symbol","description","collection","type","source","species")]
## $symbol
## [1] "POU2F2"
## 
## $description
## [1] "POU class 2 homeobox 2"
## 
## $collection
## [1] "CORE"
## 
## $type
## [1] "ChIP-seq"
## 
## $source
## [1] "ENCODE"
## 
## $species
##           9606 
## "Homo sapiens"
# convert to universalmotif class for easier handling
pfm <- convert_motifs(pfm,class = "universalmotif-universalmotif")

Trim motif

Motifs frequently contain positions with low information content (see non-trimmed motif below). Here, I am using universalmotif’s trim_motif to remove edges of motifs with an information content <0.5

# visualize full motif
view_motifs(pfm)

# trim
pfm.trim <- trim_motifs(pfm,min.ic = 0.5)
# visualize trimmed motif:
view_motifs(pfm.trim)

Motif representations

There are 3 commonly used TF motif representations:

  • Position Frequncy Matrix (PFM) aka Postion Count Matrix (PCM)
  • Information Content Matrix (ICM)
  • Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

Each of these is informative in a different way and is therefore displayed below.

Position Frequency Matrix (PFM) aka Postion Count Matrix (PCM)

This is the most basic representation of a motif and is how the motif is provided by Jaspar.

knitr::kable(pfm.trim@motif,right = F)
A T T T G C A T
A 2016 0 256 0 158 0 2257 0
C 19 0 2 0 0 2271 30 2
G 1 0 0 0 1997 0 0 0
T 251 2287 2029 2287 132 16 0 2285

Information Content Matrix (ICM)

ICMs are usually used to represent the motif as a probability sequence logo (as above). The height of the letters represent their probabilities at each position. The ICM has a column sum between 0 (no base preference) and 2 (only 1 base used).

icm <- convert_type(pfm.trim,type = "ICM")
knitr::kable(icm@motif)
A T T T G C A T
A 1.2560832 0.000218 0.1659172 0.000218 0.0915148 0.0002115 1.8698794 0.0002170
C 0.0119924 0.000218 0.0014568 0.000218 0.0001446 1.9217363 0.0250587 0.0019526
G 0.0007787 0.000218 0.0001619 0.000218 1.1549947 0.0002115 0.0002071 0.0002170
T 0.1565237 1.994559 1.3139025 1.994559 0.0764792 0.0137494 0.0002071 1.9831807

Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

PWMs are generally used to evaluate how well a sequence matches a motif. Thus, use PWM for scan_sequences below.
Following Nishida et al. Nucleic Aciids Research 2009, a pseudocount of 0.8 is added for conversion to PWM.

pwm <- convert_type(pfm.trim,type = "PWM",pseudocount = 0.8)
knitr::kable(pwm@motif)
A T T T G C A T
A 1.817678 -11.481673 -1.158619 -11.481673 -1.854139 -11.481673 1.980573 -11.481673
C -4.896711 -11.481673 -8.022242 -11.481673 -11.481673 1.989494 -4.243269 -8.022242
G -8.896711 -11.481673 -11.481673 -11.481673 1.804018 -11.481673 -11.481673 -11.481673
T -1.187053 1.999622 1.826950 1.999622 -2.113167 -5.141823 -11.481673 1.998360

Search for motif in region of interest

From the universalmotif vignette: Given a motif of length n, scan_sequences() considers every possible n-length subset in a sequence and scores it using the PWM format. If the match surpasses the minimum threshold, it is reported.
For now, I am using a low threshold. Which thershold / match to consider significant is a complex question and will be dealt with in downstream analysis.

# scan sequence for matches:
res <- scan_sequences(pwm,DNAStringSet(roi),threshold = 0.75,threshold.type = "logodds",RC = T,verbose = 0)

Correct match coordinates by adding the genomic coordinate of the start pos of the region of interest. and add chromosome name.

res$start <- res$start+187700000-1
res$stop <- res$stop+187700000-1
res$sequence <- "chr3"

Calculate p-values

P-value calculation as per universalMotif Vignette:
https://bioconductor.org/packages/release/bioc/vignettes/universalmotif/inst/doc/MotifComparisonAndPvalues.pdf
Note: smaller k leads to faster but worse approximations, and larger k leads to slower but better approximations. If k is equal to the width of the motif, then the calculation is exact.

# get background nucleotide frequency:
bkg <- Biostrings::oligonucleotideFrequency(roi,width = 1,as.prob = T)
# calculate p values:
res$pvalue <- motif_pvalue(pwm, res$score, bkg.probs = bkg,k = 8)

Modify results object

For minus strand hits produced by scan_sequences, the ‘start’ position of after the ’end position. To make these genomic positions work in bed files and UCSC, I will flip start and stop positions for minus strand hits.

minus <- res$strand=="-"
res[minus, c("start", "stop")] <- res[minus, c("stop", "start")]

Similarly, add the reverse complement of hits on the minus strand.

res$match.stranded <- res$match
res$match.stranded[minus] <- reverseComplement(DNAStringSet(res$match.stranded[minus]))

Save results for later analysis

# save results:
save(pwm,file = "OCT2_MA0507.1_PWM.Rda")
save(res,file = "BCL6locus-matches_OCT2_MA0507.1.Rda")

Cutoff evaluation

The question of which match is to be considered significant or biologically relevant is difficult to answer and will depend on things like the type of analysis, the TF at hand or chromatin context.
The following chunk is meant to give a rough overview of how relative match scores (score.pct in the scan_sequences results) relate to PWM scores and p-values (an vice versa).

df <- data.frame(rel.score=c(0.75,0.8,0.85,0.9,0.95,0.99,1))
df$pwm.score <- motif_score(pwm,df$rel.score)
df$pvalue <- motif_pvalue(pwm,score = df$pwm.score)
knitr::kable(df)
rel.score pwm.score pvalue
0.75 -10.75675 0.0082550
0.80 -5.52300 0.0044250
0.85 -0.28925 0.0010376
0.90 4.94450 0.0003815
0.95 10.17825 0.0000763
0.99 14.36525 0.0000153
1.00 15.41200 0.0000000

The following chunk gives you example matches at different threshold.

res$cut <- cut(res$score.pct,breaks = c(75,80,85,90,95,99,100))
res %>% group_by(cut) %>% top_n(-1,score.pct) %>% filter(!duplicated(score.pct)) %>% arrange(score.pct) %>% ungroup() %>% select(-cut) %>% knitr::kable()
motif sequence start stop score max.score score.pct match strand pvalue match.stranded
POU2F2 chr3 187700654 187700661 -9.955 15.41557 75.21742 TTAGGCCT + 0.0135559 TTAGGCCT
POU2F2 chr3 187749361 187749368 -5.058 15.41557 80.00093 GTCTGCAT + 0.0071330 GTCTGCAT
POU2F2 chr3 187717154 187717161 0.842 15.41557 85.76419 GTTTTCAT + 0.0018727 GTTTTCAT
POU2F2 chr3 187711693 187711700 5.275 15.41557 90.09445 ATTTTCCT + 0.0006622 ATTTTCCT
POU2F2 chr3 187740876 187740883 11.496 15.41557 96.17127 ATTTTCAT + 0.0001145 ATTTTCAT
POU2F2 chr3 187707731 187707738 15.411 15.41557 99.99553 ATTTGCAT + 0.0000262 ATTTGCAT

MEF2B motif analysis

Analysis as above but w/o annotation. ### Motif download http://jaspar.genereg.net/matrix/MA0660.1/

# import motif from file:
pfm <- TFBSTools::getMatrixByID(x = JASPAR2018,ID = "MA0660.1")
# check key tags
pfm@tags[c("remap_tf_name","collection","type","source","species")]
## $remap_tf_name
## [1] "MEF2B"
## 
## $collection
## [1] "CORE"
## 
## $type
## [1] "HT-SELEX"
## 
## $source
## [1] "23332764"
## 
## $species
##           9606 
## "Homo sapiens"
# convert to universalmotif class for easier handling
pfm <- convert_motifs(pfm,class = "universalmotif-universalmotif")

Trim motif

# visualize full motif
view_motifs(pfm)

# trim
pfm.trim <- trim_motifs(pfm,min.ic = 0.5)
# visualize trimmed motif:
view_motifs(pfm.trim)

Motif representations

Position Frequency Matrix (PFM) aka Postion Count Matrix (PCM)

knitr::kable(pfm.trim@motif,right = F)
C T A W A A A T A G
A 21 0 3710 1768 3032 3561 3675 0 3718 112
C 3693 29 0 2 0 3 0 0 4 22
G 0 0 0 0 12 2 0 0 2 3591
T 14 3699 18 1958 684 162 53 3728 4 3

Information Content Matrix (ICM)

icm <- convert_type(pfm.trim,type = "ICM")
knitr::kable(icm@motif)
C T A W A A A T A G
A 0.0108964 0.0001295 1.9435226 0.4714365 1.0412121 1.6469185 1.8627202 0.0001339 1.9614370 0.0524562
C 1.8937980 0.0151530 0.0001310 0.0005999 0.0000858 0.0015030 0.0001267 0.0001339 0.0022419 0.0103978
G 0.0001282 0.0001295 0.0001310 0.0000667 0.0042064 0.0010405 0.0001267 0.0001339 0.0011869 1.6782495
T 0.0073070 1.9164031 0.0095598 0.5220928 0.2349573 0.0750334 0.0269886 1.9965197 0.0022419 0.0015188

Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

pwm <- convert_type(pfm.trim,type = "PWM",pseudocount = 0.8)
knitr::kable(pwm@motif)
C T A W A A A T A G
A -5.458503 -12.186424 1.992786 0.923570 1.7016535 1.933652 1.979111 -12.186424 1.995893 -3.054567
C 1.986160 -4.996599 -12.186424 -8.726992 -12.1864238 -8.186424 -12.186424 -12.186424 -7.794106 -5.392008
G -12.186424 -12.186424 -12.186424 -12.186424 -6.2556865 -8.726992 -12.186424 -12.186424 -8.726992 1.945755
T -6.036677 1.988502 -5.678629 1.070817 -0.4462214 -2.522866 -4.131141 1.999768 -7.794106 -8.186424

Search for motif in region of interest

# scan sequence for matches:
res <- scan_sequences(pwm,DNAStringSet(roi),threshold = 0.75,threshold.type = "logodds",RC = T,verbose = 0)
res$start <- res$start+187700000-1
res$stop <- res$stop+187700000-1
res$sequence <- "chr3"

Calculate p-values

# get background nucleotide frequency:
bkg <- Biostrings::oligonucleotideFrequency(roi,width = 1,as.prob = T)
# calculate p values:
res$pvalue <- motif_pvalue(pwm, res$score, bkg.probs = bkg,k = 8)

Modify results object

minus <- res$strand=="-"
res[minus, c("start", "stop")] <- res[minus, c("stop", "start")]
res$match.stranded <- res$match
res$match.stranded[minus] <- reverseComplement(DNAStringSet(res$match.stranded[minus]))
# save results:
save(pwm,file = "MEF2B_MA0660.1_PWM.Rda")
save(res,file = "BCL6locus-matches_MEF2B_MA0660.1.Rda")

Cutoff evaluation

df <- data.frame(rel.score=c(0.75,0.8,0.85,0.9,0.95,0.99,1))
df$pwm.score <- motif_score(pwm,df$rel.score)
df$pvalue <- motif_pvalue(pwm,score = df$pwm.score)
knitr::kable(df)
rel.score pwm.score pvalue
0.75 -13.79400 0.0060081
0.80 -7.31760 0.0018015
0.85 -0.84120 0.0004683
0.90 5.63520 0.0000858
0.95 12.11160 0.0000095
0.99 17.29272 0.0000019
1.00 18.58800 0.0000000
res$cut <- cut(res$score.pct,breaks = c(75,80,85,90,95,99,100))
res %>% group_by(cut) %>% top_n(-1,score.pct) %>% filter(!duplicated(score.pct)) %>% arrange(score.pct) %>% ungroup() %>% select(-cut) %>% knitr::kable()
motif sequence start stop score max.score score.pct match strand pvalue match.stranded
MEF2B chr3 187926606 187926615 -13.206 18.59354 75.00016 CTTTCTTTAG + 0.0108889 CTTTCTTTAG
MEF2B chr3 188055725 188055734 -6.846 18.59354 80.00020 GCAAATATAG + 0.0035803 GCAAATATAG
MEF2B chr3 188233541 188233550 -0.469 18.59354 85.01361 CTATGATTAA + 0.0009354 CTATGATTAA
MEF2B chr3 188247283 188247292 5.878 18.59354 90.00343 CTATTTTTAG + 0.0001938 CTATTTTTAG
MEF2B chr3 188145045 188145054 12.333 18.59354 95.07815 CTAATTTTAG - 0.0000208 CTAAAATTAG
MEF2B chr3 188291474 188291483 18.441 18.59354 99.88008 CTAAAAATAG + 0.0000021 CTAAAAATAG

Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.2.1                     dplyr_0.8.3                      
##  [3] BSgenome.Hsapiens.UCSC.hg38_1.4.1 BSgenome_1.52.0                  
##  [5] rtracklayer_1.44.4                Biostrings_2.52.0                
##  [7] XVector_0.24.0                    GenomicRanges_1.36.1             
##  [9] GenomeInfoDb_1.20.0               IRanges_2.18.3                   
## [11] S4Vectors_0.22.1                  BiocGenerics_0.30.0              
## [13] JASPAR2018_1.1.1                  universalmotif_1.2.1             
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-142                bitops_1.0-6               
##  [3] matrixStats_0.55.0          ggtree_1.16.6              
##  [5] DirichletMultinomial_1.26.0 TFBSTools_1.22.0           
##  [7] bit64_0.9-7                 httr_1.4.1                 
##  [9] tools_3.6.1                 backports_1.1.5            
## [11] R6_2.4.1                    seqLogo_1.50.0             
## [13] DBI_1.0.0                   lazyeval_0.2.2             
## [15] colorspace_1.4-1            withr_2.1.2                
## [17] tidyselect_0.2.5            processx_3.4.1             
## [19] bit_1.1-14                  compiler_3.6.1             
## [21] Biobase_2.44.0              DelayedArray_0.10.0        
## [23] labeling_0.3                caTools_1.17.1.3           
## [25] scales_1.1.0                readr_1.3.1                
## [27] stringr_1.4.0               digest_0.6.23              
## [29] Rsamtools_2.0.3             R.utils_2.9.0              
## [31] rmarkdown_1.18              pkgconfig_2.0.3            
## [33] htmltools_0.4.0             bibtex_0.4.2               
## [35] highr_0.8                   rlang_0.4.2                
## [37] VGAM_1.1-2                  RSQLite_2.1.2              
## [39] farver_2.0.1                jsonlite_1.6               
## [41] BiocParallel_1.18.1         gtools_3.8.1               
## [43] R.oo_1.23.0                 RCurl_1.95-4.12            
## [45] magrittr_1.5                GO.db_3.8.2                
## [47] GenomeInfoDbData_1.2.1      Matrix_1.2-18              
## [49] Rcpp_1.0.3                  munsell_0.5.0              
## [51] ape_5.3                     R.methodsS3_1.7.1          
## [53] lifecycle_0.1.0             stringi_1.4.3              
## [55] yaml_2.2.0                  gbRd_0.4-11                
## [57] SummarizedExperiment_1.14.1 zlibbioc_1.30.0            
## [59] plyr_1.8.4                  grid_3.6.1                 
## [61] ggseqlogo_0.1               blob_1.2.0                 
## [63] crayon_1.3.4                CNEr_1.20.0                
## [65] lattice_0.20-38             splines_3.6.1              
## [67] annotate_1.62.0             KEGGREST_1.24.1            
## [69] hms_0.5.2                   zeallot_0.1.0              
## [71] knitr_1.26                  ps_1.3.0                   
## [73] pillar_1.4.2                reshape2_1.4.3             
## [75] TFMPvalue_0.0.8             XML_3.98-1.20              
## [77] glue_1.3.1                  evaluate_0.14              
## [79] BiocManager_1.30.10         png_0.1-7                  
## [81] vctrs_0.2.0                 treeio_1.8.2               
## [83] Rdpack_0.11-0               gtable_0.3.0               
## [85] poweRlaw_0.70.2             purrr_0.3.3                
## [87] tidyr_1.0.0                 assertthat_0.2.1           
## [89] xfun_0.11                   xtable_1.8-4               
## [91] tidytree_0.3.0              tibble_2.1.3               
## [93] rvcheck_0.1.7               AnnotationDbi_1.46.1       
## [95] GenomicAlignments_1.20.1    memoise_1.1.0