Introduction

The following script represents the transcription factor motif-based analysis performed in our recent paper “An OCT2 / OCA-B / MEF2B Ternary Complex Controls the Activity and Architecture of an Essential Locus Control Region for Normal and Malignant Germinal Center B-cells”. This script runs through the following steps for the two TFs we are focusing on, OCT2 (POU2F2) and MEF2B :

download TF motifs from Jaspar
trim motifs to remove low information content
convert the basic Position Count Matrix (PCM) to Position Weight Matrix (PWM)
search genomic region of interest for matches to PWM motif
calculate p values for each match

The data generated here (i.e. TF motif matches to the BCL6 locus) are used in the analysis of TF motif density in essential vs non-essential enhancers described in a separate script.
The script is extensively annotated and can thus be used as a hands-on tutorial for TF motif analysis. Please feel free to email me with any questions, comments or suggestions.
info at jchellmuth.com

Load required libraries

library(universalmotif)
library(JASPAR2018)
library(BSgenome.Hsapiens.UCSC.hg38)
library(dplyr)
library(ggplot2)

Get the genomic sequence for the region of interest

Note on nomenclature: this is the ‘search subject’, i.e. the region to be searched for a given motif.
The region of interest here is the BCL6 locus, losely defined as:
chr3:187,700,000-188,300,000

hg38 <- BSgenome.Hsapiens.UCSC.hg38
chr3 <- getSeq(hg38,"chr3") 
roi <- subseq(chr3,start=187700000,end=188300000)

OCT2 motif analysis

Motif download

Download TF motif from JASPAR using TFBSTools.
http://jaspar.genereg.net/matrix/MA0507.1/

# import motif from file:
pfm <- TFBSTools::getMatrixByID(x = JASPAR2018,ID = "MA0507.1")

##

# check key tags
pfm@tags[c("symbol","description","collection","type","source","species")]

## $symbol
## [1] "POU2F2"
## 
## $description
## [1] "POU class 2 homeobox 2"
## 
## $collection
## [1] "CORE"
## 
## $type
## [1] "ChIP-seq"
## 
## $source
## [1] "ENCODE"
## 
## $species
##           9606 
## "Homo sapiens"

# convert to universalmotif class for easier handling
pfm <- convert_motifs(pfm,class = "universalmotif-universalmotif")

Trim motif

Motifs frequently contain positions with low information content (see non-trimmed motif below). Here, I am using universalmotif’s trim_motif to remove edges of motifs with an information content <0.5

# visualize full motif
view_motifs(pfm)

# trim
pfm.trim <- trim_motifs(pfm,min.ic = 0.5)

# visualize trimmed motif:
view_motifs(pfm.trim)

Motif representations

There are 3 commonly used TF motif representations:

Position Frequncy Matrix (PFM) aka Postion Count Matrix (PCM)
Information Content Matrix (ICM)
Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

Each of these is informative in a different way and is therefore displayed below.

Position Frequency Matrix (PFM) aka Postion Count Matrix (PCM)

This is the most basic representation of a motif and is how the motif is provided by Jaspar.

knitr::kable(pfm.trim@motif,right = F)

	A	T	T	T	G	C	A	T
A	2016	0	256	0	158	0	2257	0
C	19	0	2	0	0	2271	30	2
G	1	0	0	0	1997	0	0	0
T	251	2287	2029	2287	132	16	0	2285

Information Content Matrix (ICM)

ICMs are usually used to represent the motif as a probability sequence logo (as above). The height of the letters represent their probabilities at each position. The ICM has a column sum between 0 (no base preference) and 2 (only 1 base used).

icm <- convert_type(pfm.trim,type = "ICM")
knitr::kable(icm@motif)

	A	T	T	T	G	C	A	T
A	1.2560832	0.000218	0.1659172	0.000218	0.0915148	0.0002115	1.8698794	0.0002170
C	0.0119924	0.000218	0.0014568	0.000218	0.0001446	1.9217363	0.0250587	0.0019526
G	0.0007787	0.000218	0.0001619	0.000218	1.1549947	0.0002115	0.0002071	0.0002170
T	0.1565237	1.994559	1.3139025	1.994559	0.0764792	0.0137494	0.0002071	1.9831807

Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

PWMs are generally used to evaluate how well a sequence matches a motif. Thus, use PWM for scan_sequences below.
Following Nishida et al. Nucleic Aciids Research 2009, a pseudocount of 0.8 is added for conversion to PWM.

pwm <- convert_type(pfm.trim,type = "PWM",pseudocount = 0.8)
knitr::kable(pwm@motif)

	A	T	T	T	G	C	A	T
A	1.817678	-11.481673	-1.158619	-11.481673	-1.854139	-11.481673	1.980573	-11.481673
C	-4.896711	-11.481673	-8.022242	-11.481673	-11.481673	1.989494	-4.243269	-8.022242
G	-8.896711	-11.481673	-11.481673	-11.481673	1.804018	-11.481673	-11.481673	-11.481673
T	-1.187053	1.999622	1.826950	1.999622	-2.113167	-5.141823	-11.481673	1.998360

Search for motif in region of interest

From the universalmotif vignette: Given a motif of length n, scan_sequences() considers every possible n-length subset in a sequence and scores it using the PWM format. If the match surpasses the minimum threshold, it is reported.
For now, I am using a low threshold. Which thershold / match to consider significant is a complex question and will be dealt with in downstream analysis.

# scan sequence for matches:
res <- scan_sequences(pwm,DNAStringSet(roi),threshold = 0.75,threshold.type = "logodds",RC = T,verbose = 0)

Correct match coordinates by adding the genomic coordinate of the start pos of the region of interest. and add chromosome name.

res$start <- res$start+187700000-1
res$stop <- res$stop+187700000-1
res$sequence <- "chr3"

Calculate p-values

P-value calculation as per universalMotif Vignette:
https://bioconductor.org/packages/release/bioc/vignettes/universalmotif/inst/doc/MotifComparisonAndPvalues.pdf
Note: smaller k leads to faster but worse approximations, and larger k leads to slower but better approximations. If k is equal to the width of the motif, then the calculation is exact.

# get background nucleotide frequency:
bkg <- Biostrings::oligonucleotideFrequency(roi,width = 1,as.prob = T)
# calculate p values:
res$pvalue <- motif_pvalue(pwm, res$score, bkg.probs = bkg,k = 8)

Modify results object

For minus strand hits produced by scan_sequences, the ‘start’ position of after the ’end position. To make these genomic positions work in bed files and UCSC, I will flip start and stop positions for minus strand hits.

minus <- res$strand=="-"
res[minus, c("start", "stop")] <- res[minus, c("stop", "start")]

Similarly, add the reverse complement of hits on the minus strand.

res$match.stranded <- res$match
res$match.stranded[minus] <- reverseComplement(DNAStringSet(res$match.stranded[minus]))

Save results for later analysis

# save results:
save(pwm,file = "OCT2_MA0507.1_PWM.Rda")
save(res,file = "BCL6locus-matches_OCT2_MA0507.1.Rda")

Cutoff evaluation

The question of which match is to be considered significant or biologically relevant is difficult to answer and will depend on things like the type of analysis, the TF at hand or chromatin context.
The following chunk is meant to give a rough overview of how relative match scores (score.pct in the scan_sequences results) relate to PWM scores and p-values (an vice versa).

df <- data.frame(rel.score=c(0.75,0.8,0.85,0.9,0.95,0.99,1))
df$pwm.score <- motif_score(pwm,df$rel.score)
df$pvalue <- motif_pvalue(pwm,score = df$pwm.score)
knitr::kable(df)

rel.score	pwm.score	pvalue
0.75	-10.75675	0.0082550
0.80	-5.52300	0.0044250
0.85	-0.28925	0.0010376
0.90	4.94450	0.0003815
0.95	10.17825	0.0000763
0.99	14.36525	0.0000153
1.00	15.41200	0.0000000

The following chunk gives you example matches at different threshold.

res$cut <- cut(res$score.pct,breaks = c(75,80,85,90,95,99,100))
res %>% group_by(cut) %>% top_n(-1,score.pct) %>% filter(!duplicated(score.pct)) %>% arrange(score.pct) %>% ungroup() %>% select(-cut) %>% knitr::kable()

motif	sequence	start	stop	score	max.score	score.pct	match	strand	pvalue	match.stranded
POU2F2	chr3	187700654	187700661	-9.955	15.41557	75.21742	TTAGGCCT	+	0.0135559	TTAGGCCT
POU2F2	chr3	187749361	187749368	-5.058	15.41557	80.00093	GTCTGCAT	+	0.0071330	GTCTGCAT
POU2F2	chr3	187717154	187717161	0.842	15.41557	85.76419	GTTTTCAT	+	0.0018727	GTTTTCAT
POU2F2	chr3	187711693	187711700	5.275	15.41557	90.09445	ATTTTCCT	+	0.0006622	ATTTTCCT
POU2F2	chr3	187740876	187740883	11.496	15.41557	96.17127	ATTTTCAT	+	0.0001145	ATTTTCAT
POU2F2	chr3	187707731	187707738	15.411	15.41557	99.99553	ATTTGCAT	+	0.0000262	ATTTGCAT

MEF2B motif analysis

Analysis as above but w/o annotation. ### Motif download http://jaspar.genereg.net/matrix/MA0660.1/

# import motif from file:
pfm <- TFBSTools::getMatrixByID(x = JASPAR2018,ID = "MA0660.1")
# check key tags
pfm@tags[c("remap_tf_name","collection","type","source","species")]

## $remap_tf_name
## [1] "MEF2B"
## 
## $collection
## [1] "CORE"
## 
## $type
## [1] "HT-SELEX"
## 
## $source
## [1] "23332764"
## 
## $species
##           9606 
## "Homo sapiens"

# convert to universalmotif class for easier handling
pfm <- convert_motifs(pfm,class = "universalmotif-universalmotif")

Trim motif

# visualize full motif
view_motifs(pfm)

# trim
pfm.trim <- trim_motifs(pfm,min.ic = 0.5)

# visualize trimmed motif:
view_motifs(pfm.trim)

Motif representations

Position Frequency Matrix (PFM) aka Postion Count Matrix (PCM)

knitr::kable(pfm.trim@motif,right = F)

	C	T	A	W	A	A	A	T	A	G
A	21	0	3710	1768	3032	3561	3675	0	3718	112
C	3693	29	0	2	0	3	0	0	4	22
G	0	0	0	0	12	2	0	0	2	3591
T	14	3699	18	1958	684	162	53	3728	4	3

Information Content Matrix (ICM)

icm <- convert_type(pfm.trim,type = "ICM")
knitr::kable(icm@motif)

	C	T	A	W	A	A	A	T	A	G
A	0.0108964	0.0001295	1.9435226	0.4714365	1.0412121	1.6469185	1.8627202	0.0001339	1.9614370	0.0524562
C	1.8937980	0.0151530	0.0001310	0.0005999	0.0000858	0.0015030	0.0001267	0.0001339	0.0022419	0.0103978
G	0.0001282	0.0001295	0.0001310	0.0000667	0.0042064	0.0010405	0.0001267	0.0001339	0.0011869	1.6782495
T	0.0073070	1.9164031	0.0095598	0.5220928	0.2349573	0.0750334	0.0269886	1.9965197	0.0022419	0.0015188

Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

pwm <- convert_type(pfm.trim,type = "PWM",pseudocount = 0.8)
knitr::kable(pwm@motif)

	C	T	A	W	A	A	A	T	A	G
A	-5.458503	-12.186424	1.992786	0.923570	1.7016535	1.933652	1.979111	-12.186424	1.995893	-3.054567
C	1.986160	-4.996599	-12.186424	-8.726992	-12.1864238	-8.186424	-12.186424	-12.186424	-7.794106	-5.392008
G	-12.186424	-12.186424	-12.186424	-12.186424	-6.2556865	-8.726992	-12.186424	-12.186424	-8.726992	1.945755
T	-6.036677	1.988502	-5.678629	1.070817	-0.4462214	-2.522866	-4.131141	1.999768	-7.794106	-8.186424

Search for motif in region of interest

# scan sequence for matches:
res <- scan_sequences(pwm,DNAStringSet(roi),threshold = 0.75,threshold.type = "logodds",RC = T,verbose = 0)

res$start <- res$start+187700000-1
res$stop <- res$stop+187700000-1
res$sequence <- "chr3"

Calculate p-values

# get background nucleotide frequency:
bkg <- Biostrings::oligonucleotideFrequency(roi,width = 1,as.prob = T)
# calculate p values:
res$pvalue <- motif_pvalue(pwm, res$score, bkg.probs = bkg,k = 8)

Modify results object

minus <- res$strand=="-"
res[minus, c("start", "stop")] <- res[minus, c("stop", "start")]

res$match.stranded <- res$match
res$match.stranded[minus] <- reverseComplement(DNAStringSet(res$match.stranded[minus]))

# save results:
save(pwm,file = "MEF2B_MA0660.1_PWM.Rda")
save(res,file = "BCL6locus-matches_MEF2B_MA0660.1.Rda")

Cutoff evaluation

df <- data.frame(rel.score=c(0.75,0.8,0.85,0.9,0.95,0.99,1))
df$pwm.score <- motif_score(pwm,df$rel.score)
df$pvalue <- motif_pvalue(pwm,score = df$pwm.score)
knitr::kable(df)

rel.score	pwm.score	pvalue
0.75	-13.79400	0.0060081
0.80	-7.31760	0.0018015
0.85	-0.84120	0.0004683
0.90	5.63520	0.0000858
0.95	12.11160	0.0000095
0.99	17.29272	0.0000019
1.00	18.58800	0.0000000

res$cut <- cut(res$score.pct,breaks = c(75,80,85,90,95,99,100))
res %>% group_by(cut) %>% top_n(-1,score.pct) %>% filter(!duplicated(score.pct)) %>% arrange(score.pct) %>% ungroup() %>% select(-cut) %>% knitr::kable()

motif	sequence	start	stop	score	max.score	score.pct	match	strand	pvalue	match.stranded
MEF2B	chr3	187926606	187926615	-13.206	18.59354	75.00016	CTTTCTTTAG	+	0.0108889	CTTTCTTTAG
MEF2B	chr3	188055725	188055734	-6.846	18.59354	80.00020	GCAAATATAG	+	0.0035803	GCAAATATAG
MEF2B	chr3	188233541	188233550	-0.469	18.59354	85.01361	CTATGATTAA	+	0.0009354	CTATGATTAA
MEF2B	chr3	188247283	188247292	5.878	18.59354	90.00343	CTATTTTTAG	+	0.0001938	CTATTTTTAG
MEF2B	chr3	188145045	188145054	12.333	18.59354	95.07815	CTAATTTTAG	-	0.0000208	CTAAAATTAG
MEF2B	chr3	188291474	188291483	18.441	18.59354	99.88008	CTAAAAATAG	+	0.0000021	CTAAAAATAG

Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggplot2_3.2.1                     dplyr_0.8.3                      
##  [3] BSgenome.Hsapiens.UCSC.hg38_1.4.1 BSgenome_1.52.0                  
##  [5] rtracklayer_1.44.4                Biostrings_2.52.0                
##  [7] XVector_0.24.0                    GenomicRanges_1.36.1             
##  [9] GenomeInfoDb_1.20.0               IRanges_2.18.3                   
## [11] S4Vectors_0.22.1                  BiocGenerics_0.30.0              
## [13] JASPAR2018_1.1.1                  universalmotif_1.2.1             
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-142                bitops_1.0-6               
##  [3] matrixStats_0.55.0          ggtree_1.16.6              
##  [5] DirichletMultinomial_1.26.0 TFBSTools_1.22.0           
##  [7] bit64_0.9-7                 httr_1.4.1                 
##  [9] tools_3.6.1                 backports_1.1.5            
## [11] R6_2.4.1                    seqLogo_1.50.0             
## [13] DBI_1.0.0                   lazyeval_0.2.2             
## [15] colorspace_1.4-1            withr_2.1.2                
## [17] tidyselect_0.2.5            processx_3.4.1             
## [19] bit_1.1-14                  compiler_3.6.1             
## [21] Biobase_2.44.0              DelayedArray_0.10.0        
## [23] labeling_0.3                caTools_1.17.1.3           
## [25] scales_1.1.0                readr_1.3.1                
## [27] stringr_1.4.0               digest_0.6.23              
## [29] Rsamtools_2.0.3             R.utils_2.9.0              
## [31] rmarkdown_1.18              pkgconfig_2.0.3            
## [33] htmltools_0.4.0             bibtex_0.4.2               
## [35] highr_0.8                   rlang_0.4.2                
## [37] VGAM_1.1-2                  RSQLite_2.1.2              
## [39] farver_2.0.1                jsonlite_1.6               
## [41] BiocParallel_1.18.1         gtools_3.8.1               
## [43] R.oo_1.23.0                 RCurl_1.95-4.12            
## [45] magrittr_1.5                GO.db_3.8.2                
## [47] GenomeInfoDbData_1.2.1      Matrix_1.2-18              
## [49] Rcpp_1.0.3                  munsell_0.5.0              
## [51] ape_5.3                     R.methodsS3_1.7.1          
## [53] lifecycle_0.1.0             stringi_1.4.3              
## [55] yaml_2.2.0                  gbRd_0.4-11                
## [57] SummarizedExperiment_1.14.1 zlibbioc_1.30.0            
## [59] plyr_1.8.4                  grid_3.6.1                 
## [61] ggseqlogo_0.1               blob_1.2.0                 
## [63] crayon_1.3.4                CNEr_1.20.0                
## [65] lattice_0.20-38             splines_3.6.1              
## [67] annotate_1.62.0             KEGGREST_1.24.1            
## [69] hms_0.5.2                   zeallot_0.1.0              
## [71] knitr_1.26                  ps_1.3.0                   
## [73] pillar_1.4.2                reshape2_1.4.3             
## [75] TFMPvalue_0.0.8             XML_3.98-1.20              
## [77] glue_1.3.1                  evaluate_0.14              
## [79] BiocManager_1.30.10         png_0.1-7                  
## [81] vctrs_0.2.0                 treeio_1.8.2               
## [83] Rdpack_0.11-0               gtable_0.3.0               
## [85] poweRlaw_0.70.2             purrr_0.3.3                
## [87] tidyr_1.0.0                 assertthat_0.2.1           
## [89] xfun_0.11                   xtable_1.8-4               
## [91] tidytree_0.3.0              tibble_2.1.3               
## [93] rvcheck_0.1.7               AnnotationDbi_1.46.1       
## [95] GenomicAlignments_1.20.1    memoise_1.1.0

TF motif analysis including motif scanning / matching to region of interest

BCL6 LCR paper - script 3.1

Johannes C. Hellmuth

2/6/2020

Introduction

Load required libraries

Get the genomic sequence for the region of interest

OCT2 motif analysis

Motif download

Trim motif

Motif representations

Position Frequency Matrix (PFM) aka Postion Count Matrix (PCM)

Information Content Matrix (ICM)

Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

Search for motif in region of interest

Calculate p-values

Modify results object

Cutoff evaluation

MEF2B motif analysis

Trim motif

Motif representations

Position Frequency Matrix (PFM) aka Postion Count Matrix (PCM)

Information Content Matrix (ICM)

Postion Weight Matrix (PWM) aka Logodds Scoring Matrix

Search for motif in region of interest

Calculate p-values

Modify results object

Cutoff evaluation

Session info