A Comparison of Peak Callers Used for DNase-Seq Data

Similar documents
Scout 2.0 Software. Introductory Training

PERCEPTUAL QUALITY OF H.264/AVC DEBLOCKING FILTER

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

1 Introduction Steganography and Steganalysis as Empirical Sciences Objective and Approach Outline... 4

StaMPS Persistent Scatterer Exercise

A NEW LOOK AT FREQUENCY RESOLUTION IN POWER SPECTRAL DENSITY ESTIMATION. Sudeshna Pal, Soosan Beheshti

How to Optimize Ad-Detective

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

m RSC Chromatographie Integration Methods Second Edition CHROMATOGRAPHY MONOGRAPHS Norman Dyson Dyson Instruments Ltd., UK

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

MindMouse. This project is written in C++ and uses the following Libraries: LibSvm, kissfft, BOOST File System, and Emotiv Research Edition SDK.

VLSI Chip Design Project TSEK06

DIGITAL COMMUNICATION

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

EDDY CURRENT IMAGE PROCESSING FOR CRACK SIZE CHARACTERIZATION

Hands-on session on timing analysis

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

MUSI-6201 Computational Music Analysis

Removing the Pattern Noise from all STIS Side-2 CCD data

On Screen Marking of Scanned Paper Scripts

Smart Traffic Control System Using Image Processing

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers

Communication Lab. Assignment On. Bi-Phase Code and Integrate-and-Dump (DC 7) MSc Telecommunications and Computer Networks Engineering


TIME RESOLVED XAS DATA COLLECTION WITH AN XIA DXP-4T SPECTROMETER

Automatic Music Clustering using Audio Attributes

Time Domain Simulations

The Measurement Tools and What They Do

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

Doubletalk Detection

Validity. What Is It? Types We Will Discuss. The degree to which an inference from a test score is appropriate or meaningful.

Figure 2: Original and PAM modulated image. Figure 4: Original image.

AppNote - Managing noisy RF environment in RC3c. Ver. 4

DATA! NOW WHAT? Preparing your ERP data for analysis

Lecture 9 Source Separation

For the SIA. Applications of Propagation Delay & Skew tool. Introduction. Theory of Operation. Propagation Delay & Skew Tool

1C.4.1. Modeling of Motion Classified VBR Video Codecs. Ya-Qin Zhang. Ferit Yegenoglu, Bijan Jabbari III. MOTION CLASSIFIED VIDEO CODEC INFOCOM '92

Compact multichannel MEMS based spectrometer for FBG sensing

The EMC, Signal And Power Integrity Institute Presents

Reducing False Positives in Video Shot Detection

Nature Neuroscience: doi: /nn Supplementary Figure 1. Emergence of dmpfc and BLA 4-Hz oscillations during freezing behavior.

QSched v0.96 Spring 2018) User Guide Pg 1 of 6

Visual Encoding Design

Single Channel Speech Enhancement Using Spectral Subtraction Based on Minimum Statistics

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

Analysis, Synthesis, and Perception of Musical Sounds

Lecture 16: Feedback channel and source-channel separation

StaMPS Persistent Scatterer Practical

Motion Video Compression

Supplemental Material for Gamma-band Synchronization in the Macaque Hippocampus and Memory Formation

System Identification

Lecture 2 Video Formation and Representation

Introduction to QScan

PROCESSING YOUR EEG DATA

Region Adaptive Unsharp Masking based DCT Interpolation for Efficient Video Intra Frame Up-sampling

Using the MAX3656 Laser Driver to Transmit Serial Digital Video with Pathological Patterns

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Smoothing Techniques For More Accurate Signals

LAB 1: Plotting a GM Plateau and Introduction to Statistical Distribution. A. Plotting a GM Plateau. This lab will have two sections, A and B.

Upgrading E-learning of basic measurement algorithms based on DSP and MATLAB Web Server. Milos Sedlacek 1, Ondrej Tomiska 2

Composer Style Attribution

Package spotsegmentation

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

Higher-Order Modulation and Turbo Coding Options for the CDM-600 Satellite Modem

ECE 5765 Modern Communication Fall 2005, UMD Experiment 10: PRBS Messages, Eye Patterns & Noise Simulation using PRBS

Film Grain Technology

New-Generation Scalable Motion Processing from Mobile to 4K and Beyond

Subjective Similarity of Music: Data Collection for Individuality Analysis

Comparison Parameters and Speaker Similarity Coincidence Criteria:

OSL Preprocessing Henry Luckhoo. Wednesday, 23 October 13

Comparison of Mixed-Effects Model, Pattern-Mixture Model, and Selection Model in Estimating Treatment Effect Using PRO Data in Clinical Trials

Research on sampling of vibration signals based on compressed sensing

Scan. This is a sample of the first 15 pages of the Scan chapter.

Reconfigurable Neural Net Chip with 32K Connections

Estimation of inter-rater reliability

Benefits of the R&S RTO Oscilloscope's Digital Trigger. <Application Note> Products: R&S RTO Digital Oscilloscope

Automatic Defect Recognition in Industrial Applications

PERCEPTUAL QUALITY ASSESSMENT FOR VIDEO WATERMARKING. Stefan Winkler, Elisa Drelie Gelasca, Touradj Ebrahimi

Applying Machine Vision to Verification and Testing Ben Dawson and Simon Melikian ipd, a division of Coreco Imaging, Inc.

FEASIBILITY STUDY OF USING EFLAWS ON QUALIFICATION OF NUCLEAR SPENT FUEL DISPOSAL CANISTER INSPECTION

2. AN INTROSPECTION OF THE MORPHING PROCESS

Topic 4. Single Pitch Detection

How to Predict the Output of a Hardware Random Number Generator

SMART VEHICLE SCREENING SYSTEM USING ARTIFICIAL INTELLIGENCE METHODS

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Introduction. Packet Loss Recovery for Streaming Video. Introduction (2) Outline. Problem Description. Model (Outline)

Automatic Rhythmic Notation from Single Voice Audio Sources

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

Design of Fault Coverage Test Pattern Generator Using LFSR

base calling: PHRED...

TRT Software Activities

A Framework for Segmentation of Interview Videos

EMBEDDED ZEROTREE WAVELET CODING WITH JOINT HUFFMAN AND ARITHMETIC CODING

Agilent Feature Extraction Software (v10.7)

Instructions. Final Exam CPSC/ELEN 680 December 12, Name: UIN:

Transcription:

A Comparison of Peak Callers Used for DNase-Seq Data Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim Hubbard Spivakov s and Fraser s Lab September 16, 2014 Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

1 2 3 4 Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Biology Becomes the most Data Intensive Science! probability Biological Processes Experimental Data Machine Learning Statistical Inference Pattern Recognition

Biology Becomes the most Data Intensive Science! probability Biological Processes Experimental Data Machine Learning Statistical Inference Pattern Recognition Mathematical and Statistical Modelling Software Engineering Annotations Visualisation Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

ChIP-Seq Data Analysis Sequencing, Mapping and Quality Controls Sequencing is getting cheaper, providing us with more data! Mapping possibly is still the most computationally expensive part. Peak Calling Gauging the statistical significance of reads enrichment which is generally known as Peak Calling is very central to ChIP-Seq data analysis. Post Peak Calling Analysis Different directions and purposes, including differential binding analysis, motif discovery, detection of regulatory regions, Genome segmentation and so on Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Why Too Many Peak Callers? Different protein classes have distinct mode of interactions: Point-Source These factors and chromatin marks are localised specifically and have high signal-to-noise ration Broad-Source These factors are associated with wide genomic domains, generating broad but more noisy signals; e.g. H3K9me3, H3K36me3 Mixed-Source These factors show a point-source style signal at some regions whereas more broader in other regions e.g. RNA Pol II Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

ChIP-Seq vs DNase-Seq Note that DNase HS is different from its sister DNase Footprinting ChIP-Seq: Nature Reviews, Peter J. Park, 2009 DNase HS: Duke Protocol Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

TF ChIP-Seq vs DNase-Seq Some key differences between TF ChIP-Seq and DNase-Seq: In ChIP-Seq data, a protein is usually in bound or unbound position, whereas DNaseI shows a more generic behaviour, representing the openness of the chromatin to any regulatory feature; DNase HS are strand-independent and therefore no need to shift size or tag extension; DNase HS data sometimes shows less enrichment over wider regions (a kind of Mixed-Source). Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Currently Existing DNase-Seq Protocols Double Hit Protocol Developed in John Stam Lab in University of Washington, and has been used greatly for detection of DHS in ENCODE project. End Capture Protocol Developed in Greg Crawford Lab in Duke University. It has been used for detection DHS in ENCODE. This protocol is also in great use by some other researchers world-wide. ATAC-Seq Developed in Greenland Lab in Stanford University. This is a very new protocol (published 2013) and has been reported to be very fast and very efficient. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

End Capture (Duke) vs Double Hit (UW) Protocol Ligate Biotinylated Linker1 Mmel Digested Ligate Linker2 PCR Amplification Sequencing End Capture Protocol: Greg Crawford Lab, Duke Double Hit Protocol: John Stam Lab, UW Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Study Design We Sought to assess four peak callers used for DNase-Seq data: Hotspot, F-Seq, MACS and ZINBA; The comparison was repeated on three human cell lines: GM12878, K562 and HelaS3, only on chr22; Raw data was obtained from ENCODE repository (from both Duke and UW protocols) Comparison was made in range of signal threshold (statistical significants of signals); All the remaining parameters kept as default (although we individually tried to assess them) The overlap level of detected peaks with TF binding sites was defined as the measure of comparison; The same process was repeated with Duke dat too. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

DNase-Seq Peak Callers Hotspot The peak caller which is behind the ENCODE DHS. F-Seq F-Seq, Initially developed with DNaseI-Seq data in mind, but it has been used for TF ChIP-Seq data too. MACS Initially for TF ChIP-Seq, but has shown great performance for DNase-Seq data. This is the most used and cited peak caller. ZINBA Meant to be a generic peak caller for TF and Chromatin ChIP-Seq, DNase-Seq, RNA-Seq, FAIRE-Seq. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Hotsopt Tries to locally gauge the enrichment of tags by centring each tag in a small (250pb) and a large (50kb) window; The ratio of number of tags are assigned to each position; These scores are standardised (converted to Z scores) by assuming a binomial distribution; Regions with Z scores above the threshold is reported; This process is applied in two phases, the highly enriched regions are filtered and a second phase is applied to recover the regions which are overshadowed by monster peaks in phase one. FDR: some random tags are generated(uniformly distributed), then the ration of number of random tags to real tags for a specific Z score is reported as the FDR, for the given Z score. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Hotspot Cont. The core of Hotspot has been implemented in C ++ and its statistical analysis in R; It is wrapped up in python and bash script; It is relatively fast; I found it not well documented and not easy to work! Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

F-Seq An histogram-based (number of tags per bin) approach is, possibly, the most naivest for gauging the enrichment of short read tags; However, it suffers from some problems including boundary effects and selection of bin width; To overcome, F-Seq suggested in which a Kernel Density Estimator(with mean 0 and variance 1) is applied to obtain the distribution of reads: p(x) = 1 i=n K( x x i ) nb b i=1 F-Seq has been implemented in Java, easy to use, though, doesn t support some commonly used file formats. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

MACS The most used peak callers for ChIP-Seq data; It has been reviewed and benchmarked in different studies; At the time of development, the emphasis was on handling shift size and local biases from sequencability and mappability; A Poisson model is employed for identification of statistically signicant enriched regions; MACS has been implemented in python and is relatively fast. It is user friendly and fairly well-supported. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Zero Inflated Negative Binomial Approach: ZINBA ZINBA is a generic peak caller, and meant to be used for TF ChIP-Seq, histone ChIP-Seq, RNA-Seq and DNase-Seq(Both DNase and FAIRE); The short read tags are summarised into counts over non-overlapping windows (250pb) of the genome; Read counts per bin, G/C contents, mappablility scores and copy number variations are the parameters of its underlying mixture regression model; Based on this model, each region in the genome is assigned into one of the enriched, background and zero groups; ZINBA has been implemented in R Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Two More Peak Callers Two more peak callers for DNase-Seq are out now: PeaKDEck The idea behind PeakDEck is a kind of a combination of Hotspot (where they try to learn the local background) and F-Seq where they apply a Gaussian kernel to estimate the probability distribution! but surely has been more work! Dnase2hotspots Dnase2hotspot is actually a modification of Hotpost; A key difference is that two phases of detecting hotspots in Hotspot is combined. It has also been claimed to be faster, more efficient! Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

A Visual Inspection Shows Some Inconsistency Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Sensitivity vs Specificity Shows up to 10% Difference Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Number of Peaks Detected Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Distribution of Peaks Length Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Chromosome-wide Coverage Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

F β Score: A Metric to measure the Performance of a Test F β Score is a commonly used measure for gauging the performance of a test; It is normally consistent with AUC; F β Score is defined as: F β = (1 + β 2 prec.recall ). (β 2.prec) + recall Normally β = 1 but you can change it, depending on emphasising recall or precision(2 and 0.5) are very common. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Gold Standard Set It is generally accepted that open chromatin regions (DHS in ENCODE data) are accessible regions of the genome to TFs; Therefore it makes sense to compare the DNase peaks with TF Binding Sites; The problem is, though, set of TFBSs are incomplete; For each of the three cell lines in our study, there were more ChIP-Seq data of more 50 TFs; The union of the binding sites of these TFBSs were used as our Reference Set ; We set β = 0.5 to compensate for the incompleteness of our Reference Set. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Improving the Performance by Adjusting the Parameters Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

DNase-Seq is gaining popularity as a genome-wide chromatin accessibility analysis method; Its applications have led to new insights into genome function and variation; Robust peak detection on these data is therefore instrumental to the research community; They should be publicly available, well-documented and user-friendly softwares that can be easily used in any lab. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for

Acknowledgments I am grateful to Spivakov s group members for their comments. This study was carried on during my transition from the Sanger Institute to the Babraham Institute. I therefore appreciate financial support from both institutes. Hashem Koohy, Thomas Down, Mikhail Spivakov and Tim HubbardASpivakov s Comparison and offraser s Peak Callers Lab Used for