Base, Pulse, and Trace File Reference Guide

Similar documents
BitWise (V2.1 and later) includes features for determining AP240 settings and measuring the Single Ion Area.

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

PulseCounter Neutron & Gamma Spectrometry Software Manual

base calling: PHRED...

AUDIOVISUAL COMMUNICATION

CS229 Project Report Polyphonic Piano Transcription

Frame Processing Time Deviations in Video Processors

OptoFidelity Video Multimeter User Manual Version 2017Q1.0

UC San Diego UC San Diego Previously Published Works

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Singer Traits Identification using Deep Neural Network

Principles of Video Compression

RECOMMENDATION ITU-R BT.1203 *

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Detecting Musical Key with Supervised Learning

HD-SDI Express User Training. J.Egri 4/09 1

Bar Codes to the Rescue!

The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem.

PYROPTIX TM IMAGE PROCESSING SOFTWARE

Enabling editors through machine learning

Precision testing methods of Event Timer A032-ET

TechNote: MuraTool CA: 1 2/9/00. Figure 1: High contrast fringe ring mura on a microdisplay

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Restoration of Hyperspectral Push-Broom Scanner Data

Scout 2.0 Software. Introductory Training

Normalization Methods for Two-Color Microarray Data

Smart Coding Technology

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Brain-Computer Interface (BCI)

Understanding Compression Technologies for HD and Megapixel Surveillance

VISSIM Tutorial. Starting VISSIM and Opening a File CE 474 8/31/06

Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

Multi-Frame Matrix Capture Common File Format (MFMC- CFF) Requirements Capture

Outline for ContigExpress workshop

Creating a Feature Vector to Identify Similarity between MIDI Files

FPA (Focal Plane Array) Characterization set up (CamIRa) Standard Operating Procedure

Dual Link DVI Receiver Implementation

Using Extra Loudspeakers and Sound Reinforcement

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

How to Manage Video Frame- Processing Time Deviations in ASIC and SOC Video Processors

Auditory Illusions. Diana Deutsch. The sounds we perceive do not always correspond to those that are

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Intelligent Monitoring Software IMZ-RS300. Series IMZ-RS301 IMZ-RS304 IMZ-RS309 IMZ-RS316 IMZ-RS332 IMZ-RS300C

PSC300 Operation Manual

GLog Users Manual.

Design Project: Designing a Viterbi Decoder (PART I)

Cycle-7 MAMA Pulse height distribution stability: Fold Analysis Measurement

PAPER Wireless Multi-view Video Streaming with Subcarrier Allocation

Techniques to Reduce Manufacturing Cost-of-Test of Optical Transmitters, Flex DCA Interface

Table of content. Table of content Introduction Concepts Hardware setup...4

A Matlab toolbox for. Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE

Python Quick-Look Utilities for Ground WFC3 Images

Lecture 2 Video Formation and Representation

TOMELLERI ENGINEERING MEASURING SYSTEMS. TUBO Version 7.2 Software Manual rev.0

GANZ Bridge Powered by

The Measurement Tools and What They Do

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

Fig. 1 Add the Aro spotfinding Suite folder to MATLAB's set path.

Long and Fast Up/Down Counters Pushpinder Kaur CHOUHAN 6 th Jan, 2003

Implementation of Real- Time Spectrum Analysis

PS User Guide Series Seismic-Data Display

An Overview of Video Coding Algorithms

Release Notes for LAS AF version 1.8.0

CSE Data Visualization. Graphical Perception. Jeffrey Heer University of Washington

Example: compressing black and white images 2 Say we are trying to compress an image of black and white pixels: CSC310 Information Theory.

MC9211 Computer Organization

Patchmaster. Elektronik. The Pulse generator. February 2013

EDDY CURRENT IMAGE PROCESSING FOR CRACK SIZE CHARACTERIZATION

Using Genre Classification to Make Content-based Music Recommendations

Adaptive decoding of convolutional codes

IT T35 Digital system desigm y - ii /s - iii

Reduction of Device Damage During Dry Etching of Advanced MMIC Devices Using Optical Emission Spectroscopy

SVC Uncovered W H I T E P A P E R. A short primer on the basics of Scalable Video Coding and its benefits

Improving Frame Based Automatic Laughter Detection

Real Time Commercial Detection in Videos

Vocoder Reference Test TELECOMMUNICATIONS INDUSTRY ASSOCIATION

What s New in Raven May 2006 This document briefly summarizes the new features that have been added to Raven since the release of Raven

Video Surveillance *

Image Acquisition Technology

2. Problem formulation

Overview of All Pixel Circuits for Active Matrix Organic Light Emitting Diode (AMOLED)

The Million Song Dataset

A HIGHLY INTERACTIVE SYSTEM FOR PROCESSING LARGE VOLUMES OF ULTRASONIC TESTING DATA. H. L. Grothues, R. H. Peterson, D. R. Hamlin, K. s.

Sodern recent development in the design and verification of the passive polarization scramblers for space applications

MUSI-6201 Computational Music Analysis

SignalTap Plus System Analyzer

The Bias-Variance Tradeoff

Automatic Piano Music Transcription

JPEG2000: An Introduction Part II

Video coding standards

Homework 2 Key-finding algorithm

Chapter 2 Introduction to

Keysight Technologies Techniques to Reduce Manufacturing Cost-of-Test of Optical Transmitters, Classic DCA. Application Note

Extraction Methods of Watermarks from Linearly-Distorted Images to Maximize Signal-to-Noise Ratio. Brandon Migdal. Advisors: Carl Salvaggio

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

A low noise multi electrode array system for in vitro electrophysiology. Mobius Tutorial AMPLIFIER TYPE SU-MED640

Hands-on session on timing analysis

MAGNETIC CARD READER DESIGN KIT TECHNICAL SPECIFICATION

m RSC Chromatographie Integration Methods Second Edition CHROMATOGRAPHY MONOGRAPHS Norman Dyson Dyson Instruments Ltd., UK

Transcription:

Base, Pulse, and Trace File Reference Guide Introduction This document describes the contents of the three main files generated by the Pacific Biosciences primary analysis pipeline: bas.h5 (Base File, includes Circular Consensus Sequencing Basecalls created by the Base2Circular Consensus pipeline step.) trc.h5 (Trace file) pls.h5 (Pulse File) Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) File Format: HDF5 HDF5 HDF5 Generated by: Movie2Trace Trace2Pulse Pulse2Base Approximate Size (2 x 45 Minute Movies): > 75 GB ~ 20 GB ~ 2 GB Contains: Real time trace data processed from image frames from selected ZMWs. Pulse characteristics from trace data: Pulse height, width, inter-pulse distance, and so on. Raw base calls from each defined pulse along with quality metrics. You can easily browse HDF5 files using HDFView, a free utility. See http://www.hdfgroup.org/hdf-java-html/hdfview/. For information about the HDF5 format, see http:// www.hdfgroup.org/hdf5/. API Software Pacific Biosciences provides Java and R APIs to read three types of HDF5 files (Base, Trace, and Pulse) produced by the primary analysis pipeline. The APIs allows you to query for details in the HDF5 files for post-processing analysis. Note that you can use any programming language that can access HDF5 files to work with the Base, Trace, and Pulse files. Note: There are objects in the Base, Pulse and Trace files that cannot be accessed using Pacific Biosciences API software. The latest version of the API software and documentation are available from the PacBio Developer s Network at http://www.pacbiodevnet.com/smrt-analysis/tools. Page 1

Base File (bas.h5) The bas.h5 file is created by the Pulse2Base primary analysis pipeline step. The file is processed by the SMRT Pipe secondary analysis pipeline to generate mapping, alignment, consensus, and variants information. If you need to archive primary analysis results, we recommend that you keep only the bas.h5 file as it is the only file needed for reprocessing secondary analysis results. bas.h5 is a proper subset of the pls.h5 file, but lacks the majority of pulse features. Both bas.h5 and pls.h5 files contain raw base calls generated from the pulse metrics, but bas.h5 contains only minimal pulse data for kinetic analysis purposes. Because of this, the bas.h5 file can be recreated from the pls.h5 file, but not vice versa. The following table describes the contents of the bas.h5 file, including Circular Consensus Sequencing Basecalls created by the Base2Circular Consensus pipeline step: Base File HDF5 Object Root (/) /PulseData /PulseData/BaseCalls BaseCall PulseIndex QualityValue DeletionQV DeletionTag InsertionQv PreBaseFrames SubstitutionQV SubstitutionTag Top-level container. Container for the BaseCalls and ConsensusBaseCalls groups. Contains base metrics produced by the PulseToBase pipeline stage. Each child data is a 1-dimensional array of length numbases, where numbases is the total number of basecalls in the file (that is, the sum of ZMW/NumEvent). An ASCII representation of the called base for a The index into the pulse stream corresponding to the pulse that was called as this base. The Phred-style quality values of the bases. A Phred-style quality value indicating the total probability of a deleted base before the current base. The ASCII code of the most likely base to have been deleted before the current base. A Phred-style quality value indicating the probability that the current base is an insertion. The number of frames between the start of the base and the end of the previous base. A Phred-style quality value indicating the total probability that the current base call is a substitution error. The ASCII code of the most likely alternative base call at this position for a Page 2

Base File HDF5 Object WidthInFrames ZMW ZMW/ HoleNumber ZMW/ HoleStatus ZMW/HoleXY ZMW/ NumEvent ZMWMetrics ZMWMetrics/BaseFraction ZMWMetrics/BaseRate ZMWMetrics/BaseWidth ZMWMetrics/BaseIpd ZMWMetrics/CmBasQv ZMWMetrics/Productivity ZMWMetrics/ReadScore ZMWMetrics/RmBasQv ZMWMetrics/CmDelQv ZMWMetrics/CmInsQv ZMWMetrics/CmSubQv ZMWMetrics/HQRegionSNR ZMWMetrics/LocalBaseRate ZMWMetrics/RmDelQv The width of the pulse that generated this base, in frames. ZMW identifiers. The hole numbers of the ZMWs. Indicates how to decode the HoleStatus field. Only ZMWs with a HoleStatus == 0 can generate a sequence. The X, Y coordinates of a The number of bases per Pulse metrics, per The fraction of the bases called by channel within a The average (global) pulse rate (in pulses/second) of called bases within the high quality region for a The mean pulse width of called bases within the high quality region in the ZMW, in seconds. The robust mean pulse IPD (interpulse distance) for called bases within a high quality region for a The mean Phred-style quality values by base channel over the read for a A classification corresponding to hole productivity for a The values are: 0 = Empty, 1 = Sequencing (Good), and 2 = Other (Bad, Multiple occupation). A score corresponding to the predicted accuracy of the read within a The mean Phred-style quality value over all bases in the read for a The mean deletion quality values by base channel over the read for a The mean insertion quality values by base channel over the read for a The mean substitution quality values by base channel over the read for a The signal-to-noise ratio (SNR) of the pulses inside the HQRegion, where the HQRegion is a trimmed region predicted to be the high quality subset of the basecalls in the trace. An estimate of the local pulse rate for called bases, excluding polymerase, and pauses within a high quality region, for a The mean deletion quality value over all bases in the read for a Page 3

Base File HDF5 Object ZMWMetrics/RmInsQv ZMWMetrics/RmSubQv ZMWMetrics/HQRegionStartTime ZMWMetrics/HQRegionEndTime ZMWMetrics/DarkBaseRate /PulseData/Regions /PulseData/ConsensusBaseCalls BaseCall QualityValue DeletionQV DeletionTag InsertionQv SubstitutionQV SubstitutionTag Passes Passes/AdapterHitAfter Passes/AdapterHitBefore Passes/NumPasses The mean insertion quality value over all bases in the read for a The mean substitution quality value over all bases in the read for a The start time of the HQRegion from the beginning of the movie. The end time of the HQRegion from the beginning of the movie. The predicted local base rate when the chip is not illuminated (1/sec). This is a robust estimate of the local polymerization rate in this ZMW when the lasers are off. Single molecule consensus region objects information for a Each row in this table applies an annotation to a region of basecalls in one trace. These regions are used by downstream secondary analysis algorithms. Column 0: The HoleNumber of the ZMW that the annotation is being applied to. Column 1: The RegionType index. This value is an index into the 'RegionType' attribute of the Regions dataset. Column 2: The start base of the region. Column 3: The end base of the region. Column 4: The score applied to the region. Information on Single Molecule Consensus reads produced by the Single Molecule Consensus pipeline stage. An ASCII representation of the base calls for every read. The Phred-style quality values of the bases. A Phred-style quality value indicating the total probability of a deleted base before the current base. The likely identity of deleted base (if they exist) in a A Phred-style quality value indicating the probability that the current base is an insertion. The Phred-style quality value indicating the total probability that the current base call is a substitution error for bases in a The ASCII code of the most likely alternative base call at this position for a Information from Single Molecule Consensus processing of the raw read. For each pass, 1 if the pass ended with an adapter hit; 0 if it didn't. For each pass, 1 if the pass began with an adapter hit; 0 if it didn't. The number of passes detected in a Page 4

Base File HDF5 Object Passes/PassDirection Passes/PassNumBases Passes/PassStartBase ZMW ZMW/HoleNumber ZMW/HoleStatus /PulseData/ConsensusBaseCalls / ZMW/HoleXY PulseData/ConsensusBaseCalls/ ZMW/NumPasses /ScanData /ScanData/AcqParams /ScanData/ChipArray /ScanData/ChipArray/ChipMask DataSet /ScanData/DyeSet /ScanData/DyeSet/Analog[0] /ScanData/DyeSet/Analog[0]/ /ScanData/DyeSet/Analog[1] /ScanData/DyeSet/Analog[1]/ /ScanData/DyeSet/Analog[2] /ScanData/DyeSet/Analog[2]/ /ScanData/DyeSet/Analog[3] /ScanData/DyeSet/Analog[3]/ /ScanData/Experiment /ScanData/RunInfo The pass direction per hole. 0 for a forward pass, 1 for a reverse pass. The number of bases in a circular consensus pass. The index of the first base in a circular consensus pass. ZMW identifiers. The hole numbers of the ZMWs. The hole status per The X, Y coordinates of a The number of SMCs passes detected in each Container for instrument and acquisition metadata. Acquisition-related information, such as number of lasers, laser intensity, number of frames acquired, and so on. Chip-related information, such as chip identifier, chip layout, number of non-sequencing dark holes, and so on. The binary matrix of a chip indicating which ZMWs have been masked. The dye set name. Experiment-related information. Run-related information, such as platform name, run ID, instrument ID, and so on. Page 5

Pulse File (pls.h5) The Pulse file is created by the Trace2Pulse primary analysis pipeline step. The following table describes the contents of the pls.h5 file: Pulse File HDF5 Object Root (/) /PulseData /PulseData/PulseCalls Channel Chi2 ClassifierQV IsPulse MaxSignal MeanSignal MidSignal WidthInFrames StartFrame ZMW ZMW/ BaselineLevel ZMW/ BaselineSigma ZMW/ HoleNumber ZMW/ HoleStatus ZMW/ HoleXY ZMW/ NumEvent ZMWMetrics ZMWMetrics/PulseRate Top-level container. Container for the PulseCalls, Basecalls, and ConsensusBasecalls groups. Container for pulsecalls data. The classified channel index (0-based), corresponding to the channel order in the input trace file. The chi-squared values, Chi2/DOF per dye, integrated over pulse signals for a The quality values of a pulse as determined by the CRF in the trace to pulse portion of the primary analysis pipeline for a The pulse CRF trace to pulse valid pulse classification as true, false or otherwise for a The maximum signal levels over the pulse frames for a The mean signal levels over the pulse frames for a The mean signal levels over the mid-pulse frames, excluding the leading and trailing frame. The duration of a pulse in frames for a The acquisition frame when the pulse began for a ZMW identifiers. The mean bias of the baseline signal as estimated by the trace signal processing algorithm for a The standard deviation of the baseline signal for a ZMW; for example, the average noise level. The hole numbers of the ZMWs. A number specifying the status of a The X, Y coordinates of a The number of pulses per Container for pulse metrics. The global pulse rate (in pulses per second) in the Page 6

Pulse File HDF5 Object ZMWMetrics/PulseWidth ZMWMetrics/Snr /ScanData /ScanData/AcqParams /ScanData/ChipArray /ScanData/ChipArray/ChipMask DataSet /ScanData/DyeSet /ScanData/DyeSet/Analog[0] /ScanData/DyeSet/Analog[0]/ /ScanData/DyeSet/Analog[1] /ScanData/DyeSet/Analog[1]/ /ScanData/DyeSet/Analog[2] /ScanData/DyeSet/Analog[2]/ /ScanData/DyeSet/Analog[3] /ScanData/DyeSet/Analog[3]/ /ScanData/Experiment /ScanData/RunInfo The mean pulse width (in pulses per second) for a The signal-to-noise ratio (SNR) for a Container for instrument and acquisition metadata. Acquisition-related information, such as number of lasers, laser intensity, number of frames acquired, and so on. Chip-related information, such as chip identifier, chip layout, number of non-sequencing dark holes, and so on. The binary matrix of a chip indicating which ZMWs have been masked. The dye set name. Experiment-related information. Run-related information, such as platform name, run ID, instrument ID, and so on. Page 7

Trace File (trc.h5) The Trace file is created by the Movie2Trace primary analysis pipeline step.the following table describes the contents of the trc.h5 file: Trace File HDF5 Object Root (/) /TraceData /TraceData HoleNumber /TraceData HoleStatus /TraceData HoleXY /TraceData ReadVariance /TraceData Spectra /TraceData Traces /TraceData Variances /TraceData HolePhase /TraceData/Codec /TraceData/Codec/Decode /ScanData /ScanData/AcqParams /ScanData/ChipArray /ScanData/ChipArray/ChipMask DataSet /ScanData/DyeSet /ScanData/DyeSet/Analog[0] /ScanData/DyeSet/Analog[0]/ /ScanData/DyeSet/Analog[1] /ScanData/DyeSet/Analog[1]/ /ScanData/DyeSet/Analog[2] /ScanData/DyeSet/Analog[2]/ Top-level container. Container for the trace data. The ZMW hole number. The ZMW hole status. The X,Y coordinates for a ZMW location. The variance estimates corresponding to camera read noise contribution for trace. The spectral distribution for a 8-bit trace data in a ZMW for all frames. Channel variance estimates (post-spatial reduction) that were used to construct the weights for the dye-weighted sum (spectral) reduction, by block increments. The units correspond to decoded trace units. The time delay, as a fraction of the frame interval, of each ZMW for each camera where 0 <= HolePhase <= 1. Container for encoding method and attributes. Codec look-up table. Container for instrument and acquisition metadata. Acquisition-related information, such as number of lasers, laser intensity, number of frames acquired, and so on. Chip-related information, such as chip identifier, chip layout, number of non-sequencing dark holes, and so on. The binary matrix of a chip indicating which ZMWs have been masked. The dye set name. Page 8

Trace File HDF5 Object /ScanData/DyeSet/Analog[3] /ScanData/DyeSet/Analog[3]/ /ScanData/Experiment /ScanData/RunInfo Experiment-related information. Run-related information, such as platform name, run ID, instrument ID, and so on. For Research Use Only. Not for use in diagnostic procedures. Copyright 2011, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html. Pacific Biosciences, the Pacific Biosciences logo, SMRT and SMRTbell are trademarks of Pacific Biosciences in the United States and/or certain other countries. All other trademarks are the sole property of their respective owners. P/N 001-564-663-01 Page 9