Base, Pulse, and Trace File Reference Guide Introduction This document describes the contents of the three main files generated by the Pacific Biosciences primary analysis pipeline: bas.h5 (Base File, includes Circular Consensus Sequencing Basecalls created by the Base2Circular Consensus pipeline step.) trc.h5 (Trace file) pls.h5 (Pulse File) Trace (.trc.h5) Pulse (.pls.h5) Base (.bas.h5) File Format: HDF5 HDF5 HDF5 Generated by: Movie2Trace Trace2Pulse Pulse2Base Approximate Size (2 x 45 Minute Movies): > 75 GB ~ 20 GB ~ 2 GB Contains: Real time trace data processed from image frames from selected ZMWs. Pulse characteristics from trace data: Pulse height, width, inter-pulse distance, and so on. Raw base calls from each defined pulse along with quality metrics. You can easily browse HDF5 files using HDFView, a free utility. See http://www.hdfgroup.org/hdf-java-html/hdfview/. For information about the HDF5 format, see http:// www.hdfgroup.org/hdf5/. API Software Pacific Biosciences provides Java and R APIs to read three types of HDF5 files (Base, Trace, and Pulse) produced by the primary analysis pipeline. The APIs allows you to query for details in the HDF5 files for post-processing analysis. Note that you can use any programming language that can access HDF5 files to work with the Base, Trace, and Pulse files. Note: There are objects in the Base, Pulse and Trace files that cannot be accessed using Pacific Biosciences API software. The latest version of the API software and documentation are available from the PacBio Developer s Network at http://www.pacbiodevnet.com/smrt-analysis/tools. Page 1
Base File (bas.h5) The bas.h5 file is created by the Pulse2Base primary analysis pipeline step. The file is processed by the SMRT Pipe secondary analysis pipeline to generate mapping, alignment, consensus, and variants information. If you need to archive primary analysis results, we recommend that you keep only the bas.h5 file as it is the only file needed for reprocessing secondary analysis results. bas.h5 is a proper subset of the pls.h5 file, but lacks the majority of pulse features. Both bas.h5 and pls.h5 files contain raw base calls generated from the pulse metrics, but bas.h5 contains only minimal pulse data for kinetic analysis purposes. Because of this, the bas.h5 file can be recreated from the pls.h5 file, but not vice versa. The following table describes the contents of the bas.h5 file, including Circular Consensus Sequencing Basecalls created by the Base2Circular Consensus pipeline step: Base File HDF5 Object Root (/) /PulseData /PulseData/BaseCalls BaseCall PulseIndex QualityValue DeletionQV DeletionTag InsertionQv PreBaseFrames SubstitutionQV SubstitutionTag Top-level container. Container for the BaseCalls and ConsensusBaseCalls groups. Contains base metrics produced by the PulseToBase pipeline stage. Each child data is a 1-dimensional array of length numbases, where numbases is the total number of basecalls in the file (that is, the sum of ZMW/NumEvent). An ASCII representation of the called base for a The index into the pulse stream corresponding to the pulse that was called as this base. The Phred-style quality values of the bases. A Phred-style quality value indicating the total probability of a deleted base before the current base. The ASCII code of the most likely base to have been deleted before the current base. A Phred-style quality value indicating the probability that the current base is an insertion. The number of frames between the start of the base and the end of the previous base. A Phred-style quality value indicating the total probability that the current base call is a substitution error. The ASCII code of the most likely alternative base call at this position for a Page 2
Base File HDF5 Object WidthInFrames ZMW ZMW/ HoleNumber ZMW/ HoleStatus ZMW/HoleXY ZMW/ NumEvent ZMWMetrics ZMWMetrics/BaseFraction ZMWMetrics/BaseRate ZMWMetrics/BaseWidth ZMWMetrics/BaseIpd ZMWMetrics/CmBasQv ZMWMetrics/Productivity ZMWMetrics/ReadScore ZMWMetrics/RmBasQv ZMWMetrics/CmDelQv ZMWMetrics/CmInsQv ZMWMetrics/CmSubQv ZMWMetrics/HQRegionSNR ZMWMetrics/LocalBaseRate ZMWMetrics/RmDelQv The width of the pulse that generated this base, in frames. ZMW identifiers. The hole numbers of the ZMWs. Indicates how to decode the HoleStatus field. Only ZMWs with a HoleStatus == 0 can generate a sequence. The X, Y coordinates of a The number of bases per Pulse metrics, per The fraction of the bases called by channel within a The average (global) pulse rate (in pulses/second) of called bases within the high quality region for a The mean pulse width of called bases within the high quality region in the ZMW, in seconds. The robust mean pulse IPD (interpulse distance) for called bases within a high quality region for a The mean Phred-style quality values by base channel over the read for a A classification corresponding to hole productivity for a The values are: 0 = Empty, 1 = Sequencing (Good), and 2 = Other (Bad, Multiple occupation). A score corresponding to the predicted accuracy of the read within a The mean Phred-style quality value over all bases in the read for a The mean deletion quality values by base channel over the read for a The mean insertion quality values by base channel over the read for a The mean substitution quality values by base channel over the read for a The signal-to-noise ratio (SNR) of the pulses inside the HQRegion, where the HQRegion is a trimmed region predicted to be the high quality subset of the basecalls in the trace. An estimate of the local pulse rate for called bases, excluding polymerase, and pauses within a high quality region, for a The mean deletion quality value over all bases in the read for a Page 3
Base File HDF5 Object ZMWMetrics/RmInsQv ZMWMetrics/RmSubQv ZMWMetrics/HQRegionStartTime ZMWMetrics/HQRegionEndTime ZMWMetrics/DarkBaseRate /PulseData/Regions /PulseData/ConsensusBaseCalls BaseCall QualityValue DeletionQV DeletionTag InsertionQv SubstitutionQV SubstitutionTag Passes Passes/AdapterHitAfter Passes/AdapterHitBefore Passes/NumPasses The mean insertion quality value over all bases in the read for a The mean substitution quality value over all bases in the read for a The start time of the HQRegion from the beginning of the movie. The end time of the HQRegion from the beginning of the movie. The predicted local base rate when the chip is not illuminated (1/sec). This is a robust estimate of the local polymerization rate in this ZMW when the lasers are off. Single molecule consensus region objects information for a Each row in this table applies an annotation to a region of basecalls in one trace. These regions are used by downstream secondary analysis algorithms. Column 0: The HoleNumber of the ZMW that the annotation is being applied to. Column 1: The RegionType index. This value is an index into the 'RegionType' attribute of the Regions dataset. Column 2: The start base of the region. Column 3: The end base of the region. Column 4: The score applied to the region. Information on Single Molecule Consensus reads produced by the Single Molecule Consensus pipeline stage. An ASCII representation of the base calls for every read. The Phred-style quality values of the bases. A Phred-style quality value indicating the total probability of a deleted base before the current base. The likely identity of deleted base (if they exist) in a A Phred-style quality value indicating the probability that the current base is an insertion. The Phred-style quality value indicating the total probability that the current base call is a substitution error for bases in a The ASCII code of the most likely alternative base call at this position for a Information from Single Molecule Consensus processing of the raw read. For each pass, 1 if the pass ended with an adapter hit; 0 if it didn't. For each pass, 1 if the pass began with an adapter hit; 0 if it didn't. The number of passes detected in a Page 4
Base File HDF5 Object Passes/PassDirection Passes/PassNumBases Passes/PassStartBase ZMW ZMW/HoleNumber ZMW/HoleStatus /PulseData/ConsensusBaseCalls / ZMW/HoleXY PulseData/ConsensusBaseCalls/ ZMW/NumPasses /ScanData /ScanData/AcqParams /ScanData/ChipArray /ScanData/ChipArray/ChipMask DataSet /ScanData/DyeSet /ScanData/DyeSet/Analog[0] /ScanData/DyeSet/Analog[0]/ /ScanData/DyeSet/Analog[1] /ScanData/DyeSet/Analog[1]/ /ScanData/DyeSet/Analog[2] /ScanData/DyeSet/Analog[2]/ /ScanData/DyeSet/Analog[3] /ScanData/DyeSet/Analog[3]/ /ScanData/Experiment /ScanData/RunInfo The pass direction per hole. 0 for a forward pass, 1 for a reverse pass. The number of bases in a circular consensus pass. The index of the first base in a circular consensus pass. ZMW identifiers. The hole numbers of the ZMWs. The hole status per The X, Y coordinates of a The number of SMCs passes detected in each Container for instrument and acquisition metadata. Acquisition-related information, such as number of lasers, laser intensity, number of frames acquired, and so on. Chip-related information, such as chip identifier, chip layout, number of non-sequencing dark holes, and so on. The binary matrix of a chip indicating which ZMWs have been masked. The dye set name. Experiment-related information. Run-related information, such as platform name, run ID, instrument ID, and so on. Page 5
Pulse File (pls.h5) The Pulse file is created by the Trace2Pulse primary analysis pipeline step. The following table describes the contents of the pls.h5 file: Pulse File HDF5 Object Root (/) /PulseData /PulseData/PulseCalls Channel Chi2 ClassifierQV IsPulse MaxSignal MeanSignal MidSignal WidthInFrames StartFrame ZMW ZMW/ BaselineLevel ZMW/ BaselineSigma ZMW/ HoleNumber ZMW/ HoleStatus ZMW/ HoleXY ZMW/ NumEvent ZMWMetrics ZMWMetrics/PulseRate Top-level container. Container for the PulseCalls, Basecalls, and ConsensusBasecalls groups. Container for pulsecalls data. The classified channel index (0-based), corresponding to the channel order in the input trace file. The chi-squared values, Chi2/DOF per dye, integrated over pulse signals for a The quality values of a pulse as determined by the CRF in the trace to pulse portion of the primary analysis pipeline for a The pulse CRF trace to pulse valid pulse classification as true, false or otherwise for a The maximum signal levels over the pulse frames for a The mean signal levels over the pulse frames for a The mean signal levels over the mid-pulse frames, excluding the leading and trailing frame. The duration of a pulse in frames for a The acquisition frame when the pulse began for a ZMW identifiers. The mean bias of the baseline signal as estimated by the trace signal processing algorithm for a The standard deviation of the baseline signal for a ZMW; for example, the average noise level. The hole numbers of the ZMWs. A number specifying the status of a The X, Y coordinates of a The number of pulses per Container for pulse metrics. The global pulse rate (in pulses per second) in the Page 6
Pulse File HDF5 Object ZMWMetrics/PulseWidth ZMWMetrics/Snr /ScanData /ScanData/AcqParams /ScanData/ChipArray /ScanData/ChipArray/ChipMask DataSet /ScanData/DyeSet /ScanData/DyeSet/Analog[0] /ScanData/DyeSet/Analog[0]/ /ScanData/DyeSet/Analog[1] /ScanData/DyeSet/Analog[1]/ /ScanData/DyeSet/Analog[2] /ScanData/DyeSet/Analog[2]/ /ScanData/DyeSet/Analog[3] /ScanData/DyeSet/Analog[3]/ /ScanData/Experiment /ScanData/RunInfo The mean pulse width (in pulses per second) for a The signal-to-noise ratio (SNR) for a Container for instrument and acquisition metadata. Acquisition-related information, such as number of lasers, laser intensity, number of frames acquired, and so on. Chip-related information, such as chip identifier, chip layout, number of non-sequencing dark holes, and so on. The binary matrix of a chip indicating which ZMWs have been masked. The dye set name. Experiment-related information. Run-related information, such as platform name, run ID, instrument ID, and so on. Page 7
Trace File (trc.h5) The Trace file is created by the Movie2Trace primary analysis pipeline step.the following table describes the contents of the trc.h5 file: Trace File HDF5 Object Root (/) /TraceData /TraceData HoleNumber /TraceData HoleStatus /TraceData HoleXY /TraceData ReadVariance /TraceData Spectra /TraceData Traces /TraceData Variances /TraceData HolePhase /TraceData/Codec /TraceData/Codec/Decode /ScanData /ScanData/AcqParams /ScanData/ChipArray /ScanData/ChipArray/ChipMask DataSet /ScanData/DyeSet /ScanData/DyeSet/Analog[0] /ScanData/DyeSet/Analog[0]/ /ScanData/DyeSet/Analog[1] /ScanData/DyeSet/Analog[1]/ /ScanData/DyeSet/Analog[2] /ScanData/DyeSet/Analog[2]/ Top-level container. Container for the trace data. The ZMW hole number. The ZMW hole status. The X,Y coordinates for a ZMW location. The variance estimates corresponding to camera read noise contribution for trace. The spectral distribution for a 8-bit trace data in a ZMW for all frames. Channel variance estimates (post-spatial reduction) that were used to construct the weights for the dye-weighted sum (spectral) reduction, by block increments. The units correspond to decoded trace units. The time delay, as a fraction of the frame interval, of each ZMW for each camera where 0 <= HolePhase <= 1. Container for encoding method and attributes. Codec look-up table. Container for instrument and acquisition metadata. Acquisition-related information, such as number of lasers, laser intensity, number of frames acquired, and so on. Chip-related information, such as chip identifier, chip layout, number of non-sequencing dark holes, and so on. The binary matrix of a chip indicating which ZMWs have been masked. The dye set name. Page 8
Trace File HDF5 Object /ScanData/DyeSet/Analog[3] /ScanData/DyeSet/Analog[3]/ /ScanData/Experiment /ScanData/RunInfo Experiment-related information. Run-related information, such as platform name, run ID, instrument ID, and so on. For Research Use Only. Not for use in diagnostic procedures. Copyright 2011, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html. Pacific Biosciences, the Pacific Biosciences logo, SMRT and SMRTbell are trademarks of Pacific Biosciences in the United States and/or certain other countries. All other trademarks are the sole property of their respective owners. P/N 001-564-663-01 Page 9