Normalization Methods for Two-Color Microarray Data

Similar documents
Bioconductor s marray package: Plotting component

Package spotsegmentation

Agilent Feature Extraction Software (v10.7)

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

Measuring Variability for Skewed Distributions

Algebra I Module 2 Lessons 1 19

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Chapter 3. Averages and Variation

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

Frequencies. Chapter 2. Descriptive statistics and charts

Statistics for Engineers

Box Plots. So that I can: look at large amount of data in condensed form.

Scout 2.0 Software. Introductory Training

Fig. 1 Add the Aro spotfinding Suite folder to MATLAB's set path.

EDDY CURRENT IMAGE PROCESSING FOR CRACK SIZE CHARACTERIZATION

CURIE Day 3: Frequency Domain Images

NENS 230 Assignment #2 Data Import, Manipulation, and Basic Plotting

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Estimation of inter-rater reliability

Lecture 10: Release the Kraken!

Automatic LP Digitalization Spring Group 6: Michael Sibley, Alexander Su, Daphne Tsatsoulis {msibley, ahs1,

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

LAB 1: Plotting a GM Plateau and Introduction to Statistical Distribution. A. Plotting a GM Plateau. This lab will have two sections, A and B.

Sample Analysis Design. Element2 - Basic Software Concepts (cont d)

in the Howard County Public School System and Rocketship Education

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Import and quantification of a micro titer plate image

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

MATH& 146 Lesson 11. Section 1.6 Categorical Data

Beam test of the QMB6 calibration board and HBU0 prototype

WEB APPENDIX. Managing Innovation Sequences Over Iterated Offerings: Developing and Testing a Relative Innovation, Comfort, and Stimulation

A Comparison of Relative Gain Estimation Methods for High Radiometric Resolution Pushbroom Sensors

Comparing Distributions of Univariate Data

Tutorial 0: Uncertainty in Power and Sample Size Estimation. Acknowledgements:

Results of the June 2000 NICMOS+NCS EMI Test

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

On Your Own. Applications. Unit 2. ii. The following are the pairs of mutual friends: A-C, A-E, B-D, C-D, and D-E.

LCD and Plasma display technologies are promising solutions for large-format

Computer Vision for HCI. Image Pyramids. Image Pyramids. Multi-resolution image representations Useful for image coding/compression

Lecture 2 Video Formation and Representation

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian

Release Year Prediction for Songs

Frequency Response and Standard background Overview of BAL-003-1

PACS. Dark Current of Ge:Ga detectors from FM-ILT. J. Schreiber 1, U. Klaas 1, H. Dannerbauer 1, M. Nielbock 1, J. Bouwman 1.

Digital Image and Fourier Transform

Sampling Worksheet: Rolling Down the River

Seismic data random noise attenuation using DBM filtering

WATERMARKING USING DECIMAL SEQUENCES. Navneet Mandhani and Subhash Kak

Reducing CCD Imaging Data

Resampling Statistics. Conventional Statistics. Resampling Statistics

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

Supplementary Figures Supplementary Figure 1 Comparison of among-replicate variance in invasion dynamics

RF Safety Surveys At Broadcast Sites: A Basic Guide

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Chapter 6. Normal Distributions

Graphical Displays of Univariate Data

Noise. CHEM 411L Instrumental Analysis Laboratory Revision 2.0

Alternative: purchase a laptop 3) The design of the case does not allow for maximum airflow. Alternative: purchase a cooling pad

Libraries as Repositories of Popular Culture: Is Popular Culture Still Forgotten?

Chapter 1 Midterm Review

AUDIOVISUAL COMMUNICATION

Testing and Characterization of the MPA Pixel Readout ASIC for the Upgrade of the CMS Outer Tracker at the High Luminosity LHC

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

TI-Inspire manual 1. Real old version. This version works well but is not as convenient entering letter

Visual Encoding Design

complex than coding of interlaced data. This is a significant component of the reduced complexity of AVS coding.

Multi-Vector Fluorescence Analysis of the xpt Guanine Riboswitch. Aptamer Domain and the Conformational Role of Guanine

The Effect of Plate Deformable Mirror Actuator Grid Misalignment on the Compensation of Kolmogorov Turbulence

More About Regression

Electrospray-MS Charge Deconvolutions without Compromise an Enhanced Data Reconstruction Algorithm utilising Variable Peak Modelling

ISOMET. Compensation look-up-table (LUT) and How to Generate. Isomet: Contents:

2. ctifile,s,h, CALDB,,, ACIS CTI ARD file (NONE none CALDB <filename>)

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

Design Trade-offs in a Code Division Multiplexing Multiping Multibeam. Echo-Sounder

Richard B. Haynes Philip J. Muniz Douglas C. Smith

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Homework Packet Week #5 All problems with answers or work are examples.

Application Note: Using the Turner Designs Model 10-AU Fluorometer to Perform Flow Measurements in Sanitary Sewers by Dye Dilution

Special Article. Prior Publication Productivity, Grant Percentile Ranking, and Topic-Normalized Citation Impact of NHLBI Cardiovascular R01 Grants

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

base calling: PHRED...

GPA for DigitalMicrograph

Marc I. Johnson, Texture Technologies Corp. 6 Patton Drive, Hamilton, MA Tel

BitWise (V2.1 and later) includes features for determining AP240 settings and measuring the Single Ion Area.

Transform Coding of Still Images

CALIBRATION OF SOLUTION SECONDARY CURRENT FOR 9180 controls with SC software PAGE 1 OF 5

m RSC Chromatographie Integration Methods Second Edition CHROMATOGRAPHY MONOGRAPHS Norman Dyson Dyson Instruments Ltd., UK

1022 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 4, APRIL 2010

Measurement of automatic brightness control in televisions critical for effective policy-making

Power Consumption Trends in Digital TVs produced since 2003

User s Manual. Log Scale (/LG) GX10/GX20/GP10/GP20/GM10 IM 04L51B01-06EN. 2nd Edition

Fast Ethernet Consortium Clause 25 PMD-EEE Conformance Test Suite v1.1 Report

Adaptive decoding of convolutional codes

The One Penny Whiteboard

What is Statistics? 13.1 What is Statistics? Statistics

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Transcription:

Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright 2009 Dan Nettleton What is Normalization? Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected. Typically normalization attempts to remove global effects, i.e., effects that can be seen by examining plots that show all the data for a slide or slides. Normalization does not necessarily have anything to do with the normal distribution that plays a prominent role in statistics. 1 2 Sources of Non-Biological Variation Side-by-side boxplots show examples of variation across channels. Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation Differences in the amount of labeled cdna hybridized to each channel in a microarray experiment (here channel is used to refer to a particular slide/dye combination.) Variation across replicate slides Variation across hybridization conditions Variation in scanning conditions Variation among technicians doing the lab work 3 4 etc. Slide 2 Slide 1 Cy3 Cy5 Cy3 Cy5 maximum Q3=75 th percentile median Q1=25 th percentile Interquartile range (IQR) is Q3-Q1. Points more than 1.5*IQR above Q3 or more than 1.5*IQR below Q1 are displayed individually. maximum Q3=75 th percentile median Q1=25 th percentile minimum minimum 5 6 1

The side-by-side boxplots were produced in R using the following commands. boxplot(as.data.frame(log(x)), xlab="channel",ylab="log Mean Signal", axes=f) axis(1,labels=1:ncol(x),at=1:ncol(x)) axis(2) box() x is a matrix with one column for each channel. Element i,j of the matrix is the signal mean for the i th gene on the j th channel. If the matrix x has other columns that you don t want to deal with, you may pick out the columns that you want or delete those you don t want. For example, x[,c(1,2,3,6)] (only work with columns 1,2,3 and 6) or x[,-1] (all columns except the first column). One of the simplest normalization strategies is to align the log signals so that all channels have the same median. The value of the common median is not important for subsequent analyses. A convenient choice is zero so that positive or negative values reflect signals above or below the median for a particular channel. If negative normalized signal values seem confusing, any positive constant may be added to all values after normalization to zero medians. 7 8 Normalization to a median of 0 can be accomplished with the following R commands. Log Mean Signal Centered at 0 channel.medians=apply(log(x),2,median) normalized.log.x=sweep(log(x),2,channel.medians) x is a matrix with one column for each channel. Element i,j of the matrix is the signal mean for the i th gene on the j th channel. If the matrix x has other columns that you don t want to deal with, you may pick out the columns that you want or delete those you don t want. For example, x[,c(1,2,3,6)] (only work with columns 1,2,3 and 6) or x[,-1] (all columns except the first column). 9 10 Note that medians match but variation seems to differ greatly across channels. Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend scale normalization.* Consider a matrix X with i=1,...,i rows and j=1,...,j columns. Log Mean Signal Centered at 0 11 Let x ij denote the entry in row i and column j. We will apply scale normalization to the matrix of log signal mean values that have already been median centered (each row corresponds to a gene and each column corresponds to a channel). For each column j, let m j =median(x 1j, x 2j,..., x Ij ). For each column j, let MAD j =median( x 1j -m j, x 2j -m j,..., x Ij -m j ). To scale normalize the columns of X to a constant value C, multiply all the entries in the j th column by C/MAD j for all j=1,...,j. A common choice for C is the geometric mean of MAD 1,...,MAD J = ( ) J MAD 1/ J j =1 j The choice of C will not effect subsequent tests or p-values but will affect fold change calculations. *Yang et al. recommended scale normalization for log R/G values. 12 2

Log Mean Signal (centered and scaled) Data after Median Centering and Scale Normalizing Scale normalization can be accomplished with the following R commands. medians=apply(x,2,median) Y=sweep(X,2,medians) mad=apply(abs(y),2,median) const=prod(mad)^(1/length(mad)) scale.normalized.x=sweep(x,2,const/mad, * ) X is a matrix of logged (and usually median-centered) signal mean values. Element i,j of the matrix corresponds to the i th gene on the j th channel. 13 14 A Simple Example Determine Channel Medians 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 medians 7 6 6 11 15 16 Subtract Channel Medians Find Median Absolute Deviations 1 1 9 3 2 2 0-4 1 4 3-4 0-1 -3 4-6 -1-4 -2 5 2 7 0 0 This is the data after median centering. 1 1 9 3 2 2 0-4 1 4 3-4 0-1 -3 4-6 -1-4 -2 5 2 7 0 0 MAD 2 4 1 2 17 18 3

Find Scaling Constant Find Scaling Factors 1 1 9 3 2 2 0-4 1 4 3-4 0-1 -3 4-6 -1-4 -2 5 2 7 0 0 MAD 2 4 1 2 C = (2*4*1*2) 1/4 = 2 1 1 9 3 2 2 0-4 1 4 3-4 0-1 -3 4-6 -1-4 -2 5 2 7 0 0 Scaling 2 2 2 2 Factors 2 4 1 2 19 20 Scale Normalize the Median Centered Data Slide 1 Log Signal Means after Median Centering and Scaling All Channels Evidence of intensity-dependent dye bias 1 1 4.5 6 2 2 0-2.0 2 4 3-4 0.0-2 -3 4-6 -0.5-8 -2 5 2 3.5 0 0 Log Red This is the data after median centering and scale normalizing. 21 Log Green 22 M vs. A Plot of the Logged, Centered, and Scaled Slide 1 Data To handle intensity-dependent dye bias, Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend lowess normalization prior to median centering and scale normalizing. M = Log Red - Log Green lowess stands for LOcally WEighted polynomial regression. The original reference for lowess is A = (Log Green + Log Red) / 2 23 Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. JASA 74 829-836. 24 4

Slide 1 Log Signal Means M vs. A Plot for Slide 1 Log Signal Means Log Red M = Log Red - Log Green Log Green 25 A = (Log Green + Log Red) / 2 26 M vs. A Plot for Slide 1 Log Signal Means with lowess fit (f=0.40) Adjust M Values M = Log Red - Log Green M = Log Red - Log Green A = (Log Green + Log Red) / 2 27 A = (Log Green + Log Red) / 2 28 M vs. A Plot after Adjustment M vs. A Plot for Slide 1 Log Signal Means M = Adjusted Log Red Adjusted Log Green Adjusted Log Red adjusted log red = log red adj/2 adjusted log green=log green + adj/2 where adj = lowess fitted value A = (Adjusted Log Green + Adjusted Log Red) / 2 29 Adjusted Log Green 30 5

For spots with A=7, the lowess fitted value is 0.883. Thus the value of adj discussed on the previous slide is 0.883 for spots with A=7. M = Log Red - Log Green The M value for such spots would be moved down by 0.883. The log red value would be decreased by 0.883/2 and the log green value would be increased by 0.883/2 to obtain adjusted log red and adjusted log green values, respectively. 0.883 M vs. A Plot for Slide 1 Log Signal Means with lowess fit (f=0.40) A = (Log Green + Log Red) / 2 31 lowess in R out=lowess(x,y,f=0.4) plot(x,y) lines(out$x,out$y,col=2,lwd=2) out$x will be a vector containing the x values. out$y will contain the lowess fitted values for the values in out$x. f controls the fraction of the data used to obtain each fitted value. f = 0.4 has been recommended for microarray data normalization. 32 Boxplots of Mean Signal after Logging, Lowess Normalization, Median Centering, and Scaling After a separate lowess normalization for each slide, the adjusted values can be median centered and (if deemed necessary) scale normalized across all channels using the lowess-normalized data for each channel. Normalized Signal 33 34 Data from 3 Sectors on a Single Slide After a separate lowess normalization for each slide, the adjusted values can be median centered and scale normalized across all channels using the lowess-normalized data for each channel. A sector represents the set of points spotted by a single pin on a single slide. The entire normalization process described above can be carried out separately for each sector on each channel. Log Red It may be necessary to normalize by sector/channel combinations if spatial variability is apparent. 35 Log Green 36 6

Bolstad, et al. (2003, Bioinformatics 19 2:185-193) propose quantile normalization for microarray data Boxplots of Log Signal Means after Quantile Normalization Quantile normalization is most commonly used in normalization of Affymetrix data It can be used for two-color data as well. Quantile normalization can force each channel to have the same quantiles. x q (for q between 0 and 1) is the q quantile of a data set if the fraction of the data points less than or equal to x q is at least q, and the fraction of the data points greater than or equal to x q at least 1-q. median=x 0.5 Q1=x 0.25 Q3=x 0.75 37 38 Original Slide 1 Log Signal Means Comparison of Slide 1 Log Signal Means after Quantile Normalization Log Red Log Red Log Green 39 Log Green 40 Details of Quantile Normalization A Simple Example 1. Find the smallest log signal on each channel. 2. Average the values from step 1. 3. Replace each value in step 1 with the average computed in step 2. 4. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values. 41 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 42 7

Find the Smallest Value for Each Channel Average These Values 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 (1+2+2+8)/4=3.25 43 44 Replace Each Value by the Average Find the Next Smallest Values 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 (1+2+2+8)/4=3.25 45 46 Average These Values Replace Each Value by the Average 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 2 7 3.25 7 15 3 5.50 6 5.50 3.25 (3+5+5+9)/4=5.5 47 48 8

Find the Average of the Next Smallest Values Replace Each Value by the Average 2 7 3.25 7 15 3 5.50 6 5.50 3.25 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 5 9 13 7.50 7.50 (7+6+6+11)/4=7.5 49 50 Find the Average of the Next Smallest Values Replace Each Value by the Average 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 5 9 13 7.50 7.50 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 5 9 10.25 7.50 7.50 (8+13+7+13)/4=10.25 51 52 Find the Average of the Next Smallest Values Replace Each Value by the Average 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 5 9 10.25 7.50 7.50 1 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.00 3 5.50 7.50 5.50 3.25 5 12.00 10.25 7.50 7.50 (9+15+9+15)/4=12.00 This is the data matrix after quantile normalization. 53 54 9

Miscellaneous Comments on Normalization Data presented on previous slides are somewhat extreme. Many microarray data sets will require less normalization. We have only scratched the surface in terms of normalization methods. There are many variations on the techniques that were described previously as well as other approaches that we won t discuss at this point in the course. Normalization affects the final results, but it is often not clear what normalization strategy is best. It would be good to integrate normalization and statistical analysis, but it is difficult to do so. The most common approach is to normalize data and then perform statistical analysis of the normalized data as a separate step in the microarray analysis process. 55 Normalization for Specialized Arrays Sometimes researchers will construct an array with a set of probe sequences that represent a specialized set of genes. If the treatment effects are expected to cause changes of expression in the specialized set that are predominantly in one direction, the global normalization strategies that we discussed may remove the treatment effects of interest. One strategy for normalizing in such cases requires a set of control sequences spotted on each slide. 56 Normalization for Specialized Arrays (ctd). For normalization purposes, good control sequences should represent genes that will not change expression in response to treatments of interest. 1. Housekeeping genes are genes involved in basic functions needed for sustenance of a cell. They are always expressed, but are they constant across conditions? 2. Random cdna sequences can be used as a negative control (a control not expected to give biological signal). 3. cdna sequences from an unrelated organism can be used as negative controls or positive spike-in controls (identical amounts of complementary labeled cdnas added to each hybridized sample). The idea is to determine the adjustment necessary to normalize the control genes and then make that same adjustment to all genes on the array. 57 Background Correction Background correction is often the very first step in microarray analysis Recall that Spot signal or simply signal is fluorescence intensity due to target molecules hybridized to probe sequences contained in a spot (what we would like to measure) plus background fluorescence (what we would rather not measure). Background is fluorescence that may contribute to spot pixel intensities but is not due to fluorescence from target molecules hybridized to spot probe sequences. The idea is to remove background fluorescence from the spot signal fluorescence because the spot signal is believed to be a sum of fluorescence due to background and fluorescence due to hybridized target cdna. 58 Background Correction Strategies (applied prior to logging signal intensity) Background Correction Strategies (applied prior to logging signal intensity) 1. Subtract local background, e.g., signal mean background mean or signal mean background median This can increase variation in measurements, especially for low expressing genes. Some believe that local background will overestimate the background contribution to spot fluorescence. Background fluorescence where cdna has been spotted may be different than background where no cdna has been spotted. 59 2. For each spot, find the local background of the spot as well as the local backgrounds of all neighboring spots. Compute the median or mean of these local backgrounds. Subtract that summary of local backgrounds from the spot s signal. This is similar to option 1 but can reduce some variation in background estimation. 60 10

Background Correction Strategies (applied prior to logging signal intensity) 3. Find the median or mean of local backgrounds in a sector. Subtract the sector summary of local backgrounds from each signal in the sector. 4. Subtract the median or mean of blank spot signals or negative control signals in a sector from all other signals in a sector. 5. Estimate the background for each spot by fitting a model to the local background values. 61 Final Comments on Background Correction Subtracting background may result in a negative or zero adjusted-signal values. Such values cannot be logged. One simple approach is to replace all negative values by zero, add one to all values (whether zero or not), and log the resulting values. As technology improves and labs gain experience in carrying out microarray experiments, using signal with no background correction may be the best choice. 62 11