Introduction to multivariate analysis for bacterial GWAS using

Similar documents
Visual Encoding Design

2012, the Author. This is the final version of a paper published in Participations: Journal of Audience and Reception Studios.

Discriminant Analysis. DFs

Phenopix - Exposure extraction

Orthogonal rotation in PCAMIX

Package crimelinkage

Package colorpatch. June 10, 2017

For these items, -1=opposed to my values, 0= neutral and 7=of supreme importance.

CS229 Project Report Polyphonic Piano Transcription

SECTION I. THE MODEL. Discriminant Analysis Presentation~ REVISION Marcy Saxton and Jenn Stoneking DF1 DF2 DF3

Package spotsegmentation

Resampling Statistics. Conventional Statistics. Resampling Statistics

1. Model. Discriminant Analysis COM 631. Spring Devin Kelly. Dataset: Film and TV Usage National Survey 2015 (Jeffres & Neuendorf) Q23a. Q23b.

(Week 13) A05. Data Analysis Methods for CRM. Electronic Commerce Marketing

Cluster Analysis of Internet Users Based on Hourly Traffic Utilization

EDDY CURRENT IMAGE PROCESSING FOR CRACK SIZE CHARACTERIZATION

Navigate to the Journal Profile page

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

Media Xpress by TAM Media Research INDEX. 1. How has a particular channel been performing over the chosen time period(quarter/month/week)

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Supporting Information

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Reviews of earlier editions

Normalization Methods for Two-Color Microarray Data

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Subjective Similarity of Music: Data Collection for Individuality Analysis

AmbDec User Manual. Fons Adriaensen

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

Why visualize data? Advanced GDA and Software: Multivariate approaches, Interactive Graphics, Mondrian, iplots and R. German Bundestagswahl 2005

LAB 1: Plotting a GM Plateau and Introduction to Statistical Distribution. A. Plotting a GM Plateau. This lab will have two sections, A and B.

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

Example module stability analysis

An Empirical Analysis of Macroscopic Fundamental Diagrams for Sendai Road Networks

STUDY OF THE PERCEIVED QUALITY OF SAXOPHONE REEDS BY A PANEL OF MUSICIANS

Analysis and Clustering of Musical Compositions using Melody-based Features

PYROPTIX TM IMAGE PROCESSING SOFTWARE

Permutations of the Octagon: An Aesthetic-Mathematical Dialectic

ggplot and ColorBrewer Nice plots with R November 30, 2015

Exercises. ASReml Tutorial: B4 Bivariate Analysis p. 55

A Top-down Hierarchical Approach to the Display and Analysis of Seismic Data

ITU-T Y.4552/Y.2078 (02/2016) Application support models of the Internet of things

Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

ECE438 - Laboratory 1: Discrete and Continuous-Time Signals

2. ctifile,s,h, CALDB,,, ACIS CTI ARD file (NONE none CALDB <filename>)

Restoration of Hyperspectral Push-Broom Scanner Data

I. Model. Q29a. I love the options at my fingertips today, watching videos on my phone, texting, and streaming films. Main Effect X1: Gender

Linear mixed models and when implied assumptions not appropriate

Scout 2.0 Software. Introductory Training

Identifying the Importance of Types of Music Information among Music Students

abc Mark Scheme Statistics 3311 General Certificate of Secondary Education Higher Tier 2007 examination - June series

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

Subject-specific observed profiles of change from baseline vs week trt=10000u

Chapter 3. Boolean Algebra and Digital Logic

Analyzing Modulated Signals with the V93000 Signal Analyzer Tool. Joe Kelly, Verigy, Inc.

Piotr KLECZKOWSKI, Magdalena PLEWA, Grzegorz PYDA

Can scientific impact be judged prospectively? A bibliometric test of Simonton s model of creative productivity

Sequential Circuits. Output depends only and immediately on the inputs Have no memory (dependence on past values of the inputs)

Graphical User Interface for Modifying Structables and their Mosaic Plots

A HIGHLY INTERACTIVE SYSTEM FOR PROCESSING LARGE VOLUMES OF ULTRASONIC TESTING DATA. H. L. Grothues, R. H. Peterson, D. R. Hamlin, K. s.

Olga Feher, PhD Dissertation: Chapter 4 (May 2009) Chapter 4. Cumulative cultural evolution in an isolated colony

Modeling memory for melodies

Defining and Labeling Circuits and Electrical Phasing in PLS-CADD

NENS 230 Assignment #2 Data Import, Manipulation, and Basic Plotting

VISSIM Tutorial. Starting VISSIM and Opening a File CE 474 8/31/06

Problem Points Score USE YOUR TIME WISELY USE CLOSEST DF AVAILABLE IN TABLE SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

K ABC Mplus CFA Model. Syntax file (kabc-mplus.inp) Data file (kabc-mplus.dat)

Sarcasm Detection in Text: Design Document

MusiCube: A Visual Music Recommendation System featuring Interactive Evolutionary Computing

Part 1: Introduction to Computer Graphics

Aalborg Universitet. Scaling Analysis of Author Level Bibliometric Indicators Wildgaard, Lorna; Larsen, Birger. Published in: STI 2014 Leiden

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Video coding standards

Citation for the original published paper (version of record):

STAT 503 Case Study: Supervised classification of music clips

arxiv: v1 [cs.sd] 8 Jun 2016

Effects of acoustic degradations on cover song recognition

READ THIS FIRST. Morphologi G3. Quick Start Guide. MAN0412 Issue1.1

K3. Why did the certain ethnic mother put her baby in a crib with 20-foot high legs? So she could hear it if it fell out of bed.

STAT 250: Introduction to Biostatistics LAB 6

K3. Why did the certain ethnic mother put her baby in a crib with 20-foot high legs? So she could hear it if it fell out of bed.

Switching Circuits & Logic Design, Fall Final Examination (1/13/2012, 3:30pm~5:20pm)

AUDIOVISUAL COMMUNICATION

Latest Assessment of Seismic Station Observations (LASSO) Reference Guide and Tutorials

Graphics I Or Making things pretty in R.

Music Genre Classification

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition

Multi-Shaped E-Beam Technology for Mask Writing

MANOVA COM 631/731 Spring 2017 M. DANIELS. From Jeffres & Neuendorf (2015) Film and TV Usage National Survey

Mixed Effects Models Yan Wang, Bristol-Myers Squibb, Wallingford, CT

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

Dual frame motion compensation for a rate switching network

Algebra I Module 2 Lessons 1 19

A development of user interface on a new model of automatic washing-drying machine

Analysis of a Two Step MPEG Video System

Using DICTION. Some Basics. Importing Files. Analyzing Texts

in the Howard County Public School System and Rocketship Education

Transcription:

Practical course using the software Introduction to multivariate analysis for bacterial GWAS using Thibaut Jombart (tjombart@imperial.ac.uk) Imperial College London MSc Modern Epidemiology / Public Health Abstract This practical illustrates how multivariate methods can be used for the analysis of bacterial genomic datasets. Principal Component Analysis (PCA, [7, 2, 3]) is introduced for assessing the diversity between sampled isolates. We also show how hierarchical clustering methods can be applied on principal components to identify groups of genetically related isolates. Discriminant Analysis of Principal Components (DAPC, [5]) is then used for identifying polymorphic sites associated with phenotypic traits such as bacterial resistance. While this tutorial uses simulated data, the procedures described are applicable to a wide range of Genome-Wide Association Studies (GWAS). 1

Contents 1 Introduction 3 1.1 Required packages............................ 3 1.2 Getting help................................ 3 1.3 The data................................. 4 2 First assessment of the genetic diversity 6 3 Identifying SNPs linked to bacterial resistance 12 2

1 Introduction 1.1 Required packages This practical requires a working version of [8] greater than or equal to 2.15.2. To check which version of R you are using, examine the welcome message displayed when starting R, or type: > R.version$version.string [1] "R version 2.15.2 (2012-10-26)" The practical relies on the package ade4 [1] for classical multivariate analyses (PCA) and on adegenet [4] for the DAPC. Both packages need to be installed, which may be tricky if you do not possess administrative rights on your computer. However, most systems still public areas where any user has read/write access. This is exploited by a simple hack allowing one to install packages without the administrative rights. To use it, simply type (while connected to the internet): > source("http://adegenet.r-forge.r-project.org/files/hacklib/hacklib.r") > hacklib() Then, the required packages can be installed using the usual procedure: > install.packages("ade4", dep=true) > install.packages("adegenet", dep=true) and loaded using: > library(ade4) > library(adegenet) 1.2 Getting help There are several ways of getting information about R functions, including some specific documentation sources for adegenet. The function help.search is used to look for help on a given topic. For instance: > help.search("hardy-weinberg") replies that there is a function HWE.test.genind in the adegenet package, and other similar functions in genetics and pegas. To get help for a given function, use?foo where foo is the function of interest. For instance: >?spca will open up the manpage of the spatial principal component analysis [6]. At the end of a manpage, an example section often shows how to use a function. This can be copied and pasted to the R console, or directly executed from the console using example. For further researches on R functions, RSiteSearch can be used to perform online researches using keywords in R s archives (mailing lists and manpages). adegenet has a few extra documentation sources. Information can be found from the adegenet website (http://adegenet.r-forge.r-project.org/), in the documents section, including several tutorials and a manual which compiles all manpages of the package, and a dedicated mailing list with searchable archives. To open the website from R, use: > adegenetweb() Tutorials ( vignettes in R s terminology) are also distributed with adegenet, and can be accessed using the command vignette. These can be listed using: 3

> vignette(package="adegenet") To open a vignette, for instance the tutorial on DAPC, simply use: > vignette("adegenet-dapc") Lastly, several mailing lists are available to find different kinds of information on R; to name a few: R-help: general questions about R. https://stat.ethz.ch/mailman/listinfo/r-help R-sig-genetics: genetics in R. https://stat.ethz.ch/mailman/listinfo/r-sig-genetics R-sig-phylo: phylogenetics in R. https://stat.ethz.ch/mailman/listinfo/r-sig-phylo adegenet forum: adegenet and multivariate analysis of genetic markers. https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum 1.3 The data The simulated data used in this practical are available online from the following address: http://adegenet.r-forge.r-project.org/files/simgwas/simgwas.rdata. The dataset is in R s binary format (extension RData), which uses compression to store data efficiently (the raw csv file would be more than 4MB). R objects can be loaded into R using load. The instruction url is required to load the data directly from the internet; as data are loaded, a new object simgwas appears in the R environment: > load(url("http://adegenet.r-forge.r-project.org/files/simgwas/simgwas.rdata")) > ls(pattern="sim") [1] "simgwas" > class(simgwas) [1] "list" > names(simgwas) [1] "snps" "phen" > class(simgwas$snps) [1] "matrix" > class(simgwas$phen) [1] "character" > dim(simgwas$snps) [1] 95 10000 > simgwas$snps[1:10,1:20] 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ind1 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 ind2 0 0 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 0 0 1 ind3 0 0 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 ind4 0 1 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 1 0 0 ind5 1 0 0 1 0 1 0 1 0 1 1 0 1 1 0 1 1 1 0 0 ind6 1 1 0 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 ind7 1 1 0 1 1 0 0 1 0 1 0 1 1 1 0 0 1 1 1 0 ind8 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 ind9 0 1 0 1 1 1 0 0 0 1 1 1 1 0 1 1 1 1 1 1 ind10 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 1 1 0 1 > print(object.size(simgwas$snps), unit="mb") 7.8 Mb > length(simgwas$phen) [1] 95 > simgwas$phen [1] "R" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "R" "S" "S" "R" "S" "S" "S" "S" [20] "S" "S" "S" "R" "S" "S" "S" "S" "S" "S" "S" "R" "R" "S" "S" "S" "S" "R" "S" [39] "S" "R" "R" "R" "S" "S" "S" "R" "S" "S" "S" "S" "S" "S" "S" "S" "R" "R" "S" [58] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "R" "R" "R" "S" "S" "R" "S" [77] "R" "R" "R" "R" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "R" "S" "S" "R" "R" > table(simgwas$phen) R S 24 71 The object simgwas is a list with two components: $snps is a matrix of Single Nucleotide Polymorphism (SNPs) data, and $phen is the phenotype of the different sampled isolates. The SNPs data has a modest size by GWAS standards: only 95 isolates (in row) and 10000 SNPs (alleles coded as 0/1). To simplify further commands, we create the new objects snps and phen from simgwas: > snps <- simgwas$snps > phen <- factor(simgwas$phen) 5

2 First assessment of the genetic diversity Principal Component Analysis (PCA) is a very powerful tool for reducing the diversity contained in massively multivariate data into a few synthetic variables (the principal components PCs). There are several versions of PCA implemented in R. Here, we use dudi.pca from the ade4 package, specifying that variables should not be scaled (scale=false) to unit variances (this is only useful when variables have inherently different scales of variation, which is not the case here): > pca1 <- dudi.pca(snps, scale=false) PCA eigenvalues 0 5 10 15 20 25 30 The method displays a screeplot (barplot of eigenvalues) to help the user decide how many PCs should be retained. The general rule is to retain only the largest eigenvalues, after which non-structured variation results in smoothly decreasing eigenvalues. How many PCs would you retain here? > pca1 Duality diagramm class: pca dudi $call: dudi.pca(df = snps, scale = FALSE, scannf = FALSE, nf = 4) $nf: 4 axis-components saved $rank: 94 eigen values: 31.76 28.77 28.13 25.72 21.68... vector length mode content 1 $cw 10000 numeric column weights 2 $lw 95 numeric row weights 3 $eig 94 numeric eigen values data.frame nrow ncol content 6

1 $tab 95 10000 modified array 2 $li 95 4 row coordinates 3 $l1 95 4 row normed scores 4 $co 10000 4 column coordinates 5 $c1 10000 4 column normed scores other elements: cent norm The object pca1 contains various information. Most importantly: pca1$eig: contains the eigenvalues of the analysis, representing the amount of information contained in each PC. pca1$li: contains the principal components. pca1$c1: contains the principal axes (loadings of the variables). > head(pca1$eig) [1] 31.75594 28.77240 28.12875 25.72180 21.68303 21.36911 > head(pca1$li) Axis1 Axis2 Axis3 Axis4 ind1 3.606420-2.132999 9.622764-6.301912 ind2 1.912918-1.656548 8.734490-10.006055 ind3 2.316603-2.564638 9.324818-7.445660 ind4 2.490536-2.484711 8.819193-6.029816 ind5 2.448958-1.489571 8.576321-8.775661 ind6 2.938701-2.693103 10.876804-3.797021 > head(pca1$c1) CS1 CS2 CS3 CS4 X1 1.004273e-02 0.004291539-0.003509719-0.0092503284 X2-5.145732e-03-0.003539221-0.001470553 0.0075073374 X3-3.349998e-05-0.003362894 0.003797944 0.0013048886 X4-1.017829e-03 0.002489303-0.002323418-0.0007847613 X5 7.047362e-03 0.007801922-0.003475240 0.0057483659 X6-1.011989e-02 0.013435213 0.021812199-0.0321210072 Because of the large number of variables, the usual biplot (function scatter) is useless to visualize the results (try scatter(pca1) if unsure). We represent only PCs using s.label: > s.label(pca1$li, sub="pca - PC 1 and 2") > add.scatter.eig(pca1$eig,4,1,2, ratio=.3, posi="topleft") 7

Eigenvalues d = 5 ind41 ind47 ind34 ind39 ind44 ind45 ind50 ind36 ind43 ind32 ind33 ind49 ind35 ind46 ind38 ind40 ind42 ind37 ind31 ind48 ind75 ind63 ind79 ind58 ind55 ind76 ind64 ind59 ind54 ind69 ind70 ind78 ind60 ind72 ind52 ind56 ind51 ind53 ind61 ind62 ind67 ind80 ind74 ind65 ind68 ind66 ind71 ind73 ind57 ind77 ind15 ind26 ind22 ind29 ind5 ind9 ind10 ind16 ind24 ind19 ind7 ind8 ind3 ind21 ind4 ind30 ind12 ind6 ind20 ind23 ind25 ind13 ind17 ind27 ind11 ind18 ind28 ind14 PCA PC 1 and 2 ind89 ind81 ind90 ind82 ind86 ind88 ind91 ind94 ind83 ind87 ind92 ind84 ind85 ind95 ind93 What can you say about the genetic relationships between the isolates? Are there indications the existence of distinct lineages of bacteria? If so, how many lineages would you count? For a more quantitative assessment of this clustering, we derive squared Euclidean distances between isolates (function dist) and use hierarchical clustering with complete linkage (hclust) to define tight clusters: > D <- dist(pca1$li[,1:4])^2 > clust <- hclust(d, method="complete") > plot(clust, main="clustering (complete linkage) based on the first 4 PCs", cex=.4) 8

Clustering (complete linkage) based on the first 4 PCs ind19 ind26 ind18 ind25 ind17 ind29 ind22 ind28 ind27 ind21 ind30 ind20 ind23 ind16 ind24 ind93 ind86 ind87 ind91 ind92 ind81 ind89 ind82 ind83 ind94 ind85 ind88 ind95 ind84 ind90 ind15 ind6 ind14 ind7 ind3 ind10 ind1 ind4 ind11 ind12 ind2 ind5 ind9 ind8 ind13 ind40 ind48 ind37 ind31 ind46 ind42 ind49 ind47 ind32 ind43 ind41 ind39 ind33 ind44 ind35 ind36 ind38 ind50 ind34 ind45 ind51 ind67 ind68 ind61 ind74 ind63 ind78 ind53 ind73 ind57 ind66 ind56 ind80 ind60 ind52 ind70 ind62 ind65 ind71 ind77 ind59 ind75 ind54 ind72 ind69 ind58 ind79 ind76 ind55 ind64 Height 0 100 200 300 400 500 600 D hclust (*, "complete") How many clusters are there in the data? How does it compare to what you would have assessed based on the first two PCs of PCA? Bonus question: considering that the original data are profile of binary SNPs, what does the height represent in this dendrogram? You can define clusters based on the dendrogram clust using cutree: > pop <- factor(cutree(clust, k=5)) > pop ind1 ind2 ind3 ind4 ind5 ind6 ind7 ind8 ind9 ind10 ind11 ind12 ind13 1 1 1 1 1 1 1 1 1 1 1 1 1 ind14 ind15 ind16 ind17 ind18 ind19 ind20 ind21 ind22 ind23 ind24 ind25 ind26 1 1 2 2 2 2 2 2 2 2 2 2 2 ind27 ind28 ind29 ind30 ind31 ind32 ind33 ind34 ind35 ind36 ind37 ind38 ind39 2 2 2 2 3 3 3 3 3 3 3 3 3 ind40 ind41 ind42 ind43 ind44 ind45 ind46 ind47 ind48 ind49 ind50 ind51 ind52 3 3 3 3 3 3 3 3 3 3 3 4 4 ind53 ind54 ind55 ind56 ind57 ind58 ind59 ind60 ind61 ind62 ind63 ind64 ind65 4 4 4 4 4 4 4 4 4 4 4 4 4 ind66 ind67 ind68 ind69 ind70 ind71 ind72 ind73 ind74 ind75 ind76 ind77 ind78 4 4 4 4 4 4 4 4 4 4 4 4 4 ind79 ind80 ind81 ind82 ind83 ind84 ind85 ind86 ind87 ind88 ind89 ind90 ind91 4 4 5 5 5 5 5 5 5 5 5 5 5 ind92 ind93 ind94 ind95 5 5 5 5 Levels: 1 2 3 4 5 Now, we can represent these groups on top of the PCs using s.class (clusters are indicated by different colors and ellipses): > s.class(pca1$li, fac=pop, col=transp(funky(5)), cpoint=2, + sub="pca - axes 1 and 2") 9

d = 5 3 4 1 2 5 PCA axes 1 and 2 We do the same for PCs 3 and 4: > s.class(pca1$li, xax=3, yax=4, fac=pop, col=transp(funky(5)), + cpoint=2, sub="pca - axes 3 and 4") 10

d = 5 5 3 4 2 1 PCA axes 3 and 4 Are the clusters compatible with the results of the PCA? What is the meaning of the 3rd axis of the PCA? How many dimensions are needed to differentiate the 5 groups? 11

3 Identifying SNPs linked to bacterial resistance The data contained in phen indicate whether isolates are susceptible or resistant to a given antibiotic (S/R): > head(phen,10) [1] R S S S S S S S S S Levels: R S As we have done with genetic clusters previously, we can represent these two groups on the PCs to assess whether antibiotic resistance correlates to some components of the genetic diversity. > s.class(pca1$li, fac=phen, col=transp(c("royalblue","red")), cpoint=2, + sub="pca - axes 1 and 2") d = 5 R S PCA axes 1 and 2 > s.class(pca1$li, xax=3, yax=4, fac=phen, col=transp(c("royalblue","red")), + cpoint=2, sub="pca - axes 3 and 4") 12

d = 5 R S PCA axes 3 and 4 This visual assessment can be completed by a standard Chi-square test to check if there is an association between genetic clusters and resistance: > table(phen, pop) pop phen 1 2 3 4 5 R 3 1 7 10 3 S 12 14 13 20 12 > chisq.test(table(phen, pop), simulate=true) Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: table(phen, pop) X-squared = 5.2267, df = NA, p-value = 0.2419 What do you conclude? Is antibiotic resistance correlated to the main genetic features of these isolates? It is important to keep in mind that PCA optimizes the representation of the overall genetic diversity, and does not explicitly look for distinctions between predefined groups of isolates. If only a few loci are correlated to bacterial resistance, PCA may well overlook these, especially if stronger structures such as separate lineages or populations are present. To look for combinations of SNPs correlated to a given partition of individuals, DAPC is much more appropriate. We apply the method using the function dapc, specifying the input data snps and the groups of individuals to distinguish (susceptible/resistant, phen). 13

> dapc1 <- dapc(snps, phen) The function asks for a number of principal components to retain for the dimensionreduction step (PCA, retain 30 PCs) and for the subsequent discriminant analysis (DA). For the latter, only one axis can be retained (the maximum number of axes in DA is always the number of groups minus 1). > dapc1 ######################################### # Discriminant Analysis of Principal Components # ######################################### class: dapc $call: dapc.data.frame(x = as.data.frame(x), grp =..1, n.pca = 30, n.da = 1) $n.pca: 30 first PCs of PCA used $n.da: 1 discriminant functions saved $var (proportion of conserved variance): 0.371 $eig (eigenvalues): 116.4 vector length content 1 $eig 1 eigenvalues 2 $grp 95 prior group assignment 3 $prior 2 prior group probabilities 4 $assign 95 posterior group assignment 5 $pca.cent 10000 centring vector of PCA 6 $pca.norm 10000 scaling vector of PCA 7 $pca.eig 94 eigenvalues of PCA data.frame nrow ncol content 1 $tab 95 30 retained PCs of PCA 2 $means 2 30 group means 3 $loadings 30 1 loadings of variables 4 $ind.coord 95 1 coordinates of individuals (principal components) 5 $grp.coord 2 1 coordinates of groups 6 $posterior 95 2 posterior membership probabilities 7 $pca.loadings 10000 30 PCA loadings of original variables 8 $var.contr 10000 1 contribution of original variables The function scatter can be used to visualize the results of DAPC. It produces usual plots of the principal components, using colors and ellipses to indicate groups. However, whenever only one axis has been retained, scatter plots the density of the individuals on the first principal component: > scatter(dapc1, bg="white", scree.da=false, scree.pca=true, + posi.pca="topright", col=c("royalblue","red"), + legend=true, posi.leg="topleft") 14

R S PCA eigenvalues Density 0.0 0.1 0.2 0.3 4 2 0 2 4 Discriminant function 1 The contribution of each variable to the separation of the two groups (susceptible/resistant) is stored in dapc1$var.contr; it can be visualized using loadingplot, which displays all contributions as bars and annotates variables with the largest contributions (see argument threshold in?loadingplot): > loadingplot(dapc1$var.contr) 15

0.000 0.001 0.002 0.003 0.004 0.005 0.006 Loading plot Variables Loadings 7 11 12 14 17 22 25 28 31 38 42 47 49 51 54 55 62 63 66 67 72 75 76 79 80 86 88 89 91 96 100 108 112 120 121 122 123 124 125 126 129 133 141 145 149 151 155 158 159 161 162 163 165 166 171 172 180 192 194 199 202 207 227 231 242 246 247 249 256 257 261 262 269 273 274 284 289 291 292 293 297 298 300 301 304 306 312 318 323 326 328 335 346 350 362 364 368 372 380 381 382 386 392 393 398 403 406 408 415 416 426 435 441 442 445 451 453 456 458 465 466 472 476 477 480 483 488 495 496 499 500 503 504 507 513 514 515 524 526 531 532 534 535 543 545 546 547 548 549 555 556 557 558 559 563 565 574 576 588 590 596 598 599 601 603 604 605 615 616 620 632 637 638 644 650 661 666 668 672 675 676 679 682 683 687 688 689 690 692 693 694 695 698 699 705 708 717 720 721 735 741 749 750 751 753 754 756 760 761 766 767 775 779 780 784 792 796 808 813 814 817 825 833 835 836 837 840 844 845 853 857 864 872 878 884 885 889 896 897 901 902 907 911 912 913 922 927 929 933 934 937 940 945 953 954 956 958 959 963 964 965 968 969 971 974 982 987 989 993 998 1001 1003 1009 1010 1016 1019 1020 1024 1026 1032 1033 1036 1041 1043 1044 1047 1049 1050 1057 1058 1060 1063 1064 1065 1071 1074 1076 1077 1080 1081 1082 1083 1084 1085 1090 1091 1093 1099 1104 1106 1107 1109 1110 1124 1128 1133 1135 1139 1142 1154 1156 1167 1170 1171 1185 1186 1192 1194 1197 1198 1199 1203 1204 1205 1208 1212 1226 1227 1229 1234 1240 1242 1251 1253 1260 1263 1264 1267 1269 1274 1275 1276 1282 1283 1289 1291 1292 1293 1295 1297 1302 1306 1307 1309 1313 1315 1320 1327 1329 1332 1339 1341 1347 1350 1354 1358 1359 1361 1367 1371 1383 1387 1390 1392 1395 1396 1405 1414 1424 1429 1435 1437 1439 1441 1444 1445 1452 1453 1456 1458 1459 1464 1466 1468 1472 1480 1483 1484 1485 1486 1487 1489 1502 1503 1510 1514 1518 1519 1524 1525 1527 1540 1542 1549 1553 1555 1559 1561 1565 1566 1575 1581 1587 1588 1593 1595 1596 1605 1611 1612 1615 1618 1619 1621 1624 1632 1633 1649 1650 1656 1659 1661 1664 1666 1684 1686 1697 1705 1708 1711 1716 1717 1736 1739 1741 1749 1757 1769 1776 1781 1791 1794 1795 1798 1800 1802 1804 1807 1813 1814 1818 1831 1835 1836 1843 1847 1850 1851 1854 1867 1868 1871 1880 1884 1888 1894 1898 1899 1901 1903 1914 1915 1916 1919 1921 1925 1928 1929 1934 1939 1948 1949 1951 1953 1956 1957 1961 1963 1964 1979 1987 1989 1993 2001 2003 2007 2017 2024 2025 2041 2051 2054 2061 2064 2067 2069 2071 2078 2082 2086 2090 2091 2096 2098 2106 2108 2113 2114 2116 2119 2120 2125 2127 2128 2130 2133 2137 2148 2150 2157 2160 2164 2169 2171 2176 2180 2185 2186 2193 2196 2201 2208 2213 2217 2231 2235 2238 2242 2249 2253 2258 2259 2262 2270 2271 2272 2278 2280 2283 2287 2296 2307 2308 2316 2319 2322 2324 2331 2332 2336 2350 2353 2359 2361 2368 2369 2372 2376 2379 2382 2383 2389 2390 2391 2396 2397 2399 2409 2420 2428 2435 2438 2447 2449 2454 2456 2460 2461 2462 2463 2464 2467 2476 2477 2479 2484 2487 2488 2494 2514 2516 2517 2518 2535 2536 2538 2544 2549 2558 2559 2565 2566 2572 2573 2577 2578 2579 2586 2588 2592 2595 2599 2603 2611 2612 2615 2619 2620 2624 2625 2627 2629 2633 2635 2642 2643 2644 2645 2649 2654 2656 2670 2674 2677 2689 2691 2694 2697 2703 2704 2711 2723 2738 2742 2752 2761 2765 2766 2767 2768 2771 2772 2776 2781 2785 2790 2801 2802 2803 2808 2814 2819 2820 2822 2828 2829 2830 2833 2841 2857 2860 2861 2862 2863 2865 2872 2873 2879 2880 2882 2886 2887 2890 2898 2899 2904 2908 2910 2917 2934 2935 2938 2939 2943 2946 2947 2953 2964 2972 2975 2978 2982 2994 2995 2996 3002 3004 3007 3012 3013 3014 3021 3023 3033 3036 3039 3045 3048 3049 3059 3060 3061 3062 3064 3065 3069 3070 3086 3089 3107 3122 3124 3125 3126 3127 3128 3135 3136 3137 3138 3139 3140 3143 3151 3168 3171 3172 3176 3179 3182 3183 3184 3190 3191 3193 3196 3205 3207 3209 3213 3214 3217 3218 3225 3229 3230 3231 3238 3248 3252 3254 3261 3262 3267 3274 3278 3280 3281 3286 3287 3299 3303 3306 3316 3327 3328 3329 3331 3334 3342 3349 3354 3356 3373 3377 3381 3384 3388 3397 3398 3408 3410 3411 3416 3417 3418 3424 3425 3426 3427 3430 3451 3453 3461 3462 3463 3466 3471 3482 3497 3502 3504 3505 3506 3507 3511 3514 3515 3524 3525 3527 3529 3531 3532 3537 3539 3548 3553 3556 3559 3561 3563 3586 3588 3589 3590 3591 3592 3594 3595 3600 3606 3614 3615 3624 3633 3635 3636 3637 3639 3640 3642 3645 3652 3656 3663 3668 3674 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3692 3693 3702 3705 3710 3711 3712 3714 3715 3728 3731 3732 3733 3738 3740 3742 3746 3748 3750 3755 3757 3759 3763 3766 3769 3770 3771 3773 3782 3783 3784 3787 3793 3796 3800 3808 3815 3822 3827 3835 3836 3840 3850 3859 3860 3863 3866 3879 3881 3888 3891 3899 3900 3901 3902 3908 3909 3910 3912 3913 3914 3915 3934 3938 3942 3950 3952 3953 3955 3967 3972 3983 3988 3989 3992 4001 4008 4009 4022 4028 4033 4037 4038 4052 4053 4056 4059 4060 4062 4063 4066 4076 4080 4082 4087 4091 4094 4102 4108 4110 4114 4120 4129 4130 4138 4145 4163 4167 4173 4175 4178 4181 4187 4190 4191 4196 4198 4208 4212 4224 4227 4229 4232 4235 4236 4238 4239 4245 4252 4262 4269 4270 4272 4279 4280 4281 4288 4290 4297 4299 4304 4307 4309 4311 4312 4325 4337 4339 4340 4343 4348 4351 4353 4355 4356 4359 4361 4368 4375 4379 4382 4389 4390 4395 4398 4404 4405 4407 4410 4412 4413 4425 4432 4433 4436 4437 4441 4443 4445 4453 4459 4467 4470 4472 4475 4477 4478 4487 4488 4490 4491 4501 4511 4520 4521 4526 4534 4535 4541 4544 4548 4552 4554 4559 4564 4569 4571 4573 4580 4581 4582 4583 4604 4606 4607 4608 4631 4643 4644 4654 4656 4658 4659 4665 4672 4677 4679 4680 4681 4691 4692 4696 4697 4706 4707 4711 4713 4714 4717 4718 4725 4728 4731 4738 4739 4740 4747 4748 4758 4761 4763 4766 4767 4768 4770 4771 4777 4785 4786 4791 4796 4798 4801 4808 4809 4814 4817 4819 4822 4844 4845 4854 4859 4862 4867 4871 4872 4873 4878 4879 4883 4884 4886 4888 4890 4896 4900 4902 4905 4919 4924 4928 4930 4935 4936 4939 4942 4944 4947 4950 4953 4954 4963 4971 4972 4974 4977 4986 4990 4991 4999 5003 5005 5009 5015 5022 5028 5029 5033 5037 5051 5055 5056 5058 5060 5064 5067 5076 5083 5084 5085 5086 5097 5104 5106 5109 5113 5114 5115 5116 5117 5124 5126 5130 5133 5134 5137 5138 5142 5143 5168 5169 5175 5177 5178 5179 5186 5188 5190 5191 5195 5200 5201 5202 5210 5218 5227 5235 5240 5241 5242 5243 5248 5249 5253 5254 5256 5258 5259 5279 5281 5294 5297 5298 5299 5305 5306 5307 5326 5328 5329 5330 5334 5339 5340 5342 5346 5350 5352 5354 5357 5360 5362 5364 5366 5367 5372 5374 5382 5393 5394 5399 5406 5407 5423 5429 5431 5435 5441 5446 5450 5453 5456 5465 5466 5470 5476 5490 5493 5495 5501 5503 5506 5507 5509 5510 5530 5532 5535 5536 5538 5539 5541 5544 5551 5557 5561 5564 5565 5569 5572 5583 5587 5592 5598 5608 5611 5612 5616 5617 5619 5621 5624 5633 5635 5637 5638 5641 5642 5643 5653 5656 5660 5667 5675 5676 5684 5685 5687 5688 5695 5703 5705 5706 5715 5720 5725 5730 5750 5752 5754 5758 5762 5771 5774 5777 5783 5785 5789 5797 5805 5811 5814 5826 5846 5852 5863 5866 5882 5888 5893 5897 5904 5907 5910 5913 5915 5919 5921 5922 5923 5924 5926 5929 5935 5953 5968 5969 5971 5974 5976 5979 5981 5985 5989 5992 5995 5997 5998 6005 6010 6011 6015 6022 6023 6025 6026 6029 6034 6035 6036 6037 6038 6066 6067 6072 6074 6075 6079 6085 6086 6087 6091 6102 6111 6115 6116 6120 6122 6123 6129 6134 6146 6148 6153 6161 6164 6167 6169 6172 6173 6174 6180 6190 6194 6195 6196 6205 6206 6218 6220 6222 6223 6225 6227 6229 6235 6236 6244 6246 6249 6251 6252 6256 6259 6269 6274 6278 6282 6290 6296 6300 6305 6314 6316 6323 6324 6327 6330 6340 6341 6351 6352 6355 6357 6358 6359 6362 6363 6369 6377 6380 6383 6386 6387 6397 6402 6403 6409 6414 6418 6423 6424 6425 6430 6435 6437 6450 6456 6457 6458 6459 6462 6466 6472 6473 6477 6478 6479 6482 6484 6489 6490 6495 6498 6502 6503 6504 6507 6511 6523 6526 6527 6529 6530 6533 6536 6537 6542 6543 6545 6552 6553 6556 6564 6568 6579 6581 6583 6588 6594 6597 6598 6603 6609 6612 6615 6616 6618 6619 6622 6627 6629 6631 6637 6640 6642 6645 6654 6669 6672 6673 6679 6683 6693 6702 6705 6709 6714 6716 6718 6722 6730 6735 6737 6743 6751 6757 6759 6761 6765 6766 6767 6769 6770 6779 6781 6782 6783 6786 6797 6801 6810 6814 6817 6822 6824 6829 6833 6835 6837 6839 6853 6856 6863 6868 6871 6872 6874 6876 6878 6887 6888 6889 6890 6896 6903 6906 6911 6915 6925 6929 6934 6939 6941 6942 6944 6945 6947 6950 6959 6960 6962 6963 6971 6972 6976 6989 6991 6993 6997 6998 7009 7013 7024 7026 7027 7028 7029 7032 7033 7043 7046 7049 7050 7052 7066 7075 7081 7089 7090 7097 7109 7110 7112 7113 7118 7120 7122 7124 7125 7133 7136 7137 7140 7146 7150 7155 7156 7159 7163 7165 7177 7181 7187 7189 7195 7196 7197 7199 7202 7204 7206 7207 7214 7217 7222 7226 7228 7231 7237 7238 7240 7242 7245 7247 7254 7256 7262 7264 7280 7286 7287 7290 7294 7300 7302 7313 7321 7322 7333 7337 7348 7349 7353 7357 7359 7363 7364 7368 7369 7372 7374 7380 7382 7383 7385 7386 7388 7389 7390 7392 7394 7396 7400 7402 7411 7416 7420 7422 7433 7437 7440 7443 7448 7451 7454 7471 7473 7475 7480 7486 7494 7496 7498 7502 7509 7520 7523 7527 7531 7532 7542 7544 7546 7549 7552 7562 7564 7568 7570 7573 7574 7575 7577 7578 7581 7583 7585 7586 7588 7590 7591 7601 7611 7613 7614 7623 7626 7631 7636 7642 7647 7654 7659 7677 7685 7686 7694 7700 7707 7709 7712 7717 7722 7724 7728 7732 7734 7737 7738 7745 7750 7751 7754 7759 7761 7762 7767 7768 7771 7772 7775 7778 7781 7798 7801 7804 7808 7810 7815 7816 7817 7853 7855 7858 7859 7867 7872 7877 7884 7885 7886 7889 7895 7896 7904 7908 7912 7917 7920 7921 7924 7929 7938 7940 7946 7948 7954 7960 7963 7970 7972 7978 7980 7985 7986 7988 7995 8002 8005 8006 8008 8014 8016 8020 8022 8023 8033 8034 8039 8044 8048 8053 8056 8057 8060 8062 8063 8064 8071 8073 8074 8085 8090 8092 8095 8099 8102 8103 8104 8110 8113 8117 8125 8126 8133 8142 8145 8148 8149 8151 8159 8164 8167 8168 8172 8174 8182 8185 8187 8193 8194 8201 8202 8213 8216 8217 8218 8219 8225 8231 8232 8242 8243 8245 8246 8250 8262 8264 8265 8266 8269 8271 8278 8281 8282 8284 8288 8290 8302 8309 8311 8315 8318 8326 8327 8330 8332 8334 8340 8343 8344 8347 8348 8349 8350 8359 8360 8362 8364 8367 8371 8373 8374 8375 8378 8383 8393 8402 8405 8408 8410 8417 8433 8439 8441 8442 8443 8447 8449 8452 8455 8456 8462 8465 8471 8475 8482 8483 8486 8490 8493 8503 8519 8525 8526 8529 8530 8531 8535 8541 8546 8552 8554 8566 8570 8571 8572 8574 8575 8578 8580 8581 8585 8587 8595 8599 8605 8606 8612 8617 8621 8625 8626 8631 8632 8633 8644 8649 8650 8651 8657 8659 8662 8677 8678 8683 8687 8694 8697 8699 8703 8704 8708 8709 8710 8713 8719 8728 8732 8735 8736 8746 8751 8752 8754 8755 8758 8764 8765 8770 8776 8782 8790 8792 8796 8797 8798 8800 8805 8810 8811 8816 8817 8818 8819 8821 8826 8830 8832 8833 8839 8845 8847 8850 8851 8855 8857 8860 8862 8863 8874 8887 8889 8890 8898 8899 8909 8911 8913 8914 8917 8918 8922 8925 8929 8931 8935 8937 8938 8940 8948 8949 8950 8952 8954 8964 8967 8971 8975 8980 8984 8989 8997 8998 9007 9013 9019 9022 9028 9033 9038 9045 9047 9048 9053 9054 9057 9062 9064 9067 9069 9071 9073 9074 9080 9082 9087 9093 9095 9098 9100 9101 9102 9105 9110 9112 9113 9126 9130 9140 9144 9145 9150 9151 9153 9157 9159 9162 9163 9164 9169 9171 9175 9178 9184 9185 9188 9195 9196 9198 9199 9216 9221 9225 9226 9228 9236 9237 9238 9247 9250 9263 9265 9270 9276 9283 9287 9300 9301 9304 9308 9312 9315 9317 9321 9322 9326 9327 9329 9335 9337 9340 9341 9344 9353 9356 9363 9369 9370 9371 9373 9377 9380 9381 9386 9389 9391 9396 9397 9404 9407 9409 9411 9415 9416 9417 9419 9420 9421 9423 9424 9429 9430 9434 9435 9436 9438 9440 9449 9456 9460 9472 9476 9486 9494 9498 9505 9507 9510 9511 9514 9517 9525 9526 9529 9531 9538 9540 9541 9544 9546 9561 9563 9565 9568 9570 9572 9573 9576 9588 9593 9596 9597 9608 9611 9614 9622 9627 9634 9643 9650 9652 9655 9670 9671 9672 9674 9678 9682 9683 9690 9693 9694 9695 9701 9702 9705 9707 9708 9714 9720 9728 9730 9733 9735 9738 9744 9753 9759 9762 9767 9783 9784 9785 9789 9790 9792 9793 9796 9798 9801 9803 9807 9809 9814 9815 9820 9822 9825 9826 9827 9830 9831 9834 9838 9841 9846 9848 9852 9857 9858 9859 9860 9862 9864 9872 9874 9875 9877 9882 9883 9886 9890 9891 9893 9894 9902 9906 9911 9914 9915 9922 9923 9924 9927 9928 9929 9932 9933 9936 9939 9943 9945 9946 9947 9951 9953 9956 9958 9962 9971 9975 9978 9982 9983 9984 9985 9986 9990 9993 The function also invisibly returns information on the annotated variables. Recall loadingplot, specifying a higher threshold so that only the few outlying variables are retained, and store this result in an object called sel.snps. 16

Loading plot Loadings 0.000 0.001 0.002 0.003 0.004 0.005 0.006 7197 7199 7202 7206 7207 Variables The object should look like this: > sel.snps $threshold [1] 0.004 $var.names [1] "7197" "7199" "7202" "7206" "7207" $var.idx 7197 7199 7202 7206 7207 7197 7199 7202 7206 7207 $var.values 7197 7199 7202 7206 7207 0.004943821 0.004943821 0.004943821 0.004943821 0.004943821 Which SNPs are the most strongly correlated to antibiotic resistance? The following command derives allelic profiles of these SNPs for each isolate: > sel.profiles <- apply(snps[,sel.snps$var.idx],1,paste,collapse="-") > head(sel.profiles) ind1 ind2 ind3 ind4 ind5 ind6 "1-1-1-1-1" "0-0-0-0-0" "0-0-0-0-0" "0-0-0-0-0" "0-0-0-0-0" "0-0-0-0-0" > table(sel.profiles) sel.profiles 0-0-0-0-0 1-1-1-1-1 71 24 > head(cbind.data.frame(phen,sel.profiles),10) 17

phen sel.profiles ind1 R 1-1-1-1-1 ind2 S 0-0-0-0-0 ind3 S 0-0-0-0-0 ind4 S 0-0-0-0-0 ind5 S 0-0-0-0-0 ind6 S 0-0-0-0-0 ind7 S 0-0-0-0-0 ind8 S 0-0-0-0-0 ind9 S 0-0-0-0-0 ind10 S 0-0-0-0-0 > tail(cbind.data.frame(phen,sel.profiles),10) phen sel.profiles ind86 S 0-0-0-0-0 ind87 S 0-0-0-0-0 ind88 S 0-0-0-0-0 ind89 S 0-0-0-0-0 ind90 S 0-0-0-0-0 ind91 R 1-1-1-1-1 ind92 S 0-0-0-0-0 ind93 S 0-0-0-0-0 ind94 R 1-1-1-1-1 ind95 R 1-1-1-1-1 A contingency table between phenotype and SNPs profile can be created using table: > table(phen,sel.profiles) sel.profiles phen 0-0-0-0-0 1-1-1-1-1 R 0 24 S 71 0 What can you conclude on these SNPs? Assuming that their position in the dataset reflects their original position in the genome, would you think that each of these SNPs actually determines the antibiotic resistance? How would you address this question? 18

References [1] S. Dray and A.-B. Dufour. The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software, 22(4):1 20, 2007. [2] H. Hotelling. Analysis of a complex of statistical variables into principal components. The Journal of Educational Psychology, 24:417 441, 1933. [3] H. Hotelling. Analysis of a complex of statistical variables into principal components (continued from september issue). The Journal of Educational Psychology, 24:498 520, 1933. [4] T. Jombart. adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 24:1403 1405, 2008. [5] T. Jombart, S. Devillard, and F. Balloux. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics, 11(1):94, 2010. [6] T. Jombart, S. Devillard, A.-B. Dufour, and D. Pontier. Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity, 101:92 103, 2008. [7] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2:559 572, 1901. [8] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0. 19