Statistical analysis of oligonucleotide microarray data

Ziad Taib

doi:10.1016/j.crvi.2003.05.003

Biological modelling / Biomodélisation

Statistical analysis of oligonucleotide microarray data
[Analyse statistique de données de puces d'ADN à base d'oligonucléotides]

Présenté par : Pierre Auger

Ziad Taib ¹

¹ Department of Biostatistics, AstraZeneca R&D, Mölndal, 431 83 Mölndal, Sweden

Comptes Rendus. Biologies, Volume 327 (2004) no. 3, pp. 175-180.

Résumés

Anglais
Français

Microchip arrays have become one of the most rapidly growing techniques for monitoring gene expression at the genomic level and thereby gaining valuable insight about various important biological mechanisms. Examples of such mechanisms are: identifying disease-causing genes, genes involved in the regulation of some aspect of the cell cycle, etc. In this article, we discuss the problem of estimating gene expression based on a proper statistical model. More precisely, we show how the model introduced by Li and Wong can be used in its full bivariate generality to provide a new measure of gene expression from high-density oligonucleotide arrays. We also present a second gene expression index based on a new way of reducing the model into a simpler univariate model. In both cases, the gene expression indices are shown to be unbiased and to have lower variance than the established ones. Moreover, we present a bootstrap method aiming at providing non-parametric confidence intervals for the expression index.

Les puces d'ADN sont devenues l'une des techniques les plus utilisées pour observer l'expression génétique à l'échelon génomique et ainsi mieux comprendre certains mécanismes biologiques. Comme exemples de ces mécanismes, citons l'identification de gènes causant certaines maladies héréditaires, ou l'identification de gènes impliqués dans la régulation de certains aspects du cycle cellulaires etc. Dans cet article, nous discutons le problème d'estimation de l'expression génétique à l'aide d'un modèle statistique adéquat. En effet, nous montrons comment le modèle introduit par Li et Wong peut être utilisé dans toute sa généralité multivariée pour obtenir une nouvelle mesure de l'expression génétique dans le cas de puces d'ADN d'oligonucléotide à haute densité. D'autre part, nous présentons une autre mesure de l'expression génétique basée sur un nouveau concept réduisant le modèle en un autre, univarié et plus simple. Nous montrons que les nouvelles mesures sont sans biais et que leurs variances sont plus petites que les variances des mesures existantes. Nous présentons aussi une méthode de bootstrap qui vise à créer des intervalles de confiance non-paramétriques pour les mesures d'expression génétique.

Métadonnées

Reçu le : 2003-04-15
Accepté le : 2003-05-23
Publié le : 2004-03-01

PMID

DOI : 10.1016/j.crvi.2003.05.003

Keywords: gene expression, microchip array, model-based expression index, bootstrapping
Mots-clés : expression génétique, puces d'ADN, modèle statistique de l'expression génétique, bootstrapping

Affiliations des auteurs :

Ziad Taib ¹

¹ Department of Biostatistics, AstraZeneca R&D, Mölndal, 431 83 Mölndal, Sweden

@article{CRBIOL_2004__327_3_175_0,
     author = {Ziad Taib},
     title = {Statistical analysis of oligonucleotide microarray data},
     journal = {Comptes Rendus. Biologies},
     pages = {175--180},
     publisher = {Elsevier},
     volume = {327},
     number = {3},
     year = {2004},
     doi = {10.1016/j.crvi.2003.05.003},
     language = {en},
}

TY  - JOUR
AU  - Ziad Taib
TI  - Statistical analysis of oligonucleotide microarray data
JO  - Comptes Rendus. Biologies
PY  - 2004
SP  - 175
EP  - 180
VL  - 327
IS  - 3
PB  - Elsevier
DO  - 10.1016/j.crvi.2003.05.003
LA  - en
ID  - CRBIOL_2004__327_3_175_0
ER  -

%0 Journal Article
%A Ziad Taib
%T Statistical analysis of oligonucleotide microarray data
%J Comptes Rendus. Biologies
%D 2004
%P 175-180
%V 327
%N 3
%I Elsevier
%R 10.1016/j.crvi.2003.05.003
%G en
%F CRBIOL_2004__327_3_175_0

Ziad Taib. Statistical analysis of oligonucleotide microarray data. Comptes Rendus. Biologies, Volume 327 (2004) no. 3, pp. 175-180. doi : 10.1016/j.crvi.2003.05.003. https://comptes-rendus.academie-sciences.fr/biologies/articles/10.1016/j.crvi.2003.05.003/

Version originale du texte intégral

1 Introduction

In recent years, it has become possible to make simultaneous measurements of the expression levels of tens of thousands of genes using the so-called microarray technology. Through various microarray experiments, scientists try to answer various basic biological questions ranging from understanding some aspect of the cell cycle mechanism of a given organism to the identification of individual genes involved in some biochemical pathway related to some biological function or some disease of interest.

In this article, we discuss statistical modelling of data from high-density oligonucleotide arrays of the type manufactured by Affymetrix. The point of view we adopt is that statistical methods can and should be used in order to achieve a better understanding of experimental data. Although statistical methods can be useful at all the stages of a typical microarray experiment (experimental design, image analysis and artefact detection, normalization, calculation of gene expression, making comparisons between arrays, dimension reduction, clustering, etc.), we will only discuss the fundamental problem of estimating the expression index (cf. [1,2]).

The article is organized as follows. In Section 2, we describe how the data arise in some detail. The bivariate Li–Wong model (cf. [3,4]) is presented in Section 3. This model is used in Section 4 to provide maximum likelihood estimates of various parameters. A new way of reducing the model into a univariate one and the resulting estimates are discussed in Section 5. Comparisons between these two estimates as well as others are discussed in the discussion section.

The main conclusion of the article is that the new estimates seem to be better than previously established ones.

2 The nature of the data

First we take a look at some of the features of oligonucleotide-based microarrays. A single array (1.28 cm×1.28 cm) can contain probe sets for tens of thousands of genes and ESTs. Every probe set consists of 10–20 probe pairs. Every probe pair contains a perfect match (PM) probe and a mismatch probe (MM). The perfect match probe is a small DNA subsequence (25 bases long), which is assumed to be specific for a particular gene, the expression of which we want to measure. The mismatch probe is identical to the perfect match probe, except for the base in the middle (13th) position.

The oligonucleotides in the probes are typically chosen from the so-called 3 prime end of the gene, but have been until recently unknown to the user. The expression level of the gene has to be inferred from the amount of hybridisation of the PM and MM probes. The latter are measured using a scanner, which is sensible to the fluorescence intensities of the probes. Up to now, expression indices have been based on PM–MM differences at the probe level. Contrary to what one expects, such differences are often negative. A trial study reveals that although very few probe sets will only have negative differences, a vast majority will have some. In fact, the median number of negative probes per probe set is around 8 (out of a total of 20). In many cases, the overall average of these differences for a probe set will be negative. Since this means that the corresponding gene has negative expression, it is obvious that such an average difference cannot be used. Various conditions are often imposed to prevent this from happening. These include using PM values only (cf. [1,2,5]), excluding probe pairs with negative differences, using PM−c MM with some suitably chosen constant c (cf. [6]), treating negative differences as missing data and using imputation, etc. All these methods suffer from drawbacks such as being ad hoc, being inconsistent in that not all probe pairs are used in the same way, etc. This obscures the analysis.

At the most basic level, the data is in the form of pixel intensities (36 pixels/cell for the Mu11k mouse chip) for the individual probe cells. It goes without saying that one can (and should) try to model the data already at this level using ideas from statistical image analysis. In these notes, however, we will only consider models on the next level, namely that of PM and MM intensities. Another issue that we will not discuss here is that of background calculation and subtraction.

3 The model

The basic idea behind the Li–Wong model (cf. [3,4]) is that the PM intensities are expected to be higher than the MM intensities. One can thus assume the following model: (1)

where I is the number of arrays and J is the number probe pairs (usually 20), ν stands for the base line intensity of a probe pair due to non-specific hybridisation, α is the rate of increase in MM intensity of a probe pair and φ is the additional rate of increase in the corresponding PM response. The error terms are assumed to have zero mean and some common variance. The most important parameter of this model is the expression index θ.

All these quantities are assumed to be non-negative. Moreover in order to avoid ‘unidentifiability’, we impose some additional conditions like ∑_jφ_j²=J. Although this model was introduced in [3], the authors have only treated the reduced case

Y_{ij} = P_{ij} - M_{ij} = θ_{i} ϕ_{j} + ϵ_{ij}^{P} - ϵ_{ij}^{M}

(2)

The authors of [7] use (1) explicitly, but assume that the PM and MM values are independent; so their model describes the marginal distributions. We propose to augment the model so as to take into account the empirically observed correlation between PM and MM, which is usually rather high. For example, the average correlation coefficient for 30 probe sets was 0.77 with a standard deviation of 0.14. The maximum value was 0.94 and the minimum 0.36 (cf. Fig. 1).

Fig. 1
Shows the distribution of the PM/MM correlation for 30 probe sets (ρ on the x-axis and the frequency on the y-axis).

The rationale for this is that the probe pairs corresponding to the same gene are scattered all over the array, while the two components of the same probe are always adjacent to each other. More precisely, we assume that the error terms in (1) follow a bivariate normal distribution according to:

(\begin{matrix} ϵ_{ij}^{M} \\ ϵ_{ij}^{P} \end{matrix}) \approx N ((\begin{matrix} 0 \\ 0 \end{matrix}), (\begin{matrix} σ^{2} & c \\ c & σ^{2} \end{matrix}))

(3)

where c is the covariance term and σ is the corresponding correlation coefficient.

How do we know that such a model fits the data? It is possible to examine the adequacy of the model by using the residuals to measure the lack of fit. The reader will find more information about this at the end of the next section.

4 The estimates

Given data (P_ij,M_ij), we can find parameter estimates using the maximum-likelihood method. The likelihood function has the form:

L (P_{ij}, M_{ij}, θ_{i}, α_{j}, ρ, ν_{ij}, ϕ_{j}, σ^{2}) = \prod_{i, j} K \exp [- \frac{1}{2} {((\begin{matrix} P_{ij} \\ M_{ij} \end{matrix}) - (\begin{matrix} ν_{ij} + θ_{i} α_{j} + θ_{i} ϕ \\ ν_{ij} + θ_{i} α_{j} \end{matrix}))}^{T} Σ^{- 1} ((\begin{matrix} P_{ij} \\ M_{ij} \end{matrix}) - (\begin{matrix} ν_{ij} + θ_{i} α_{j} + θ_{i} ϕ \\ ν_{ij} + θ_{i} α_{j} \end{matrix}))]

The corresponding log likelihood function is

l = \sum_{i} \sum_{j} \log K - \sum_{i} \sum_{j} \frac{1}{2 σ^{2} (1 - ρ^{2})} ({(P_{ij} - (ν_{j} + θ_{i} α_{j} + θ_{i} ϕ_{j}))}^{2} + {(M_{ij} - (ν_{j} + θ_{i} α_{j}))}^{2} - 2 ρ (M_{ij} - (ν_{j} + θ_{i} α_{j})) (P_{ij} - (ν_{j} + θ_{i} α_{j} + θ_{i} ϕ_{j})))

Taking the partial derivatives with respect to the parameters and setting the resulting expression equal to zero leads to maximum-likelihood estimates of the parameters. (4)

These formulas have to be understood as steps in an iterative procedure that will lead to the final estimates. Nonetheless they are quite useful when it comes to deriving various properties. It is thus easy to see that the estimate, ${\hat{θ}}_{i}$ of the expression index is unbiased, i.e. that $E [\hat{θ}] = θ_{i}$ . The usual expression index (the average difference or ${\bar{Y}}_{i} = \frac{1}{J} \sum_{j} (P_{ij} - M_{ij})$ advocated by Affymetrix) lacks this property, as is seen by the following argument. In terms of the Li and Wong model, we have $E [{\bar{Y}}_{i}] = θ_{i} \frac{1}{J} \sum_{j = 1}^{J} ϕ_{j}$ . From Steiner's theorem, we know that $\frac{1}{J} \sum_{j = 1}^{J} ϕ_{j} ⩽ 1$ so $E [{\bar{Y}}_{i}] = θ_{i} \bar{ϕ} ⩽ θ_{i}$ , i.e. ${\bar{Y}}_{i}$ is unbiased.

To obtain non-parametric confidence intervals for the expression level, we propose the following bootstrap method. First, estimate the parameters using the above maximum likelihood estimates; then, estimate the means:

{\hat{M}}_{ij} = {\hat{ν}}_{ij} + {\hat{θ}}_{i} {\hat{α}}_{j} {\hat{P}}_{ij} = {\hat{ν}}_{ij} + {\hat{θ}}_{i} {\hat{α}}_{j} + {\hat{θ}}_{i} {\hat{ϕ}}_{j}

(5)

It is now possible to estimate the residuals

{\hat{ϵ}}_{ij} = (\begin{matrix} {\hat{ϵ}}_{ij}^{P} \\ {\hat{ϵ}}_{ij}^{M} \end{matrix})

. These can be used to generate new observations by adding (bivariate) bootstrap errors to the expectations in (5). For every new set of observations, the expression level can be estimated, so we end up with a large number of possible expression values, which can then be used to calculate a confidence interval.

Notice also that the common PM/MM variance can be (ML) estimated by:

{\hat{σ}}^{2} = \frac{\sum_{i} \sum_{j} {(P_{ij} - {\hat{P}}_{ij})}^{2} + \sum_{i} \sum_{j} {(M_{ij} - {\hat{M}}_{ij})}^{2}}{2 IJ (1 - {\hat{ρ}}^{2})} - \frac{2 \hat{ρ} \sum_{i} \sum_{j} (P_{ij} - {\hat{P}}_{ij}) (M_{ij} - {\hat{M}}_{ij})}{2 IJ (1 - {\hat{ρ}}^{2})}

and that variants of this can be used for outlier detection, i.e. we can classify a probe pair as an outlier if its residual is much higher than expected. Lack of fit can be measured using the residuals and standard methods of model fitting. In essence, standardized versions of the residuals should behave as independent drawings from a bivariate normal distribution.

5 Reduced models

As mentioned earlier, the Li–Wong model has, until now, been only used in its reduced form (2). This has the advantage of being a much simpler model than the full one. This reduced model has been implemented in dChip (one of the most popular tools for analysing this type of data). Other ways to get reduced models have also been investigated: using PM only values or the average difference. It is possible to formulate a reduced version of the Li–Wong model along the lines proposed in [6], i.e. by considering P_ij−cM_ij for some suitable value of c.

In what follows, we present a new way of reducing the full model. The rationale of this method is that the MM values can, in a sense, be considered as containing help information only. One way to use this help information is to condition on it, i.e. to consider the conditional distribution of the PM values given the MM values. Under the assumption that the bivariate values are normal, the conditional distribution is also normal according to (cf. [8]):

P_{ij} | M_{ij} \approx N (υ_{ij} + α_{j} θ_{i} + ϕ_{j} θ_{i} + ρ (M_{ij} - υ_{ij} - α_{j} θ_{i}), σ^{2} (1 - ρ^{2}))

Again, maximum likelihood estimates of the parameters can be based on this model. The resulting estimate of the expression level is:

{\hat{θ}}_{i} = \frac{\sum_{j} (P_{ij} - ρ M_{ij} - υ_{ij} (1 - ρ)) (α_{j} (1 - ρ) + ϕ_{j})}{\sum_{j} {(α_{j} (1 - ρ) + ϕ_{j})}^{2}} .

It is rather easy to verify that this estimate is unbiased.

6 Discussion

We have thus seen that using the Li–Wong model as a basis, one has the choice of using either the full bivariate model or some reduced univariate version. To make a rational decision as to the choice of a model, one can use both theoretical and empirical criteria as is done in [9]. One such criterion is unbiasedness. But perhaps the most natural of these criteria is the variance, i.e. preferring an estimate with a low variance over one with a large variance. Simple calculations show that according to this criterion the best model is the full bivariate model followed by the conditional and the reduced models in that order (cf. the Appendix). The comparison with the average difference is, however, not straightforward since that estimate has to be transformed to become unbiased. The transformed average difference has the largest variance of all reduced models.

A quite different way of comparing models is to use special experiments where known amounts of mRNA are used. Such data will become more available in the future.

It is interesting that the gene expression estimate based on the conditional model is of the type proposed by [6], i.e. of the form P_ij−cM_ij. In this estimate, the constant c is simply taken as ρ, the correlation coefficient. The conditional estimate has one additional nice feature, namely that it is a weighted average of P_ij−ρM_ij differences. Probe pairs with higher sensitivity (measured by α_j(1−ρ)+φ_j) are given a higher weight. The special cases ρ=0 and ρ=1 give the PM-only case and the reduced Li–Wong model respectively. To gain insight as to why ρ is a good choice of constant in the approach used in [6], one can argue as follows. The variance of each term P_ij−cM_ij is simply σ²(1+c²−2ρc) and is minimized when c=ρ.

To conclude, we have argued that, in the sense of unbiased estimates having low variance, the best estimate is simply the one based on the full bivariate model. Should one choose to use a univariate reduced model, then the estimate based on the conditional distribution is the next best choice. The latter has many desirable properties and for probe sets having high correlation coefficients, it is quite similar to the estimate based on the Li–Wong reduced model.

Appendix

In this appendix we give the variances of the different estimators mentioned in the article.

1. The estimate based on the full model:
$var [{\hat{θ}}_{i}] = \frac{σ^{2} (1 - ρ^{2})}{\sum_{j} ϕ_{j}^{2} + 2 (1 - ρ) \sum_{j} α_{j} (α_{j} + ϕ_{j})}$ (6)
2. The estimate based on the conditional model
$var [{\hat{θ}}_{i}] = \frac{σ^{2} (1 - ρ^{2})}{\sum_{j} (ϕ_{j} + α_{j} (1 - ρ))}$ (7)
3. The estimate based on the reduced Li–Wong model
$var [{\hat{θ}}_{i}] = \frac{2 σ^{2} (1 - ρ)}{\sum_{j} ϕ_{j}^{2}}$ (8)
4. The transformed average difference
$\frac{2 σ^{2} (1 - ρ)}{J \bar{ϕ}}$ (9)

It is now not so difficult to show that (6)⩽(7)⩽(8)⩽(9).

Bibliographie

[1] R.A. Irizarry; B. Hobbs; F. Collin; Y.D. Beazer-Barclay; K.J. Antonellis; U. Scherf; T.P. Speed Exploration, normalization, and summaries of high-density oligonucleotide array probe level data, Biostatistics, Volume 4 (2002) no. 2, pp. 249-264

[2] R.A. Irizarry; B.M. Bolstad; F. Collin; L.M. Cope; B. Hobbs; T.P. Speed Summaries of Affymetrix GeneChip probe-level data, Nucleic Acids Res., Volume 31 (2003) no. 4

[3] C. Li; W.H. Wong Model based analysis of oligonucleotide arrays: expression index computation and outlier detection, Proc. Natl Acad. Sci. USA, Volume 98 (2001), pp. 31-36

[4] C. Li; W.H. Wong Model-based analysis of oligonucleotide arrays (II): model validation, design issues and standard error application, Genome Biol., Volume 2 (2001) no. 8, pp. 1-11

[5] F. Naef; D.A. Lim; N. Patil; M.O. Magnasco From features to expression: high-density oligonucleotide array analysis revisited, 2001 (LANL e-print) | arXiv

[6] B. Efron, R. Tibshirani, J.D. Storey, V. Tusher, Empirical Bayes analysis of a microarray experiment, Technical Report, Division of Biostatistics, Stanford University, 2001, JASA, in press

[7] W.J. Lemon; J.J.T. Palatini; R. Krahe; F.A. Wright Theoretical and experimental comparisons of gene expression indexes for oligonucleotide arrays, Bioinformatics, Volume 18 (2002) no. 11, pp. 1470-1476

[8] R.A. Johnson; D.W. Wichern Applied Multivariate Statistical Analysis, Prentice-Hall, 1998

[9] W.J. Lemon et al., 2001, in preparation

Commentaires - Politique