Transcriptome Platform Genomic Service Biology department Ecole Normale Superieure
SGDB logo SGDB banner
SGDB Navigation
Introduction
Principles
Genopole
Staff
Facilities
Plateform functioning
Contact/Access
Communications
News
Jobs
Training
Forum
Publications
Services
Devices reservation
Hybridization - Analysis
Protocoles
RNA preparation
Labelling
Hybridisation
Slides production
Analysis tools
Image analysis
Excel for genomics
Normalization
Pretraitement
Differential Analysis
Clustering
Data mining
Data management
FAQ
Protocols
Protocols
Restricted Access
Open a Session

/ Home / Analysis Tools / Expression Matrix Pretreatment

Pre-treatment of expression matrices coming from DNA microarray results

Version 1.6

This tutorial can be download as a PDF file (922.9 Kb).

DNA microarray experiments related to a common reference can be integrated in experiment series during which the transcript amount is followed according to various experimental conditions (see table below). These series allow to follow transcriptome dynamic and thus to identify genes whose expression is co-regulated. These gene groups often correspond to proteins linked in the same functional complex.

  Experiment 1 Experiment 2 Experiment 3 ... Experiment j
Gene 1 log2(Ratio 1,1) log2(Ratio 1,2) log2(Ratio 1,3)   log2(Ratio 1,j)
Gene 2 log2(Ratio 2,1) log2(Ratio 2,2) log2(Ratio 2,3)   log2(Ratio 2,j)
Gene 3 log2(Ratio 3,1) log2(Ratio 3,2) log2(Ratio 3,3)   log2(Ratio 3,j)
...          
Gene i log2(Ratio i,1) log2(Ratio i,2) log2(Ratio i,3)   log2(Ratio i,j)

The expression matrix displayed above with different experiments can also be built using replicates of the same experiment.

Centring and scaling:

In order to compare several DNA microarray experiments, it is necessary to achieve a centring-scaling step on each experiment data. Centring consists in bringing the mean value of each experiment distribution equal to zero. Be careful, this step is meaningful only if most of gene expression in the experiment are not modified. The application of a Lowess normalisation step to the data allows to obtain a value distribution already centred on zero.

You can verify if the data are well centred, by performing the following steps:

  1. Open the file containing the experiment series (your expression matrix) in Excel software, using the tabulation character as the column separator.

  2. For one column (corresponding to one DNA microarray experiment), calculate the mean value, using the MEAN Excel function. Verify that the value obtained is equal to zero.

  3. If it is not the case, remove from each experiment log2(Ratio) value the corresponding mean value. Be careful, for missing values (empty cells), replace empty contents by the NULL or NA string, in order to avoid introducing a zero value in Excel calculation in this cell. Indeed, a missing value is different from a true null one!

  4. Once this operation has been done, verify that the final mean value is equal to zero, this in order to avoid errors with Excel handling. Be careful, with decimal separator handling in Excel version (dot or coma)!

The scaling step is done to suppress the variation effect between experiments by bringing back their standard deviation to the same value. Be careful, this handling process assumes as an hypothesis that the majority of genes do not have a modified expression during the experiments. It is very important to be mindful on usage restriction of scaling. This method is applied usually for replicates of the same experiment.

Practically, here is the procedure to follow:

  1. For each column (corresponding to one DNA microarray experiment), calculate the standard deviation value. You are now working with the centred column calculated previously (subtracted mean).

  2. Divide each experiment log2(Ratio) value by the corresponding standard deviation value. Be careful once more about missing values (empty cells), replace empty contents by the NULL or NA string, so that Excel does not use the zero value in its calculation for this cell.

  3. Once the calculation is over, verify that the standard deviation value of the column is equal to one, in order to avoid errors in Excel handling. To avoid that final ratio values are to much different form the real ones, it is also possible to multiply values by the mean of experiment standard deviation in addition to the division.

Pre-treatment script usage (GEPAS):

GEPAS is an online tool (http://transcriptome.ens.fr/gepas/ - local mirror) that can perform both verification and pre-treatment of an experiment serie before its expression profile classification.

The “pre-processing” step of GEPAS gives a report on the experiment series that include:

  • The calculation of gene number
  • The calculation of experiment number
  • An evaluation of missing values
  • The identification of replicated lines…

Missing values management:

As you may have noticed during image analysis, the intensities calculated for some spots could not be used. This leads to missing values in the result file. Management of missing values is a delicate problem for which various solutions are proposed.

GEPAS gives you four different ways to handle missing values:

  • Replace them by the zero
  • Replace them by the mean of the row
  • Replace them by the median of the row
  • Use the KNNimput method for estimation

This latest method (Troyanskaya et al., 2001) seems currently the most efficient. To avoid that this treatment has too much impact on the following analyses, the genes for which the profiles contain more than 20-30% of missing values are usually excluded from the analysis.

Référence : Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001 Jun;17(6):520-5.

Replicated lines management:

When a gene is found in several copies in an expression matrix, it is necessary to merge these lines in one.

GEPAS offers two methods:

  • Merge replicates using mean values
  • Merge replicates using median values

Filter invariant profiles:

Filtering is used to suppress genes with unsignificantly changed expression in an experiment series. These genes do not bear so much interest for our analysis and their suppression enables to simplify further classification.

GEPAS enables various filtering methods that eliminate the genes that have profiles stuck in an invariant zone. This zone could be more or less wide according to the filtering intensities desire. To obtain more details on the various filtering methods used by GEPAS you can refer to the online manual.

GEPAS practical usage:

  1. GEPAS file format is a classical expression matrix. The # character is used to comment lines and header. The first cell before the column/experiment name must contains the identification “#NAMES”. Missing values can be let empty or replaced by the NULL or NA string. Finally, only one column (the first one) is used for annotation. The file must be saved as e text (.txt) file with tabulations as column separators and the dot as decimal separator. Be careful, if you performed calculation using Excel to keep only the results of these calculations and not the formula and next to transform the “#VALEUR!” by NA.

  2. Once the file is selected with the web interface, launch the “pre-analyse” ( in the “tools” section) of the pre-processing script, it will enable GEPAS to read the file and detect possible problems (missing values, replicates…). In our case, GEPAS has correctly detected the header line with experiment names. It has detected missing values (NULL) and duplicate lines. Once the pre-analysis is done GEPAS recommends you various pre-processing steps you can apply on your data.



  3. GEPAS gives you also the possibility to visualize your data using graphics. Thus the “histogram of values” allow you to see the distribution of all ratio values in your matrix. This helps you ensure that your data are well centred and if they follow a gauss curve (normal hypothesis used in statistical tests). The “histogram of number of missing values by pattern” is also very helpful as it helps you establish how much missing values you can tolerate in your expression profile.



  4. The form page allows you to choose the modification to apply on your results. For example, you can select “merge replicates”, “filter missing values” (70 %), “input missing values” (without changing the default parameter, K = 15) and launch the pre-processing. GEPAS merges all lines found several times using the selected method. Then it eliminates profiles containing more than 30% of missing values (this value could be adapted using the previous histogram). For the remaining profiles, it uses the KNN impute method to allocate missing values. It is always better to perform all these operations in several steps in order of one time so that you can save intermediate matrices and you can always verify how the software has worked.





  5. Backup the file obtained so that you can use it later for clustering steps. You can write down the number of genes that has been conserved after all the filtering steps. Apply one last time the preanalysis step in order to ensure that all the options you selected have been correctly applied and that there is not any replicate lines or missing values.

  6. 6. For some clustering method, it is necessary to suppress expression profiles considered as invariant. GEPAS helps you to apply some “significant” filters (number of peaks, root mean square and standard deviation) to suppress invariant profiles. You can refer to the GEPAS manual to read details about each filtering method used in GEPAS. Practically, you can try the “Filter Flat Pattern” option using the RMS method. You should always work on the pre-process file obtains at step 4.

  7. Backup several result files according to the various filter you can apply. These different versions can be used as input files for clustering software.

Useful links:

  • GEPAS, is an online tool for analysis, pretreatment and filtration of expression matrices before applying other methods as searching for differentially expressed genes or co-expressed genes clustering. Website: http://transcriptome.ens.fr/gepas/ - local mirror.




This page is also available in french | Last page update: 9/7/2011 - 11:28
For any questions or comments send an e-mail to the webmaster