|
|
 |
/ Home / Analysis Tools / Expression Matrix Pretreatment
Pre-treatment of expression matrices coming from DNA microarray results
Version 1.6
This tutorial can be download
as a PDF file (922.9 Kb).
DNA microarray experiments related to a common reference can
be integrated in experiment series during which the transcript amount is followed
according to various experimental conditions (see table below). These series
allow to follow transcriptome dynamic and thus to identify genes whose expression
is co-regulated. These gene groups often correspond to proteins linked in the
same functional complex.
| |
Experiment 1 |
Experiment 2 |
Experiment 3 |
... |
Experiment j |
| Gene 1 |
log2(Ratio 1,1) |
log2(Ratio 1,2) |
log2(Ratio 1,3) |
|
log2(Ratio 1,j) |
| Gene 2 |
log2(Ratio 2,1) |
log2(Ratio 2,2) |
log2(Ratio 2,3) |
|
log2(Ratio 2,j) |
| Gene 3 |
log2(Ratio 3,1) |
log2(Ratio 3,2) |
log2(Ratio 3,3) |
|
log2(Ratio 3,j) |
| ... |
|
|
|
|
|
| Gene i |
log2(Ratio i,1) |
log2(Ratio i,2) |
log2(Ratio i,3) |
|
log2(Ratio i,j) |
The expression matrix displayed above with different experiments
can also be built using replicates of the same experiment.
Centring and scaling:
In order to compare several DNA microarray experiments, it
is necessary to achieve a centring-scaling step on each experiment data. Centring
consists in bringing the mean value of each experiment distribution equal to
zero. Be careful, this step is meaningful only if most of gene expression in
the experiment are not modified. The application of a Lowess normalisation step
to the data allows to obtain a value distribution already centred on zero.

You can verify if the data are well centred, by performing
the following steps:
- Open the file containing the experiment series (your expression matrix)
in Excel software, using the tabulation character as the column separator.
- For one column (corresponding to one DNA microarray experiment), calculate
the mean value, using the MEAN Excel function. Verify that the value obtained
is equal to zero.
- If it is not the case, remove from each experiment log2(Ratio) value the
corresponding mean value. Be careful, for missing values (empty cells), replace
empty contents by the NULL or NA string, in order to avoid introducing a zero
value in Excel calculation in this cell. Indeed, a missing value is different
from a true null one!
- Once this operation has been done, verify that the final mean value is equal
to zero, this in order to avoid errors with Excel handling. Be
careful, with decimal separator handling in Excel version (dot or coma)!
The scaling step is done to suppress the variation effect
between experiments by bringing back their standard deviation to the same value.
Be careful, this handling process assumes as an hypothesis that the majority
of genes do not have a modified expression during the experiments. It is very
important to be mindful on usage restriction of scaling. This method is applied
usually for replicates of the same experiment.

Practically, here is the procedure to follow:
- For each column (corresponding to one DNA microarray experiment), calculate
the standard deviation value. You are now working with the centred column
calculated previously (subtracted mean).
- Divide each experiment log2(Ratio) value by the corresponding standard
deviation value. Be careful once more about missing values (empty cells),
replace empty contents by the NULL or NA string, so that Excel does not use
the zero value in its calculation for this cell.
- Once the calculation is over, verify that the standard deviation value of
the column is equal to one, in order to avoid errors in Excel handling. To
avoid that final ratio values are to much different form the real ones, it
is also possible to multiply values by the mean of experiment standard deviation
in addition to the division.
Pre-treatment script usage (GEPAS):
GEPAS is an online tool (http://transcriptome.ens.fr/gepas/
- local mirror) that can perform both verification and pre-treatment of an experiment
serie before its expression profile classification.
The “pre-processing” step of GEPAS gives a report
on the experiment series that include:
- The calculation of gene number
- The calculation of experiment number
- An evaluation of missing values
- The identification of replicated lines…
Missing values management:
As you may have noticed during image analysis, the intensities
calculated for some spots could not be used. This leads to missing values in
the result file. Management of missing values is a delicate problem for which
various solutions are proposed.
GEPAS gives you four different ways to handle missing values:
- Replace them by the zero
- Replace them by the mean of the row
- Replace them by the median of the row
- Use the KNNimput method for estimation
This latest method (Troyanskaya et al., 2001) seems currently
the most efficient. To avoid that this treatment has too much impact on the
following analyses, the genes for which the profiles contain more than 20-30%
of missing values are usually excluded from the analysis.
Référence : Troyanskaya O, Cantor M, Sherlock
G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation
methods for DNA microarrays. Bioinformatics. 2001 Jun;17(6):520-5.
Replicated lines management:
When a gene is found in several copies in an expression matrix,
it is necessary to merge these lines in one.
GEPAS offers two methods:
- Merge replicates using mean values
- Merge replicates using median values
Filter invariant profiles:
Filtering is used to suppress genes with unsignificantly changed
expression in an experiment series. These genes do not bear so much interest
for our analysis and their suppression enables to simplify further classification.
GEPAS enables various filtering methods that eliminate the
genes that have profiles stuck in an invariant zone. This zone could be more
or less wide according to the filtering intensities desire. To obtain more details
on the various filtering methods used by GEPAS you can refer to the online manual.
GEPAS practical usage:
- GEPAS file format is a classical expression matrix. The # character is used
to comment lines and header. The first cell before the column/experiment name
must contains the identification “#NAMES”. Missing values can
be let empty or replaced by the NULL or NA string. Finally, only one column
(the first one) is used for annotation. The file must be saved as e text (.txt)
file with tabulations as column separators and the dot as decimal separator.
Be careful, if you performed calculation using Excel to keep only the results
of these calculations and not the formula and next to transform the “#VALEUR!”
by NA.
- Once the file is selected with the web interface, launch the “pre-analyse”
( in the “tools” section) of the pre-processing script, it will
enable GEPAS to read the file and detect possible problems (missing values,
replicates…). In our case, GEPAS has correctly detected the header line
with experiment names. It has detected missing values (NULL) and duplicate
lines. Once the pre-analysis is done GEPAS recommends you various pre-processing
steps you can apply on your data.
- GEPAS gives you also the possibility to visualize your data using graphics.
Thus the “histogram of values” allow you to see the distribution
of all ratio values in your matrix. This helps you ensure that your data are
well centred and if they follow a gauss curve (normal hypothesis used in statistical
tests). The “histogram of number of missing values by pattern”
is also very helpful as it helps you establish how much missing values you
can tolerate in your expression profile.

- The form page allows you to choose the modification to apply on your results.
For example, you can select “merge replicates”, “filter
missing values” (70 %), “input missing values” (without
changing the default parameter, K = 15) and launch the pre-processing. GEPAS
merges all lines found several times using the selected method. Then it eliminates
profiles containing more than 30% of missing values (this value could be adapted
using the previous histogram). For the remaining profiles, it uses the KNN
impute method to allocate missing values. It is always better to perform all
these operations in several steps in order of one time so that you can save
intermediate matrices and you can always verify how the software has worked.

- Backup the file obtained so that you can use it later for clustering steps.
You can write down the number of genes that has been conserved after all the
filtering steps. Apply one last time the preanalysis step in order to ensure
that all the options you selected have been correctly applied and that there
is not any replicate lines or missing values.
- 6. For some clustering method, it is necessary to suppress expression profiles
considered as invariant. GEPAS helps you to apply some “significant”
filters (number of peaks, root mean square and standard deviation) to suppress
invariant profiles. You can refer to the GEPAS manual to read details about
each filtering method used in GEPAS. Practically, you can try the “Filter
Flat Pattern” option using the RMS method. You should always work on
the pre-process file obtains at step 4.
- Backup several result files according to the various filter you can apply.
These different versions can be used as input files for clustering software.
Useful links:
- GEPAS, is an online tool for analysis, pretreatment and
filtration of expression matrices before applying other methods as searching
for differentially expressed genes or co-expressed genes clustering. Website:
http://transcriptome.ens.fr/gepas/
- local mirror.
|