sigReannot: an oligo-set re-annotation pipeline based on similarities with the Ensembl transcripts and Unigene clusters
© Casel et al; licensee BioMed Central Ltd. 2009
Published: 16 July 2009
Microarray is a powerful technology enabling to monitor tens of thousands of genes in a single experiment. Most microarrays are now using oligo-sets. The design of the oligo-nucleotides is time consuming and error prone. Genome wide microarray oligo-sets are designed using as large a set of transcripts as possible in order to monitor as many genes as possible. Depending on the genome sequencing state and on the assembly state the knowledge of the existing transcripts can be very different. This knowledge evolves with the different genome builds and gene builds. Once the design is done the microarrays are often used for several years. The biologists working in EADGENE expressed the need of up-to-dated annotation files for the oligo-sets they share including information about the orthologous genes of model species, the Gene Ontology, the corresponding pathways and the chromosomal location.
The results of SigReannot on a chicken micro-array used in the EADGENE project compared to the initial annotations show that 23% of the oligo-nucleotide gene annotations were not confirmed, 2% were modified and 1% were added. The interest of this up-to-date annotation procedure is demonstrated through the analysis of real data previously published.
SigReannot uses the oligo-nucleotide design procedure criteria to validate the probe-gene link and the Ensembl transcripts as reference for annotation. It therefore produces a high quality annotation based on reference gene sets.
Our knowledge of genomes and transcriptomes structures is evolving quickly. The underlying idea of expression microarray is that each probe of a slide is monitoring a corresponding biological element of the transcriptome. The design is a key step of the microarray creation process. As presented by Le Brigand et al.  several constraints have to be taken in to account when choosing an oligo-nucleotide for a given gene. Specificity is the most important factor because it certifies the link with the element to be monitored. The second constraint is technical, the oligo-nucleotide has to be stable during the experiment and must not fold to produce a stable structure which would not be able to hybridize with the corresponding transcript. The last criterion used for species for which a large set of expressed sequence tags is available is the occurrence of the oligo-nucleotide within these tags. This eliminated oligo-nucleotides designed on transcripts which have never or seldom been monitored. The design process uses as input a set of unique sequences representing the transcriptome of the studied species. These sequences are usually cleaned of low complexity areas in order to lower the probability of cross-hybridization. The probe selection software then determines the specific sub-parts of each sequence on which the design can be performed. The software produces a number of candidate probes with the corresponding quality values. The set of input sequences has to be chosen carefully in order to represent as fully as possible the transcriptome. The quality of the selected set largely depends on the knowledge available on the studied genome. This state is closely linked to the genome assembly state and the number and variety of transcript sequences available.
New genome assemblies are produced regularly thanks to the new sequences produced in the finishing process. For each new assembly a new gene build is performed in order to locate the different transcripts on the genome and link them to a given gene. New gene builds are also produced on stable assemblies when enough new annotation is available. Finally, each gene is also under annotation by the sub-group of biologists interested in the corresponding function and can gain, lose or have its annotation modified. All these processes can impact the probe annotation: re-annotating regularly the oligo-sets is therefore highly relevant. This is the aim of sigReannot.
The results will be given on the complete oligo-set and also on a subset of 791 oligo-nucleotides which have been monitored as over or under expressed in a experiment conducted by J.M.J. Rebel from Wageningen University (unpublished).
In Figure 2b, the KEGG annotation version of the EADGENE chicken oligo-nucleotide set was used to re-analyse data previously published (desert et al ) and corresponding to the gene clusters down- or up- regulated after 16 h fasting compared to the fed states in chicken liver. Three additional Kegg pathways with a minimum of 3 genes associated (see desert et al  for their selection) were found for these two gene clusters. It concerns "Glycolysis.Gluconeogenesis", 'Galactose.metabolism', 'Pyruvate.metabolism' for the down- regulated gene clusters and, "Pentose.phosphate.pathway", "Fructose.and.mannose.metabolism", "Alanine.and.aspartate.metabolism" for the down- and up- regulated gene clusters respectively.
The pipeline chains three steps. The first step tries to link each oligo-nucleotide to a given gene, the second step retrieves from different sources functional annotation using the gene identifier and the last step formats the data in several files corresponding to the biologists' needs.
Step 1: Linking each oligo-nucleotide to a gene
The aim is to verify, if the design criteria are still matched. The specificity of the oligo-nucleotide is verified by aligning it versus a set of existing transcripts. Transcript files are produced by the Ensembl  and the NCBI teams for most sequenced genomes. SigReannot uses two Ensembl transcript files ftp://ftp.ensembl.org/ containing all known cDNAs and non-coding RNAs.
The association is simply based on similarity criteria which can be calculated using a blast  output. These criteria have been determined experimentally through correlating similarity values with hybridization results. SigReannot uses criteria given by Liebich and Schadt et al.  Kane et al.  and He and Wu et al. . Two alignment criteria are taken into account to link an oligo-nucleotide to a transcript. The first one is the longest contiguous stretch. As soon as this one is longer than 15 base pairs for 50 mers (20 bp for 70 mers), a low quality link, or noise link, will be registered between the oligo-nucleotide and the transcript. And the second one is the global identity percentage, if this criteria is higher than 85%, then a high quality link, or good hit, will be registered between the oligo-nucleotide and the transcript. This criterion is computed dividing the number of nucleotides matching between the transcripts and the oligo-nucleotide by the length of the oligo-nucleotide.
To find more oligo-gene link, the pipeline also uses the results of sequence similarity searches versus the Unigene clusters . Ensembl uses stringent thresholds in the gene build process and produces often short UTRs. Manual probe re-annotation has shown that in some cases the design of the probe was done in an area very close but outside of the Ensembl selected UTR region of the transcript or in an intron. This comes first from the sequences selected to design the probes which are often ESTs presenting splice differences, and second from the high weight of the specificity criteria in the probe selection process. In order to maximize the number of probes with annotation, the pipeline checks if a probe has a similarity with a unique Unigene cluster and if this cluster can be uniquely linked to an Ensembl gene. In this case the pipeline extracts an extended region (1000 bp up and downstream) around the transcript to locate the probe. If these steps succeed then the oligo-nucleotide is linked to the corresponding gene and its category is updated. All oligo-nucleotides from categories 3 to 6 will undergo this processing step (see Figure 1b for the impact). Once each oligo-nucleotide is classified, probes from classes 1 to 4 will be functionally annotated.
Step 2: retrieving annotation using the Ensembl API (Application Programming Interface)
Once an oligo-nucleotide is linked to a gene, the Ensembl API enables the corresponding human, mouse and rat orthologuous genes (Ensembl gene ID, HGNC and its description), the GO identifiers for each Gene Ontology category (ID, class, evidence code and definition) and the external references (database_name and the xref ID) to be fetched. Then, using the human HGNC ID of the Ensembl gene other annotations are fetched from the KEGG database:
Pathway ID and description.
KEGG genes: KO ID, EC, gene ID and definition for the annotated species, human, mouse and rat.
These informations are stored in a local Mysql database.
Step 3: data formatting
Once all the annotation is stored in the database, the aim is to extract them into a user-friendly format for biologists. According to their demands, these data are extracted to comma separated files commonly opened within a spreadsheet. Using these data SigReannot also generates correspondence matrices linking each oligo-nucleotide, in rows, to its GO term, in columns. The junction of a row and a column equals one if the oligo-nucleotide has this annotation, and zero if not.
In the current version of SigReannot eight files are provided.
For the EADGENE oligo-sets, these files can be downloaded from the network website at .
The oligo-set used in this paper was designed in 2005 by the Roslin Institute and contains 20 460 oligo-nucleotides. The chicken oligo-nucleotides were designed against a mixed panel of ESTs from Genbank/EMBL, Ensembl release 30 genes and transcripts, UMIST chicken ChEST cDNAs, miRBase RNAs and contributed sequences. The initial annotation file can be downloaded from: ftp://ftp.ark-genomics.org/Chicken_oligos/.
One element which has been thoroughly discussed with the users and other teams working on tools with the same aim is the impact of the alignment strand on the annotation. Some probes of the EADGENE oligo-sets have obviously been designed on the opposite strand of the gene. With the usual transcript extraction protocol these probes should show no signal because the transcripts should not hybridize. However some of these probes are measured as under or over expressed in experiments. This may be the result of hybridization with a part of the genome which is not sequenced yet or else the result of antisense transcription of these genes. More and more evidence  supports the conclusion that quite a lot of transcripts have also antisense expression. Therefore SigReannot annotates probes with the corresponding gene and the strand. Analysing manually the localization of the probes on the genome showed that some of them were designed in intronic regions. This comes from the fact that the splicing machinery does not always perform in the same way. For these probes it is possible to calculate another quality criteria which would be the ratio of the unspliced EST over the spliced EST at this location. This criteria would express the probability of monitoring the expression of the gene using this probe. With a large number of ESTs from different conditions it would be possible to specify the criteria following that condition. To finish, the binding free energy criteria often mentionned in the oligonucleotide design paper is not used in this version of SigReannot.
Because microarray oligo-sets design is expensive and time consuming, and because the biologists community is willing to share its results, oligo-sets are often used for several years. During this time, the genome assembly quality and the amount of annotation are increasing. These elements explain why biologists are interested in up-dated annotation for existing oligo-sets. The main novelty of sigReannot is to provide biologists with quality criteria about the annotation, letting them decide how to exploit it. Even if the microarray technique is questioned with the arrival of the new sequencing technologies, pipelines like SigReannot will be relevant infrastructures to link long SAGE  tags with the corresponding transcripts.
We thanks Pieter Neerincx and Haisheng Nie from Wageningen university, we thanks our users for there feedback: LI Jiang, Frédéric Lecerf, Gwenola Tosser, Yannick Faulconnier and we thanks EADGENE who gave funding for this project.
This article has been published as part of BMC Proceedings Volume 3 Supplement 4, 2009: EADGENE and SABRE Post-analyses Workshop. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S4.
- Le Brigand K, Russell R, Moreilhon C, et al: An open-access long oligonucleotide microarray resource for analysis of the human and mouse transcriptomes. Nucleic Acids Res. 2006, 34: e87-10.1093/nar/gkl485.View ArticlePubMedGoogle Scholar
- Desert C, Duclos MJ, Blavy P, Lecerf F, Moreews F, Klopp C, Aubry M, Herault F, Le Roy P, Berri C, Douaire M, Diot C, Lagarrigue S: Transcriptome profiling of the feeding-to-fasting transition in chicken liver. BMC Genomics. 2008, 9: 611-10.1186/1471-2164-9-611.PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Aken BL, Beal K, et al: Ensembl 2008. Nucleic Acids Res. 2008, 36: D707-D714. 10.1093/nar/gkm988.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Liebich J, Schadt CW, Chong SC, He Z, Rhee S-K, Zhou J: Improvement of oligonucleotide probe design criteria for functional gene microarrays in environmental applications. Appl Environ Microbiol. 2006, 72: 1688-1691. 10.1128/AEM.72.2.1688-1691.2006.PubMed CentralView ArticlePubMedGoogle Scholar
- Kane MD, Jatkoe TA, Stumpf CR, Lu J, Thomas JD, Madore SJ: Assessment of the sensitivity and specificity of oligonucleotide (50 mer) microarrays. Nucleic Acids Res. 2000, 28: 4552-4557. 10.1093/nar/28.22.4552.PubMed CentralView ArticlePubMedGoogle Scholar
- He Z, Wu L, Li X, Fields MW, Zhou J: Empirical establishment of oligonucleotide probe design criteria. Appl Environ Microbiol. 2005, 71: 3753-3760. 10.1128/AEM.71.7.3753-3760.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003, 31: 28-33. 10.1093/nar/gkg033.PubMed CentralView ArticlePubMedGoogle Scholar
- EADGENE Oligo Set Annotation Files. [http://www.eadgene.info/TheProject/Integration/BiologicalresourcesandfacilitiesWP11/EADGENEOligoSetsAnnotationFiles/tabid/324/Default.aspx]
- 't Hoen PAC, Ariyurek Y, Thygesen HH, Vreugdenhil E, Vossen RHAM, de Menezes RX, Boer JM, van Ommen G-JB, den Dunnen JT: Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res. 2008, 36: e141-10.1093/nar/gkn705.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.