- Open Access
OligoRAP – an Oligo Re-Annotation Pipeline to improve annotation and estimate target specificity
BMC Proceedings volume 3, Article number: S4 (2009)
High throughput gene expression studies using oligonucleotide microarrays depend on the specificity of each oligonucleotide (oligo or probe) for its target gene. However, target specific probes can only be designed when a reference genome of the species at hand were completely sequenced, when this genome were completely annotated and when the genetic variation of the sampled individuals were completely known. Unfortunately there is not a single species for which such a complete data set is available. Therefore, it is important that probe annotation can be updated frequently for optimal interpretation of microarray experiments.
In this paper we present OligoRAP, a pipeline to automatically update the annotation of oligo libraries and estimate oligo target specificity. OligoRAP uses a reference genome assembly with Ensembl and Entrez Gene annotation supplemented with a set of unmapped transcripts derived from RefSeq and UniGene to handle assembly gaps. OligoRAP produces alignments of each oligo with the reference assembly as well as with unmapped transcripts. These alignments are re-mapped to the annotation sources, which results in a concise, as complete as possible and up-to-date annotation of the oligo library. The building blocks of this pipeline are BioMoby web services creating a highly modular and distributed system with a robust, remote programmatic interface.
OligoRAP was used to update the annotation for a subset of 791 oligos from the ARK-Genomics 20 K chicken array, which were selected as starting material for the oligo annotation session of the EADGENE/SABRE Post-analysis workshop. Based on the updated annotation about one third of these oligos is problematic with regard to target specificity. In addition, the accession numbers or ids the oligos were originally designed for no longer exist in the updated annotation for almost half of the oligos.
As microarrays are designed on incomplete data, it is important to update probe annotation and check target specificity regularly. OligoRAP provides both and due to its design based on BioMoby web services it can easily be embedded as an oligo annotation engine in customised applications for microarray data analysis. The dramatic difference in updated annotation and target specificity for the ARK-Genomics 20 K chicken array as compared to the original data emphasises the need for regular updates.
DNA microarray technology has evolved rapidly to become the most popular platform for high throughput gene expression analysis as it allow biologists to measure the expression of entire transcriptomes at relatively high speed and low cost. This makes microarrays ideal for applications like sample clustering/fingerprinting, genome annotation, detection of differential gene expression, detection of polymorphisms and re-sequencing [1, 2]. Microarrays contain oligonucleotides (probes) that can hybridise with the labelled reverse complement of mRNA. Since the probes are immobilised on the surface of an array and it is known which probes are located where on the array, signal at a certain spot can be used as a measure for gene expression. This requires that probes are unique for their target genes and hence optimal microarray design requires 1) a completely sequenced reference genome, 2) complete annotation for this reference genome to know what parts may be expressed and 3) complete knowledge about the natural variation amongst the sampled individuals.
Unfortunately there is currently not a single species for which such complete information is available. Although some reference genomes are now close to completion, annotation of these reference genomes as well as information on how individuals differ from these reference genomes is far from complete. Hence, microarray design is currently sub-optimal even for species with a rather complete reference genome. Probe design based on incomplete or erroneous data can lead to serious problems like non-specific probes causing cross hybridisation, orphan probes designed for non-existing targets, missing probes and misleading probes due to erroneous annotation.
Therefore, it is important to update the annotation for arrays regularly to improve the functional annotation of the targets as well as the reliability of probe-target assignments. Several tools have been developed for this purpose [3–12], but these provide either limited annotation, require complicated local installations with many dependencies, do not scale well or do not support our species of interest. We have developed OligoRAP (Oligo ReAnnotation Pipeline) to overcome these issues.
The pipeline consists of 5 steps: I. Convert oligo library data into BioMoby objects, II. Align oligos with a reference genome assembly and with a set of unmapped transcripts (UMTs), III. Analyse oligo annotation, IV. Analyse oligo quality and V. Make summary charts (see Figure 1). Implementation details are described and illustrated in Additional files 1, 2, 3, 4, 5, 6. In this section we will only focus on the key advantages of OligoRAP.
Firstly, OligoRAP does not rely solely on a reference genome or solely on transcripts (or sequences derived thereof), but uses both where possible. For the genome OligoRAP uses reference assemblies and annotation as provided by the Ensembl  project. Ensembl was chosen as primary annotation source, because it is the largest and richest resource of its kind with support for most popular model species in the animal kingdom. In addition to reference assemblies OligoRAP uses a set of unmapped transcripts (UMTs) to get a more complete picture. The UMT set contains RefSeq  and UniGene  entries, which failed to map to the reference assembly. Where available annotation derived from Ensembl (for hits on the genome) and from RefSeq or UniGene (for hits on UMTs) can be expanded with links to Entrez Gene  and GO . The combination of reference genome supplemented with UMTs provides optimally complete annotation for well-annotated species whilst keeping redundancy at a minimum. At the same time this strategy is flexible enough to support less well-annotated species even if there is no reference assembly available. In that case all of a species' transcripts simply become part of the UMT set.
Secondly, OligoRAP provides annotation for all hits instead of only for the best hit. This allows OligoRAP to provide not only updated annotation, but also oligo target specificity based on the amount and type of hits. OligoRAP can differentiate between primary hits (high hybridisation potential) and secondary hits (low hybridisation potential). Hybridisation potential is determined using three filters, which users can adjust based on their experimental setup. Based on their target specificity oligos are divided into six target specificity classes (TSCs): 1. Gene-specific probes with maximum signal potential, 2. Gene-specific probes with reduced signal potential, 3. Non-specific probes with maximum signal potential, 4. Non-specific probes with mixed signal potential, 5. Non-specific probes with reduced signal potential and 6. Orphan probes with background signal potential.
Finally, each of the steps is implemented as one or more web services , which were built using the BioMoby framework [17, 18]. These web services provide remote programmatic access and can be glued together using a variety of BioMoby clients like the Taverna Workbench  or custom code built with the BioMoby Perl or Java framework. Using web services we created a highly customisable and modular annotation pipeline with a robust interface. This allows for OligoRAP to be embedded in microarray data analysis workflows for improved scalability without tedious, local installations suffering from complex dependencies.
Results and discussion
OligoRAP was used to update annotation and target specificity for the subset of 791 oligos from the ARK-Genomics 20 K chicken array (see methods in Additional file 1). Figure 2 shows how these oligos are divided over OligoRAP's target specificity classes (TSCs) with transcriptome-based target specificity (TbTS) in Figure 2A and genome-based target specificity (GbTS) in figure 2B.
Transcriptome-based versus genome-based target specificity
Up till recently the transcriptome of higher eukaryotes was thought to contain a very small subset of the genome. For example in Ensembl 50 less than 5% of the chicken genome is annotated as exon. Since only potentially expressed sequences can hybridise to probes on a microarray, most oligo design and annotation efforts have focused on known and/or predicted transcripts without taking the rest of the genome into account. Apart from a few structural elements like the centromeres and telomeres it's still not clear what the function of the other 95% or more of DNA is, but slowly evidence is piling up indicating the size of the transcriptome is vastly underestimated. Especially the pilot phase of the ENCODE project showed that the human "genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts" . It remains unclear whether all these transcripts are biologically functional or whether they just represent noise, but it is clear that all transcripts can potentially hybridise with the oligos on microarrays. Therefore it is probably more appropriate to evaluate target specificity in the context of the entire genome as compared to what is currently annotated as transcriptome.
Looking at TbTS and GbTS for the 791 ARK-Genomics chicken oligos the total amount of gene-specific oligos differs only by 2.3% with 69.5% and 67.2%, respectively. Hence taking the entire genome into account as compared to looking only at the transcriptome does not lead to a dramatic decrease of gene-specific probes. Unfortunately at least one third of the probes are non-specific. For these problematic non-specific probes the TbTS and GbTS pictures look quite different.
For most of the oligos it is extremely difficult to verify their predicted target specificity except for the orphan oligos of TSC 6. The 791 oligos selected as starting material for this EADGENE/SABRE workshop were picked, because they do show a high differential signal on the microarrays. Hence these oligos clearly bind labelled cDNA derived from one or more target genes, but OligoRAP classifies 3.5% and 16.1% of the oligos as orphans with GbTS and TbTS, respectively. These numbers indicate that OligoRAP's TSC assignments are currently more an indicator for the relatively immature status of the chicken genome assembly and its annotation than for target specificity.
Furthermore, for almost half of the oligos, the sequence identifier they were originally designed for is no longer present in their updated annotation, which is indicated with "target changed" in Figure 2. The fact that these identifiers no longer link to these oligos not necessarily means that the oligo no longer represents expression of the same gene as before, but it does indicate at least major changes in the annotation. On the other hand annotation associated with certain identifiers may have received considerable "minor" updates keeping the sequence identifier intact. Hence, the large amount of oligos with changed targets is still an underestimate of the total amount of changed annotation.
Although the ENCODE pilot study covered only approximately 1% of the human genome it is clear that our view on the transcriptome will change dramatically over the next years. This will have a big impact on oligo annotation & target specificity making it more important than ever to be able to update oligo annotation quickly and regularly. In addition to regular updates of the data, annotation pipelines like OligoRAP will need to be updated too to adapt the annotation strategies to our changing insights in gene expression.
Microarray probes are designed on incomplete data. Therefore it is important to update probe annotation and estimate target specificity regularly. OligoRAP provides such functionality for Ensembl species and can easily be embedded in customised applications for microarray data analysis due to its design based on BioMoby web services. The rather high amount of oligos with changed targets shows the importance of updated annotation and reflects the limited amount and quality of the annotation available at the time the ARK-Genomics 20 K chicken array was designed.
ZIP-archive containing the final results of the OligoRAP pipeline run as well as all intermediate results. See included README for details.
Heller MJ: DNA microarray technology: devices, systems, and applications. Annu Rev Biomed Eng. 2002, 4: 129-153. 10.1146/annurev.bioeng.4.020702.153438.
Lee NH, Saeed AI: Microarrays: an overview. Methods Mol Biol. 2007, 353: 265-300.
Gautier L, Moller M, Friis-Hansen L, Knudsen S: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics. 2004, 5: 111-10.1186/1471-2105-5-111.
Liu H, Zeeberg BR, Qu G, Koru AG, Ferrucci A, Kahn A, Ryan MC, Nuhanovic A, Munson PJ, Reinhold WC, Kane DW, Weinstein JN: AffyProbeMiner: a web resource for computing or retrieving accurately redefined Affymetrix probe sets. Bioinformatics. 2007, 23: 2385-2390. 10.1093/bioinformatics/btm360.
Dai H, Tian B, Zhao WD, Leung A, Smith SR, Wan JS, Yao X: Dynamic integration of gene annotation and its application to microarray analysis. J Bioinform Comput Biol. 2004, 1: 627-645. 10.1142/S0219720004000387.
Roche FM, Hokamp K, Acab M, Babiuk LA, Hancock RE, Brinkman FS: ProbeLynx: a tool for updating the association of microarray probes to genes. Nucleic Acids Res. 2004, 32: W471-4. 10.1093/nar/gkh452.
Chalifa-Caspi V, Yanai I, Ophir R, Rosen N, Shmoish M, Benjamin-Rodrig H, Shklar M, Stein TI, Shmueli O, Safran M, Lancet D: GeneAnnot: comprehensive two-way linking between oligonucleotide array probesets and GeneCards genes. Bioinformatics. 2004, 20: 1457-1458. 10.1093/bioinformatics/bth081.
Ferrari F, Bortoluzzi S, Coppe A, Sirota A, Safran M, Shmoish M, Ferrari S, Lancet D, Danieli GA, Bicciato S: Novel definition files for human GeneChips based on GeneAnnot. BMC Bioinformatics. 2007, 8: 446-10.1186/1471-2105-8-446.
Kossenkov A, Manion FJ, Korotkov E, Moloshok TD, Ochs MF: ASAP: automated sequence annotation pipeline for web-based updating of sequence information with a local dynamic database. Bioinformatics. 2003, 19: 675-676. 10.1093/bioinformatics/btg056.
Diehn M, Sherlock G, Binkley G, Jin H, Matese JC, Hernandez-Boussard T, Rees CA, Cherry JM, Botstein D, Brown PO, Alizadeh AA: SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res. 2003, 31: 219-223. 10.1093/nar/gkg014.
Zhang J, Carey V, Gentleman R: An extensible application for assembling annotation for genomic data. Bioinformatics. 2003, 19: 155-156. 10.1093/bioinformatics/19.1.155.
Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, et al: Ensembl 2008. Nucleic Acids Res. 2008, 36: D707-14. 10.1093/nar/gkm988.
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, et al: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008, 36: D13-21. 10.1093/nar/gkm1000.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.
Neerincx PB, Leunissen JA: Evolution of web services in bioinformatics. Brief Bioinform. 2005, 6: 178-188. 10.1093/bib/6.2.178.
Wilkinson MD, Links MD: BioMOBY: an open source biological web services proposal. Brief Bioinform. 2002, 3: 331-341. 10.1093/bib/3.4.331.
Wilkinson MD, Senger M, Kawas E, Bruskiewich R, Gouzy J, Noirot C, Bardou P, et al: Interoperability with Moby 1.0 it's better than sharing your toothbrush!. Brief Bioinform. 2008, 9: 220-231. 10.1093/bib/bbn003.
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004, 20: 3045-3054. 10.1093/bioinformatics/bth361.
Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.
Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E: The Ensembl core software libraries. Genome Res. 2004, 14: 929-933. 10.1101/gr.1857204.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.
Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, Kober KM, Miller W, Pedersen JS, Pohl A, Raney BJ, Rhead B, Rosenbloom KR, Smith KE, Stanke M, Thakkapallayil A, Trumbower H, Wang T, Zweig AS, Haussler D, Kent WJ: The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res. 2008, 36: D773-9. 10.1093/nar/gkm966.
The authors would like to thank Harm Nijveen for his feedback on the manuscript, Arun Kommadath for his work on an early prototype and Christophe Klopp, Dennis Pricket, Micheal Watson & Pierrot Casel for fruitful discussions on the topic of updating probe annotation.
This publication was funded by EADGENE http://www.eadgene.info
HR and TMB were funded by i) the Virtual Laboratory e-Science project http://www.vl-e.nl supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and the ICT innovation program of the Ministry of Economic Affairs (EZ); ii) BioRange and BioAssist programs of the Netherlands Bioinformatics Centre (NBIC). PBTN, HN, MAMG and JAML were partially funded by EADGENE.
This article has been published as part of BMC Proceedings Volume 3 Supplement 4, 2009: EADGENE and SABRE Post-analyses Workshop. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S4.
The authors declare that they have no competing interests.
PBTN designed and programmed the pipeline of web services and drafted the manuscript. HR conceived the pipeline, participated in its design and helped to improve the manuscript. HN helped with data analysis for debugging and helped to improve the manuscript. TMB, MAMG and JAML secured funding, managed the project and helped to improve the manuscript. All authors read and approved the final manuscript.