Sanger institute

From RP
Jump to: navigation, search

About the Sanger Institute


The Sanger Institute is a genome research institute primarily funded by the Wellcome Trust. We use large-scale sequencing, informatics and analysis of genetic variation to further our understanding of gene function in health and disease and to generate data and resources of lasting value to biomedical research.

Access the Bioinformatics Software


The following bioinformatics software was developed by the Sanger Institute to analyze finished and submitted production data:

  • Alfresco
    • The aim is to develop a new visualisation tool that allows effective comparative genome sequence analysis. The program will compare multiple sequences from putitatively homologous regions in different species. Results from various different existing analysis programs, such as gene prediction, protein homology and regulatory sequence prediction programs shall be visualised and used to find corresponding sequence domains.
  • Alien hunter
    • an application for the prediction of putative Horizontal Gene Transfer (HGT) events with the implementation of Interpolated Variable Order Motifs (IVOMs).
  • Angler
    • A Browser of C.elegans Embryo Development In Time and Space
  • Artemis
    • DNA sequence viewer and annotation tool
  • Cdna_db
    • cdna_db is a software system designed for quality-control checking of finished cDNA clone sequences, and their computational analysis. The combination of a relational db (MySQL) schema, and an object-orientated perl API make it easy to implement high-level analyses of these transcript sequences.
  • DAS
    • The Wellcome Trust Sanger Institute provides support for the Distributed Annotation Systems via a range of different projects, websites and applications. This information resource provides an overview of these.
    • The Distributed Annotation System (DAS) addresses these issues. It is frequently being used to openly exchange biological annotations between distributed sites. Data distribution, performed by DAS servers, is separated from visualization, which is done by DAS clients.
    • DECIPHER tracks submicroscopic duplications and deletions of DNA in patients together with phenotypes exhibited by those patients. DECIPHER tallies these genetic abnormalities with genes and other features of interest in the affected areas. The aim of DECIPHER is to provide a research tool to aid clinical diagnosis and treatment of these conditions. DECIPHER makes use of DAS technology to integrate with Ensembl, the world's leading genome browser.
  • Doublescan
    • Doublescan is a program for comparative ab initio prediction of protein coding genes in mouse and human DNA.
  • Eponine
    • a computational method for detecting mammalian transcription start sites
  • Est_db
    • The est_db package is a software suite and database system designed to support expressed sequence tag (EST) sequencing projects.
    • The FINEX program allows sequence homology searching techniques to be applied, where the sequence data is replaced with a fingerprint abstracted from the intron/exon boundary phase and the exon length.
    • Please note FINEX is no longer supported but is available for download.
  • GAZE
    • GAZE is a tool for the integration of gene prediction signal and content sensor information into complete gene structures. It is completely configurable in the way that both the signal and content data themselves and the the model of gene structure against which assemblies are validated and scored, are external to the system and and supplied by the user.
  • Hexamer
    • Hexamer is a program to scan DNA sequences to look for likely coding regions. The principle is to use 6mers, but to avoid deriving any information from base composition. Therefore, the frequencies of each 6mer are normalized by dividing by the total frequency of all 6mers with the same base composition.
  • Illuminus
    • Illuminus is a fast and accurate algorithm for assigning single nucleotide polymorphism (SNP) genotypes to microarray data from the Illumina BeadArray technology.
  • LogoMat-M
    • Profile Hidden Markov Models (pHMMs) are a widely used tool for protein family research. We present a method to visualize all of their central aspects graphically, thus generalizing the concept of sequence logos introduced by Schneider and Stephens. For each emitting state of the pHMM, we display a stack of letters. As for sequence logos, the stack height is determined by the deviation of the position's letter emission frequencies from the background frequencies of the letters. As a new feature, the stack width now visualizes both the probability of reaching the state (the hitting probability) and the expected number of letters the state emits during a pass through the model (the expected contribution).
  • LogoMat-P
    • The problem of profile-profile comparison has a long history but has received a lot of attention recently. This is a result of the growing number of well characterised protein families in databases such as Pfam. By adding additional information about properties of the entire family, it has been shown that profile-profile methods significantly increase sensitivity compared to profile-sequence comparison.
    • The availability of advanced profile-profile comparison tools such as PRC or HHsearch demand sophisticated visualisation tools not presently available. We introduce an approach built upon the concept of HMM Logos. The method illustrates the similarities of pairs of protein family profiles in an intuitive way.
  • LookSeq
    • LookSeq is a web-based application for alignment visualization, browsing and analysis of genome sequence data.
    • LookSeq supports multiple sequencing technologies, alignment sources, and viewing modes; low or high-depth read pileups; and easy visualization of putative single nucleotide and structural variation. The visible range, from whole chromosome to single base resolution, can be set manually or by scrolling or zooming the display with fast, on-the-fly rendering from the server-side alignment database. LookSeq uses a universal database for alignments of different sequencing technologies and algorithms. Sequence data from multiple sources can be viewed separately or aligned in a single display, facilitating direct comparison between datasets. LookSeq can also link to relevant external sites such as PubMed and other online analysis tools, via buttons or double-clicking on the displayed sequence annotation.
    • MAPTAG is an informatics tool that annotates batches of unknown sequences to the mouse genome, and assigns gene ID where possible through sequence match to the Ensembl database system. It consists of several linked sequence search modules that are controlled by a management script. Sequences are submitted for analysis in simple FASTA format, and resulting data is generated in tab-delimited text files that can be manually or automatically imported into relational databases or PC-based desktop analysis packages such as Microsoft Excel.
  • Margarita
    • Margarita infers genealogies from population genotype data and uses these to map disease loci.
    • These genealogies take the form of the Ancestral Recombination Graph (ARG). The ARG defines a genealogical tree for each locus, and as one moves along the chromosome the topologies of consecutive trees shift according to the impact of historical recombination events. There are two stages to the analysis. First, we infer plausible ARGs using a heuristic algorithm, which can handle unphased and missing data. Second, we test the genealogical tree at each locus for a clustering of the disease cases beneath a branch. Since the true ARG is unknown, we average this analysis over an ensemble of inferred ARGs.
  • NestedMICA
    • NestedMICA is a method for discovering over-represented short motifs in large sets of strings. Typical applications include finding candidate transcription factor binding sites in DNA sequences.
    • PAjHMMA is a Parameter Adjustable Java Hidden Markov Model Architecture.
    • This is a toolkit for generalised hidden Markov model (GHMM) decoding. The user specifies a DNA sequence in FASTA format and a hidden Markov model as a text file. The text file contains the parameters for the HMM; states, emission frequencies within each state, and the transition probabilities between the states. Certain states may have distinctive length distributions that the user may wish to specify. Text files of these length distributions should be made and the path to the length distribution file included in the State specification.
    • Pseudogene inference from Loss of Constraint
  • Pfam
    • The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
  • ProServer
    • ProServer is a very lightweight DAS server written in Perl. It is simple to install and configure and has existing adaptors for a wide variety of data sources. It is also easily extensible allowing adaptors to be written for other data sources
  • Projector
    • Projector is a program for the comparative, homology based prediction of protein coding genes in mouse and human DNA. Projector takes the known genes of one DNA sequence and predicts the corresponding genes in an evolutionarily related DNA sequence.
  • QuickTree
    • QuickTree is a program for the rapid reconstruction of phylogenies by the Neighbor-Joining method.
    • The SCOOP software allows the comparison of families of proteins. SCOOP stands for Simple Comparison Of Outputs Program. The software provides an alternative method to profile-profile comparison. The method is conceptually simple yet provides results that are comparable with other state of the art tools.
  • Ssaha
    • SSAHA (Sequence Search and Alignment by Hashing Algorithm) is an algorithm for very fast matching and alignment of DNA sequences. It achieves its fast search speed by encoding sequence information in a perfect hash function.
  • Ssaha2
    • The SSAHA2 package combines the SSAHA searching algorithm with the cross_match sequence alignment program developed by Phil Green at the University of Washington. The SSAHA algorithm is used to identify regions of high similarity which are then aligned using a banded Smith-Waterman algorithm. Parameters can be tuned via a number of command line options for a wide range of applications.:
      • mapping of sequence reads (Solexa, ABI, Sanger) to a genomic reference sequence
      • polymorphism detection
      • cross-species whole genome alignment
      • EST/cDNA alignment
      • BACends placement
      • database searching of reads and fragments
      • mapping of segmental duplications
      • primer design
  • SsahaEST
    • ssahaEST is a software tool for very fast matching and alignment of ESTs/cDNAs to genomic DNA sequences. It uses the same core algorithms as ssaha2 for sequence matching and contains implementation of Smith-Waterman sequence alignment code from cross_match developed by Phil Green at the University of Washington. Hits produced by the alignment algorithm are clustered into potential transcripts and coordinates of exons are adjusted using several splice site models to produce spliced alignments.
  • SsahaSNP
    • ssahaSNP is a polymorphism detection tool. It detects homozygous SNPs and indels by aligning shotgun reads to the finished genome sequence. Highly repetitive elements are filtered out by ignoring those kmer words with high occurrence numbers. For those less repetitive or non-repetitive reads, we place them uniquely on the reference genome sequence and find the best alignment according to the pair-wise alignment score if there are multiple seeded regions. From the best alignment, SNP candidates are screened, taking into account the quality value of the bases with variation as well as the quality values in the neighbouring bases, using neighbourhood quality standard (NQS). For insertions/deletions, we check if the same indel is mapped by more than one read, ensuring the detected indel with high confidence.
  • StrataSplice
    • StrataSplice is a human splice site predictor that combines local GC content with a first-order dependence weight array model to predict human splice sites.
  • Tctool
    • This program allows the curator to visually adjust the gene tree topology and recalculate a score which reflects both how well the topology explains the sequence alignment and (optionally) how closely the topology agrees with the species tree. This score is proportional to the log of the maximum (over all possible branch lengths for the gene tree) of the product of two probability terms: the likelihood of the gene tree given the sequence alignment; and a conditional probability of the gene tree given the species tree, which is derived from a probabilistic model of gene duplication and loss. The second term penalizes gene duplication and loss, thereby allowing the curator to trade-off reductions in the number of gene duplications and losses in the tree with decreases in the likelihood term. The curator has the option of curating the tree purely on the basis of the likelihood term, which is equivalent to not penalizing gene duplications or losses. In the scoring step, the curator can allow all branch lengths in the gene tree to be either unconstrained or clock-like; or alternatively can require the gene tree branch lengths to be 'tied' to the species tree, while allowing the species tree branch lengths to be either unconstrained or clock-like. To perform the likelihood calculations, tctool provides the curator with the choice of several common nucleotide, codon, and amino-acid models of evolution.

The website also contains external links to: