Top 5 DNA Sequencing Analysis Tools Used in the Biotech Industry authored by By: Dr. Saptarshi Sinha*
In the era of informatics, one can not ignore the information hidden inside a stretch of biomolecules called nucleotides. But the tricky part is the proper understanding of the simple sequences carrying intricate pieces of information. Analysis of sequences first came into the picture during the 70s, when Christian D. Wunsch and Saul B. Needleman came with their first algorithm based on amino acid sequence alignment (Needleman et al. 1970). But the big picture started from the beginning of the human genome project (Olson 1993).
Followed by Wunsch and Needleman, computer scientists worldwide slowly started working on sequence-based problems that lie to biologist until date. Today, analysis of nucleotide sequences are required at every step of research. From primer design for polynucleotide amplification, recognizing genes in a genome for functional study, to sequence similarity analysis for comparative genomics, sequences analysis is crucial for every domain of life, including viruses. Here we will discuss the top five DNA sequence analysis tools that revolutionized biology research.
One of the leading problems of genome analysis is recognizing genes along the stretch of a nucleotide sequence. In prokaryotes, the approach towards this problem is much easier as they mostly have coding regions. But in eukaryotes, it will be more challenging as they posses both intron and exons. There are several different algorithms present to encounter the problem. Broadly their approaches are based on either statistical parameters of DNA sequences or homology-based methods.
Genemark: It was developed in 1993 at Georgia Institute of Technology by Mark Borodovsky and James McInincg (Borodovsky et al. 1993). It is the first algorithm that includes a non-homogeneous Markov model to identify protein-coding and noncoding regions in the prokaryotic genome. With the help of this tool, we can classify prokaryotic genes into typical, highly typical and atypical depending on the multivariate codon analysis, where the atypical corresponds to horizontally transferred genes (Médigue et al. 1991 ).
Genemark uses gene recognition parameter based on the target organism. It is a open source application and also involved in NCBI pipeline for prokaryotes.
Glimmer: Like Genemark, Gene Locator and Interpolated Markov Modeler (Glimmer) is another tool that reforms prokaryotic functional genomics. This algorithm was developed at Johns Hopkins University by Steven Salzberg (Salzberg at al. 1998). It is also based on Markov models. Here the order of the model increases at each step with the separate estimations of predictive power. Later, it was also modified for eukaryotic genomes. Glimmer is used at TIGR as a primary gene finder tool.
A developed variant of Glimmer is specialised in small eukaryotic genomes like genome of Plasmodium. It is mainly an open source program and used by National institute of health for medical research.
Grail: On the other hand, eukaryotic genome analysis was first revolutionised by Gene Recognition and Assembly Internet Link (Grail), which was developed at Oak Ridge National Laboratory by Ed Uberbacher in 1996 (Uberbacher et al. 1996). For the first time, with the homology-based method, their algorithm can identify CpG islands, polyA sites, promoters, exons, and frameshift mutations by comparing them with the human and mouse genome.
It is incorporated in the Oak Ridge genome analysis pipeline, which reforms the eukaryotic genome analysis. This pipeline also offers comparing GrailEXP (a modified version) results with Genscane (discussed below) for better understanding.
Genscan: The complexity associated with a sizeable eukaryotic genome demands a more accurate and time-efficient algorithm that also takes care of a gene’s statistical and structural properties. This need was fulfilled by Genscan, developed at Stanford University by Chris Burge and Samuel Karlin (Burge et al. 1997). Based on a complex probabilistic model, this algorithm takes care of the gene structure and precisely its biological role in translational, transcription and splicing events.
Genscan can provide comparative gene models based on GC content and has been used as the primary gene prediction tool in the international human genome project.
Genebuilder: On the other extreme, this algorithm was developed by Milanesi et al., which is based on the ab-initio open source gene prediction method (Milanesi et al. 1999). This algorithm considers various parameters like CpG islands, splicing site data, GC content, repetitive elements etc., for the identification of a gene. Moreover, based on relative frequencies of synonymous and non-synonymous substitutions, this algorithm identifies the coding sequences.
Genebuilder enables us to recognise gene structure based on protein sequences. Also, it helps users to predict the gene structure in an interactive manner by using various parameters.
These algorithms are continuously evolving and lead to the development of new and more efficient techniques. Nowadays, there are many tools available that could perform sequence analysis more efficiently and accurately. The sequence analysis also motivates computer scientists to evolve a new branch called DNA computing, an alternative to traditional electronic computation (Paun et al. 2005). Bioinformatics freelancers can help with analyzing complex data and interpreting results.
*Dr. Saptarshi Sinha is an experienced systems biologist on Kolabtree. He has substantial interdisciplinary research experience in microbiology as well as mathematical/computational biology. His research areas include biological network analysis, evolutionary game theory, bioinformatics, Monte Carlo simulation and basic mathematical modelling. He also has expertise in bacterial population dynamics, phage-bacteria interactions, molecular cloning, polymerase chain reaction, fluorescence-activated cell sorting and scanning electron microscopy.
- Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3), 443-453.
- Olson, M. V. (1993). The human genome project. Proceedings of the National Academy of Sciences, 90(10), 4338-4344.
- Borodovsky, M., & McIninch, J. (1993). GENMARK: parallel gene recognition for both DNA strands. Computers & chemistry, 17(2), 123-133.
- Médigue, C., Rouxel, T., Vigier, P., Hénaut, A., & Danchin, A. (1991). Evidence for horizontal gene transfer in Escherichia coli speciation.Journal of molecular biology, 222(4), 851-856.
- Salzberg, S. L., Delcher, A. L., Kasif, S., & White, O. (1998). Microbial gene identification using interpolated Markov models. Nucleic acids research, 26(2), 544-548.
- Uberbacher, E. C., Xu, Y., & Mural, R. J. (1996).  Discovering and understanding genes in human DNA sequence using GRAIL. Methods in enzymology, 266, 259-281.
- Burge, C., & Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. Journal of molecular biology, 268(1), 78-94.
- Milanesi, L., D’Angelo, D., & Rogozin, I. B. (1999). GeneBuilder: interactive in silico prediction of gene structure.Bioinformatics (Oxford, England), 15(7), 612-621.
- Paun, G., Rozenberg, G., & Salomaa, A. (2005). DNA computing: new computing paradigms. Springer Science & Business Media.