bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/lms jnu/mca/sem vi... ·...

125
Bioinformatics

Upload: others

Post on 10-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

Page 2: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Board of Studies

Prof. H. N. Verma Prof. M. K. GhadoliyaVice- Chancellor Director, Jaipur National University, Jaipur School of Distance Education and Learning Jaipur National University, JaipurDr. Rajendra Takale Prof. and Head AcademicsSBPIM, Pune

___________________________________________________________________________________________

Subject Expert Panel

Dr. Ramchandra G. Pawar Ashwini PanditDirector, SIBACA, Lonavala Subject Matter ExpertPune

___________________________________________________________________________________________

Content Review Panel

Gaurav Modi Shubhada PawarSubject Matter Expert Subject Matter Expert

___________________________________________________________________________________________Copyright ©

This book contains the course content for Bioinformatics.

First Edition 2013

Printed byUniversal Training Solutions Private Limited

Address05th Floor, I-Space, Bavdhan, Pune 411021.

All rights reserved. This book or any portion thereof may not, in any form or by any means including electronic or mechanical or photocopying or recording, be reproduced or distributed or transmitted or stored in a retrieval system or be broadcasted or transmitted.

___________________________________________________________________________________________

Page 3: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

I

Index

ContentI. ...................................................................... II

List of FiguresII. ........................................................... V

AbbreviationsIII. .........................................................VI

ApplicationIV. ........................................................... 104

BibliographyV. .......................................................... 112

Self Assessment AnswersVI. ..................................... 115

Book at a Glance

Page 4: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

II

Contents

Chapter I ....................................................................................................................................................... 1Introduction to Bioinformatics ................................................................................................................... 1Aim ................................................................................................................................................................ 1Objectives ...................................................................................................................................................... 1Learning outcome .......................................................................................................................................... 11.1 Introduction .............................................................................................................................................. 21.2 Bioinformatics:The Brain of Biotechnology .......................................................................................... 31.3 Evolutionary Biology ............................................................................................................................... 31.4 Origin & History of Bioinformatics ....................................................................................................... 41.5 Origin of Bioinformatic/Biological Databases ....................................................................................... 61.6 Importance of Bioinformatics .................................................................................................................. 71.7 Use of Bioinformatics ............................................................................................................................. 81.8 Basics of Molecular Biology ................................................................................................................... 91.9 Definitions of Fields Related to Bioinformatics ....................................................................................111.10 Bioinformatics Applications ............................................................................................................... 13Summary ..................................................................................................................................................... 16References ................................................................................................................................................... 17Recommended Reading ............................................................................................................................. 17Self Assessment ........................................................................................................................................... 18

Chapter II ................................................................................................................................................... 20Biological Databases .................................................................................................................................. 20Aim .............................................................................................................................................................. 20Objectives .................................................................................................................................................... 20Learning outcome ........................................................................................................................................ 202.1 Introduction ............................................................................................................................................ 212.2 Categories of Biological Databases ...................................................................................................... 222.3 The Database Industry .......................................................................................................................... 222.4 Classification of Biological Databases ................................................................................................. 232.5 The Creation of Sequence Databases .................................................................................................... 292.6 Bioinformatics Programs and Tools ...................................................................................................... 312.7 Bioinformatics Tools .............................................................................................................................. 322.8 Application of Programmes in Bioinformatics ...................................................................................... 35Summary ..................................................................................................................................................... 36References ................................................................................................................................................... 36Recommended Reading ............................................................................................................................. 37Self Assessment ........................................................................................................................................... 38

Chapter III .................................................................................................................................................. 40Genomics & Proteomics ............................................................................................................................ 40Aim .............................................................................................................................................................. 40Objectives .................................................................................................................................................... 40Learning outcome ........................................................................................................................................ 403.1 DNA, Genes and Genomes .................................................................................................................... 413.2 DNA Sequencing ................................................................................................................................... 413.3 Genome Mapping ................................................................................................................................... 423.4 Implications of Genomics for Medical Science ..................................................................................... 423.5 Proteomics..... ......................................................................................................................................... 433.6 Application of Proteomics to Medicine ................................................................................................. 463.7 Difference between Proteomics and Genomics ..................................................................................... 46

Page 5: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

III

3.8 Protein Modeling ................................................................................................................................... 46Summary ..................................................................................................................................................... 48References ................................................................................................................................................... 48Recommended Reading ............................................................................................................................. 48Self Assessment ........................................................................................................................................... 49

Chapter IV .................................................................................................................................................. 51Sequence Alignment ................................................................................................................................... 51Aim .............................................................................................................................................................. 51Objectives .................................................................................................................................................... 51Learning outcome ........................................................................................................................................ 514.1 Introduction ............................................................................................................................................ 524.2 Pairwise Sequence Alignment ................................................................................................................ 524.3 Multiple Sequence Alignment (MSA) ................................................................................................... 564.4 Substitution Matrices ............................................................................................................................. 564.5 Two Sample Applications ...................................................................................................................... 56Summary ..................................................................................................................................................... 57References ................................................................................................................................................... 57Recommended Reading ............................................................................................................................. 57Self Assessment ........................................................................................................................................... 58

Chapter V .................................................................................................................................................... 60Phylogenetic Analysis................................................................................................................................. 60Aim .............................................................................................................................................................. 60Objectives .................................................................................................................................................... 60Learning outcome ........................................................................................................................................ 605.1 Introduction ............................................................................................................................................ 615.2 Fundamental Elements of Phylogenetic Models .................................................................................... 625.3 Tree Interpretation: Importance of Identifying Paralogs and Orthologs ................................................ 635.4 Phylogenetic Data Analysis ................................................................................................................... 64 5.4.1 Alignment: Building the Data Model ..................................................................................... 64 5.4.2 Determining the Substitution Model ...................................................................................... 64 5.4.3 Tree-Building Methods .......................................................................................................... 65 5.4.4 Tree Evaluation ...................................................................................................................... 65Summary ..................................................................................................................................................... 66References ................................................................................................................................................... 66Recommended Reading ............................................................................................................................. 66Self Assessment ........................................................................................................................................... 67

Chapter VI .................................................................................................................................................. 69Microarray Technology: A Boon to Biological Sciences ......................................................................... 69Aim .............................................................................................................................................................. 69Objectives .................................................................................................................................................... 69Learning outcome ........................................................................................................................................ 696.1 Introduction to Microarray ..................................................................................................................... 706.2 Microarray Technique ............................................................................................................................ 706.3 Potential of Microarray Analysis .......................................................................................................... 726.4 Microarray Products .............................................................................................................................. 736.5 Microarray: Identifying Interactions ...................................................................................................... 736.6 Applications of Microarrays .................................................................................................................. 73Summary ..................................................................................................................................................... 76References ................................................................................................................................................... 76Recommended Reading ............................................................................................................................. 77Self Assessment ........................................................................................................................................... 78

Page 6: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

IV

Chapter VII ................................................................................................................................................ 80Bioinformatics in Drug Discovery: A Brief Overview ............................................................................ 80Aim .............................................................................................................................................................. 80Learning outcome ........................................................................................................................................ 807.1 Introduction .............................................................................................................................. 817.2 Drug Discovery ...................................................................................................................................... 827.3 Informatics and Medical Sciences ......................................................................................................... 827.4 Bioinformatics and Medical Sciences .................................................................................................... 837.5 Bioinformatics in Computer-Aided Drug Design .................................................................................. 847.6 Bioinformatics Tools ............................................................................................................................. 86Summary ..................................................................................................................................................... 88References ................................................................................................................................................... 88Recommended Reading ............................................................................................................................. 88Self Assessment ........................................................................................................................................... 89

Chapter VIII ............................................................................................................................................... 91Human Genome Project ............................................................................................................................ 91Objectives .................................................................................................................................................... 91Learning outcome ........................................................................................................................................ 918.1 Introduction ............................................................................................................................................ 928.2 Human Genome Project ......................................................................................................................... 928.3 Genome Sequenced in the Public (HGP) and Private Projects .............................................................. 928.4 Funding for Human Genome Sequencing .............................................................................................. 938.5 DNA Sequencing .................................................................................................................................. 938.6 Bioinformatic Analysis: Finding Functions ........................................................................................... 948.7 Insights Learned from the Human DNA Sequence ................................................................................ 978.8 Future Challenges ................................................................................................................................. 98Summary ................................................................................................................................................... 100References ................................................................................................................................................. 100Recommended Reading ........................................................................................................................... 101Self Assessment ......................................................................................................................................... 102

Page 7: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

V

List of Figures

Fig. 1.1 Genes encode the recipes for proteins .............................................................................................. 9Fig. 2.1 Growth of the GenBank database ................................................................................................... 24Fig. 2.2 GenBank file format ....................................................................................................................... 25Fig. 2.3 International nucleotide data banks ................................................................................................ 27Fig. 2.4 Application of bioinformatics in medical science .......................................................................... 34Fig. 4.1 Dot matrix ....................................................................................................................................... 56Fig. 5.1 Clade and node ............................................................................................................................... 61Fig. 5.2 A phylogenetic tree ......................................................................................................................... 63Fig. 6.1 Gene expression data ...................................................................................................................... 74Fig. 6.2 Gene expression over time ............................................................................................................. 75

Page 8: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

VI

Abbreviations

ADMET - Absorption, Distribution, Metabolism, Excretion, Toxicity

ASCII - American Standard Code for Information Interchange

BLAST - Basic Local Alignment Search Tool

BLOSUM - Blocks Substitution Matrix

CADD - Computer-Aided Drug Design

cDNA - Complementary DNA

COPIA - ConsensusPatternIdentificationandAnalysis

DDBJ - DNA Data Bank of Japan

DNA - Deoxyribonucleic Acid

EBI - European Bioinformatics Institute

ELSI - Ethical, Legal and Social Issues

EMBL - European Molecular Biology Laboratory

EMBOSS - European Molecular Biology Open Software Suite

EMR - Electronic Medical Records

ESI - Electro-Spray Ionisation

GIS - Geographic Information System

HGP - Human Genome Project

HMM - Hidden Markov Models

HTML - HyperText Markup Language

JVM - Java Virtual Machine

LC - Liquid Chromatography

MALDI - Matrix-Assisted Laser Desorption Ionisation

MS - Mass Spectrometry

NCBI - National Center for Biotechnology Information

NIH - National Institutes of Health

NMR - Nuclear Magnetic Resonance

OMIM - Online Mendelian Inheritance in Man

ORF - Open Reading Frame

PCR - Polymerase Chain Reaction

PDB - Protein Data Bank

PROSPECT - Protein Structure Prediction and Evaluation Computer ToolKit

Page 9: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

VII

PSA - Prostate-SpecificAntigen

QSAR - Quantitative Structure Activity Relationships

RNA - Ribonucleic Acid

SCOP - StructuralClassificationofProteins

TOF - Time-of-Flight

vHTS - Virtual High-Throughput Screening

W3C - World-Wide Web Consortium

WWW - World Wide Web

XML - Extensible Markup Language

Page 10: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

1

Chapter I

Introduction to Bioinformatics

Aim

The aim of this chapter is to:

definebioinformatics•

enlist components of bioinformatics•

describe evolutionary biology•

Objectives

The objectives of this chapter are to:

explain history of bioinformatics •

elucidate use of HTML and Java•

describe XML•

Learning outcome

At the end of this chapter, you will be able to:

understand bioinformation infrastructure•

describe CORBA•

explain features of • ROSETTA

Page 11: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

2

1.1 IntroductionBioinformaticsisanewlyemergedscientificdisciplineforthecomputationalanalysisandstorageofbiologicaldata. The word bioinformatics has been derived from two words ‘Bio’ means biology and ‘Informatique’ (a French word) meaning ‘data processing’.

Bioinformatics is the combination of biology and information technology. The discipline encompasses any computational tools and methods used to manage, analyse and manipulate large sets of biological data. Essentially, bioinformatics has three components:

The creation of databases allowing the storage and management of large biological data sets.•The development of algorithms and statistics to determine relationships among members of large data sets.•The use of these tools for the analysis and interpretation of various types of biological data, including DNA, •RNAandproteinsequences,proteinstructures,geneexpressionprofiles,andbiochemicalpathways.

Thetermbioinformaticsfirstcameintouseinthe1990sandwasoriginallysynonymouswiththemanagementandanalysis of DNA, RNA and protein sequence data. Computational tools for sequence analysis had been available since the 1960s, but this was a minority interest until advances in sequencing technology, which led to a rapid expansion in the number of stored sequences in databases such as GenBank. Now, the term has expanded to incorporate many othertypesofbiologicaldata,forexampleproteinstructures,geneexpressionprofilesandproteininteractions.Eachof these areas requires its own set of databases, algorithms and statistical methods.

Bioinformatics is largely, although not exclusively, a computer-based discipline. Computers are important in bioinformatics for two reasons:

Firstly, many bioinformatics problems require the same task to be repeated millions of times. For example, comparing a new sequence to every other sequence stored in a database or comparing a group of sequences systematically to determine evolutionary relationships. In such cases, the ability of computers to process information and test alternative solutions rapidly is indispensable.

Secondly, computers are required for their problem-solving power. Typical problems that might be addressed using bioinformatics could include solving the folding pathways of protein given its amino acid sequence, or deducing a biochemicalpathwaygivenacollectionofRNAexpressionprofiles.Computerscanhelpwithsuchproblems,butit is important to note that expert input and robust original data are also required.

Bioinformaticsisthefieldinwhichbiology,computerscienceandinformationtechnologymergeintosingledisciplinefor managing and analysing biological data using advanced computing techniques. Bioinformatics has emerged asafull-fledgedinterdisciplinarysubjectthatinterfacesthedevelopmentsofcomputerscienceandinformationtechnology with biological sciences. The knowledge of computer science and information technology is applied for creation as well as management of databases, data warehousing, data mining and overall communication networking throughout the world.

Bioinformatics is the application of computer technology to the management and analysis of biological data. The result is that computers are being used to gather, store, analyse and merge biological data. The ultimate goal of bioinformatics is to uncover the wealth of biological information hidden in the mass of data and obtain a clearer insight into the fundamental biology of organisms. This new knowledge could have profound impactsonfieldsasvariedashumanhealth,agriculture,theenvironment,energyandbiotechnology.

Bioinformatics is conceptualising biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organise the information associated with these molecules, on a large-scale.

Page 12: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

3

The three terms: bioinformatics, computational biology and bioinformation infrastructure are very similar and most of the time used interchangeably. However;

Bioinformatics refers to database like activities, involving persistent sets of data that are maintained in a consistent •stateoveressentiallyindefiniteperiodsoftime.

Computational biology encompasses the use of algorithmic tools to facilitate biological analyses. �Bioinformation infrastructure comprises the entire collection of information management systems, analysis �tools and communication networks supporting biology. Thus, the latter may be viewed as a computational scaffold of the former two.

The future of bioinformatics is integration. For example, integration of a wide variety of data sources such as clinical and genomic data will allow us to use disease symptoms to predict genetic mutations and vice versa. The integration of GIS data, such as maps, weather systems, with crop health and genotype data, will allow us to predict successful outcomes of agriculture experiments.

Another future area of research in bioinformatics is large-scale comparative genomics. For example, the development oftoolsthatcando10-waycomparisonsofgenomeswillpushforwardthediscoveryrateinthisfieldofbioinformatics.Along these lines, the modelling and visualisation of full networks of complex systems could be used in the future to predict how the system (or cell) reacts to a drug for example. A technical set of challenges faces bioinformatics and is being addressed by faster computers, technological advances in disk storage space, and increased bandwidth. Finally, a key research question for the future of bioinformatics will be how to computationally compare complex biological observations, such as gene expression patterns and protein networks. Bioinformatics is about converting biological observations to a model that a computer will understand. This is a very challenging task since biology can be very complex. This problem of how to digitise phenotypic data such as behavior, electrocardiograms, and crop health into a computer readable form offers exciting challenges for future bioinformaticians.

A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also knows how to write interfaces for effective use of the tools. A bioinformatician, on the other hand, is a trained individual who only knows to use bioinformatics tools without a deeper understanding.

1.2 Bioinformatics:The Brain of Biotechnology The practical aspect of bioinformatics is to understand the code of life that is to decode the information reside in nucleotide sequence. It is well known fact that DNA is the basic molecule of life that directly controls the fundamental biology of nearly all organisms (except those where RNA is genetic material). The nucleotide sequence constitutes the genes, which in turn express in terms of proteins. Any variations and errors in the nucleotide sequence of the genomic DNA or mutations may lead to development of genetic disorders or other metabolic changes. Therefore, researchers/scientistsinthefieldsofbiotechnologyormolecularbiologyneedtoknowthenatureofindividualgenomes of various prokaryotic and eukaryotic organisms. Already many DNA sequencing projects have been completed and many more are in progress leading to huge amount of biological information. This has added in the growth of the science of bioinformatics. Handling such enormous information and interpretation was not possible without bioinformatics. Hence, bioinformatics can be called as brain of biotechnology.

1.3 Evolutionary BiologyNew insight into the molecular basis of a disease may come from investigating the function of homolog’s of a disease gene in model organisms. In this case, homology refers to two genes sharing a common evolutionary history. Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary relationship.

Page 13: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

4

Equally exciting is the potential for uncovering evolutionary relationships and patterns between different forms oflife.Withtheaidofnucleotideandproteinsequences,itshouldbepossibletofindtheancestraltiesbetweendifferent organisms. Thus far, experience has taught us that closely related organisms have similar sequences and thatmoredistantly relatedorganismshavemoredissimilar sequences.Proteins that showsignificant sequenceconservation, indicating a clear evolutionary relationship, are said to be from the same protein family. By studying protein folds (distinct protein building blocks) and families, scientists are able to reconstruct the evolutionary relationship between two species and to estimate the time of divergence between two organisms since they last shared a common ancestor.

1.4 Origin & History of Bioinformatics Over a century ago, bioinformatics history started with an Austrian monk named Gregor Mendel. He is known as the“FatherofGenetics”.Hecross-fertiliseddifferentcoloursofthesamespeciesofflowers.Hekeptcarefulrecordsofthecoloursofflowersthathecross-fertilisedandthecolour(s)offlowerstheyproduced.Mendelillustratedthatthe inheritance of traits could be more easily explained if it was controlled by factors passed down from generation to generation.

Since Mendel, bioinformatics and genetic record keeping have come a long way. The understanding of genetics has advancedremarkablyinthelastthirtyyears.In1972,PaulbergmadethefirstrecombinantDNAmoleculeusingligase.Inthatsameyear,StanleyCohen,AnnieChangandHerbertBoyerproducedthefirstrecombinantDNAorganism.In1973,twoimportantthingshappenedinthefieldofgenomics:

JosephSambrookledateamthatrefinedDNAelectrophoresistechniqueusingagarosegel,and•Herbert Boyer and Stanely Cohen invented DNA cloning. By 1977, a method for sequencing DNA was discovered •andthefirstgeneticengineeringcompany,Genetechwasfounded.

By 1981, 579 human genes had been mapped and mapping by in situ hybridisation had become a standard method. Marvin Carruthers and Leory Hood made a huge leap in bioinformatics when they invented a method for automated DNA sequencing. In 1988, the Human Genome Organisation (HUGO) was founded. This is an international organisationofscientistsinvolvedinHumanGenomeProject.In1989,thefirstcompletegenomemapwaspublishedof the bacteria Haemophilusinfluenza.Thefollowingyear,theHumanGenomeProjectwasstarted.By1991,atotalof 1879 human genes had been mapped. In 1993, Genethon, a human genome research centre in France produced a physicalmapofthehumangenome.Threeyearslater,GenethonpublishedthefinalversionoftheHumanGeneticMap.ThisconcludedtheendofthefirstphaseoftheHumanGenomeProject.

In the mid-1970s, it would take a laboratory at least two months to sequence 150 nucleotides. Ten years ago, the only way to track genes was to scour large, well documented family trees of relatively inbred populations, such as theAshkenzaiJewsfromEurope.Thesetypesofgenealogicalsearches11millionnucleotidesadayforitscorpo-rate clients and company research.

Bioinformatics was fuelled by the need to create huge databases, such as GenBank and EMBL and DNA Database of Japan to store and compare the DNA sequence data erupting from the human genome and other genome sequencing projects. Today, bioinformatics embraces protein structure analysis, gene and protein functional information, data from patients, pre-clinical and clinical trials, and the metabolic pathways of numerous species.

Origin of internet The management and, more importantly, accessibility of this data is directly attributable to the development of the Internet, particularly the World Wide Web (WWW). Originally developed for military purposes in the 60’s and expandedbytheNationalScienceFoundationinthe80’s,scientificuseoftheInternetgrewdramaticallyfollowingthe release of the WWW by CERN in 1992.

Page 14: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

5

HTMLThe WWW is a graphical interface based on hypertext by which text and graphics can be displayed and highlighted. Each highlighted element is a pointer to another document or an element in another document, which can reside on any internet host computer. Page display, hypertext links and other features are coded using a simple, cross-platform HyperText Markup Language (HTML) and viewed on UNIX workstations, PCs and Apple Macs as WWW pages using a browser.

JavaThefirstgraphicalWWWbrowser-MosaicforXandthefirstmolecularbiologyWWWserver-ExPASyweremadeavailable in 1993. In 1995, Sun Microsystems released Java, an object-oriented, portable programming language based on C++. In addition to being a standalone programming language in the classic sense, Java provides a highly interactive, dynamic content to the Internet and offers a uniform operational level for all types of computers, provided they implement the ‘Java Virtual Machine’ (JVM). Thus, programs can be written, transmitted over the internet and executed on any other type of remote machine running a JVM. Java is also integrated into Netscape and Microsoft browsers, providing both the common interface and programming capability, which is vital in sorting through and interpreting the gigabytes of bioinformatics data now available and increasing at an exponential rate.

XMLThe new XML standard 8 is a project of the World-Wide Web Consortium (W3C), which extends the power of the WWW to deliver not only HTML documents but an unlimited range of document types using customised markup. This will enable the bioinformatics community to exchange data objects such as sequence alignments, chemical structures, spectra and so on together with appropriate tools to display them, just as easily as they exchange HTML documents today. Both Microsoft and Netscape support this new technology in their latest browsers.

CORBAAnother new technology, called CORBA, provides a way of bringing together many existing or ‘legacy’ tools and databases with a common interface that can be used to drive them and access data. CORBA frameworks for bioinformatics tools and databases have been developed by, for example, NetGenics and the European Bioinformatics Institute (EBI).

Representatives from industry and the public sector under the umbrella of the Object Management Group are working on open CORBA-based standards for biological information representation The Internet offers scientists a universal platform on which to share and search for data and the tools to ease data searching, processing, integration and interpretation. The same hardware and software tools are also used by companies and organisations in more private yet still global Intranet networks. One such company, Oxford GlycoSciences in the UK, has developed a bioinformatics system as a key part of its proteomics activity.

ROSETTAROSETTAfocusesonproteinexpressiondataandsetsouttoidentifythespecificproteins,whichareupordown-regulated in a particular disease; characterise these proteins with respect to their primary structure, post-translational modificationsandbiologicalfunction;evaluatethemasdrugtargetsandmarkersofdisease;anddevelopnoveldrugcandidates.OGSusesatechniquecalledfluorescentIPG-PAGEtoseparateandmeasuredifferentproteintypesinabiologicalsamplesuchasabodyfluidorpurifiedcellextract.Afterseparation,eachproteiniscollectedandthenbroken up into many different fragments using controlled techniques. The mass and sequence of these fragments is determined with great accuracy using a technique called mass spectrometry. The sequence of the original protein canthenbetheoreticallyreconstructedbyfittingthesefragmentsbacktogetherinakindofjigsaw.Thisreassemblyof the protein sequence is a task well-suited to signal processing and statistical methods.

ROSETTA is built on an object-relational database system, which stores demographic and clinical data on sample donors and tracks the processing of samples and analytical results. It also interprets protein sequence data and matches this data with that held in public, client and proprietary protein and gene databases. ROSETTA comprises a suite oflinkedHTMLpages,whichallowdatatobeentered,modifiedandsearchedandallowstheusereasyaccessto

Page 15: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

6

other databases. A high level of intelligence is provided through a sophisticated suite of proprietary search, analytical and computational algorithms. These algorithms facilitate searching through the gigabytes of data generated by the Company’s proteome projects, matching sequence data, carrying out de novo peptide sequencing and correlating results with clinical data. These processing tools are mostly written in C, C++ or Java to run on a variety of computer platforms and use the networking protocol of the internet, TCP/IP, to co-ordinate the activities of a wide range of laboratory instrument computers, reliably identifying samples and collecting data for analysis.

The need to analyse ever increasing numbers of biological samples using increasingly complex analytical techniques is insatiable. Searching for signals and trends in noisy data continues to be a challenging task, requiring great computing power. Fortunately, this power is available with today’s computers, but of key importance is the integration of analytical data, functional data and biostatistics. The protein expression data in ROSETTA forms only part of an elaborate network of the type of data, which can now be brought to bear in biology. The need to integrate different information systems into a collaborative network with a friendly face is bringing together an exciting mixture of talents in the software world and has brought the new science of bioinformatics to life.

1.5 Origin of Bioinformatic/Biological Databases Thefirstbioinformatic/biologicaldatabaseswereconstructedafewyearsafterthefirstproteinsequencesbegantobecomeavailable.Thefirstproteinsequencereportedwasthatofbovineinsulinin1956,consistingof51residues.Nearlyadecadelater,thefirstnucleicacidsequencewasreported,thatofyeastalaninetRNAwith77bases.Justayearlater,Dayhoffgatheredalltheavailablesequencedatatocreatethefirstbioinformaticdatabase.TheProtein

Data Bank followed in 1972 with a collection of ten X-ray crystallographic protein structures, and the SWISSPROT proteinsequencedatabasebeganin1987.Ahugevarietyofdivergentdataresourcesofdifferenttypesandsizesare now available either in the public domain or more recently from commercial third parties. All of the original databaseswereorganisedinaverysimplewaywithdataentriesbeingstoredinflatfiles,eitheroneperentry,orasasinglelargetextfile.Re-write-Lateronlookupindexeswereaddedtoallowconvenientkeywordsearchingof header information.

Origin of tools Aftertheformationofthedatabases,toolsbecameavailabletosearchsequencedatabases-atfirstinaverysimpleway, looking for keyword matches and short sequence words, and then more sophisticated pattern matching and alignment based methods. The rapid but less rigorous BLAST algorithm has been the mainstay of sequence database searching since its introduction a decade ago, complemented by the more rigorous and slower FASTA and Smith Waterman algorithms. Suites of analysis algorithms, written by leading academic researchers at Stanford, CA, Cambridge, UK and Madison, WI for their in-house projects, began to become more widely available for basic sequence analysis. These algorithms were typically single function black boxes that took input and produced output intheformofformattedfiles.UNIXstylecommandswereusedtooperatethealgorithms,withsomesuiteshavinghundreds of possible commands, each taking different command options and input formats. Since these early efforts, significantadvanceshavebeenmadeinautomatingthecollectionofsequenceinformation.

Rapid innovation in biochemistry and instrumentation has brought us to the point where the entire genomic sequence of at least 20 organisms, mainly microbial pathogens, are known and projects to elucidate at least 100 moreprokaryoticandeukaryoticgenomesarecurrentlyunderway.Groupsarenowevencompetingtofinishthesequence of the entire human genome. With new technologies we can directly examine the changes in expression levels of both mRNA and proteins in living cells, both in a disease state or following an external challenge. We can go on to identify patterns of response in cells that lead us to an understanding of the mechanism of action of an agent on a tissue. The volume of data arising from projects of this nature is unprecedented in the pharmaceutical industry, and will have a profound effect on the ways in which data are used and experiments performed in drug discovery and development projects. This is true not least because, with much of the available interesting data being in the hands of commercial genomics companies, pharmacies are unable to get exclusive access to many gene sequences ortheirexpressionprofiles.

Page 16: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

7

The competition between co-licensees of a genomic database is effectively a race to establish a mechanistic role or other utility for a gene in a disease state in order to secure a patent position on that gene. Much of this work is carried out by informatics tools. Despite the huge progress in sequencing and expression analysis technologies, and the corresponding magnitude of more data that is held in the public, private and commercial databases, the tools used for storage, retrieval, analysis and dissemination of data in bioinformatics are still very similar to the original systems gathered together by researchers 15-20 years ago.

Many are simple extensions of the original academic systems, which have served the needs of both academic and commercial users for many years. These systems are now beginning to fall behind as they struggle to keep up with the pace of change in the pharmaceutical industry. Databases are still gathered, organised, disseminated and searchedusingflatfiles.Relationaldatabasesarestillfewandfarbetween,andobject-relationalorfullyobjectoriented systems are rarer still in mainstream applications. Interfaces still rely on command lines, fat client interfaces, which must be installed on every desktop, or HTML/CGI forms. Whilst they were in the hands of bioinformatics specialists, pharmacies have been relatively undemanding of their tools. Now the problems have expanded to cover themainstreamdiscoveryprocess,muchmoreflexibleandscalablesolutionsareneededtoservepharmaceuticalR&D informatics requirements.

There are different views of origin of Bioinformatics- From T K Attwood and D J ParrySmith’s “Introduction to Bioinformatics”, Prentice-Hall 1999 [Longman Higher Education; ISBN 0582327881]: “The term bioinformatics is used to encompass almost all computer applications in biological sciences, but was originally coined in the mid-1980s for the analysis of biological sequence data.” From Mark S. Boguski’s article in the “Trends Guide to Bioinformatics” Elsevier, Trends Supplement 1998 p1: “The term “bioinformatics” is a relatively recent invention, not appearing in the literature until 1991 and then only in the context of the emergence of electronic publishing. The National Center for Biotechnology Information (NCBI), is celebrating its 10th anniversary this year, having been written into existence by US Congressman Claude Pepper and President Ronald Reagan in 1988. So, bioinformatics has, in fact, been in existence for more than 30 years and is now middle-aged.

1.6 Importance of BioinformaticsThe greatest challenge facing the molecular biology community today is to make sense of the wealth of data that has been produced by the genome sequencing projects. Traditionally, molecular biology research was carried out entirely at the experimental laboratory bench but the huge increase in the scale of data being produced in this genomic era has seen a need to incorporate computers into this research process.

Sequence generation, and its subsequent storage, interpretation and analysis are entirely computer dependent tasks. However, the molecular biology of an organism is a very complex issue with research being carried out at different levels including the genome, proteome, transcriptome and metabalome levels. Following on from the explosion in volume of genomicdata,similarincreaseindatahavebeenobservedinthefieldsofproteomics,transcriptomicsandmetabalomics.

Thefirstchallengefacingthebioinformaticscommunitytodayistheintelligentandefficientstorageofthismassofdata. It is then their responsibility to provide easy and reliable access to this data. The data itself is meaningless before analysis and the sheer volume present makes it impossible for even a trained biologist to begin to interpret it manually. Therefore, incisive computer tools must be developed to allow the extraction of meaningful biological information. There are three central biological processes around, which bioinformatics tools must be developed:

DNA sequence determines protein sequence•Protein sequence determines protein structure•Protein structure determines protein function•

The integration of information learned about these key biological processes should allow us to achieve the long term goal of the complete understanding of the biology of organisms.

Page 17: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

8

1.7 Use of Bioinformatics Bioinformatics is used to:

Store/retrieve biological information (databases)• Retrieve/compare gene sequences• Predict function of unknown genes/proteins• Search for previously known functions of a gene • Compare data with other researchers• Compile/distribute data for other researchers•

Due to the spectacular growth of biotechnology and molecular biology tremendous amount of data of nucleotide sequences or protein sequences are being produced. Here, comes the role of bioinformatics:

To uncover the wealth of biological information hidden in the mass of nucleotide sequence.•Knowing the amino acid sequence on the basis of nucleotide sequences.•Knowing structure of proteins on the basis of amino acid sequences.•Prediction of functional aspects of proteins on the basis of its structure.•

Besides these, the other aims of bioinformatics are:To provide biological data information and other related literature on the Internet.•To obtain a clearer insight into the fundamental biology of organisms.•Using this information for welfare of mankind. •

Therefore, it is clear that the knowledge of bioinformatics not merely limited to the computation of data, but in reality it can be used to solve many biological problems and can be applied how living things work.

The major applications of bioinformatics being to access, search, visualise and retrieve the information of databases of the sequences as well as to understand structural information of biomolecules proteome analysis and so on. Other applications include cell metabolism, biodiversity, downstream processing in chemical engineering, drug and vaccine design. These are the areas in which bioinformatics is an integral component. Current efforts in molecular biology (example, genome projects) are producing a large quantity of data that is not only providing exciting opportunities for knowledge discovery, but also increasing problem of information overload. Bioinformatics also concerns the developmentofnewtoolsfortheanalysisofgenomicandmolecularbiologicaldata.Thiscanbeappliedtoallfieldsof biological science as agricultural science, environmental science, pharmaceutical science, chemical science and medical science.

Page 18: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

9

Fig. 1.1 Genes encode the recipes for proteins(Source: http://birg.cs.wright.edu/text/Ch2.ppt)

1.8 Basics of Molecular Biology

The key concepts are:CellOur body consists of a number of organs. Each organ composes of a number of tissues, and each tissue composes of cells of the same type. The individual cell is the minimal self-reproducing unit in all living species. It performs two types of functions, such as performs chemical reactions necessary to maintain our life and also passes the information for maintaining life to the next generation. Since the cell is the vehicle for transmission of the genetic information in all living species, it needs to store the genetic information in the form of double-stranded DNA. The cell replicates its information by separating the paired DNA strands and using each as a template for polymerisation to make a new DNA strand with a complementary sequence of nucleotides. The same strategy is used to transcribe portions of the information from DNA into molecules of the closely related polymer, RNA. RNA is the intermediate between DNA and protein and it guides the synthesis of protein molecules by the complex machinery of translation that is the ribosome. The resultant proteins are the main catalysts for almost all the chemical reactions in the cell. In addition to catalyst, proteins are performing also building block, transportation, signaling, and so on.

Proteins: Molecular MachinesProteins constitute most of a cell’s dry mass. They are not only the building blocks from, which cells are built; they also execute nearly all cell functions. Understanding of proteins can guide us to understand how our bodies function and other biological processes. Protein is made from a long chain of amino acids, each links to its neighbor through a covalentpeptidebond.Thereare20typesofaminoacidsinproteins,andeachaminoacidcarriesdifferentchemicalproperties. The length of proteins is in the range of 20 to more than 5000 amino acids. In average, protein contains around 350 amino acids. Therefore, protein is also known as polypeptides. In order to perform its chemical function, proteins need to fold into certain 3 dimensional shapes. There are several interactions that cause the proteins to fold, such as the sets of weak non-covalent bonds that form between one part of the chain and another. The weak bonds are of three types, such as hydrogen bonds, ionic bonds, and Van der Waals attractions. In addition to these, three weak bonds, the fourth weak force, that is the hydrophobic interaction, also has a central role in determining the shape of a protein. Correct shape for a protein is vital to its functionality.

Page 19: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

10

Proteins have a variety of roles that they must fulfill:Theyaretheenzymesthatrearrangechemicalbonds.•They carry signals to and from the outside of the cell, and within the cell.•They transport small molecules.•They form many of the cellular structures.•They regulate cell processes, turning them on and off and controlling their rates.•

This variety of roles is accomplished by the variety of proteins, which collectively can assume a variety of three-dimensional shapes.

A protein’s three-dimensional shape, in turn, is determined by the particular one-dimensional composition of the protein. Each protein is a linear sequence made of smaller constituent molecules called amino acids. The constituent amino acids are joined by a “backbone” composed of a regularly repeating sequence of bonds. There are 20 different typesofaminoacids.The three-dimensionalshapeassumedbytheprotein isdeterminedbythespecific linearsequence of amino acids from N-terminus to C-terminus. Different sequences of amino acids fold into different three-dimensional shapes.

Proteins in your muscles allow you to move (myosin and actin)•Enzymes(digestion,catalysis)•Structure (collagen)•Signaling (hormones, kinases)•Transport (energy, oxygen)•

DNADNA is the genetic material in all organisms (with certain viruses being exception) and it stores the instruction needed by the cell to perform daily life function. DNA can be thought of as a large cookbook with recipes for making every protein in the cell. The information in DNA is used like a library. Library books can be read and reread many times. Similarly, the information in the genes is read, perhaps millions of times in the life of an organism, but the DNA itself is never used up. DNA consists of two strands, which interwoven together and form a double helix. Each strand is a chain of small molecules called nucleotides. DNA contains the instructions needed by the cell to carry out its functions. DNA consists of two long interwoven strands that form the famous “double helix”. Each strand is built from a small set of constituent molecules called nucleotides.

DNA StructureDNA is double-helix in structure and it consists of two strands, which interwoven together to resemble a twisted ladder. If you look at it in detail, you could observe that the rungs are consisted of chemical compounds called bases, while the sides of the rungs are the sugar (deoxyribose) and the phosphate molecules. These three parts that are base, sugar, and phosphate form the small molecules that we knew as nucleotides. There are 4 types of bases that form the rungs of DNA double-helix that is the 4 letters genetic code (A/Adenine, G/Guanine, C/Cytosine, and T/Thymine).ThecorrectstructureofDNAwasfirstdeducedbyJ.D.WatsonandF.H.C.Crickin1953.

RNAChemically, RNA is very similar to DNA. There are two main differences:

RNA uses the sugar ribose instead of deoxyribose in its backbone (from which RNA, RiboNucleic Acid, gets •its name).RNA uses the base uracil (U) instead of thymine (T). U is chemically similar to T, and in particular is also •complementary to A.

RNA has two properties important for our purposes. First, it tends to be single-stranded in its “normal” cellular state. Secondly, because RNA (like DNA) has base-pairing capability, it often forms intramolecular hydrogen bonds, partially hybridising to it. Because of this, RNA, like proteins, can fold into complex three-dimensional shapes.

Page 20: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

11

RNA has some of the properties of both DNA and proteins. It has the same information storage capability as DNA duetoitssequenceofnucleotides.Butitsabilitytoformthree-dimensionalstructuresallowsittohaveenzymaticproperties like those of proteins. Because of this dual functionality of RNA, it has been conjectured that life may have originated from RNA alone, DNA and proteins having evolved later.

GenomeThe genome of an organism is its complete set of DNA. All the genetic information in an organism is referred collectivelyasa“genome”.Genomesvarywidelyinsize:thesmallestknowngenomeforafree-livingorganism(a bacterium of the genus Mycoplasma, such as Mycoplasma genitalium) contains about 600,000 DNA base pairs, while human and mouse genomes have some 3 billions. Except for mature red blood cells, all human cells contain a complete genome.

ChromosomeThe 3 billion bases of the human genome are not all in one continuous strand of DNA. Rather, the human genome is divided into 23 separate pairs of DNA, called chromosomes. Chromosomes are structures within the cell nucleus that carries genes.

GeneA gene is a DNA sequence that encodes a protein or an RNA molecule. Each chromosome contains many genes that is the basic physical and functional units of heredity. Each gene exists in the particular position of particular chromosome. In human genome, it is expected that there are 30,000 - 35,000 genes.

1.9 Definitions of Fields Related to Bioinformatics Computational BiologyComputational biologists interest themselves more with evolutionary, population and theoretical biology rather than cell and molecular biomedicine. It is inevitable that molecular biology is profoundly important in computational biology, but it is certainly not what computational biology is all about. In these areas of computational biology it seems that computational biologists have tended to prefer statistical models for biological phenomena over physico-chemical ones.

One computational biologist (Paul J Schulte) did object to the above and makes the entirely valid point that this definitionderivesfromapopularuseoftheterm,ratherthanacorrectone.Paulworksonwaterflowinplantcells.Hepointsoutthatbiologicalfluiddynamicsisafieldofcomputationalbiologyinitself.Hearguesthatthis,andany application of computing to biology, can be described as “computational biology”. Computational biology is nota“field”,butan“approach”involvingtheuseofcomputerstostudybiologicalprocessesandhenceitisanareaas diverse as biology itself.

Genomics Genomicsisafield,whichexistedbeforethecompletionofthesequencesofgenomes,butinthecrudestofforms,for example, the referenced estimate of 100000 genes in the human genome derived from an famous piece of “back of an envelope” genomics, guessing the weight of chromosomes and the density of the genes they bear. Genomics is any attempt to analyse or compare the entire genetic complement of a species or species (plural). It is, of course possible to compare genomes by comparing more-or-less representative subsets of genes within genomes.

ProteomicsMichaelJ.Dunn,theEditor-in-ChiefofProteomicsdefinesthe“proteome”as:“theProteincomplementofthegenome” and proteomics to be concerned with: “Qualitative and quantitative studies of gene expression at the level of the functional proteins themselves” that is: “an interface between protein biochemistry and molecular biology” Characterising the many tens of thousands of proteins expressed in a given cell type at a given time, whether measuring their molecular weights or isoelectric points, identifying their ligands or determining their structures, which involves the storage and comparison of vast numbers of data. Inevitably this requires bioinformatics.

Page 21: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

12

Pharmacogenomics Pharmacogenomicsistheapplicationofgenomicapproachesandtechnologiestotheidentificationofdrugtargets.Examples include trawling entire genomes for potential receptors by bioinformatics means, or by investigating patterns of gene expression in both pathogens and hosts during infection, or by examining the characteristic expression patterns found in tumours or patients samples for diagnostic purposes (or in the pursuit of potential cancer therapy targets).

PharmacogeneticsAll individuals respond differently to drug treatments; some respond positively, others with little obvious change in their conditions and yet others with side effects or allergic reactions. Much of this variation is known to have a genetic basis. Pharmacogenetics is a subset of Pharmacogenomics, which uses genomic/bioinformatics methods to identify genomic correlates, for example SNPs (Single Nucleotide Polymorphisms), characteristic of particular patientresponseprofilesandusethosemarkerstoinformtheadministrationanddevelopmentoftherapies.Strikinglysuch approaches have been used to “resurrect” drugs thought previously to be ineffective, but subsequently found to work with in subset of patients or in optimising the doses of chemotherapy for particular patients.

CheminformaticsThe Web advertisement for Cambridge Healthtech Institute’s Sixth Annual Cheminformatics conference describes thefieldthus:“thecombinationofchemicalsynthesis,biologicalscreening,anddata-miningapproachesusedtoguidedrugdiscoveryanddevelopment”butthis,again,soundsmorelikeafieldbeingidentifiedbysomeofitsmost popular (and lucrative) activities, rather than by including all the diverse studies that come under its general heading.

Thestoryofoneofthemostsuccessfuldrugsofalltime,penicillin,seemsbizarre,butthewaywediscoveranddevelop drugs even now has similarities, being the result of chance, observation and a lot of slow, intensive chemistry. Until recently, drug design always seemed doomed to continue to be a labour-intensive, trial-and-error process. The possibility of using information technology, to plan intelligently and to automate processes related to the chemical synthesis of possible therapeutic compounds is very exciting for chemists and biochemists. The rewards for bringing a drug to market more rapidly are huge, so naturally this is what a lot of cheminformatics works is about. The span ofacademiccheminformaticsiswideandisexemplifiedbytheinterestsofthecheminformaticsgroupsattheCentrefor Molecular and Biomolecular Informatics at the University of Nijmegen in the Netherlands.

These interests include: Synthesis Planning, Reaction and Structure Retrieval , 3-D Structure Retrieval , Modelling Computational Chemistry , Visualisation Tools and Utilities Trinity University’s Cheminformatics Web page, for another example, concerns itself with cheminformatics as the use of the Internet in chemistry.

Medical Informatics“BiomedicalInformaticsisanemergingdisciplinethathasbeendefinedasthestudy,invention,andimplementationof structures and algorithms to improve communication, understanding and management of medical information.” Medical informatics is more concerned with structures and algorithms for the manipulation of medical data, rather than with the data itself. This suggests that one difference between bioinformatics and medical informatics as disciplines lies with their approaches to the data; there are bioinformaticists interested in the theory behind the manipulation of that data and there are bioinformatics scientists concerned with the data itself and its biological implications. Medical informatics, for practical reasons, is more likely to deal with data obtained at “grosser” biological levels that is information from super-cellular systems, right up to the population level-while most bioinformatics is concerned with information about cellular and biomolecular structures and systems.

Page 22: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

13

1.10 Bioinformatics Applications Molecular medicineThehumangenomewillhaveprofoundeffectsonthefieldsofbiomedicalresearchandclinicalmedicine.Everydisease has a genetic component. This may be inherited (as is the case with an estimated 3000-4000 hereditary disease including Cystic Fibrosis and Huntingtons disease) or a result of the body’s response to an environmental stress which causes alterations in the genome (example, cancers, heart disease, diabetes.).

The completion of the human genome means that we can search for the genes directly associated with different diseases and begin to understand the molecular basis of these diseases more clearly. This new knowledge of the molecular mechanisms of disease will enable better treatments, cures and even preventative tests to be developed.

Personalised medicine Clinicalmedicinewillbecomemorepersonalisedwiththedevelopmentofthefieldofpharmacogenomics.Thisisthe study of how an individual’s genetic inheritance affects the body’s response to drugs. At present, some drugs fail to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug due to sequence variants in their DNA. As a result, potentially life saving drugs never makes it to the marketplace. Today,doctorshavetousetrialanderrortofindthebestdrugtotreataparticularpatientasthosewiththesameclinical symptoms can show a wide range of responses to the same treatment. In the future, doctors will be able to analyseapatient’sgeneticprofileandprescribethebestavailabledrugtherapyanddosagefromthebeginning.

Preventative medicineWiththespecificdetailsofthegeneticmechanismsofdiseasesbeingunraveled,thedevelopmentofdiagnosticteststo measure a person’s susceptibility to different diseases may become a distinct reality. Preventative actions such as change of lifestyle or having treatment at the earliest possible stages when they are more likely to be successful, could result in huge advances in our struggle to conquer disease.

Gene therapyIn the not too distant future, the potential for using genes themselves to treat disease may become a reality. Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of a person’s genes. Currently,thisfieldisinitsinfantilestagewithclinicaltrialsformanydifferenttypesofcancerandotherdiseasesongoing.

Drug developmentAt present all drugs on the market target only about 500 proteins. With an improved understanding of disease mechanismsandusingcomputationaltoolstoidentifyandvalidatenewdrugtargets,morespecificmedicinesthatactonthecause,notmerelythesymptoms,ofthediseasecanbedeveloped.Thesehighlyspecificdrugspromiseto have fewer side effects than many of today’s medicines.

Microbial genome applicationsMicroorganisms are ubiquitous, that is they are found everywhere. They have been found surviving and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the environment, our bodies, the air, food and water. Traditionally, use has been made of a variety of microbial properties in the baking, brewing and food industries. The arrival of the complete genome sequences and their potential to provide a greater insight into the microbial world and its capacities could have broad and far reaching implications for environment, health, energy and industrial applications. For these reasons, in 1994, the US

Department of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence genomes of bacteria useful in energy production, environmental cleanup, industrial processing and toxic waste reduction. By studying the genetic material of these organisms, scientists can begin to understand these microbes at a very fundamental level and isolate the genes that give them their unique abilities to survive under extreme conditions.

Page 23: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

14

Waste cleanupThe world’s toughest bacterium is the most radiation resistant organism known. Scientists are interested in this organism because of its potential usefulness in cleaning up waste sites that contain radiation and toxic chemicals.

Climate change studiesIncreasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for energy, are thought to contribute to global climate change. Recently, the DOE

(Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is to study the genomes of microbes that use carbon dioxide as their sole carbon source.

Alternative energy sourcesScientists are studying the genome of the different organisms, which have an unusual capacity for generating energy from light.

BiotechnologySome archaeon and the bacterium have potential for practical applications in industry and government-funded environmental remediation. These microorganisms thrive in water temperatures above the boiling point and therefore, mayprovidetheDOE,theDepartmentofDefence,andprivatecompanieswithheat-stableenzymessuitableforuse in industrial processes.

Other industrially useful microbes are of high industrial interest as a research object because it is used by the chemical industry for the biotechnological production of the amino acid lysine. The substance is employed as a source of protein in animal nutrition. Lysine is one of the essential amino acids in animal nutrition. Biotechnologically produced lysine is added to feed concentrates as a source of protein, and is an alternative to soybeans or meat and bonemeal.

Micro-organisms are useful in the dairy industry, for manufacturing dairy products like buttermilk, yogurt and cheese. They are also used to prepare pickled vegetables, beer, wine, some bread and sausages and other fermented foods. Researchers anticipate that understanding the physiology and genetic make-up of this bacterium will prove invaluable for food manufacturers as well as the pharmaceutical industry as a vehicle for delivering drugs.

Antibiotic resistanceScientists have been examining the genome of a bacterium. They have discovered a virulence region made up of a number of antibiotic-resistant genes that may contribute to the bacterium’s transformation from harmless gut bacteria to a menacing invader. The discovery of the region, known as a pathogenicity island, could provide useful markers for detecting pathogenic strains and help to establish controls to prevent the spread of infection in wards.

Forensic analysis of microbesScientists used their genomic tools to help distinguish between the strains of a rod shaped bacterium that was used in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains.

The reality of bioweapon creationScientistshaverecentlybuilttheviruspoliomyelitisusingentirelyartificialmeans.TheydidthisusinggenomicdataavailableontheInternetandmaterialsfromamail-orderchemicalsupply.TheresearchwasfinancedbytheUSDepartment of Defense as part of a biowarfare response program to prove to the world the reality of bioweapons. Theresearchersalsohopetheirworkwilldiscourageofficialsfromeverrelaxingprogramsofimmunisation.Thisproject has been met with very mixed feelings.

Evolutionary studiesThe sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that evolutionary studies can be performed in a quest to determine the tree of life and the last universal common ancestor.

Page 24: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

15

Crop improvement Comparative genetics of the plant genomes has shown that the organisation of their genes has remained more conservedoverevolutionarytimethanwaspreviouslybelieved.Thesefindingssuggestthatinformationobtainedfrom the model crop systems can be used to suggest improvements to other food crops. At present the complete genomes of water cress and rice are available.

Insect resistanceGenes from Bacillus thuringiensis that can control a number of serious pests have been successfully transferred to cotton,maizeandpotatoes.Thisnewabilityoftheplantstoresistinsectattackmeansthattheamountofinsecticidesbeing used can be reduced and hence the nutritional quality of the crops is increased.

Improve nutritional qualityScientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron and other micronutrients. This work could have a profound impact in reducing occurrences of blindness and anaemia caused bydeficienciesinVitaminAandironrespectively.Scientistshaveinsertedagenefromyeastintothetomato,andthe result is a plant whose fruit stays longer on the vine and has an extended shelf life.

Development of drought resistance varietiesProgress has been made in developing cereal varieties that have a greater tolerance for soil alkalinity, free aluminium and iron toxicities. These varieties will allow agriculture to succeed in poorer soil areas, thus adding more land to the global production base. Research is also in progress to produce crop varieties capable of tolerating reduced water conditions.

Vetinary scienceSequencing projects of many farm animals including cows, pigs and sheep are now well under way in the hope that a better understanding of the biology of these organisms will have huge impacts for improving the production and healthoflivestockandultimatelyhavebenefitsforhumannutrition.

Comparative studiesAnalysing and comparing the genetic material of different species is an important method for studying the functions of genes, the mechanisms of inherited diseases and species evolution. Bioinformatics tools can be used to make comparisons between the numbers, locations and biochemical functions of genes in different organisms. Organisms that are suitable for use in experimental research are termed model organisms.

They have a number of properties that make them ideal for research purposes including short life spans, rapid reproduction, being easy to handle, inexpensive and they can be manipulated at the genetic level.

An example of a human model organism is the mouse. Mouse and human are very closely related (>98%) and for the most part we see a one to one correspondence between genes in the two species. Manipulation of the mouse at the molecular level and genome comparisons between the two species can and is revealing detailed information on the functions of human genes, the evolutionary relationship between the two species and the molecular mechanisms of many human diseases.

Page 25: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

16

SummaryBioinformatics isanewlyemergedscientificdisciplinefor thecomputationalanalysisandstorageofbiological•data.Bioinformatics is the combination of biology and information technology. The discipline encompasses any •computational tools and methods used to manage, analyse and manipulate large sets of biological data.Bioinformaticsisthefieldinwhichbiology,computerscienceandinformationtechnologymergeintosingle•discipline for managing and analysing biological data using advanced computing techniques.The three terms: bioinformatics, computational biology and bioinformation infrastructure are very similar and •most of the time used interchangeablyBioinformatics refers to database like activities, involving persistent sets of data that are maintained in a consistent •stateoveressentiallyindefiniteperiodsoftime.Computational biology encompasses the use of algorithmic tools to facilitate biological analyses. •Bioinformation infrastructure comprises the entire collection of information management systems, analysis tools •and communication networks supporting biology. Thus, the latter may be viewed as a computational scaffold of the former two.A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also knows how to •write interfaces for effective use of the tools. A bioinformatician, on the other hand, is a trained individual who only knows to use bioinformatics tools without •a deeper understanding.Page display, hypertext links and other features are coded using a simple, cross-platform HyperText Markup •Language (HTML) and viewed on UNIX workstations, PCs and Apple Macs as WWW pages using a browser. ThefirstgraphicalWWWbrowser-MosaicforXandthefirstmolecularbiologyWWWserver-ExPASywere•made available in 1993.CORBA frameworks for bioinformatics tools and databases have been developed by, for example, NetGenics •and the European Bioinformatics Institute (EBI). ROSETTA is built on an object-relational database system, which stores demographic and clinical data on •sample donors and tracks the processing of samples and analytical results.Sequence generation, and its subsequent storage, interpretation and analysis are entirely computer dependent •tasks.Due to the spectacular growth of biotechnology and molecular biology tremendous amount of data of nucleotide •sequences or protein sequences are being produced.DNA is the genetic material in all organisms (with certain viruses being exception) and it stores the instruction •needed by the cell to perform daily life function.A gene is a DNA sequence that encodes a protein or an RNA molecule.•

Page 26: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

17

ReferencesKoslow, S. H., & Huerta, M. F., 2000. • Electronic collaboration in science, Routledge.Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd. Robbins. • Bioinformatics: Essential Infrastructure For Global Biology [pdf] Available at: <http://www.esp.org/oecd.pdf> [Accessed 28 February 2012].Khandekar. • Role of Bioinformatics In Medical Informatics A Case Study : Tuberculosis [pdf] Available at: <http://www.jbtdrc.org/Symposium/Topics/Role_bio.pdf> [Accessed 28 February 2012].InsGenomeSciences, 2010.• Introduction to Bioinformatics [Video Online] Available at: <http://www.youtube.com/watch?v=xODTm4a6nsM> [Accessed 28 February 2012].plantbreedgenomics, 2010. • Bioinformatics 101 - Part 2 Intro [Video Online] Available at: <http://www.youtube.com/watch?v=WlVGTtqT4Tg&feature=related> [Accessed 28 February 2012].

Recommended ReadingRamsden, J., 2009. • Bioinformatics: An introduction, 2nd ed., Springer. Polański,A.&Kimmel,M.,2007.• Bioinformatics, Springer.Letovsky, S. • Bioinformatics: Databases and Systems, O’REILLY.

Page 27: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

18

Self Assessment

The word ‘bio’ refers to _________.1. biologya. data miningb. data warehousingc. analysisd.

_________ encompasses the use of algorithmic tools to facilitate biological analyses. 2. Computational biology a. Bioinformation infrastructureb. Bioinformaticsc. Biologyd.

What refers to database like activities, involving persistent sets of data that are maintained in a consistent state 3. overessentiallyindefiniteperiodsoftime?

Computational biology a. Bioinformation infrastructureb. Bioinformaticsc. Biologyd.

What comprises the entire collection of information management systems, analysis tools and communication 4. networks supporting biology?

Computational biology a. Bioinformation infrastructureb. Bioinformaticsc. Biologyd.

___________refers to two genes sharing a common evolutionary history.5. Homologya. Evolutionary biologyb. Biologyc. Bioinformaticsd.

The _______is a graphical interface based on hypertext by which text and graphics can be displayed and 6. highlighted.

WWWa. HTMLb. UNIXc. JVMd.

Page 28: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

19

Which of the following statements is false?7. Proteinsthatshowsignificantsequenceconservation,indicatingaclearevolutionaryrelationship,aresaida. to be from the same protein family.Protein is the basic molecule of life that directly controls the fundamental biology of nearly all organisms b. (except those where RNA is genetic material).Any variations and errors in the nucleotide sequence of the genomic DNA or mutations may lead to c. development of genetic disorders or other metabolic changes.A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also knows how to d. write interfaces for effective use of the tools.

Which of the following statements is false?8. Data mining refers to database like activities, involving persistent sets of data that are maintained in a a. consistentstateoveressentiallyindefiniteperiodsoftime.Computational biology encompasses the use of algorithmic tools to facilitate biological analyses. b. Bioinformation infrastructure comprises the entire collection of information management systems, analysis c. tools and communication networks supporting biology. Thus, the latter may be viewed as a computational scaffold of the former two.The ultimate goal of bioinformatics is to uncover the wealth of biological information hidden in the mass d. of data and obtain a clearer insight into the fundamental biology of organisms.

Which of the following statements is false?9. A bioinformaticist is a trained individual who only knows to use bioinformatics tools without a deeper a. understanding.Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary b. relationship.Homology refers to two genes sharing a common evolutionary history.c. Proteinsthatshowsignificantsequenceconservation,indicatingaclearevolutionaryrelationship,aresaidd. to be from the same protein family.

Who is known as the father of genetics?10. Stanley Cohena. Gregor Mendel b. Paul Bergc. Herbert Boyerd.

Page 29: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

20

Chapter II

Biological Databases

Aim

The aim of this chapter is to:

definebiologicaldatabase•

describe four major sequence databases•

understandtheGenBankfileformat•

Objectives

The objectives of this chapter are to:

describe • nucleotide databases

elucidate the • principal requirements on the public data services

classify biological databases•

Learning outcome

At the end of this chapter, you will be able to:

understand the • specificfeaturesofbiologicaldatabases

enumerate the categories of biological databases •

differentiate between primary and secondary database•

Page 30: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

21

2.1 IntroductionThe modern genomic research leads to the generation of huge amounts of raw sequence data. Sophisticated computational methodologies are required to manage the mass of data as the volume of genomic data grows. The challenge in the genomics era is to store and handle the volume of information through the establishment and use of computer databases. Thus, the development of databases to handle the vast amount of molecular biological data is a fundamental task of bioinformatics.

A biological database is a large, organised body of persistent data, usually associated with computerised software designed to update, query and retrieve components of the data stored within system. A simple database might beasinglefilecontainingmanyrecords,eachincludingthesamesetofinformation.Arecordassociatedwithanucleotide sequence database contains information such as contact name, input sequence with a description of the typeofmolecule,scientificnameofthesourceorganismfromwhichitwasisolatedandliteraturecitationsassociatedwith sequence.

Forresearcherstobenefitfromthedatastoredinadatabase,twoadditionalrequirementsmustbemet:Easy access to the information •Amethodforextractingonlythatinformationneededtoansweraspecificbiologicalquestion•

Currently, a lot of bioinformatics work is concerned with the technology of databases. These databases include both ‘public’ repositories of gene data such as GenBank or Protein.

DataBank (PDB), and private databases such as those used by research groups involved in gene mapping projects or those held by biotech companies. Making such databases accessible through open standards (such as Web) is very important since consumers of bioinformatics data use a range of computer platforms, from the more powerful and forbidding UNIX boxes favoured by the developers and curators to the far friendlier Macs often found populating the labs of computer-wary biologists.

RNA and DNA are the proteins that store hereditary information about an organism. These macromolecules have a fixedstructureanalysedbybiologistswiththehelpofbioinformatictoolsanddatabases.Afewpopulardatabasesare GenBank from NCBI (National Center for Biotechnology Information), SwissProt from Swiss Institute of Bioinformatics and PIR from Protein Information Resource.

History of biological databases 1965: Margaret Dayhoff et al. published ‘Atlas of Protein Sequences and Structures.’ •1982: EMBL initiates DNA sequence databases, followed within a year by GenBank and in 1984 by the DNA •Database of Japan. 1988: EMBL/GenBank/DDBJ agrees on common format for data elements. •1980: Only 80 genes were fully sequenced. The PCR techniques in 1983 lead to tremendous increase in •nucleotide sequence.

Thespecificfeaturesofbiologicaldatabasesare:Sub-classofscientificdatabases•Autonomous: many independent maintainers •Heterogeneous data formats (for example, various data formats for the same data entities; various types of •biological data: genomic, microarray, proteomic)Dynamic: frequent and continuous changes in data content (in data schema) •Broad domain knowledge •Workflow-oriented:databasesandrichsetofanalysistools•Information integration is essential: data aggregation from several databases•

Page 31: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

22

Dependingontheresearchproject,biologicaldatacomesinmanydifferentflavours.Mostresearchesworkwithanumberofdifferentformatseventhoughtheymaynotrealisethisatfirsthand.Someofthedata,whichcanbefoundwhenresearchinganybiologicalquestion,isbrieflylistedbelow.Someofthesedatainthedatabasesarepartly overlapping and referring to each other.

Text: • Examples of text databases are PubMed and OMIM containing textual information and references related to biological data.Sequence data:• GenBank and UniProt exemplify biological databases containing DNA and protein sequences, respectively.Protein structure:• Youcanalsofinddatabasesspecificallyrelatedtoproteinstructurefiles(forexample,thePDB, SCOP and CATH databases).Links:• Mostdatabasescontaininformationonsequencedatawithinaspecificfieldorsubject.Adifferenttypeof database is for example, the InterPro database consisting of a collection of links from protein domains and families to other databases providing related resources.Images:• Inthefieldof2Dgelandmicroscopicimagesyoucanalsofindvariousdatabasescontainingdata,forexample,identifiedonreferencegelimages.Numerical data• : Gene expression data as well as other microarray data are also accessible from a number of databases. An example is the ArrayExpress database of the European Bioinformatics Institute, EBI.Biological matter:• Frozenbacterial strains,vectorsandsoonarealso tobe found indatabasescollectinginformationoneachofthesespecificbiologicalmatters,forexample,UniVecdatabasehostedbyNCBI.

2.2 Categories of Biological Databases These include:

Nucleotide sequences •Genomics (information on gene chromosomal location and nomenclature, provide links to sequence •databases) Mutation/polymorphism (sequence variations linked or not to genetic diseases) •Protein sequences •Protein domain/family •Proteomics •Microarray(high-dimensionaldata:profilesofthousandsofgenesdependingonhundreds/thousandsofvarious•conditions) Organism-specific•3D structure •Metabolism (for example, metabolic pathways – graph data) •Bibliography •

2.3 The Database Industry The public databases have become the major medium through which genome sequence data are published. This is because of the high rate of data production and need for researchers to have rapid access to new data. Public databases and data services that support them are important resources in bioinformatics. However, successful public data services suffer from continually escalating demands from the biological community.

EMBL and GenBank are the two major nucleotide databases. EMBL is the European version and GenBank is the American. EMBL and GenBank collaborate and synchronise their databases so that the databases will contain same information. The rate of growth of DNA databases has been following an exponential trend, with a doubling time now estimated to be 9 to12 months. In January 1998, EMBL contained more than a million entries, representing more than 15,500 species, although most data is from model organisms. These databases are updated on a daily basis.

Page 32: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

23

The principal requirements on the public data services are:Data quality: Data quality has to be of the highest priority. However, because the data services in most cases •lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter. Supporting data: Database users will need to examine the primary experimental data, either in the database •itself, or by following cross-references back to network accessible laboratory databases. Deep annotation: Deep, consistent annotation comprising supporting and ancillary information should be attached •to each basic data object in the database. Timeliness: The basic data should be available on an Internet-accessible server within days (or hours) of •publication or submission. Integration: Each data object in the database should be cross-referenced to representation of the same or related •biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another.

2.4 Classification of Biological Databases

Biologicaldatabasescanbebroadlyclassifiedintosequenceandstructuredatabases:

Sequence databasesWith the current speed of sequencing projects, to store and organise sequence data a lot of work is needed. Most sequence databases store additional information along with the sequence. This could be references to the original researchpapersstoredinPubMed,informationaboutannotatedregions,regionswereconflictingresidueshavebeenpublished, information on species and much more. So far, a common standard for handling all of this information has not been created. Thus, every database has its own standard on how to store the data. However, most data is storedinaplaintextformat(flatfile)andcanthus,beopenedinstandardsoftwaresuchasWord,Notepadandsoon.However,largeamountsofplaintextmaynotbeeasycomprehensible.Anotherproblembystoringinaflatfileformatisthesizeofthedatabase.Databaseswithlongsequenceentriesmaybecometoolargetohandleonanormal PC for most users.

Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure database isapplicabletoonlyproteins.Thefirstdatabasewascreatedwithinashortperiodaftertheinsulinproteinsequencewasmadeavailable in1956. Incidentally, Insulin is thefirstprotein tobe sequenced.The sequenceof Insulinconsisted of just 51 residues, which characterise the sequence. An alternative approach used by most websites with large databases is to store all the information in a relational database.

Relational databases have connections or pointers to additional data in other databases or tables. Thus, one can easily and very fast retrieve a large amount of information on one particular sequence.

One of the characteristics of these databases is that they are maintained and kept up to date on a regular basis. The four major sequence databases are:

GenBank• : A US-based comprehensive collection of various biological data.EMBL• : The main European Resource of Nucleotide Sequence Data.DDBJ:• The DNA Data Bank of Japan.UniProt:• The universal protein resource.

Page 33: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

24

The four databases are further described in detail as follows:

GenBank at NCBIGenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It hasaflatfilestructure,whichisanASCIItextfile,readablebybothhumansandcomputers.Inadditiontosequencedata,GenBankfilescontaininformationlikeaccessionnumbersandgenenames,phylogeneticclassificationandreferences to published literature. There are approximately 191,400,000 bases and 183,000 sequences as of June 1994.

The National Institute of Health hosted at http://www.ncbi.nlm.nih.gov/, has achieved a strong position in collecting biological data of almost any kind. In addition to storing sequence data, NCBI stores almost all kinds of biological sequence related data. PubMed is probably the mostly used service that NCBI offers to theirs users together with BLAST, an option for searching for homologous sequences in the entire database. The NCBI staff provides software tools for handling sequence data.

Bases in GenBank

Billions90

80

70

60

50

40

30

20

10

0Jan-83

Jan-85

Jan-87

Jan-89

Jan-91

Jan-93

Jan-95

Jan-97

Jan-99

Jan-01

Jan-03

Jan-05

Jan-07

Fig. 2.1 Growth of the GenBank database(Source: http://www.clcbio.com/sciencearticles/BE-biodatabase.pdf)

Page 34: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

25

Fig. 2.2 GenBank file format(Source: www.bioinf.org.uk/teaching/c40/ppt/intro_databases.ppt)

Page 35: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

26

EMBLEMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientificliteratureandpatentapplicationsanddirectlysubmittedfromresearchersandsequencinggroups.Datacollection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The database currentlydoublesinsizeevery18monthsandcurrently(June1994)containsnearly2millionbasesfrom182,615sequence entries.

The EMBL Nucleotide Sequence Database is hosted by EBI - the European Bioinformatics Institute, at the European Molecular Biology Laboratory (EMBL), hosted at http://www.ebi.ac.uk/embl/. DNA and RNA sequences are directly submitted to the EMBL nucleotide sequence database by individual researchers wand also by genome sequencing projects and patent applications, and the database is produced and maintained collaborating with both GenBank and the DNA Data Bank of Japan (DDBJ). The international collection of sequence data is exchanged between EMBL, GenBank and DDBJ on a daily basis and knowledge of global sequence information can be retrieved from any of the three entries.

DNA Data Bank of Japan (DDBJ)DDBJ (DNA Data Bank of Japan) is a nucleotide database hosted in Japan and is accepting DNA submission from mainly Japanese researchers. They work in close collaboration with GenBank and EMBL and the three databases store almost identical data. DDBJ also provides various search and analysis tools through the website http://www.ddbj.nig.ac.jp/.

UniProtUniProt is the universal protein resource, and as stated on its website the database intends to be both comprehensive and ofhighquality.AttheUniProtwebsite,http://www.uniprot.org/,datahasbeendividedintothreeclassifications:

Core data•Supporting data •Information •

You can search information of your sequence in these three categories. You can also do BLAST searches and create alignments; a couple of other services are also provided. The UniProt Knowledgebase, UniProtKB, contains translations of the coding sequences submitted to EMBL, GenBank and DDBJ and the UniProtKB contains all publicly available protein sequences.

Page 36: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

27

International

Advisory Meeting

Collaborative Meeting

EMBL

Europe

EMBL

EBI

DDBJ

Japan

NIG

CIB

GenBank

USA

NLM

NCBI

TrEMBL NRDB

Fig. 2.3 International nucleotide data banks(Source: www.bioinf.org.uk/teaching/c40/ppt/intro_databases.ppt)

Other valuable databases and resources

SwissProtThis is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).

PubMedPubMed gives biological data in text format and this service provided by the U.S. National Library of Medicine linkstomorethan17millionresourcesfromdifferentjournalswithinthefieldoflifescience.Arelativelynewfunctionality at NCBI website is possibility to sign up for an account at My NCBI, which is a service, offering a customised and automated PubMed update. After registration at My NCBI, you can save your searches and set up automatedsearchesalertingyoubye-mail.Youcanalsocustomise,forexample,filteringoptionsonthesearches.PubMed can be accessed at http://www.ncbi.nlm.nih.gov/pubmed/.

EnsemblEnsembl is a project developing software for automatic annotation of eukaryotic genomes.EMBL - EBI and the Sanger Institute are behind the project and at the website, http://www.ensembl.org/index.html, you can search within all the data from the Ensemble project divided into species.

EBIOne of the larger European bioinformatic centers, European Bioinformatics Institute (EBI), hosts a number of databases and a lot of methods to help analyse all this data. The EBI website http://www.ebi.ac.uk/ also stores the EMBL nucleotide sequence database (http://www.ebi.ac.uk/embl/).

Page 37: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

28

InterProMost of the databases mentioned above also provide links to related information in other databases. InterPro database try to a larger extent to link from one protein domain or family to a number of different databases, which individually contains a lot of relevant information. InterPro database does not contain any sequence information but is largely a mesh of hyperlinks to various other resources. The link to InterPro is http://www.ebi.ac.uk/interpro/

PfamAveryusefuldatabaseforfindingproteindomainsisthePfamdatabase.Pfamcurrentlystoresinformationonmorethan 9000 protein families. When working on an unknown protein, it is often very valuable to retrieve information of theactualproteinfamilyasidentificationoffunctionaldomainswithinaproteinsequencecanbenefityourknowledgeabout the role and function of the protein. You can access the Pfam database at http://pfam.janelia.org/.

Structure databasesInformation about protein structure is not developing as fast as sequence data information due to slower pace in solving 3 D structures of proteins. The RCSB Protein Data Bank hosted at http://www.pdb.org holds slightly more than48000structures.Atthewebsite,youcandownloadstructurefilesandyouareprovidedanumberoftoolsforstructure studies.

SCOP:• Structural Classification of Proteins is accessible at http://scop.berkeley.edu/. The SCOP database describes structural and evolutionary relationships between all known protein structures and also provides a number of links to other on-line resources related to protein structure and to sequence databases in general. CATH Protein Structure Classification.• TheCATHdatabase hosted at http://www.cathdb.info/ classifiesprotein structures from the PDB according to a four-level hierarchy.

Species-specific databasesAnumberofspecies-specificdatabasesusuallyholdverydetailedinformationaboutonlyoneparticularspecies.Duringthisperiod,3DstructureofproteinswerestudiedandwellknownPDBwasdevelopedasthefirstproteinstructure database with only 10 entries in 1972, which has now grown into a large database with over 10,000 entries. While the initial databases of protein sequences were maintained at the individual laboratories, the development of a consolidated formal database known as SWISS-PROT protein sequence database was initiated in 1986, which recently has about 70,000 protein sequences from more than 5000 model organisms, a small fraction of all known organisms. These huge varieties of data resources are now available for study and research by both academic institutions and industries. These are made available as public domain information in the larger interest of research community through Internet (www.ncbi.nlm.nih.gov) and CD-ROMs (on request from www.rcsb.org). Databasescanbeclassifiedinto:

Primary databasesA primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot and PIR for protein sequences, GenBank and DDBJ for Genome sequences and Protein Databank for protein structures. Biologicaldatabasesarearchivesofconsistentdatastoredinanefficientmanner.Thesedatabasescontaindatafroma wide spectrum of molecular biology areas. Primary or archived databases contain information and annotation of DNAandproteinsequences,DNAandproteinstructuresandDNAandproteinexpressionprofiles.

Secondary databasesA secondary database contains derived information from the primary database. A secondary sequence database contains information such as the conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins.

Page 38: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

29

AsecondarystructuredatabasecontainsentriesofthePDBinanorganisedway.Thesecontainentriesclassifiedaccording to their structure such as all alpha proteins, all beta proteins, and so on. These also contain information on conserved secondary structure motifs of a particular protein. Some of the secondary database created and hosted by various researchers at their individual laboratories includes SCOP, developed at Cambridge University; CATH developed at University College of London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

Secondary or derived databases contain the results of analysis on the primary resources including information on sequence patterns or motifs, variants and mutations and evolutionary relationships. Information from the literature is contained in bibliographic databases (Medline). These databases are easily accessible and that an intuitive query systemisprovidedtoallowresearcherstoobtainveryspecificinformationonaparticularbiologicalsubject.Thedata should be provided in a clear, consistent manner with some visualisation tools for biological interpretation. Specialist databases for particular subjects have been set-up, for example EMBL database for nucleotide sequence data, UniProtKB/Swiss-Prot protein database and PDB (a 3D protein structure database).

Scientists also need to be able to integrate the information obtained from the underlying heterogeneous databases in a sensible manner for having an overview of their biological subject. Sequence Retrieval System (SRS) is a power-ful, querying tool provided by EBI that links information from more than 150 heterogeneous resources.

Composite databasesComposite database amalgamates different primary database sources, which obviates the need to search multiple resources. Different composite database use different primary database and different criteria in their search algorithm. Various options for search have also been incorporated in the composite database. NCBI hosts these nucleotide and protein databases in their large high available redundant array of computer servers. NCBI provides free access to various persons involved in research.

This also has link to OMIM (Online Mendelian Inheritance in Man), which contains information about the proteins involved in genetic diseases. The growth of the primary databases gave rise to questions on the format of sequences, reliability and comprehensiveness of databases. To address the format issues, in-house software solutions have been developed to convert format of one database to another. A public domain software (FORCON) can also be used. The newer software tools are used for analysis to accept data in multiple formats.

The problem in the data reliability is the possibility of misannotations. The misannotations are some time introduced due to the process of automation of annotation process carried out with computers. A misannotation (if introduced) multiplies in subsequent additions and may accumulate to an unbelievable extent and create confusion. A possible solutiontoprevent thisfromhappeningis toflagtheproteinsequence,whichhasbeenannotatedbysequencecomparison but whose function has not been validated by experimental methods.

2.5 The Creation of Sequence Databases Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, and so on). Each sequence of nucleotides or amino acids represents a particular gene or protein, respectively. Sequences are represented in shorthand, using single letter designations. This decreases the space necessary to store information and increases processing speed for analysis.

While most biological databases contain nucleotide and protein sequence information, there are also databases, which include taxonomic information such as the structural and biochemical characteristics of organisms. However, the power and ease of using sequence information has made it the method of choice in modern analysis.

Contributions from thefieldsofbiologyandchemistryhave facilitatedan increase in thespeedof sequencinggenes and proteins. The advent of cloning technology allowed foreign DNA sequences to be easily introduced into bacteria. In this way, rapid mass production of particular DNA sequences became possible. Oligonucleotide synthesis provided researchers with the ability to construct short fragments of DNA with sequences of their own choosing.These oligonucleotides could then be used in probing vast libraries of DNA to extract genes containing that sequence.

Page 39: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

30

Alternatively, these DNA fragments could also be used in polymerase chain reactions to amplify existing DNA sequences or to modify these sequences. With these techniques in place, progress in biological research increased exponentially.However,forresearcherstobenefitfromallthisinformation,twoadditionalthingswererequired:

Ready access to the collected pool of sequence information and •A way to extract from this pool only those sequences of interest to a given researcher •

Collecting all necessary sequence information of interest to a given project from published journal articles quickly became a formidable task. After collection, the organisation and analysis of this data still remained. It could take weekstomonthsforaresearchertosearchsequencesbyhandinordertofindrelatedgenesorproteins.Computertechnology has provided the obvious solution to this problem. Not only can computers be used to store and organise sequence information into databases, but they can also be used to analyse sequence data rapidly.

The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence information being created. Theoretical scientists have derived new and sophisticated algorithms, which allow sequences to be readily compared using probability theories. These comparisons become the basis for determining gene function, developing phylogenetic relationships and simulating protein models.

The physical linking of a vast array of computers in the 1970’s provided a few biologists with ready access to the expanding pool of sequence information. This web of connections, now known as the Internet, has evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyse it. Databases ofexistingsequencingdatacanbeusedtoidentifyhomologuesofnewmoleculesthathavebeenamplifiedandsequenced in the lab. The property of sharing a common ancestor, homology, can be a very powerful indicator in bioinformatics.

Acquisition of sequence data Bioinformatics tools can be used to obtain sequences of genes or proteins of interest, either from material obtained, labelled,preparedandexaminedinelectricfieldsbyindividualresearchers/groupsorfromrepositoriesofsequencesfrom previously investigated material.

Analysis of data Both types of sequence can then be analysed in many ways with bioinformatics tools. They can be assembled. Note that this is one of the occasions when the meaning of a biological term differs markedly from a computational one. Computer scientists, banish from your mind any thought of assembly language. Sequencing can only be performed forrelativelyshortstretchesofabiomoleculeandfinishedsequencesaretherefore,preparedbyarrangingoverlapping‘reads’ of monomers (single beads on a molecular chain) into a single continuous passage of ‘code’.

Thisisthebioinformaticsenseofassembly.Theycanbemapped(thatis,theirsequencescanbeparsedtofindsiteswhereso-called‘restrictionenzymes’willcutthem).Theycanbecompared,usuallybyaligningcorrespondingsegments and looking for matching and mismatching letters in their sequences. Genes or proteins, which are sufficientlysimilararelikelytoberelatedandaretherefore,saidtobe‘homologous’toeachotherthewholetruthis rather more complicated than this. Such cousins are called ‘homologues’. If a homologue (a related molecule) exists then a newly discovered protein may be modelled that is the three dimensional structure of the gene product can be predicted without doing laboratory experiments.

Bioinformatics is used in primer design. Primers are short sequences needed to make many copies of (amplify) a piece of DNA as used in PCR (the Polymerase Chain Reaction). Bioinformatics is used to attempt to predict the function of actual gene products. Information about the similarity, and, by implication, the relatedness of proteins is used to trace the family trees’ of different molecules through evolutionary time.

Page 40: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

31

There are various other applications of computer analysis to sequence data, but, with so much raw data being generated by the Human Genome Project and other initiatives in biology, computers are presently essential for manybiologistsjusttomanagetheirday-to-dayresultsMolecularmodelling/structuralbiologyisagrowingfield,which can be considered part of bioinformatics. There are, for example, tools which allow (often via the Net) to make pretty good predictions of the secondary structure of proteins arising from a given amino acid sequence, often based on known ‘solved’ structures and other sequenced molecules acquired by structural biologists. Structural biologists use ‘bioinformatics’ to handle the vast and complex data from X-ray crystallography, nuclear magnetic resonance (NMR) and electron microscopy investigations and create the 3-D models of molecules that seem to be everywhere in the media.

2.6 Bioinformatics Programs and Tools Bioinformatic tools are software programs that are designed for extracting the meaningful information from the mass of data and to carry out this analysis step.

Factors that must be taken into consideration when designing these tools are:The end user (the biologist) may not be a frequent user of computer technology •Thesesoftwaretoolsmustbemadeavailableovertheinternetgiventheglobaldistributionofthescientific•research community

Major categories of bioinformatics tools There are both standard and customised products to meet the requirements of particular projects. There are data-mining software that retrieves data from genomic sequence databases and also visualisation tools to analyse and retrieveinformationfromproteomicdatabases.Thesecanbeclassifiedashomologyandsimilaritytools,proteinfunctional analysis tools, sequence analysis tools and miscellaneous tools. Here, is a brief description of a few of these. Everyday bioinformatics is done with sequence search programs like BLAST, sequence analysis programs, like the EMBOSS and Staden packages, structure prediction programs like THREADER or PHD or molecular imaging/modelling programs like RasMol and WHATIF.

Structural analysisThese sets of tools allow you to compare structures with the known structure databases. The function of a protein is more directly a consequence of its structure rather than its sequence with structural homologs tending to share functions. The determination of a protein’s 2D/3D structure is crucial in the study of its function.

Homology and similarity toolsHomologous sequences are sequences that are related by divergence from a common ancestor. Thus, the degree of similarity between two sequences can be measured while their homology is a case of being either true or false. This set of tools can be used to identify similarities between novel query sequences of unknown structure and function and database sequences whose structure and function have been elucidated.

Protein function analysisThese groups of programs allow you to compare your protein sequence to the secondary (or derived) protein databases thatcontaininformationonmotifs,signaturesandproteindomains.Highlysignificanthitsagainstthesedifferentpattern databases allow you to approximate the biochemical function of your query protein.

Protein sequence analysisApart from maintaining the large database, mining useful information from these set of primary and secondary databases is very important.Lot of efficient algorithmshavebeendeveloped for datamining andknowledgediscovery. These are computation intensive and need fast and parallel computing facilities for handling multiple queries simultaneously. It is these search tools that integrate the user and the databases. One of the widely used search program is BLAST (Basic Local Alignment Search Tool)

Page 41: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

32

2.7 Bioinformatics Tools

BLASTSequence data are compared with one another using the Basic Local Alignment Search Tool or BLAST (Altschul et al.,1990).Thisalgorithmattemptstofind‘‘high-scoringsegmentpairs’’(HSPs),whicharepairsofsequencesthatcan be aligned with one another and, when aligned, meet certain scoring and statistical criteria.

BLAST is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrificeofsensitivity.ThescoresassignedinaBLASTsearchhaveawell-definedstatisticalinterpretation,makingreal matches easier to distinguish from random background hits. BLAST uses a heuristic algorithm, which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity. This is a primary criterion in sequence analysis. Other tool available includes CLUSTALW for multiple sequence alignment.

BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools. It is a set of search programs designed for the Windows platform and is used to perform fast similarity searches regardless of whether the query is for protein or DNA. Comparison of nucleotide sequences in a database can be performed. Also aproteindatabasecanbesearchedtofindamatchagainstthequeriedproteinsequence.NCBIhasalsointroducedthe new queuing system to BLAST (Q BLAST) that allows users to retrieve results at their convenience and format their results multiple times with different formatting options.

Depending on the type of sequences to compare, there are different programs: blastp compares an amino acid query sequence against a protein sequence database •blastn compares a nucleotide query sequence against a nucleotide sequence database •blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence •database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in •all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations •of a nucleotide sequence database.

Databases available for BLAST search include the following:

Protein sequence databasesnr:• All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF month: • All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days. Swissprot :• Last major release of the SWISS-PROT protein sequence database (no updates) Drosophila genome:• Drosophila genome proteins provided by Celera and Berkeley Drosophila Genome Project (BDGP)• :(www.fruitfly.org)Yeast:• Yeast (Saccharomyces cerevisiae) genomic CDS translations ecoli :• Escherichia coli genomic CDS translations pdb: • Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank (www.pdb.org)kabat: • Kabat’s database of sequences of immunological interest (http://immuno.bme.nwu.edu)alu: • Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.

Page 42: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

33

Nucleotide sequence databasesnr:• All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer ‘non-redundant’. month :• All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. Drosophila genome• : Drosophila genome provided by Celera and Berkeley Drosophila Genome Project )dbest:• Database of GenBank+EMBL+DDBJ sequences from EST Divisionsdbsts: • Database of GenBank+EMBL+DDBJ sequences from STS Divisionshtgs:• UnfinishedHighThroughputGenomicSequences:phases0,1and2gss:• Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. Yeast: • Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences E. coli• : Escherichia coli genomic nucleotide sequences pdb :• Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bankkabat:• Kabat’s database of sequences of immunological interest vector :• Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/mito :• Database of mitochondrial sequences alu:• Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory). Epd:• EukaryoticPromotorDatabasefoundonthewebathttp://www.genome.ad.jp/dbgetbin/www_bfind?epd

Page 43: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

34

Fig. 2.4 Application of bioinformatics in medical science(Source: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pdfs/biodbseq.pdf)

FASTAFASTA is an alignment program for protein sequences created by Pearsin and Lipman in 1988. The program is one of the many heuristic algorithms proposed to speed up sequence comparison. The basic idea is to add a fast pre screen step to locate the highly matching segments between two sequences, and then extend these matching segments to local alignments using more rigorous algorithms such as Smith-Waterman.

EMBOSSEMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package. It can work with data in a range of formats and also retrieve sequence data transparently from the Web. Extensive libraries are also provided with this package, allowing other scientists to release their software as open source. It provides a set of sequence-analysis programs, and also supports all UNIX platforms.

ClustalwIt is a fully automated sequence alignment tool for DNA and protein sequences. It returns the best match over a total length of input sequences, be it a protein or a nucleic acid.

Page 44: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

35

RasMolIt is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to use program.

PROSPECTPROSPECT (Protein Structure Prediction and Evaluation Computer ToolKit) is a protein structure prediction system that employs a computational technique called protein threading to construct a protein’s 3-D model.

PatternHunter PatternHunter, based on Java, can identify all approximate repeats in a complete genome in a short time using little memory on a desktop computer. Its features are its advanced patented algorithm and data structures, and the java languageusedtocreateit.TheJavalanguageversionofPatternHunterisjust40KB,only1%thesizeofBlast,while offering a large portion of its functionality.

COPIA COPIA(ConsensusPatternIdentificationandAnalysis)isaproteinstructureanalysistoolfordiscoveringmotifs(conserved regions) in a family of protein sequences. Such motifs can be then used to determine membership to the family for new protein sequences, predict secondary and tertiary structure and function of proteins and study evolution history of the sequences.

2.8 Application of Programmes in BioinformaticsThe applications include:

JAVA in bioinformatics Since research centers are scattered all around the globe ranging from private to academic settings, and a range of hardware and OSs are being used, Java is emerging as a key player in bioinformatics. Physiome Sciences’ computer-based biological simulation technologies and Bioinformatics Solutions’ PatternHunter are two examples of the growing adoption of Java in bioinformatics.

Perl in bioinformaticsStringmanipulation,regularexpressionmatching,fileparsing,dataformatinter-conversionandsoonarethecommontext-processing tasks performed in bioinformatics. Perl excels in such tasks and is being used by many developers. Yet,therearenostandardmodulesdesignedinPerlspecificallyforthefieldofbioinformatics.However,developershave designed several of their own individual modules for the purpose, which have become quite popular and are coordinated by the BioPerl project.

Page 45: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

36

SummaryA biological database is a large, organised body of persistent data, usually associated with computerised software •designed to update query and retrieve components of the data stored within system.RNA and DNA are the proteins that store hereditary information about an organism.•Biologicaldatabasesarearchivesofconsistentdatastoredinanefficientmanner.•Primary or archived databases contain information and annotation of DNA and protein sequences, DNA and •proteinstructuresandDNAandproteinexpressionprofiles.Secondary or derived databases contain the results of analysis on the primary resources including information •on sequence patterns or motifs, variants and mutations and evolutionary relationships.EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected •fromthescientificliteratureandpatentapplicationsanddirectlysubmittedfromresearchersandsequencinggroups.DDBJ (DNA Data Bank of Japan) is a nucleotide database hosted in Japan and is accepting DNA submission •from mainly Japanese researchers.UniProt is the universal protein resource, and as stated on its website the database intends to be both comprehensive •and of high quality.Ensembl is a project developing software for automatic annotation of eukaryotic genomes.•The SCOP database describes structural and evolutionary relationships between all known protein structures •and also provides a number of links to other on-line resources related to protein structure and to sequence databases in general. TheCATHdatabasehostedathttp://www.cathdb.info/classifiesproteinstructuresfromthePDBaccordingto•a four-level hierarchy.Composite database amalgamates different primary database sources, which obviates the need to search multiple •resources.Bioinformatic tools are software programs that are designed for extracting the meaningful information from the •mass of data and to carry out this analysis step. Homologous sequences are sequences that are related by divergence from a common ancestor.•BLAST is a set of similarity search programs designed to explore all of the available sequence databases •regardless of whether the query is protein or DNA.BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools.•COPIA(ConsensusPatternIdentificationandAnalysis)isaproteinstructureanalysistoolfordiscoveringmotifs•(conserved regions) in a family of protein sequences.PatternHunter, based on Java, can identify all approximate repeats in a complete genome in a short time using •little memory on a desktop computer.EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package.•

ReferencesBaxevanis, A. D. & Ouellette, B. F., 1998. • Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, John Wiley and Sons, New York.Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd. clcbio. • Bioinformatics explained: Biological databases [Online] Available at: <http://www.clcbio.com/index.php?id=1238> [Accessed 28 February 2012].EMBL-EBI. • What is Bioinformatics? [Online] Available at: <http://www.ebi.ac.uk/2can/bioinformatics/bioinf_biodatabases_1.html> [Accessed 28 February 2012].

Page 46: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

37

InsGenomeSciences, 2010.• Introduction to Bioinformatics [Video Online] Available at: <http://www.youtube.com/watch?v=xODTm4a6nsM> [Accessed 28 February 2012].jv51jjv5, 2010. • NCBI BLAST Tutorial - Part 1 [Video Online] Available at: <http://www.youtube.com/watch?v=ZuBMBJmfn-4&feature=related> [Accessed 28 February 2012].

Recommended ReadingLehninger, A. L. 1984. • Principles of Biochemistry, CBS publishers and distributors, New Delhi, India. Shanmughavel, P. 2005. • Principles of Bioinformatics, Pointer Publishers, Jaipur, India.Markel, S. and Leon, D., Sequence Analysis in A Nutshell, O’REILLY.•

Page 47: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

38

Self AssessmentRNA and DNA are the ________that store hereditary information about an organism.1.

proteinsa. nucleotidesb. biological databasesc. software programsd.

______is a set of similarity search programs designed to explore all of the available sequence databases regardless 2. of whether the query is protein or DNA.

BLASTa. COPIAb. CATHc. EMBOSSd.

_________is a software-analysis package.3. BLASTa. COPIAb. CATHc. EMBOSSd.

________can identify all approximate repeats in a complete genome in a short time using little memory on a 4. desktop computer.

PatternHuntera. BLASTb. COPIAc. CATHd.

_____is the universal protein resource, and as stated on its website the database intends to be both comprehensive 5. and of high quality.

PatternHuntera. UniProtb. Ensembl c. COPIAd.

________is project developing software for automatic annotation of eukaryotic genomes.6. PatternHuntera. UniProtb. Ensembl c. COPIAd.

Which is a protein structure analysis tool for discovering motifs (conserved regions) in a family of protein 7. sequences?

PatternHuntera. UniProtb. Ensembl c. COPIAd.

Page 48: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

39

Which of the following statements is false?8. Stringmanipulation,regularexpressionmatching,fileparsing,dataformatinter-conversionandsoonarea. the common text-processing tasks performed in bioinformatics.TheJavalanguageversionofPatternHunterisjust40KB,only1%thesizeofBlast,whileofferingalargeb. portion of its functionality.EMBOSS is a protein structure prediction system that employs a computational technique called protein c. threading to construct a protein’s 3-D model. Protein Explorer, a derivative of RasMol, is an easier to use program.d.

Which of the following statements is false?9. FASTA is an alignment program for protein sequences created by Pearsin and Lipman in 1988.a. EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package.b. blastx compares an amino acid query sequence against a protein sequence database.c. blastn compares a nucleotide query sequence against a nucleotide sequence database. d.

Which of the following comes under the category of homology and similarity tools?10. BLASTa. FASTAb. EMBOSSc. COPIAd.

Page 49: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

40

Chapter III

Genomics & Proteomics

Aim

The aim of this chapter is to:

defineproteomics•

describe application of proteomics to medicine•

explain genome mapping•

Objectives

The objectives of this chapter are to:

describe • genomics

defineproteome•

elucidate the implications of genomics for medical science•

Learning outcome

At the end of this chapter, you will be able to:

identify proteomics •

understand DNA sequencing•

differentiate between proteomics and genomics•

Page 50: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

41

3.1 DNA, Genes and GenomesDeoxyribonucleic acid (DNA) is the chemical compound that contains the instructions needed to develop and direct the activities of nearly all living organisms. DNA molecules are made of two twisting, paired strands, often referred to as a double helix.

Each DNA strand is made of four chemical units, called nucleotide bases, which comprise the genetic “alphabet.” Thebasesareadenine(A),thymine(T),guanine(G),andcytosine(C).Basesonoppositestrandspairspecifically:an A always pairs with a T; a C always pairs with a G. The order of the As, Ts, Cs, and Gs determines the meaning of the information encoded in that part of the DNA molecule just as the order of letters determines the meaning of a word.

An organism’s complete set of DNA is called its genome. Virtually every single cell in the body contains a complete copy of the approximately 3 billion DNA base pairs, or letters, that make up the human genome.

With its four-letter language, DNA contains the information needed to build the entire human body. A gene is the unitofDNAthatcarriestheinstructionsformakingaspecificproteinorsetofproteins.Eachoftheestimated20,000to 25,000 genes in the human genome codes for an average of three proteins.

Located on 23 pairs of chromosomes packed into the nucleus of a human cell, genes direct the production of proteins withtheassistanceofenzymesandmessengermolecules.Specifically,anenzymecopiestheinformationinagene’sDNA into a molecule called messenger ribonucleic acid RNA (mRNA). The mRNA travels out of the nucleus and into the cell’s cytoplasm, where the mRNA is read by a tiny molecular machine called a ribosome, and the information isusedtolinktogethersmallmoleculescalledaminoacidsintherightordertoformaspecificprotein.

Proteins make up body structures like organs and tissue, as well as control chemical reactions and carry signals between cells. If a cell’s DNA is mutated, an abnormal protein may be produced, which can disrupt the body’s usual processes and lead to a disease, such as cancer.

Genomics is the study of genes and non-coding sequences of DNA in organisms.It is the large-scale study of proteins, particularly their structures and functions. •It is study of sequences, gene organisation and mutations at the DNA level.•Itisthestudyofinformationflowwithinacell.•

This term was coined to make an analogy with genomics, and is often viewed as the “next step”, but proteomics is much more complicated than genomics. Most importantly, while the genome is a rather constant entity, the proteome is constantly changing through its biochemical interactions with the genome. One organism will have radically different protein expression in different parts of its body and in different stages of its life cycle. The entirety of proteins in existence in an organism is referred to as the proteome.

3.2 DNA SequencingSequencing simply means determining the exact order of the bases in a strand of DNA. Because bases exist as pairs, and the identity of one of the bases in the pair determines the other member of the pair, researchers do not have to report both bases of the pair.

In the most common type of sequencing used today, called the chain termination method, a DNA strand is treated with avarietyofnucleotides,asetofenzymes,andaspecificprimertogenerateacollectionofsmallerDNAfragments.Fourfluorescenttags,eachspecificforagivenbase,ispartofthemixture.Eachofthefragmentsdiffersinlengthbyonebaseandismarkedwithafluorescenttagthatidentifiesthelastbaseofthefragment.Thefragmentsarethenseparatedaccordingtosizeandpassedbyadetectorthatreadsthefluorescenttag.Then,acomputerreconstructstheentiresequenceofthelongDNAstrandbyidentifyingthebaseateachpositionfromthesizeofeachfragmentandtheparticularfluorescentsignalatitsend.

Page 51: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

42

At present, this technology only can determine the order of up to 800 base pairs of DNA at a time. So, to assemble the sequence of all the bases in a large piece of DNA, such as a gene, researchers need to read the sequence of overlapping segments. This allows the longer sequence to be assembled from shorter pieces, somewhat like putting togetheralinearjigsawpuzzle.Inthisprocess,eachbasehastobereadnotjustonce,butatleastseveraltimesinthe overlapping segments to ensure accuracy.

Researchers can use DNA sequencing to search for genetic variations and/or mutations that may play a role in the development or progression of a disease. The disease-causing change may be as small as the substitution, deletion, or addition of a single base pair or as large as a deletion of thousands of bases.

The Human Genome Project, which was led at the National Institutes of Health (NIH) by the National Human Genome Research Institute, produced a very high-quality version of the human genome sequence that is freely available in public databases. This international project was successfully completed in April 2003, under budget and more than two years ahead of schedule.

The sequence is not that of one person, but is a composite derived from several individuals. Therefore, it is a “representative” or generic sequence. To ensure anonymity of the DNA donors, more blood samples (nearly 100) were collected from volunteers than were used, and no names were attached to the samples that were analysed. Thus, not even the donors knew whether their samples were actually used.

The Human Genome Project was designed to generate a resource that could be used for a broad range of biomedical studies.Onesuchuseistolookforthegeneticvariationsthatincreaseriskofspecificdiseases,suchascancer,orto look for the type of genetic mutations frequently seen in cancerous cells. More research can then be done to fully understand how the genome functions and to discover the genetic basis for health and disease.

The International HapMap Project, in which NIH also played a leading role, represents a major step in that direction. In October 2005, the project published a comprehensive map of human genetic variation that is already speeding the search for genes involved in common, complex diseases, such as heart disease, diabetes, blindness, and cancer.

Another initiative that builds upon the tools and technologies created by the Human Genome Project is The Cancer Genome Atlas pilot project. This three-year pilot, which was launched in December 2005, will develop and test strategies for a comprehensive exploration of the universe of genetic factors involved in cancer.

3.3 Genome MappingGenomic maps serve as a scaffold for orienting sequence information. A few years ago, a researcher wanting to localise a gene, or nucleotide sequence, was forced to manually map the genomic region of interest, a time-consuming andoftenpainstakingprocess.Today,thankstonewtechnologiesandtheinfluxofsequencedata,anumberofhigh-quality,genome-widemapsareavailabletothescientificcommunityforuseintheirresearch.

Computerised maps make gene hunting faster, cheaper, and more practical for almost any scientist. In a nutshell, scientistswouldfirstuseageneticmaptoassignagenetoarelativelysmallareaofachromosome.Theywouldthenuse a physical map to examine the region of interest close up, to determine a gene’s precise location. In light of these advances, a researcher’s burden has shifted from mapping a genome or genomic region of interest to navigating a vast number of Web sites and databases.

3.4 Implications of Genomics for Medical ScienceVirtually every human ailment, except perhaps trauma, has some basis in our genes. Until recently, doctors were able to take the study of genes, or genetics, into consideration only in cases of birth defects and a limited set of other diseases. These were conditions, such as sickle cell anemia, which have very simple, predictable inheritance patterns because each is caused by a change in a single gene.

Page 52: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

43

With the vast trove of data about human DNA generated by the Human Genome Project and the HapMap Project, scientists and clinicians have much more powerful tools to study the role that genetic factors play in much more complex diseases, such as cancer, diabetes, and cardiovascular disease that constitute the majority of health problems in the United States. Genome-based research is already enabling medical researchers to develop more effective diagnostic tools, to better understand the health needs of people based on their individual genetic make-ups, and to design new treatments for disease. Thus, the role of genetics in health care is starting to change profoundly and the firstexamplesoftheeraofpersonalisedmedicineareonthehorizon.

It is important to realise, however, that it often takes considerable time, effort, and funding to move discoveries from thescientificlaboratoryintothemedicalclinic.Mostnewdrugsbasedongenome-basedresearchareestimatedtobeat least 10 to 15 years away. According to biotechnology experts, it usually takes more than a decade for a company to conduct the kinds of clinical studies needed to receive approval from the Food and Drug Administration.

Screening and diagnostic tests, however, are expected to arrive more quickly. Rapid progress is also anticipated in theemergingfieldofpharmacogenomics,whichinvolvesusinginformationaboutapatient’sgeneticmake-uptobetter tailor drug therapy to their individual needs.

Clearly, genetics remains just one of several factors that contribute to people’s risk of developing most common diseases. Diet, lifestyle, and environmental exposures also come into play for many conditions, including many types of cancer. Still, a deeper understanding of genetics will shed light on more than just hereditary risks by revealing the basic components of cells and, ultimately, explaining how all the various elements work together to affect the human body in both health and disease.

3.5 ProteomicsProteomics studies the structure and function of proteins, the principal constituents of the protoplasm of all cells.

ProteomeThe word “proteome” is derived from proteins expressed by a genome, and it refers to all the proteins produced by an organism, much like the genome is the entire set of genes. The human body may contain more than 2 million different proteins, each having different functions. As the main components of the physiological pathways of the cells, proteins serve vital functions in the body such as:

catalysingvariousbiochemicalreactions,example,enzymes•acting as messengers, example, neurotransmitters•acting as control elements that regulate cell reproduction•influencinggrowthanddevelopmentofvarioustissues,example,trophicfactors•transporting oxygen in the blood, example, hemoglobin•defending the body against disease, example, antibodies•

Proteins are fairly large molecules made up of strings of amino acids linked like a chain. While there are only 20 amino acids, they combine in different ways to form tens of thousands of proteins, each with a unique, genetically definedsequencethatdeterminestheprotein’sspecificshapeandfunction.Inaddition,eachproteincanundergoavarietyofpost-translationalmodificationsthatfurtherinfluenceitsshapeandfunction.Researchersandscientistsareworkingondevelopingamapofthehumanproteomemuchlikethatofthehumangenomethatidentifiesnovelprotein families, protein interactions and signaling pathways.

Page 53: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

44

Proteomicsis“theanalysisofcompletecomplementsofproteins”.Proteomicsincludesnotonlytheidentificationandquantificationofproteins,butalsothedeterminationoftheirlocalisation,modifications,interactions,activities,and,ultimately,theirfunction.Initiallyencompassingproteinseparationandidentification,proteomicsnowreferstoanyprocedurethatcharacteriseslargesetsofproteins.Theexplosivegrowthofthisfieldisdrivenbymultipleforces genomics and its revelation of more and more new proteins; powerful protein technologies such as newly developed mass spectrometry approaches and innovative computational tools and methods to process, analyse, and interpret prodigious amounts of data.

There are many different subdivisions of proteomics, including:Structural proteomics : In-depth analysis of protein structure•Expression proteomics : Analysis of expression and differential expression of proteins•Interaction proteomics:Analysis of interactions between proteins to characterise complexes and determine •function.

The theme of molecular biology research, in the past, has been oriented around the gene rather than the protein. This is not to say that researchers have neglected to study proteins, but rather that the approaches and techniques most commonly used have looked primarily at the nucleic acids and then later at the protein(s) implicated. The main reason for this has been that the technologies available, and the inherent characteristics of nucleic acids, have made the genes the low hanging fruit. This situation has changed recently and continues to change at larger scale, higher throughput methods are developed for both nucleic acids and proteins. The majority of processes that take place in a cell are not performed by the genes themselves, but rather by the proteins that they code for.

A disease can arise when a gene/protein is over or under expressed, or when a mutation in a gene results in a malformedprotein,orwhenposttranslationalmodificationsalteraprotein’sfunction.Thus,totrulyunderstanda biological process, the relevant proteins must be studied directly. But there are more challenges while studying proteins compared to studying genes, due to their complex 3-D structure, which is related to the function, analogous to a machine.

Proteomics is defined as the systematic large-scale analysis of protein expressionunder normal andperturbed(stressed,diseased,and/ordrugged)states,andgenerallyinvolvestheseparation,identification,andcharacterisationof all of the proteins in a cell or tissue sample. The meaning of the term has also been expanded, and is now used loosely to refer to the approach of analysing which proteins a particular type of cell synthesises, how much the cell synthesises, how cells modify proteins after synthesis, and how all of those proteins interact. There are orders of magnitude more proteins than genes in an organism - based on alternative splicing (several per gene) and post translationalmodifications(over100known),thereareestimatedtobeamillionormore.

Fortunately there are features such as folds and motifs, which allow them to be categorised into groups and families, making the task of studying them more tractable. There is a broad range of technologies used in proteomics, but the central paradigm has been the use of 2-D gel electrophoresis (2D-GE) followed by mass spectrometry (MS). 2D-GEisusedtofirstseparatetheproteinsbyisoelectricpointandthenbysize.

The individual proteins are subsequently removed from the gel and prepared, then analysed by MS to determine their identity and characteristics. There are various types of mass analysers used in proteomics MS including quadrupole, time-of-flight(TOF),andiontrap,andeachhasitsownparticularcapabilities.Tandemarrangementsareoftenused,such as quadrupole-TOF, to provide more analytical power. The recent development of soft ionisation techniques, namely matrix-assisted laser desorption ionisation (MALDI) and electro-spray ionisation (ESI), has allowed large biomolecules to be introduced into the mass analyser without completely decomposing their structures, or even without breaking them at all, depending on the design of the experiment.

There are techniques, which incorporate liquid chromatography (LC) with MS, and others that use LC by itself. Roboticshasbeenappliedtoautomateseveralstepsinthe2DGE-MSprocesssuchasspotexcisionandenzymedigests. To determine a protein’s structure, XRD and NMR techniques are being improved to reach higher throughput

Page 54: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

45

and better performance. For example, automated high-throughput crystallisation methods are being used upstream of XRDtoalleviatethatbottleneck.ForNMR,cryo-probesandflowprobesshortenanalysistimeanddecreasesamplevolume requirements. The hope is that determining about 10,000 protein structures will be enough to characterise the estimated 5,000 or so folds, which will feed into more reliable in silico structural prediction methods.

Structure by itself does not provide all of the desired information, but is a major step in the right direction. Protein chips are being developed for many of the processes in proteomics. For example, researchers are developing protocols for protein microarrays at institutions such as Harvard and Stanford as well as at several companies. These chips - grids of attached peptide fragments, attached antibodies, or gel “pads” with proteins suspended inside - will be used for various experiments such as protein-protein interaction studies and differential expression analysis.

Theycanalsobeusedtofilterouthighabundanceproteinsbeforefurtherexperiments;oneofthemajorchallengesin proteomics is isolating and analysing the low abundance proteins, which are thought to be the most important. Therearemanyothertypesofproteinchips,andthenumberwillcontinuetogrow.Forexample,microfluidicschipscancombinethesamplepreparationstepspriortoMS,suchasenzymedigests,withnanoelectrosprayionisation,all on the one chip. Or, the samples can be ionised directly off of the surface of the chip, similar to a MALDI target. MicrofluidicschipsarealsobeingcombinedwithNMR.

In the next few years, various protein chips will be used increasingly in diagnostic applications as well. The bioinformatics side of proteomics includes both databases and analysis software. There are many public and privatedatabasescontainingproteindatarangingfromsequences,tofunctions,toposttranslationalmodifications.Typically,aresearcherwillfirstperform2D-GEfollowedbyMS;thiswillresultinafingerprint,molecularweight,or even sequence for each protein of interest, which can then be used to query databases for similarities or other information.

Swiss-Prot and TrEMBL, developed in collaboration between the Swiss Institute of Bioinformatics and the European BioinformaticsInstitute,arecurrentlythemajordatabasesdedicatedtocatalogingproteindata,buttherearedozensof more specialised databases and tools. New bioinformatics approaches are constantly being introduced. Recent customised versions of PSI-BLAST can, for example, utilise not only the curated protein entries in Swiss-Prot but also linguistic analyses of biomedical journal articles to help determine protein family relationships. Publicly available databases and tools are popular, but there are also several companies offering subscriptions to proprietary databases, which often include protein-protein interaction maps generated.

The proteomics market is comprised of instrument manufacturers, bioinformatics companies, laboratory product suppliers, service providers, and other biotech related companies, which can defy categorisation. A given company can often overlap more than one of these areas. Many of the companies involved in the proteomics market are actually doing drug discovery as their major focus, while partnering, or providing services or subscriptions, to other companies to generate short term revenues. The market for proteomics products and services was estimated to be $1.0B in 2000, growing at a CAGR of 42% to about $5.8B in 2005. The major drivers will continue to be the biopharmaceutical industry’s pursuit of blockbuster drugs and the recent technological advances, which have allowed large-scale studiesofgenesandproteins.Alliancesarebecomingincreasinglyimportantinthisfield,becauseitischallengingforcompaniestofindallofthenecessaryexpertisetocoverthedifferentactivitiesinvolvedinproteomics.

Synergies must be created by combining forces. For example, many companies working with mass spectrometry, both the manufacturers and end user labs, are collaborating with protein chip related companies. There are many combinations of diagnostics, instrumentation, chip, and bioinformatics companies, which create effective partnerships.

In general, proteomics appears to hold great promise in the pursuit of biological knowledge. There has been a general realisation that the large-scale approach to biology, as opposed to the strictly hypothesis-driven approach, will rapidly generate much more useful information.

Page 55: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

46

The two approaches are not mutually exclusive, and the happy medium seems to be the formation of broad hypotheses, which are subsequently investigated by designing large-scale experiments and selecting the appropriate data. Proteomics and genomics, and other varieties of ‘omics’, will all continue to complement each other in providing the tools and information for this type of research.

3.6 Application of Proteomics to MedicineProteomic technologies will play an important role in drug discovery, diagnostics and molecular medicine because is the link between genes, proteins and disease. As researchers study defective proteins that cause particular diseases, theirfindingswillhelpdevelopnewdrugsthateitheraltertheshapeofadefectiveproteinormimicamissingone.

Already, many of the best-selling drugs today either act by targeting proteins or are proteins themselves. Advances in proteomics may help scientists eventually create medications that are “personalised” for different individuals to be more effective and have fewer side effects. Current research is looking at protein families linked to diseases including cancer, diabetes and heart disease.

Identifyinguniquepatternsofproteinexpression,orbiomarkers,associatedwithspecificdiseasesisoneofthemostpromisingareasofclinicalproteomics.Oneofthefirstbiomarkersusedindiseasediagnosiswasprostate-specificantigen (PSA). Today, serum PSA levels are commonly used in diagnosing prostate cancer in men. Unfortunately, many single protein biomarkers have proven to be unreliable. Researchers are now developing diagnostic tests thatsimultaneouslyanalysetheexpressionofmultipleproteinsinhopesofimprovingthespecificityandsensitivityof these types of assays.

3.7 Difference between Proteomics and GenomicsUnlike the genome, which is relatively static, the proteome changes constantly in response to tens of thousands of intra- and extracellular environmental signals. The proteome varies with health or disease, the nature of each tissue, thestageofcelldevelopment,andeffectsofdrugtreatments.Assuch,theproteomeoftenisdefinedas“theproteinspresent in one sample (tissue, organism, cell culture) at a certain point in time.”

In many ways, proteomics runs parallel to genomics: genomics starts with the gene and makes inferences about its products(proteins),whereasproteomicsbeginswiththefunctionallymodifiedproteinandworksbacktothegeneresponsible for its production.

The sequencing of the human genome has increased interest in proteomics because while DNA sequence information provides a static snapshot of the various ways in which the cell might use its proteins, the life of the cell is a dynamic process. This new data set holds great new promise for proteomic applications in science, medicine, and most notably – pharmaceuticals.

3.8 Protein ModelingTheprocessofevolutionhasresultedintheproductionofDNAsequencesthatencodeproteinswithspecificfunctions.In the absence of a protein structure that has been determined by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, researchers can try to predict the three-dimensional structure using protein or molecular modeling. This method uses experimentally determined protein structures (templates) to predict the structure of another protein that has a similar amino acid sequence (target).

Although, molecular modeling may not be as accurate at determining a protein’s structure as experimental methods, it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modeling also provides astartingpointforresearcherswishingtoconfirmastructurethroughX-raycrystallographyandNMRspectroscopy.As the different genome projects are producing more sequences and as novel protein folds and families are being determined, protein modeling will become an increasingly important tool for scientists working to understand normal and disease-related processes in living organisms.

Page 56: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

47

The four steps of protein modeling are:Identify the proteins with known three-dimensional structures that are related to the target sequence.•Align the related three-dimensional structures with the target sequence and determine those structures that will •be used as templates.Construct a model for the target sequence based on its alignment with the template structure(s).•Evaluate the model against a variety of criteria to determine if it is satisfactory.•

Page 57: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

48

SummaryDeoxyribonucleic acid (DNA) is the chemical compound that contains the instructions needed to develop and •direct the activities of nearly all living organisms.Genomics is the study of genes and non-coding sequences of DNA in organisms.•Sequencing simply means determining the exact order of the bases in a strand of DNA.•In the most common type of sequencing used today, called the chain termination method, a DNA strand is •treatedwithavarietyofnucleotides,asetofenzymes,andaspecificprimertogenerateacollectionofsmallerDNA fragments.The Human Genome Project, which was led at the National Institutes of Health (NIH) by the National Human •Genome Research Institute, produced a very high-quality version of the human genome sequence that is freely available in public databases.The International HapMap Project, in which NIH also played a leading role, represents a major step in that •direction.Genomic maps serve as a scaffold for orienting sequence information.•Proteomics studies the structure and function of proteins, the principal constituents of the protoplasm of all •cells.The word “proteome” is derived from proteins expressed by a genome, and it refers to all the proteins produced •by an organism, much like the genome is the entire set of genes.Proteins are fairly large molecules made up of strings of amino acids linked like a chain.•Proteomics is “the analysis of complete complements of proteins”.•Proteomicsisdefinedasthesystematiclarge-scaleanalysisofproteinexpressionundernormalandperturbed•(stressed, diseased, and/or drugged) states, and generally involves the separation, identification, and characterisation of all of the proteins in a cell or tissue sample.

ReferencesMount, D. W., 2001. • Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press.Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd. NHGRI. • A Brief Guide to Genomics [Online] Available at: <http://www.genome.gov/18016863> [Accessed 28 February 2012].cisn. • Genomics [Online] Available at: <http://cisncancer.org/research/what_we_know/omics/genomics.html> [Accessed 28 February 2012].genomicseducation, 2009.• What is Genomics Part 2 - The Human Genome Project [Video Online] Available at: <http://www.youtube.com/watch?v=C86YbyEsct8&feature=results_main&playnext=1&list=PLE62E79AB3FDD7867> [Accessed 28 February 2012].genomicseducation, 2010. • What is Genomics - Chapter 1 [Video Online] Available at: <http://www.youtube.com/watch?v=9jZF74iqLac&feature=related> [Accessed 28 February 2012].

Recommended ReadingPatthy, L., 1999. • Protein Evolution, Blackwell Science. Shanmughavel, P., 2005. • Principles of Bioinformatics, Pointer Publishers, Jaipur, India.BrandenC.,&Tooze,J.,1999.• Introduction to Protein Structure, Garland Publishing, New York.

Page 58: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

49

Self Assessment_______is the chemical compound that contains the instructions needed to develop and direct the activities of 1. nearly all living organisms.

DNAa. RNAb. Proteinc. Proteomed.

Sequencing simply means determining the exact order of the bases in a strand of___________.2. RNAa. DNAb. proteinc. proteomed.

_______serve as a scaffold for orienting sequence information.3. Genomic maps a. Proteinsb. Proteomec. Sequencingd.

_______isdefinedas“theproteinspresentinonesample(tissue,organism,cellculture)atacertainpointin4. time.”

Genomea. Proteomeb. Databasesc. Nucleotidesd.

_____________molecules are made of two twisting, paired strands, often referred to as a double helix.5. RNAa. DNAb. proteinc. proteomed.

An organism’s complete set of DNA is called its_________.6. genomea. proteomeb. genome mapc. protein modeld.

Which of the following statements is false?7. Molecularmodelingalsoprovidesastartingpointforresearcherswishingtoconfirmastructurethrougha. X-ray crystallography and NMR spectroscopy.Genomics starts with the gene and makes inferences about its products (proteins).b. Proteomicsbeginswiththefunctionallymodifiedproteinandworksbacktothegeneresponsibleforitsc. production.The genome changes constantly in response to tens of thousands of intra- and extracellular environmental d. signals.

Page 59: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

50

Which of the following statements is false?8. Proteomic technologies will play an important role in drug discovery, diagnostics and molecular medicine a. because is the link between genes, proteins and disease.Genomics is “the analysis of complete complements of proteins”.b. Proteomicsincludesnotonlytheidentificationandquantificationofproteins,butalsothedeterminationofc. theirlocalisation,modifications,interactions,activities,and,ultimately,theirfunction.Proteins are fairly large molecules made up of strings of amino acids linked like a chain.d.

Which of the following statements is true?9. The word “proteome” is derived from proteins expressed by a genome.a. Genome refers to all the proteins produced by an organism, much like the genome is the entire set of b. genes.The human body may contain more than 2 million different proteins, each having same functions.c. Proteomics studies the structure and function of nucleotides, the principal constituents of the protoplasm d. of all cells.

What refers to the analysis of expression and differential expression of proteins?10. Structural proteomics a. Expression proteomics b. Interaction proteomics c. Genomicsd.

Page 60: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

51

Chapter IV

Sequence Alignment

Aim

The aim of this chapter is to:

definesequencealignment•

describe pairwise sequence alignment•

explain global alignment•

Objectives

The objectives of this chapter are to:

explain the Needleman-Wunsch algorithm•

elucidate alignment scoring function•

describe local alignment•

Learning outcome

At the end of this chapter, you will be able to:

identity matrix•

know Smith Waterman algorithm•

understand features of substitution matrices•

Page 61: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

52

4.1 IntroductionOnce a genome is completely sequenced, there are sorts of analyses performed on it. Some of the goals of sequence analysis are the following:

Identify the genes.•Determinethefunctionofeachgene.Onewaytohypothesisethefunctionistofindanothergene(possibly•from another organism) whose function is known and to which the new gene has high sequence similarity. This assumes that sequence similarity implies functional similarity, which may or may not be true.Identify the proteins involved in the regulation of gene expression.•Identify sequence repeats.•Identify other functional regions.•

Many of these tasks are computational in nature. Given the incredible rate at which sequence data is being produced, the integration of computer science, mathematics, and biology will be integral to analysing those sequences.

Sequencealignmentinbioinformaticsisafieldofresearchfocusedondevelopingtoolsforcomparingandfindingsimilar sequences of amino acids or DNA base pairs with the aid of computers. The sequence similarity is used to assess gene and protein homology, classify genes and proteins, predict biological function, secondary and tertiary protein structure, detect point mutations, construct evolutionary trees, and so on. There are two main areas of sequence alignment: pairwise sequence alignment and multiple sequence alignment.

Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are padded with gaps (dashes) so that wherever possible, columns contain identical characters from the sequences involved.

tcctctgcctctgccatcat---caaccccaaagt|||| ||| ||||| ||||| ||||||||||||tcctgtgcatctgcaatcatgggcaaccccaaagt

4.2 Pairwise Sequence AlignmentPairwisesequencealignmentisconcernedwithcomparingtwoDNAoraminoacidsequences,findingtheglobaland local “optimum alignment” of the two sequences. Based on differences between the two sequences, one can calculate the “cost” of aligning the two sequences by using replacements, deletions and insertions, and assign a similarity score.

The problem has tractable solutions by means of dynamic programming and Hidden Markov Models and is the basis ofpopularheuristicsearchmethodssuchasFASTAorBLAST.NeedlemanandWunsch(1970),werethefirsttopresentadynamicprogrammingalgorithmthatcouldfindtheglobalalignmentbetweentwoaminoacidsequences.Smith and Waterman (1981), introduced a new algorithm with a different method of scoring similarity aimed at findingoptimumlocalalignmentsub-sequences,attheexpenseoftheglobalscore.Globalalgorithmsaregenerallynot sensitive for highly diverged sequences with some localised similarities within them.

A particular application of pairwise sequence alignment is quickly searching large DNA and protein databases for matches to a query sequence. Popular heuristic algorithms, such as those from the FASTA (Pearson and Lipman 1985, 1988) or BLAST (Altschul et al 1990, 1997) families are much faster than algorithms based on dynamic programming.

Pairwise sequence alignmentmethods are concernedwithfinding the best-matchingpiecewise local or globalalignments of protein (amino acid) orDNA (nucleic acid) sequences.Typically, the purpose of this is tofindhomologues (relatives) of a gene or gene-product in a database of known examples.

Page 62: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

53

This information is useful for answering a variety of biological questions:Theidentificationofsequencesofunknownstructureorfunction.•The study of molecular evolution.•

Global alignmentA global alignment between two sequences is an alignment in which all the characters in both sequences participate inthealignment.Globalalignmentsareusefulmostlyforfindingclosely-relatedsequences.Asthesesequencesarealsoeasilyidentifiedbylocalalignmentmethodsglobalalignmentisnowsomewhatdeprecatedasatechnique.Further,thereareseveralcomplicationstomolecularevolution(suchasdomainshuffling),whichpreventthesemethodsfrombeinguseful.Findtheglobalbestfitbetweentwosequences.

Example: the sequences s = VIVALASVEGAS and t = VIVADAVIS align like:

A(s,t) = V I V A L A S V E G A S| | | | | | |V I V A D A - V - - I S

The Needleman-Wunsch algorithm The Needleman-Wunsch algorithm performs a global alignment on two sequences (s and t) and is applied to align protein or nucleotide sequences. The Needleman-Wunsch algorithm is an example of dynamic programming, and isguaranteedtofindthealignmentwiththemaximumscore.ThisworksforbothDNA-sequencesasforprotein-sequences.

Alignment scoring function The cost of aligning two symbols xi and yj is the scoring function σ(xi,yj

Alignment cost The cost of the entire alignment:

∑=

=c

iii yxM

1),(s

A simple scoring functionσ(-,a) = σ(a,-) = -1 σ(a,b) = -1 if a ≠bσ(a,b) = 1 if a = b

The substitution matrix A more realistic scoring function is given by the biologically inspired substitution matrix:- A G C T A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8

Scoring function The cost for aligning the two sequences s = VIVALASVEGAS and t = VIVADAVIS

A(s,t) = V I V A L A S V E G A S| | | | | | |V I V A D A - V - - I S

Page 63: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

54

is: M(A) = 7 matches + 2 mismatches + 3 gaps = 7 – 2 – 3 = 2

Optimal global alignment The optimal global alignment A* between two sequences s and t is the alignment A(s,t) that maximises the total alignment score M(A) over all possible alignments.A* = argmax M(A)

Finding the optimal alignment A* looks a combinatorial optimisation problem: generate all possible alignments•compute the score M•select the alignment A* with the maximum score M*•

Local alignment Localalignmentmethodsfindrelatedregionswithinsequencestheycanconsistofasubsetofthecharacterswithineach sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B. Thisisamoreflexibletechniquethanglobalalignmentandhastheadvantagethatrelatedregions,whichappearinadifferentorderinthetwoproteins(whichisknownasdomainshuffling)canbeidentifiedasbeingrelated.Thisis not possible with global alignment methods.

The Smith Waterman algorithm The Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein sequences. Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. As such, it has thedesirablepropertythatitisguaranteedtofindtheoptimallocalalignmentwithrespecttothescoringsystembeing used (which includes the substitution matrix and the gap-scoring scheme). However, the Smith-Waterman algorithm is demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn) time and space are required.

Asaresult,ithaslargelybeenreplacedinpracticalusebytheBLASTalgorithm;althoughnotguaranteedtofindoptimalalignments,BLASTismuchmoreefficient.

Sequence similarities and scoringGiven two sequences: how similar are they? This question cannot be answered because it depends on the context. Perhaps the sequences must have the same trend (stock market), contain the same pattern (text), or have the same frequencies (speech) and so on to be similar to one another.

Identity matrixFor biological sequences it is known how one sequence can mutate into another one. First there are point mutation that is one nucleotide or amino acid is changed into another one. Secondly, there is deletion that is one element (nucleotide or amino acid) or a whole subsequence of element is deleted from the sequence. Thirdly, there are insertions such as one element or a subsequence is inserted into the sequence. First approach the similarity of two biological sequences that can be expressed through the minimal number of mutations to transform one sequence into another one. All mutations are not equally likely. Point mutations are more likely because an amino acid can be replaced by an amino acid with similar chemical properties without changing the function. Deletions and insertions are more prone to destroying the function of the protein, where the length of deletions and insertions must be taken into account. For simplicity we can count the length of insertions and deletions. Finally, we are left with simply counting the number of amino acids, which match in the two sequences (it is the length of both sequences added togetherandinsertions,deletionsandtwotimesthemismatchessubtracted,finallydividedbytwo).

Page 64: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

55

Here is an example:BIOINFORMATICS BIOIN-FORMATICS!BOILING FOR MANICS B-OILINGFORMANICS

The hit count gives 12 identical letters out of the 14 letters of BIOINFORMATICS. The mutations would be:delete I BOINFORMATICS•insert LI BOILINFORMATICS•insert G BOILINGFORMATICS•change T into N BOILINGFORMANICS•

These two texts seem to be very similar. Note that insertions or deletions cannot be distinguished if two sequences arepresented(isIdeletedformthefirststringorinsertedinthesecond?).Therefore,botharedenotedbya“-”(note,two“-”arenotmatchedtooneanother).Thetaskforbioinformaticsalgorithmsistofindfromthetwostrings(lefthand side in above example) the optimal alignment (right hand side in above example). The optimal alignment is the arrangement of the two strings in a way that the number of mutations is minimal. The optimality criterion scores matches (the same amino acid) with 1 and mismatches (different amino acids) with 0. If these scores for pairs of amino acids are written in matrix form, then the identity matrix is obtained. The number of mutation is one criterion for optimality but there exists more. In general, an alignment algorithm searches for the arrangement of two sequences such that a criterion is optimised. The sequences can be arranged by inserting “-” into the strings and moving them horizontallyagainsteachother.Forlongsequencesthesearchforanoptimalalignmentcanbeverydifficult.

Onetoolforrepresentingalignmentsisthedotmatrix,whereonesequenceiswrittenhorizontallyonthetopandtheotheroneverticallyontheleft.Thisgivesamatrixwhereeachletterofthefirstsequenceispairedwitheachletter of the second sequence. For each matching of letters, a dot is written in the according position in the matrix. Which pairs appear in the optimal alignment? We will see later, that each path through the dot matrix corresponds to an alignment. The dots on diagonals correspond to matching regions.

B I O I N F O R M A T I C SBOILINGFORMANICS

Page 65: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

56

BOILINGFORMANICS

B I O I N F O R M A T I C S

Fig. 4.1 Dot matrix(Source: http://www.master-bioinformatik.at/curriculum/BioInf_I_Notes.pdf)

4.3 Multiple Sequence Alignment (MSA)Multiplesequencealignmentaimstofindsimilaritiesbetweenmanysequences.MSAishardandlesstractablethanpairwise alignment. Dynamic programming is impractical for a large number of sequences. The most successful MSA solutions are heuristic algorithms with approximate approaches, such as the CLUSTAL family of programs created by Higgins, which use a progressive algorithm (Feng and Doolittle 1987): CLUSTAL (1988), ClustalV (1992),ClustalW(1994),ClustalX(1998).ProfileHiddenMarkovModels(HMMs)provideanothersuccessfulsolution to the problem of MSA. They were introduced by Krogh and colleagues in 1994.

4.4 Substitution MatricesBoth pairwise and multiple sequence alignment algorithms use substitution matrices to score the sequence alignment. Insubstitutionmatriceseachpossibleresiduesubstitutionisgivenascorereflectingtheprobabilityofsuchachange.There are two popular protein substitution matrix models: Percent Accepted Mutation (PAM - Dayhoff 1978) and Blocks Substitution Matrix (BLOSUM - Henikoff and Henikoff 1992).

4.5 Two Sample ApplicationsSequence alignment algorithms are often used to characterise newly sequenced genes or gene products. For example, the sequenced genome of the SARS virus was investigated by using BLAST, FASTA, Pfam, and ClustalX tofindproteinswithsequencessimilartothoseexpectedtobeproducedbytheSARSvirusORFs(Mara2003,Rota 2003). Biological function and structure was then predicted for the SARS proteins based on the information available for the homologous proteins. Another application of sequence alignment tools is the study of phylogenetics. Phylogenetics is afield ofmolecular evolution that correlatesmutations inDNAandprotein sequenceswithevolutionary divergence.

Molecular distances of evolution between species can be calculated using various metrics based on DNA or protein sequence difference. The smaller the number of differences in the DNA and/or protein sequences of similar genes from two related organisms, the less they have evolutionarily diverged from each other.

Page 66: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

57

SummarySequencealignmentinbioinformaticsisafieldofresearchfocusedondevelopingtoolsforcomparingand•findingsimilarsequencesofaminoacidsorDNAbasepairswiththeaidofcomputers.Sequence alignment is an arrangement of two or more sequences, highlighting their similarity.•PairwisesequencealignmentisconcernedwithcomparingtwoDNAoraminoacidsequencesfindingtheglobal•and local “optimum alignment” of the two sequences. NeedlemanandWunsch(1970)werethefirsttopresentadynamicprogrammingalgorithmthatcouldfindthe•global alignment between two amino acid sequences. Smith and Waterman (1981) introduced a new algorithm with a different method of scoring similarity aimed at •findingoptimumlocalalignmentsub-sequences,attheexpenseoftheglobalscore.Global algorithms are generally not sensitive for highly diverged sequences with some localised similarities •within them.A global alignment between two sequences is an alignment in which all the characters in both sequences •participate in the alignment.The Needleman-Wunsch algorithm performs a global alignment on two sequences (s and t) and is applied to •align protein or nucleotide sequences.Localalignmentmethodsfindrelatedregionswithinsequences.Theycanconsistofasubsetofthecharacters•within each sequence.Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein •sequences.Multiplesequencealignmentaimstofindsimilaritiesbetweenmanysequences.MSAishardandlesstractable•than pairwise alignment.Both pairwise and multiple sequence alignment algorithms use substitution matrices to score the sequence •alignment. Sequence alignment algorithms are often used to characterise newly sequenced genes or gene products.•

ReferencesBranden,C.&Tooze,J.,1998.• An Introduction to Protein Structure. Garland, 1998.Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd. Huson, D., 2005. • A Brief Guide to Genomics [Online] Available at: <http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0607/reinert1.pd> [Accessed 28 February 2012].Biology Computers. • Pairwise sequence alignment [Online] Available at: <http://gtbinf.wordpress.com/biol-41506150/pairwise-sequence-alignment/> [Accessed 28 February 2012].ABNOVA1, 2010.• BLAST - Multiple Alignment [Video Online] Available at: <http://www.youtube.com/watch?v=xdF6iZEPH_s> [Accessed 28 February 2012].sanjaysingh765, 2011. • Multiple sequence alignment with clustalw and boxshade [Video Online] Available at: <http://www.youtube.com/watch?v=BrzhdNvXXDs>[Accessed28February2012].

Recommended ReadingLivingstone & Barton., 1993. • Protein Sequence Alignments: a Strategy for the Hierarchical Analysis of Residue Conservation, Computer Applications in the Biosciences.Shanmughavel, P., 2005. • Principles of Bioinformatics, Pointer Publishers, Jaipur, India.International Human Genome Sequencing Consortium, 2001. • Initial Sequencing and Analysis of the Human Genome, Nature.

Page 67: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

58

Self AssessmentThere are _____main areas of sequence alignment.1.

twoa. threeb. fourc. fived.

________is an arrangement of two or more sequences, highlighting their similarity.2. Sequence alignmenta. FASTAb. BLASTc. Pfamd.

__________aimstofindsimilaritiesbetweenmanysequences.3. Multiple sequence alignmenta. Global alignmentb. Local alignmentc. Pairwise alignmentd.

________is a field ofmolecular evolution that correlatesmutations inDNAand protein sequenceswith4. evolutionary divergence.

Bioinformaticsa. Phylogeneticsb. Genomicsc. Proteomicsd.

_________methodsfindrelatedregionswithinsequences-theycanconsistofasubsetofthecharacterswithin5. each sequence.

Multiple sequence alignmenta. Global alignmentb. Local alignmentc. Pairwise alignmentd.

A __________between two sequences is an alignment in which all the characters in both sequences participate 6. in the alignment.

Multiple sequence alignmenta. Global alignmentb. Local alignmentc. Pairwise alignmentd.

_________isusefulmostlyforfindingclosely-relatedsequences.7. Multiple sequence alignmenta. Global alignmentb. Local alignmentc. Pairwise alignmentd.

Page 68: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

59

Which of the following statements is false?8. The Smith-Waterman algorithm performs a global alignment on two sequences (s and t) and is applied to a. align protein or nucleotide sequences. TheNeedleman-Wunschalgorithmisanexampleofdynamicprogramming,andisguaranteedtofindtheb. alignment with the maximum score.The Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein c. sequences. Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. d.

Which of the following statements is false?9. In point mutation, one nucleotide or amino acid is changed into another one.a. In deletion, one element (nucleotide or amino acid) or a whole subsequence of element is deleted from the b. sequence.In insertion, one element or a subsequence is inserted into the sequence.c. Alignments are more prone to destroying the function of the protein, where the length of deletions and d. insertions must be taken into account.

Which of the following statements is false?10. The optimal alignment is the arrangement of the two strings in a way that the number of mutations is a. minimal.In general, an alignment algorithm searches for the arrangement of two sequences such that a criterion is b. optimised.Thesequencescanbearrangedbyinserting“@”intothestringsandmovingthemhorizontallyagainstc. each other.Onetoolforrepresentingalignmentsisthedotmatrix,whereonesequenceiswrittenhorizontallyonthed. top and the other one vertically on the left.

Page 69: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

60

Chapter V

Phylogenetic Analysis

Aim

The aim of this chapter is to:

definephylogenetics•

describe phylogenetic analysis•

explain fundamental elements of phylogenetic models•

Objectives

The objectives of this chapter are to:

definebootstrapping•

describe tree evaluation•

elucidate the tree-building methods•

Learning outcome

At the end of this chapter, you will be able to:

differentiate between paralogs and orthologs •

understand tree interpretation•

enumerate the steps of • phylogenetic data analysis

Page 70: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

61

5.1 IntroductionPhylogenetic analysis is the process you use to determine the evolutionary relationships between organisms. The results of an analysis can be drawn in a hierarchical diagram called a cladogram or phylogram (phylogenetic tree). The branches in a tree are based on the hypothesised evolutionary relationships (phylogeny) between organisms. Each member in a branch, also known as a monophyletic group, is assumed to be descended from a common ancestor. Originally, phylogenetic trees were created using morphology, but now, determining evolutionary relationships includes matching patterns in nucleic acid and protein sequences.

Phylogenetics is the study of evolutionary relationships. Phylogenetic analysis is the means of inferring or estimating these relationships. The evolutionary history inferred from phylogenetic analysis is usually depicted as branching, treelike diagrams that represent an estimated pedigree of the inherited relationships among molecules (‘gene trees”), organisms, or both. Phylogenetics is sometimes called cladistics because the word ‘clade,’ a set of descendants from a single ancestor, is derived from the Greek word for branch. However, cladistics is a particular method of hypothesising about evolutionary relationships.

The basic tenet behind cladistics is that members of a group or clade share a common evolutionary history and are more related to each other than to members of another group. A given group is recognised by sharing unique features that were not present in distant ancestors. These shared, derived characteristics can be anything that can be observed and described from two organisms having developed a spine to two sequences having developed a mutation at a certain base pair of a gene. Usually, cladistic analysis is performed by comparing multiple characteristics or ‘characters’ at once, either multiple phenotypic characters or multiple base pairs or amino acids in a sequence.

There are three basic assumptions in cladistics. Any group of organisms is related by descent from a common •ancestor (fundamental tenet of evolutionary theory).There is a bifurcating pattern of cladogenesis. This assumption is controversial.•Change in characteristics occurs in lineages over time. This is a necessary condition for cladistics to work.•

The resulting relationships from cladistic analysis are most commonly represented by a phylogenetic tree:

A node

Human

Mouse

Fly

A clade

Fig. 5.1 Clade and node(Source: http://www.bioon.com/book/biology/bioinformatics/chapter-14.pdf)

Even with this simple tree, a number of terms that are used frequently in phylogenetic analysis can be introduced:A • clade is a monophyletic taxon. Clades are groups of organisms or genes that include the most recent common ancestor of all of its members and all of the descendants of that most recent common ancestor. Clade is derived from the Greek word ‘klados,’ meaning branch or twig.A • taxon is any named group of organisms but not necessarily a clade. In some analyses, • branch lengths correspond to divergence (example, in the above example, mouse is slightly morerelatedtoflythanhumanistofly).A • node is a bifurcating branch point.

Page 71: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

62

Macromolecules, especially sequences, have surpassed morphological and other organism characters as the most popular form of data for phylogenetic or cladistic analysis. Although numerous phylogenetic algorithms, procedures, and computer programs have been devised, their reliability and practicality are, in all cases, dependent on the structureandsizeofthedata.

The danger of generating incorrect results is inherently greater in computational phylogenetics than in many other fieldsofscience.Theeventsyieldingaphylogenyhappenedin thepastandcanonlybe inferredorestimated.Despite the well-documented limitations of available phylogenetic procedures, current biological literature is repleted with examples of conclusions derived from the results of analyses in which data had been simply run through one or another phylogeny program. Occasionally, the limiting factor in phylogenetic analysis is not so much the computational method used; more often than not, the limiting factor is the users’ understanding of what the method is actually doing with the data.

5.2 Fundamental Elements of Phylogenetic ModelsPhylogenetic tree-building methods presume particular evolutionary models. For a given data set, these models can be violated because of occurrences such as the transfer of genetic material between organisms. Thus, when interpreting a given analysis, one should always consider the model used and its assumptions and entertain other possibleexplanationsfortheobservedresults.Asanexample,considerthetreeinfiguregivenbelow.Aninvestigationof organismal relationships in the tree suggests the eukaryote 1 is more related to the bacteria than to the other eukaryotes. Because the vast majority of other cladistic analyses, including those based on morphological features, suggest that eukaryote 1 is more related to the other eukaryotes than to bacteria; we suspect that for this analysis theassumptionsofabifurcatingpatternofevolutionareincorrect.Wesuspectthathorizontalgenetransferfromanancestor of the bacteria 1, 2, and 3 to the ancestor of eukaryote 1 occurred because this would most simply explain the results.

Models inherent in phylogenetics methods make additional ‘default’ assumptions:Thesequenceiscorrectandoriginatesfromthespecifiedsource.•The sequences are homologous (that is are all descended in some way from a shared ancestral sequence).•Each position in a sequence alignment is homologous with every other in that alignment.•Each of the multiple sequences included in a common analysis has a common phylogenetic history with the •others (example, there are no mixtures of nuclear and organellar sequences).The sampling of taxa is adequate to resolve the problem of interest.•Sequence variation among the samples is representative of the broader group of interest.•The sequence variability in the sample contains phylogenetic signal adequate to resolve the problem of •interest.

Page 72: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

63

Fig. 5.2 A phylogenetic tree(Source: http://www.bioon.com/book/biology/bioinformatics/chapter-14.pdf)

Exampleofaphylogenetictreebasedongenesthatdonotmatchorganismalphylogeny,suggestinghorizontalgenetransferhasoccurred.Theancestorofprotozoaneukaryote1(underlinedandmarkedwithanarrow)appearstohave obtained the gene from the ancestor of Bacteria 1, 2, and 3, as this is the simplest explanation for the results. This unexpected result is not without precedent: there have been a number of reported phylogenetic analyses that suggestthatprotozoahavetakenupgenesfrombacteria,mostlikelyfrombacteriathattheyhaveingested.

There are additional assumptions that are defaults in some methods but can be at least partially corrected for in others:

The sequences in the sample evolved according to a single stochastic process.•All positions in the sequence evolved according to the same stochastic process.•Each position in the sequence evolved independently.•

Errors in published phylogenetic analyses can often be attributed to violations of one or more of the foregoing assumptions. Every sequence data set must be evaluated against these assumptions, with other possible explanations for the observed results considered.

5.3 Tree Interpretation: Importance of Identifying Paralogs and OrthologsAs more genomes are sequenced, we are becoming more interested in learning about protein or gene evolution (that is investigating gene phylogeny, rather than organismal phylogeny). This can aid our understanding of the function of proteins and genes.

Studies of protein and gene evolution involve the comparison of homologs sequences that have common origins but may or may not have common activity. Sequences that share an arbitrary, threshold level of similarity determined by alignment of matching bases are termed homologous. They are inherited from a common ancestor that possessed similarstructure,althoughthestructureoftheancestormaybedifficulttodeterminebecauseithasbeenmodifiedthrough descent.

Page 73: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

64

Homologs are most commonly orthologs, paralogs, or xenologs.Orthologs • are homologs produced by speciation. They represent genes derived from a common ancestor that diverged due to divergence of the organisms they are associated with. They tend to have similar function.Paralogs• are homologs produced by gene duplication. They represent genes derived from a common ancestral gene that duplicated within an organism and then subsequently diverged. They tend to have different functions.Xenologs• arehomologsresultingfromhorizontalgenetransferbetweentwoorganisms.Thedeterminationofwhetherageneofinterestwasrecentlytransferredintothecurrenthostbyhorizontalgenetransferisoftendifficult.

Occasionally, the %(G _ C) content may be so vastly different from the average gene in the current host that a conclusionofexternaloriginisnearlyinescapable,howeveroftenitisunclearwhetheragenehashorizontalorigins.Functionofxenologscanbevariabledependingonhowsignificantthechangeincontextwasforthehorizontallymoving gene; however, in general, the function tends to be similar.

5.4 Phylogenetic Data AnalysisA straightforward phylogenetic analysis consists of four steps:

Alignment (both building the data model and extracting a phylogenetic dataset)•Determining the substitution model•Tree building•Tree evaluation•

Each step is critical for the analysis and should be handled accordingly. For example, trees are only as good as the alignment they are based on. When performing a phylogenetic analysis, it is often insightful to build trees based on differentmodificationsofthealignmenttoseehowthealignmentproposedinfluencestheresultingtree.

5.4.1 Alignment: Building the Data ModelPhylogenetic sequence data usually consist of multiple sequence alignments; the individual, aligned-base positions are commonly referred to as ‘sites.’ These sites are equivalent to ‘characters’ in theoretical phylogenetic discussions, and the actual base (or gap) occupying a site is the ‘character state.’

Aligned sequence positions subjected to phylogenetic analysis represent a priori phylogenetic conclusions because the sites themselves (not the actual bases) are effectively assumed to be genealogically related, or homologous. Sites atwhichoneisconfidentofhomologyandthatcontainchangesincharacterstatesusefulforthegivenphylogeneticanalysis are often referred to as ‘informative sites.’

Steps in building the alignment include selection of the alignment procedure(s) and extraction of a phylogenetic data set from the alignment. The latter procedure requires determination of how ambiguously aligned regions and insertion/deletions (referred to as indels, or gaps) will be treated in the tree-building procedure.

A typical alignment procedure involves the application of a program such as CLUSTAL W, followed by manual alignment editing and submission to a tree building program. This procedure should be performed with the following questions and considerations in mind.

5.4.2 Determining the Substitution ModelThe substitution model should be given the same emphasis as alignment and tree building. As implied in the preceding section,thesubstitutionmodelinfluencesbothalignmentandtreebuilding;hence,arecursiveapproachiswarranted.At the present time, two elements of the substitution model can be computationally assessed for nucleotide data but not for amino acid or codon data. One element is the model of substitution between particular bases; the other is the relative rate of overall substitution among different sites in the sequence. Simple computational procedures have not beendevelopedforassessingmorecomplexvariables(example,site-orlineagespecificsubstitutionmodels).

Page 74: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

65

5.4.3 Tree-Building MethodsTree building methods can be sorted into distance-based vs. character-based methods. Much of the discussion in molecular phylogenetics dwells on the utility of distance and character-based methods (example, Saitou, 1996; Li, 1997). Distance methods compute pairwise distances according to some measure and then discard the actual data, usingonlythefixeddistancestoderivetrees.Character-basedmethodsderivetreesthatoptimisethedistributionoftheactualdatapatternsforeachcharacter.Pairwisedistancesare,therefore,notfixed,astheyaredeterminedby the tree topology. The most commonly applied distance-based methods include neighbor-joining and the most common character-based methods include maximum parsimony and maximum likelihood.

Distance-based:• Transform the data into pairwise distances (dissimilarities), and then use a matrix during tree building. Character-based:• Use the aligned characters, such as DNA or protein sequences, directly during tree inference – based on substitutions.

5.4.4 Tree EvaluationSeveral procedures are available that evaluate the phylogenetic signal in the data and the robustness of trees (Swofford et al., 1996; Li, 1997). The most popular of the former class are tests of data signal versus randomised data (skewness and permutation tests). The latter class includes tests of tree support from resampling of observed data (nonparametric bootstrap). The likelihood ratio test provides a means of evaluating both the substitution model and the tree.

BootstrapBootstrapping is a resampling tree evaluation method that works with distance, parsimony, likelihood, and just about any other tree derivation method. It was invented in 1979 (Efron, 1979) and introduced as a tree evaluation method in phylogenetic analysis by Felsenstein (1985). The result of bootstrap analysis is typically a number associated with a particular branch in the phylogenetic tree that gives the proportion of bootstrap replicates that supports the monophyly of the clade.

Bootstrapping can be considered a two-step process comprising the generation of (many) new data sets from the original set and the computation of a number that gives the proportion of times that a particular branch (example, a taxon) appeared in the tree. That number is commonly referred to as the bootstrap value. New data sets are created from the original data set by sampling columns of characters at random from the original data set with replacement. ‘With replacement’ means that each site can be sampled again with the same probability as any of the other sites. As a consequence, each of the newly created data sets has the same number of total positions as the original data set, but some positions are duplicated or triplicated and others are missing. It is therefore possible that some of the newly created data sets are completely identical to the original set—or, on the other extreme, that only one of the sites is replicated, say, 500 times, whereas the remaining 499 positions in the original data set are dropped.

Although it has become common practice to include bootstrapping as part of a thorough phylogenetic analysis, there is some discussion on what exactly is measured by this method. It was originally suggested that the bootstrap value is a measure of repeatability (Felsenstein, 1985). In more recent interpretations, it has been considered to be a measure of accuracy a biologically more relevant parameter that gives the probability that the true phylogeny has been recovered. On the basis of simulation studies, it has been suggested that, under favourable conditions (roughly equal rates of change, symmetric branches), bootstrap values greater than 70% correspond to a probability of greater than 95% that the true phylogeny has been found (Hillis and Bull, 1993). By the same token, under less favourable conditions, bootstrap values greater than 50% will be overestimates of accuracy (Hillis and Bull, 1993). Simply put, under certain conditions, high bootstrap values can make the wrong phylogeny look good; therefore, the conditions of the analysis must be considered. Bootstrapping can be used in experiments in which trees are recomputed after internal branches are deleted one at a time. The results provide information on branching orders that are ambiguous in the full data set (cf. Leipe et al., 1994).

Page 75: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

66

SummaryBootstrapping is a resampling tree evaluation method that works with distance, parsimony, likelihood, and just •about any other tree derivation method.Distance methods compute pairwise distances according to some measure and then discard the actual data, using •onlythefixeddistancestoderivetrees.Character-based methods derive trees that optimise the distribution of the actual data patterns for each •character.The substitution model should be given the same emphasis as alignment and tree building.•Phylogenetic sequence data usually consist of multiple sequence alignments; the individual, aligned-base •positions are commonly referred to as ‘sites.’Phylogenetic tree-building methods presume particular evolutionary models.•Phylogenetics is the study of evolutionary relationships. •Phylogenetic analysis is the means of inferring or estimating these relationships. It is the process you use to •determine the evolutionary relationships between organisms.Clades are groups of organisms or genes that include the most recent common ancestor of all of its members •and all of the descendants of that most recent common ancestor.

ReferencesBaxevanis, A. D. & Ouellette, B. F., 2001. • Bioinformatics: a practical guide to the analysis of genes and proteins, John Wiley and Sons.Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd., p.456. Brinkma, F. S. L., 2001. • Phylogenetic Analysis [pdf] Available at: <http://www.bioon.com/book/biology/bioinformatics/chapter-14.pdf > [Accessed 28 February 2012].NCBI. • Systematics and Molecular Phylogenetics [Online] Available at: <http://www.ncbi.nlm.nih.gov/About/primer/phylo.html> [Accessed 28 February 2012].Thermy33, 2011.• Understanding Phylogenetic Trees [Video Online] Available at: <http://www.youtube.com/watch?v=xwuhmMIIspo> [Accessed 28 February 2012].UCBerkeley, 2010. • Biology 1B - Lecture 24: Phylogenetics [Video Online] Available at: <http://www.youtube.com/watch?v=vrGfDPteKqU> [Accessed 28 February 2012].

Recommended ReadingLivingstone & Barton, 1993. • Protein Sequence Alignments: a Strategy for the Hierarchical Analysis of Residue Conservation, Computer Applications in the Biosciences.Steel, M. A., 2003. • Phylogenetics, Oxford University Press.Jogota, A., 2005. Computational Methods in Phylogenetic Analysis.•

Page 76: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

67

Self Assessment________is a resampling tree evaluation method that works with distance, parsimony, likelihood, and just about 1. any other tree derivation method.

Phylogeneticsa. Bootstrappingb. Phylogenetic analysisc. Cladisticsd.

Which of the following statements is true?2. Distance methods compute pairwise distances according to some measure and then discard the actual data, a. usingonlythefixeddistancestoderivetrees.Distance-based methods derive trees that optimise the distribution of the actual data patterns for each b. character.The substitution model should be given the same emphasis as alignment and tree building.c. Phylogenetic sequence data usually consist of multiple sequence alignments; the individual, aligned-base d. positions are commonly referred to as ‘sites.’

Which of the following statements is false?3. Phylogenetic tree-building methods presume particular evolutionary models.a. Cladistics is the study of evolutionary relationships. b. Phylogenetic analysis is the means of inferring or estimating these relationships.c. Phylogenetic analysis is the process you use to determine the evolutionary relationships between d. organisms.

Bootstrapping can be considered a ______process.4. two-stepa. three-stepb. four-stepc. six-stepd.

Which of these is not a step of phylogenetic analysis?5. Alignment a. Determining the substitution modelb. Tree buildingc. Tree predictiond.

______are homologs produced by speciation.6. Orthologsa. Paralogsb. Xenologsc. Analogsd.

_______are homologs produced by gene duplication.7. Orthologsa. Paralogsb. Xenologsc. Analogsd.

Page 77: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

68

_______arehomologsresultingfromhorizontalgenetransferbetweentwoorganisms.8. Orthologsa. Paralogsb. Xenologsc. Analogsd.

Which of the following statements is false?9. Phylogenetic tree-building methods presume particular evolutionary models.a. Macromolecules, especially sequences, have surpassed morphological and other organismal characters as b. the most popular form of data for phylogenetic or cladistic analysis.Clade is derived from the Greek word ‘klados,’ meaning branch or twig.c. A node is any named group of organisms but not necessarily a clade. d.

Which of the following statements is false?10. A clade is a bifurcating branch point.a. In some analyses, branchb. lengths correspond to divergence.There is a bifurcating pattern of cladogenesis. This assumption is controversial.c. Change in characteristics occurs in lineages over time. d.

Page 78: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

69

Chapter VI

Microarray Technology: A Boon to Biological Sciences

Aim

The aim of this chapter is to:

definemicroarray•

describe the microarray technique•

enumerate • characteristics of microarrays

Objectives

The objectives of this chapter are to:

definehybridisation•

describe potential of microarray analysis•

elucidate the microarray products•

Learning outcome

At the end of this chapter, you will be able to:

understand hybridisation technique•

enumerate the applications of microarrays•

know gene discovery with bioinformatics•

Page 79: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

70

6.1 Introduction to MicroarrayMolecular biology research evolves through the development of the technologies used for carrying them out. It is not possible to research on a large number of genes using traditional methods. DNA microarray is one such technology, which enables the researchers to investigate and address issues, which were once thought to be non traceable. One cananalysetheexpressionofmanygenesinasinglereactionquicklyandinanefficientmanner.DNAmicroarraytechnologyhasempoweredthescientificcommunitytounderstandthefundamentalaspectsunderliningthegrowthand development of life as well as to explore the genetic causes of anomalies occurring in the functioning of the human body.

Although all of the cells in the human body contain identical genetic material, the same genes are not active in every cell. Studying active genes and inactive genes in different cell types helps scientists to understand both how these cells function normally and how they are affected when various genes do not perform properly. In the past, scientists have only been able to conduct these genetic analyses on a few genes at once. With the development of DNA microarray technology, however, scientists can now examine how active thousands of genes are at any given time.

All living organisms contain DNA, a molecule that encodes all the information required for the development and functioning of an organism. Finding and deciphering the information encoded in DNA, and understanding how suchasimplemoleculecangiverisetotheamazingbiologicaldiversityoflife,isagoalsharedinsomewaybyall life scientists. Microarrays provide an unprecedented view into the biology of DNA, and thus a rich way to examine living systems. DNA is a physical molecule that is able to encode information in a linear structure. Cells express information from different parts of this structure in a context-dependent fashion. DNA encodes for genes, and regulatory elements control whether genes are on or off. For instance, all the cells of the human body contain thesameDNA,yettherearehundredsofdifferenttypesofcells,eachexpressingauniqueconfigurationofgenesfrom the DNA. In this regard, DNA could be described as existing in some number of states. Microarrays are a tool used to read the states of DNA. Microarrays have had a transforming effect on the biological sciences. In the past, biologists had to work very hard to generate small amounts of data that could be used to explore a hypothesis with one observation at a time.

With the advent of microarrays, individual experiments generate thousands of data points or observations. This turns the experiment from a hypothesis-driven endeavour to a hypothesis generating endeavour because every experiment sheds light across an entire terrain of gene expression, letting relevant genes reveal themselves, often in surprising ways.

Thehighlyparallelnatureofmicroarraysthatareusedtomakebiologicalobservationssignifiesthatmostexperimentsgenerate more information than the experimenter could possibly interpret. Indeed, from a statistical point of view, every gene measured on a microarray is an independent variable in a highly parallel experiment. The number of hypotheses to which the data may or may not lend support cannot be known in advance. To take advantage of the excess information in microarray data, repositories have been set up in which people can deposit their experiments, thus making them available to a wide community of researchers with questions to explore.

A typical microarray experiment involves the hybridisation of an mRNA molecule to the DNA template from which it is originated. Many DNA samples are used to construct an array. The amount of mRNA bound to each site on the array indicates the expression level of the various genes. This number may run in thousands. All the data is collected andaprofileisgeneratedforgeneexpressioninthecell.

6.2 Microarray TechniqueAn array is an orderly arrangement of samples where matching of known and unknown DNA samples is done based on base pairing rules. An array experiment makes use of common assay systems such as microplates or standard blottingmembranes.Thesamplespotsizesaretypicallylessthan200micronsindiameterusuallycontainthousandsof spots.

Page 80: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

71

Thousands of spotted samples (DNA) known as probes (with known identity) are immobilised on a solid support (microscope glass slides or silicon chips or nylon membrane). These are used to determine complementary binding of the unknown sequences thus allowing parallel analysis for gene expression and gene discovery. An experiment with a single DNA chip can provide information on thousands of genes simultaneously. An orderly arrangement of the probesonthesupportisimportantasthelocationofeachspotonthearrayisusedfortheidentificationofagene.

The genome is an information scaffoldMicroarrays measure events in the genome. An event may be the transcription of a gene, the binding of a protein to a segment of the DNA, the presence or absence of a mutation, a change in the copy number of a locus, a change in the methylation state of the DNA, or any of a number of states or activities that are associated with DNA or RNA molecules. As a genomic readout, microarrays identify where these events occur. The idea that one can accurately describe the genome, let alone measure its activity in a comprehensive way, is a relatively novel concept. Several factorshaveledtotherecentenhancementandblendingofmolecularbiologyintoafieldcalledgenomics.Thefirstisgenome-sequencingprojects.

Today, sequencing a genome is considered a routine activity. However, in the late 1980s when sequencing the human genomewasfirstsuggestedasaseriousendeavour,thecommunitywasdivided.Giventhesequencingtechnologyavailable at the time, the project looked as if it would consume colossal resources over a long time frame that many thought could be put to better use on more practical projects. However, visionaries were banking on two precepts: once given the mandate, the technology would transform itself and new sequencing methods would be invented thatwouldincreasetherateofsequenceaccumulation.Thesecondaspectisthatthefinishedproject,fullgenomesequences, would be a public gold mine of a resource that would pay off for all biologists. Both of these assumptions have come to fruition. Genome sequences accumulate at rates few imagined possible. Biologists can expect the sequence of their model organism to exist in GenBank or to be in someone’s sequencing pipeline. More important, having a map of the full genomic sequence of an organism has transformed the way biology is studied.

DNA gives rise to the organism and so is a scaffold for information. The genomic map is like a landscape of code, openlyvisibletoallandforanyonetofigureout.Throughexperimentation,ofteninvolvingmicroarrays,DNAis annotated with functional information. In addition, the large-scale sequencing effort served as a kind of space program for biology, whereby the genome was a new frontier. It made possible previously unforeseen possibilities andconceptuallypavedthewayforahostofparallelanalysismethods.Theunveilingofaunifiedmapbeggedthecreationofmicroarrays,aswellasotherlargegenome-sizedprojects,suchasthesystematicdeletionofeveryyeast gene, the systematic fusion of every yeast promoter to a reporter gene and many other similar projects. As the invention of the telescope changed how we view the universe, microarrays have changed the way we view the genome.

Gene expression is detected by hybridisationThe purpose of a microarray is to examine expression of multiple genes simultaneously in response to some biological perturbation. More generally, a microarray serves to interrogate the concentrations of molecules in a complex mixture and thus, can serve as a powerful analytical tool for many kinds of experiments. To understand how this occurs, it may be useful to review the structure of DNA and examine how the unique structure of this molecule plays a role in identifying itself. Although DNA is remarkably informationally complex, the general structure of the molecule is really quite simple.

DNA is made up of four chemical building blocks called bases: adenine (A), cytosine(C), guanosine(G), and thymidine (T). As individual subunits these building blocks are also referred to as nucleotides. A strand of DNA consists of a sugar phosphate backbone to which these bases are covalently linked such that they form a series. Because these four bases can form sequences, it is possible to use them to encode information based on their patterns of occurrence. Indeed, from an information point of view, DNA has a potential data density of 145 million bits per inch and has been considered as a substrate for computation whereby the sequences are referred to as software.

Page 81: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

72

Like strings of text in a book, the sequences that make up a strand of DNA have directionality such that information can be encoded in a given direction. The amount of DNA, and thus the amount of sequence, varies from organism to organism. For instance, the microorganism Escherichia coli have 4.5 million bases of sequence, whereas human cells have about 3 billion bases. Exactly how much biological information is encoded in these sequences is unknown, representing one of the deepest mysteries of biology, but microarrays provide a way to gain clues. Cellular DNA most often consists not just of one strand but of two strands anti-parallel to each other. The two strands are hydrogen bonded together by interactions between the bases, forming a structure in the cell. The structure is helical, similar to a spiral staircase in which the bases are attached to each side and interact in a plane to form the steps of the staircase. Besides the hydrogen bonds between the bases of opposite strands, the overlapping and proximity of the bases to each other lead to a second kind of non covalent force called a stacking interaction that contributes to the stability of the double-stranded structure.

The bases of one strand interact with the bases of the other strand according to a set of pairing rules, such that A pairswithTandCpairswithG.Thus,ifoneknowsthesequenceofonestrand,bydefinition,thenoneknowsthesequence of the opposite strand. This property has profound consequences in the study of biology. It is also what the cell uses to replicate itself. As the interaction between the bases is non covalent, consisting only of hydrogen bonds, the strands can essentially be melted apart and separated, thus opening the way for a copying mechanism to read each single strand and re-create the second complementary strand for each half of the pair, resulting in a new double-stranded molecule for each cell. This is also the mechanism by which cells express genes. The strands are opened by the gene expression machinery so that some number of RNA copies of a gene can be synthesised.

The RNA transcript has the same sequence as the gene with the exception that uracil (U) replaces T; though the hybridisation pairing rules remain the same (U and T can both pair with A). This property of complementarity is also what is used for measuring gene expression on microarrays.

Just as energy can melt strands apart and separate them into single molecules, the process is reversible such that single strands that are complementary to each other can come together and reanneal to form a double stranded complex. This process is called hybridisation and is the basis for many assays or experiments in molecular biology. In the cell, hybridisation is at the center of several biological processes, whereas in the lab complementarity is identity and thus, hybridisation is at the center of many in vitro reactions and analytical techniques. The molecules can come from completely different sources, but if they match, they will hybridise.

6.3 Potential of Microarray Analysis Theacademicresearchcommunitystandstobenefitfrommicroarraytechnologyjustasmuchasthepharmaceuticalindustry. The ability to use it in place of existing technology will allow researchers to perform experiments faster and more cheaply, and will enable them to concentrate on analysing the results of microarray experiments rather than simply performing the experiments. This research could then lead to a better understanding of the disease process, whichwill requiremanydifferent levelsof research.While thefieldofexpressionhas receivedmostattentionso far, looking at the gene copy level and protein level is just as important. Microarray technology has potential applications in each of these three levels.

Identifying drug targets provided the initial market for the microarrays. A good drug target has extraordinary value for developing pharmaceuticals. By comparing the ways in which genes are expressed in a normal and diseased heart, for example, scientists might be able to identify the genes and hence the associated proteins that are part of the disease process. Researchers could then use that information to synthesise drugs that interact with these proteins, thus reducing the disease’s effect on the body.

Gene sequences can be measured simultaneously and calculated instantly when an ordered set of DNA molecules of known sequence a microarray is used. Consequently, scientists can evaluate an entire set of genes at once, rather than looking at physiological changes one gene at a time. For example, Genetics Institute, a biotechnology company in Cambridge, Massachusetts, built an array consisting of genes for cytokines, which are proteins that affect cell physiologyduringtheinflammatoryresponse,amongothereffects.ThefullsetofDNAmoleculescontainedmorethan 250 genes. While that number was not large by current standards of microarrays, it vastly outnumbered the

Page 82: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

73

one or two genes examined in typical pre-microarray experiments. The Genetics Institute scientists used the array tostudyhowchangesexperiencedbycellsintheimmunesystemduringtheinflammatoryresponsearereflectedinthe behavior of all 250 genes at the same time. This experiment established the potential for using the patterns of response to help locate points in the body at which drugs could prove most effective.

6.4 Microarray Products Within that basic technological foundation, microarray companies have created a variety of products and services. They range in price, and involve several different technical approaches. A kit containing a simple array with limited density can cost as little as $1,100, while a versatile system favoured by R&D laboratories in pharmaceutical and biotechnology companies costs more than $200,000. The differences among products lie in the basic components and the precise nature of the DNA on the arrays.

The type of molecule placed on the array units also varies according to circumstances. The most commonly used molecule is complementary DNA (cDNA). Since they are derived from a distinct messenger RNA; each feature represents an expressed gene.

6.5 Microarray: Identifying InteractionsTo detect interactions at microarray features, scientists must label the test sample in such a way that an appropriate instrumentcanrecogniseit.Sincetheminutesizeofmicroarrayfeatureslimitstheamountofmaterialthatcanbelocated at any feature, detection methods must be extremely sensitive. Otherthanafewlow-endsystemsthatuseradioactiveorchemiluminescenttagging,mostmicroarraysusefluorescenttagsastheirmeansofidentification.TheselabelscanbedeliveredtotheDNAunitsinseveraldifferentways.While relatively simple, this approach has low sensitivity because it delivers only one unit of label per interaction. Technologists can achieve more sensitivity by multiplexing the labeled entity that is delivering more than one unit of label per interaction.

6.6 Applications of MicroarraysMicroarray technology will help researchers to learn more about many different diseases, including heart disease, mental illness and infectious diseases, to name only a few. One intense area of microarray research at the National InstitutesofHealth(NIH)isthestudyofcancer.Inthepast,scientistshaveclassifieddifferenttypesofcancersbased on the organs in which the tumours develop. With the help of microarray technology, however, they will be able to further classify these types of cancers based on the patterns of gene activity in the tumour cells. Researchers willthenbeabletodesigntreatmentstrategiestargeteddirectlytoeachspecifictypeofcancer.Additionally,byexamining the differences in gene activity between untreated and treated tumour cells - for example those that are radiated or oxygen-starved - scientists will understand exactly how different therapies affect tumours and be able to develop more effective treatments.

Gene discovery: DNAMicroarraytechnologyhelpsintheidentificationofnewgenes,knowabouttheirfunctioningand expression levels under different conditions.

Disease diagnosis: DNA Microarray technology helps researchers learn more about different diseases such as heart diseases, mental illness, infectious disease and especially the study of cancer. Until recently, different types of cancer havebeenclassifiedonthebasisoftheorgansinwhichthetumoursdevelop.Now,withtheevolutionofmicroarraytechnology, it will be possible for the researchers to further classify the types of cancer on the basis of the patterns of gene activity in the tumour cells. This will tremendously help the pharmaceutical community to develop more effectivedrugsasthetreatmentstrategieswillbetargeteddirectlytothespecifictypeofcancer.

Drug discovery: Microarray technology has an extensive application in Pharmacogenomics. Pharmacogenomics is thestudyofcorrelationsbetweentherapeuticresponsestodrugsandthegeneticprofilesofthepatients.Comparativeanalysisofthegenesfromadiseasedandanormalcellwillhelptheidentificationofthebiochemicalconstitutionofthe proteins synthesised by the diseased genes. The researchers can use this information to synthesise drugs, which combat with these proteins and reduce their effect.

Page 83: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

74

Toxicological research: Microarray technology provides a robust platform for the research of the impact of toxins on the cells and their passing on to the progeny. Toxic genomics establishes correlation between responses to toxicants andthechangesinthegeneticprofilesofthecellsexposedtosuchtoxicants.

The characteristics of microarrays include:It allows simultaneous measurement of gene expression.•Differential expression, changes over time.•Single microarray can test ~10k genes.•Data obtained is faster than can be processed.•Canfindgenesthatbehavesimilarly.•

Fig. 6.1 Gene expression data(Source: http://www.science.co.il/enuka/essays/microarray-review.pdf)

Each spot represents the expression level of a gene in two different experiments. Yellow or red spots indicate that the gene is expressed in one experiment. Green spots show that the gene is expressed at same levels in both experiments. Each box represents one gene’s expression over time. Track sample over a period of time to see gene expression over time. Track two different samples under same conditions to see difference in gene expressions

Page 84: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

75

Fig. 6.2 Gene expression over time(Source: http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt)

Page 85: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

76

SummaryMolecular biology research evolves through the development of the technologies used for carrying them out.•DNA is a physical molecule that is able to encode information in a linear structure.•DNA encodes for genes, and regulatory elements control whether genes are on or off.•Microarrays are a tool used to read the states of DNA. Microarrays have had a transforming effect on the •biological sciences.A typical microarray experiment involves the hybridisation of an mRNA molecule to the DNA template from •which it is originated.An array is an orderly arrangement of samples where matching of known and unknown DNA samples is done •based on base pairing rules. An array experiment makes use of common assay systems such as micro plates or standard blotting •membranes. Thesamplespotsizesaretypicallylessthan200micronsindiameterusuallycontainthousandsofspots.•Microarrays measure events in the genome.•Several factors have led to the recent enhancement andblendingofmolecular biology into afield called•genomics.The purpose of a microarray is to examine expression of multiple genes simultaneously in response to some •biological perturbation.DNA is made up of four chemical building blocks called bases: adenine (A), cytosine(C), guanosine (G), and •thymidine (T).Cellular DNA most often consists not just of one strand but of two strands anti-parallel to each other.•Identifying drug targets provided the initial market for the microarrays.•Gene sequences can be measured simultaneously and calculated instantly when an ordered set of DNA molecules •of known sequence a microarray is used.Microarray technology will help researchers to learn more about many different diseases, including heart disease, •mental illness and infectious diseases, to name only a few.DNAmicroarraytechnologyhelpsintheidentificationofnewgenes,knowabouttheirfunctioningandexpression•levels under different conditions.Microarray technology has an extensive application in Pharmacogenomics. •Microarray technology provides a robust platform for the research of the impact of toxins on the cells and their •passing on to the progeny.

ReferencesBaxevanis, A. D. & Ouellette, B. F., 2001. • Bioinformatics: a practical guide to the analysis of genes and proteins, John Wiley and Sons.Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd., p.456. Korol, A. B., 2001. • Microarray cluster analysis and applications [pdf] Available at: <http://www.science.co.il/enuka/essays/microarray-review.pdf > [Accessed 28 February 2012].Clustering • [Online] Available at: <http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt > [Accessed 28 February 2012].HaroonBBT, 2011.• Microarray [Video Online] Available at: <http://www.youtube.com/watch?v=wKcQZVeIK-k&feature=related > [Accessed 28 February 2012].wenl888, 2012. • Easy to use microarray data analysis tool - No training needed: Goober [Video Online] Available at: <http://www.youtube.com/watch?v=nSlhCaJKhjY > [Accessed 28 February 2012].

Page 86: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

77

Recommended ReadingStekel, D., 2003. • Microarray bioinformatics, Cambridge University Press, p.263.Borlak, 2005. • Handbook of toxicogenomics:Strategies and applications, Wiley-VCH.Zelikovsky, A., 2008. • Bioinformatics algorithms: techniques and applications, John Wiley & Sons.

Page 87: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

78

Self AssessmentAll living organisms contain_____, a molecule that encodes all the information required for the development 1. and functioning of an organism.

DNAa. RNAb. proteinc. nucleotided.

_________is an orderly arrangement of samples where matching of known and unknown DNA samples is done 2. based on base pairing rules.

Miroplatea. Arrayb. Blotting membranec. Probed.

Thousands of spotted samples (DNA) known as __________(with known identity) are immobilised on a solid 3. support.

miroplatesa. arraysb. blotting membranesc. probesd.

Which of these is not included as a solid support for microarray techniques?4. Microscope glass slides a. Silicon chipsb. Nylon membranec. Copper membraned.

Microarrays measure events in the_________.5. genomea. proteomeb. DNAc. RNAd.

The_______islikealandscapeofcode,openlyvisibletoallandforanyonetofigureout.6. genomic mapa. proteomeb. microarrayc. probed.

Through experimentation, often involving microarrays, DNA is ______with functional information.7. annotateda. markedb. taggedc. methylatedd.

Page 88: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

79

DNA is made up of four chemical building blocks called ___________.8. basesa. arraysb. sugar phosphate backbonesc. bondsd.

Which of the following statements is false?9. A good drug target has extraordinary value for developing pharmaceuticals.a. Gene sequences can be measured simultaneously and calculated instantly when an ordered set of protein b. molecules of known sequence a microarray is used.Sincetheminutesizeofmicroarrayfeatureslimitstheamountofmaterialthatcanbelocatedatanyfeature,c. detection methods must be extremely sensitive. Microarray technology will help researchers to learn more about many different diseases, including heart d. disease, mental illness and infectious diseases.

Which of the following statements is false?10. DNAMicroarraytechnologyhelpsintheidentificationofnewgenes,knowabouttheirfunctioninganda. expression levels under different conditions.DNA Microarray technology helps researchers learn more about different diseases such as heart diseases, b. mental illness, infectious disease and especially the study of cancer.Differenttypesofcancerhavebeenclassifiedonthebasisoftheorgansinwhichthetumoursdevelop.c. Hybridisationisthestudyofcorrelationsbetweentherapeuticresponsestodrugsandthegeneticprofilesd. of the patients.

Page 89: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

80

Chapter VII

Bioinformatics in Drug Discovery: A Brief Overview

Aim

The aim of this chapter is to:

definedrugdiscovery•

explain the concept of electronic medical records •

describe the impact of bioinformatics in medical sciences•

Objectives

The objectives of this chapter are to:

definedrug-likeness•

describe potential of pharmacogenomics•

elucidate the drug discovery process•

Learning outcome

At the end of this chapter, you will be able to:

understand application of bioinformatics in computer-aided drug design•

enumeratethebenefitsofCADD•

know bioinformatics tools•

Page 90: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

81

7.1 IntroductionIn recent years, we have seen an explosion in the amount of biological information that is available. Various databases aredoublinginsizeevery15monthsandwenowhavethecompletegenomesequencesofmorethan100organisms.It appears that the ability to generate vast quantities of data has surpassed the ability to use this data meaningfully. Thepharmaceuticalindustryhasembracedgenomicsasasourceofdrugtargets.Italsorecognisesthatthefieldofbioinformatics is crucial for validating these potential drug targets and for determining, which ones are the most suitable for entering the drug development pipeline.

Recently, there has been a change in the way that medicines are being developed due to our increased understanding of molecular biology. In the past, new synthetic organic molecules were tested in animals or in whole organ preparations. This has been replaced with a molecular target approach in which in-vitro screening of compounds againstpurified,recombinantproteinsorgeneticallymodifiedcelllinesiscarriedoutwithahighthroughput.Thischange has come about as a consequence of better and ever improving knowledge of the molecular basis of disease. All marketed drugs today target only about 500 gene products. The elucidation of the human genome, which has an estimated 30,000 to 40,000 genes presents immense new opportunities for drug discovery and simultaneously creates a potential bottleneck regarding the choice of targets to support the drug discovery pipeline. The major advances in genomicsandsequencingmeansthatfindinganattractivetargetisnolongeraproblembutfindingthetargetsthatare most likely to succeed has become the challenge. The focus of bioinformatics in the drug discovery process has thereforeshiftedfromtargetidentificationtotargetvalidation.

A lot of factors need to be taken into account concerning a candidate target from a multitude of heterogeneous resources. The types of information that one needs to gather about potential targets include nucleotide and protein sequencing information, homologues, mapping information, function prediction, pathway information, disease associations, variants, structural information, gene and protein expression data and species/taxonomic distribution among others. Different bioinformatics tools can be used to gather this information. The accumulation of this information into databases about potential targets means that the pharmaceutical companies can save themselves much time, effort and expense exerting bench efforts on targets that will ultimately fail. The information that is gatheredhelpstocharacterisethedifferenttargetsintofamiliesandsubfamilies.Italsoclassifiesthebehaviourofthe different molecules in a biochemical and cellular context.

Decisions about which families provide the best potential targets are guided by a number of criteria. It is important that the potential target has a suitable structure for interacting with drug molecules. Structural genomics helps to prioritise the families in terms of their 3D structures. Sometimes we want to develop broad spectrum drugs that are effective against a wide range of pathogenic species while at other times we want to develop narrow spectrum drugsthatarehighlyspecifictoaparticularorganism.Comparativegenomicshelpstofindproteinfamiliesthatare widely taxonomically dispersed and those that are unique to a particular organism. For example, when we want to develop a broad spectrum antibiotic, we are looking for targets that are present in a large number of bacteria yet have no similar homologues in human. This means that the antibiotic will be effective against many bacteria killing them while causing no harm to the human. In order to determine the role our potential drug target plays in a particular disease mechanism we use DNA and protein chips. These chips can measure the amount of transcript or protein expressed by a cell at different times or in different states (healthy versus diseased).

Clustering algorithms are used to organise this expression data into different biologically relevant clusters. We can thencomparetheexpressionprofilesfromthediseasedandhealthycellstohelpusunderstandtheroleourgeneorprotein plays in a disease process. All of these computational tools can help to compose a detailed picture about a protein family, its involvement in a disease process and its potential as a possible drug target.

Following on from the genomics explosion and the huge increase in the number of potential drug targets, there has been a move from the classical linear approach of drug discovery to a non linear and high throughput approach. Thefieldofbioinformaticshasbecomeamajorpartofthedrugdiscoverypipelineplayingakeyroleforvalidatingdrug targets. By integrating data from many inter-related yet heterogeneous resources, bioinformatics can help in our understanding of complex biological processes and help improve drug discovery.

Page 91: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

82

7.2 Drug DiscoveryDrugdiscovery is the process of discovering anddesigningdrugs,which includes target identification, targetvalidation,leadidentification,leadoptimisationandintroductionofthenewdrugstothepublic.Thisprocessisveryimportant,involvinganalysingthecausesofthediseasesandfindingwaystotacklethem.

Bioinformatics, a term coined for the applications of computer science in biology is now emerging as a major element in contemporary biology and biomedical research. There is a paradigm shift in biological research to use the computers, software tools and computational models in a large scale. Walter Gilbert, a renowned scientist, described this shift in biology as follows:

‘The new paradigm, now emerging, is that all of the ‘genes’ will be known (in the sense of being resident in databases available electronically), and that the starting point of a biological investigation will be theoretical. An individual scientist will begin with a theoretical conjecture only then turning to experiment to follow or test that hypothesis.’

Bioinformatics deals with the exponential growth in biological data have led to the development of primary and secondary databases of nucleic acid sequences, protein sequences and structures. Some of the well-known databases include GenBank, SWISS-PROT, PDB, PIR, SCOP, CATH and so on. These databases are available as public domain information and hosted on various Internet servers across the world. Basic research and modelling is done using these databases with the help sequence analysis tools like BLAST, FASTA, CLUSTALW, and so on and the modelled structures are visualised using visualisation tools such as WebLab, MOLMOL, Rasmol and so on.

Bioinformatics plays an important role for the integration of broad disciplines of biology to understand the complex mechanisms of the cell. Bioinformatics also aids the way in which biomedical investigators use the information in their testing. The complete process of data collection to analysis of the results of such tests may be categorised under a separate area named ‘Clinical Informatics’.

7.3 Informatics and Medical SciencesIt is a known fact that most of the doctors are averse to computers. To overcome this problem, one of the solutions proposed, after an intensive research contacting 1500 doctors from different cities, is to introduce Palmtops specially tailoredforphysicians.Thesepalmtopsareofthesizethateasilyfitsintothepocketofalabcoat.Thishelpsthedoctor to feed in the medical data in a sequential manner that he has collected when moving from ward to ward. This addresses the basic need of any medical analysis - data capture and creating Electronic Medical Records (EMR), which eventually develops into a database for reference and analysis.

The major advantage with the introduction of the concept of Electronic Medical Records (EMR) is that, the information can be easily accessed and shared in comparison to traditional medical records. EMR also drastically reduces the possibilities of introduction of errors due to frustration and other psychological disturbances during the manual data entry process after collecting the necessary information on paper. It also helps to eliminate the manualtaskofextractingdatafromchartsorfillingoutspecialiseddatasheets.Thedatarequiredforastudycanbe obtained directly from the electronic record, thus making research data collection for analysis, a by-product of routine clinical record keeping. The record environment can help to assure compliance with a research protocol, pointingouttoaclinicianwhenapatientiseligibleforastudy,orwhentheprotocolforastudycallsforaspecificmanagement plan given the currently available data about that patient.

In the near future one can see a situation where the complete information on the patient can be accessed from the EMR. This information can be of any type, ranging from drug trial data to the various tests performed on that patient and the outcome of such experiments. The challenge in such cases will be to organise and integrate the heterogeneity of the information into a comprehensive, knowledge based database from which an individual can access the necessary portion of the record for any research analysis.

Page 92: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

83

7.4 Bioinformatics and Medical SciencesBioinformatics has a profound impact in medical sciences. The biological databases are helping physicians to diagnose the disease and develop strategies for its therapy. Consider a situation where a patient with a genetic form of haemophilia meets a physician. The physician is not sure with the symptoms of the disease but has the only clue that the patient’s family has suffered from haemophilia earlier. The physician could surf the web to get the information on the disease by checking out the OMIM (Online Mendelian Inheritance of Man) resources available at http://www.ncbi.nlm.nih.gov/omim/ which provides detailed information on genetic disorders. A focussed search for diabetes would reveal multiple disorders including Von Willerbrand Disease and also provide the information that the primary defect is due to the low anti-haemophilic globulin (AHG; factor VIII) in this disorder. Further, the search on ‘Factor VIII’ in the protein sequence database would result in the match encoding the human Factor VIII with the complete cDNA and corresponding protein sequence. The gene is linked to its DNA sequence, protein sequence and a set of references in the MEDLINE literature database. Following this MEDLINE literature database, the original research article (where the association of factor VIII with haemophilia is discussed) is obtained.

By following the link to the protein sequence, the detailed information is obtained from the SWISS-PROT database and Protein Information Resource (PIR). The information on the crystal structure can be obtained by following the link to Protein Data Bank (PDB) provided in the SWISSPROT database. Following the link to the DNA sequence in the genetic database, GENBANK, the nucleotide sequence of the gene is obtained along with records of gene irregularities. Thus, the physician uses a number of databases to collect information about the disease, which aids him to diagnose and device strategies for therapy.

Infectious diseases are now the world’s biggest killers of children and young adults. They account for more than 13 million deaths a year - one in two deaths in developing countries as stated by the WHO. Most deaths from infectious diseasesoccurindevelopingcountries.Thecauseforthishasbeenattributedtotheunavailabilityofefficientdrugsandifatallavailable,thehighcostassociatedwiththosedrugs.Developmentofcheapandefficientdrugsforadisease is one of the major problems faced by mankind. The solution to this problem could be from rational drug design using Bioinformatics.

The focus of the pharmaceutical industry has shifted from the trial and error process of drug discovery to a rational, structure based drug design. A successful and reliable drug design process could reduce the time and cost of developing useful pharmacological agents. Computational methods are used for the prediction of drug-likeness, which is the identificationandeliminationofcandidatemoleculesthatareunlikelytosurvivethelaterstagesofdiscoveryanddevelopment. Drug-likeness could be predicted by genetic algorithm and neural network based approaches.

Peoplehavebeenworkingonconstructingefficientalgorithmsandbetterenergyfunctionstopredictproteinstructuresand interaction of small molecules with them. The technical barrier to these approaches is that they are computation intensive and we do not have the computational power to handle such massive requirement. Realising the amount of raw computational power needed in such problems, IBM had recently announced a new $100 million exploratory research initiative to build a supercomputer, which is 500 times more powerful than the world’s fastest existing computer and 2 million times faster than the today’s fastest desktop PC. This new computer nicknamed ‘Blue Gene’ byIBMresearcherswillbecapableofperformingclosetoonePetaflop(1015operationspersecond).

As stated earlier, from the pharmaceutical industry point of view, Bioinformatics is the key to rational drug design. It reduces the number of trials in the screening of drug compounds and in identifying potential drug targets for a particular disease using high power computing workstations and software like Insight. This profound application of Bioinformatics in genome sequence has led to a new area in pharmacology – Pharmacogenomics, which is the study of genetic basis for the differences between individuals in response to drugs. This is mainly due to Single Nucleotide Polymorphisms (SNPs). In order to develop innovative and safe drugs, Pharmacogenomics needs to be integrated in the drug development process. Knowing the importance of SNPs, an international consortium to produce a map of human SNPs (which could aid pharmacogenomics) has been formed by major pharmaceutical companies in which IBM is also a member. In future, drug design is going to rely on the variation in SNPs. In fact SNPs with combinatorial chemistry can speed up the process of drug discovery and may also result in identifying a new set of target proteins that cross-react with drugs in the preliminary clinical trials.

Page 93: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

84

Taking into account all the above mentioned factors that have to go in for developing effective drugs, there has been a strong urge to start the Human Proteomics Initiative. This initiative aims at identifying the functions and polymorphism of all the proteins coded in the human genome and predicts their structure, or solves the structure of these proteins if possible so that these could be used as potential targets for developing drugs.

Need for IntegrationRapidadvancesinthefieldofcomputerscoupledwithincreasingcomputerliteracyamongprofessionalsfavourthe implementation of computer applications in medical practice. Further, the availability of numerous databases on the Internet has revolutionised the way by which a physician devices a strategy for treatment. Projects like the Human Proteomics Initiative is a classic example to show the necessity of integrating Bioinformatics - to predict structures and functions of proteins, Medical Sciences - to identify proteins that are important in metabolic or other disorders and Pharmacology (drug discovery) - to identify novel drugs against the predicted targets. Thus, it is apt to conclude that all the three areas must work in concert to achieve the ultimate goal of understanding the basis of life process and apply it for the betterment of human lives.

Fig. 7.1 Drug discovery process

(Source: http://www.vls3d.com/courses_talk/Villoutreix_intro_drug_design.pdf)

7.5 Bioinformatics in Computer-Aided Drug DesignComputer-Aided Drug Design (CADD) is a specialised discipline that uses computational methods to simulate drug-receptor interactions. CADD methods are heavily dependent on bioinformatics tools, applications and databases. As such, there is considerable overlap in CADD research and bioinformatics.

Bioinformatics hubBioinformatics can be thought of as a central hub that unites several disciplines and methodologies. On the support side of the hub, Information Technology, Information Management, software applications, databases and computational resourcesallprovidetheinfrastructureforbioinformatics.Onthescientificsideofthehub,bioinformaticmethodsare used extensively in molecular biology, genomics, proteomics, other emerging areas (that is metabolomics, transcriptomics) and in CADD research.

Page 94: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

85

Molecular Biology

Information Technology/ Information Management

Applications/ Databases

Computational Resources

BioinformaticsGenomics/ Proteomics/

x-omics

CADD (Computer Aided

Drug Design)

Fig. 7.2 Bioinformatics hub(Source: http://www.b-eye-network.com/view/852)

There are several key areas where bioinformatics supports CADD research. Virtual High-Throughput Screening (vHTS):• Pharmaceutical companies are always searching for new leads to develop into drug compounds. One search method is virtual high-throughput screening. In vHTS, protein targets are screened against databases of small-molecule compounds to see, which molecules bind strongly to the target. If there is a ‘hit’ with a particular compound, it can be extracted from the database for further testing. Withtoday’scomputationalresources,severalmillioncompoundscanbescreenedinafewdaysonsufficientlylarge clustered computers. Pursuing a handful of promising leads for further development can save researchers considerable time and expense. ZINC is a good example of a vHTS compound library.Sequence Analysis:• In CADD research, one often knows the genetic sequence of multiple organisms or the amino acid sequence of proteins from several species. It is very useful to determine how similar or dissimilar the organisms are based on gene or protein sequences. With this information one can infer the evolutionary relationshipsoftheorganisms,searchforsimilarsequencesinbioinformaticdatabasesandfindrelatedspeciesto those under investigation. There are many bioinformatic sequence analysis tools that can be used to determine the level of sequence similarity. Homology Modeling:• Another common challenge in CADD research is determining the 3-D structure of proteins. Most drug targets are proteins, so it’s important to know their 3-D structure in detail. It’s estimated that the human body has 500,000 to 1 million proteins. However, the 3-D structure is known for only a small fraction of these. Homology modeling is one method used to predict 3-D structure. In homology modeling, the aminoacidsequenceofaspecificprotein(target)isknown,andthe3-Dstructuresofproteinsrelatedtothetarget(templates) are known. Bioinformatics software tools are then used to predict the 3-D structure of the target based on the known 3-D structures of the templates. MODELLER is a well-known tool in homology modeling, and the SWISS-MODEL Repository is a database of protein structures created with homology modeling.

Page 95: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

86

Similarity Searches: A common activity in biopharmaceutical companies is the search for drug analogues. Starting with a promising drug molecule, one can search for chemical compounds with similar structure or properties to a known compound. There are a variety of methods used in these searches, including sequence similarity, 2D and 3D shape similarity, substructure similarity, electrostatic similarity and others. A variety of bioinformatic tools and search engines are available for this work.

Drug Lead Optimisation: When a promising lead candidate has been found in a drug discovery program, the next step (a very long and expensive step) is to optimise the structure and properties of the potential drug. This usually involvesa seriesofmodifications to theprimary structure (scaffold) and secondary structure (moieties)of thecompound. This process can be enhanced using software tools that explore related compounds (bioisosteres) to the lead candidate. OpenEye’s WABE is one such tool. Lead optimisation tools such as WABE offer a rational approach to drug design that can reduce the time and expense of searching for related compounds.

Physicochemical Modeling. Drug-receptor interactions occur on atomic scales. To form a deep understanding of how and why drug compounds bind to protein targets, we must consider the biochemical and biophysical properties of both the drug itself and its target at an atomic level. Swiss-PDB is an excellent tool for doing this. Swiss-PDB canpredictkeyphysicochemicalproperties,suchashydrophobicityandpolaritythathaveaprofoundinfluenceonhow drugs bind to proteins.

Drug Bioavailability and Bioactivity. Most drug candidates fail in Phase III clinical trials after many years of research and millions of dollars have been spent on them. And most fail because of toxicity or problems with metabolism. The key characteristics for drugs are Absorption, Distribution, Metabolism, Excretion, Toxicity (ADMET) and efficacy—inotherwordsbioavailabilityandbioactivity.Although,thesepropertiesareusuallymeasuredinthelab,they can also be predicted in advance with bioinformatics software.

Benefits of CADDCADD methods and bioinformatics tools offer significant benefits for drug discovery programs.Cost Savings: The Tufts Report suggests that the cost of drug discovery and development has reached $800 million for each drug successfully brought to market. Many biopharmaceutical companies now use computational methods and bioinformatics tools to reduce this cost burden. Virtual screening, lead optimisation and predictions of bioavailability and bioactivity can help guide experimental research. Only the most promising experimental lines of inquiry can be followed and experimental dead-ends can be avoided early based on the results of CADD simulations.

Time-to-Market: The predictive power of CADD can help drug research programs to choose only the most promisingdrugcandidates.Byfocusingdrugresearchonspecificleadcandidatesandavoidingpotential‘dead-end’compounds, biopharmaceutical companies can get drugs to market more quickly.

Insight:Oneofthenon-quantifiablebenefitsofCADDandtheuseofbioinformaticstoolsisthedeepinsightthatresearchers acquire about drug-receptor interactions. Molecular models of drug compounds can reveal intricate, atomicscalebindingpropertiesthataredifficult toenvisioninanyotherway.Whenweshowresearchersnewmolecular models of their putative drug compounds, their protein targets and how the two bind together, they often comeupwithnewideasonhowtomodifythedrugcompoundsforimprovedfit.Thisisanintangiblebenefitthatcan help design research programs.

CADD and bioinformatics together are a powerful combination in drug research and development. An important challengeforusgoingforwardisfindingskilled,experiencedpeopletomanageallthebioinformaticstoolsavailableto us.

7.6 Bioinformatics Tools The processes of designing a new drug using bioinformatics tools have opened a new area of research. However, computational techniques assist one in searching drug target and in designing drug in silco, but it takes long time and money. In order to design a new drug one need to follow the following path.

Page 96: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

87

Identify target diseaseOne needs to know all about the disease and existing or traditional remedies. It is also important to look at very similar afflictionsandtheirknowntreatments.Targetidentificationaloneisnotsufficientinordertoachieveasuccessfultreatmentofadisease.Arealdrugneedstobedeveloped.Thisdrugmustinfluencethetargetproteininsuchawaythat it does not interfere with normal metabolism. Bioinformatics methods have been developed to virtually screen the target for compounds that bind and inhibit the protein.

Study interesting compoundsOne needs to identify and study the lead compounds that have some activity against a disease. These may be only marginallyusefulandmayhaveseveresideeffects.Thesecompoundsprovideastartingpointforrefinementofthe chemical structures.

Detection the molecular bases for diseaseIf it is known that a drug must bind to a particular spot on a particular protein or nucleotide then a drug can be tailor made to bind at that site. This is often modeled computationally using any of several different techniques. Traditionally, the primary way of determining what compounds would be tested computationally was provided by the researchers’ understanding of molecular interactions. A second method is the brute force testing of large numbers of compounds from a database of available structures.

Rational drug design techniquesThese techniques attempt to reproduce the researchers’ understanding of how to choose likely compounds built into a software package that is capable of modeling a very large number of compounds in an automated way. Many differentalgorithmshavebeenusedforthistypeoftesting,manyofwhichwereadaptedfromartificialintelligenceapplications.The complexity of biological systemsmakes it verydifficult to determine the structures of largebiomolecules. Ideally experimentally determined (x-ray or NMR) structure is desired, but biomolecules are very difficulttocrystallise.

Refinement of compoundsOnce the number of lead compounds has been found, computational and laboratory techniques have been very successfulinrefiningthemolecularstructurestogiveagreaterdrugactivityandfewersideeffects.Donebothinthelaboratory and computationally by examining the molecular structures to determine, which aspects are responsible for both the drug activity and the side effects.

Quantitative Structure Activity Relationships (QSAR)Computationaltechniqueshouldbeusedtodetectthefunctionalgroupinthecompoundinordertorefineyourdrug. QSAR consists of computing every possible number that can describe a molecule than doing an enormous curvefittofindoutwhichaspectsofthemoleculecorrelatewellwiththedrugactivityorsideeffectseverity.Thisinformationcanthenbeusedtosuggestnewchemicalmodificationsforsynthesisandtesting.

Solubility of moleculeOne needs to check whether the target molecule is water soluble or readily soluble in fatty tissue will affect what part of the body it becomes concentrated in. The ability to get a drug to the correct part of the body is an important factor in its potency. Ideally, there is a continual exchange of information between the researchers doing QSAR studies, synthesis and testing.

These techniques are frequently used and often very successful since they do not rely on knowing the biological basisofthedisease,whichcanbeverydifficulttodetermine.

Drug testingOnce a drug has been shown to be effective by an initial assay technique, much more testing must be done before it can be given to human patients. Animal testing is the primary type of testing at this stage. Eventually, the compounds, which are deemed suitable at this stage, are sent on to clinical trials. In the clinical trials, additional side effects may be found and human dosages are determined.

Page 97: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

88

SummaryClustering algorithms are used to organise this expression data into different biologically relevant clusters.•Drugdiscoveryistheprocessofdiscoveringanddesigningdrugs,whichincludestargetidentification,target•validation,leadidentification,leadoptimisationandintroductionofthenewdrugstothepublic.Bioinformatics deals with the exponential growth in biological data, which led to the development of primary •and secondary databases of nucleic acid sequences, protein sequences and structures.Bioinformatics plays an important role for the integration of broad disciplines of biology to understand the •complex mechanisms of the cell.The complete process of data collection to analysis of the results of such tests may be categorised under a •separate area named ‘Clinical Informatics’.EMR also drastically reduces the possibilities of introduction of errors due to frustration and other psychological •disturbances during the manual data entry process after collecting the necessary information on paper.Bioinformatics has a profound impact in medical sciences. The biological databases are helping physicians to •diagnose the disease and develop strategies for its therapy.Computationalmethodsareusedforthepredictionof‘drug-likeness’,whichistheidentificationandelimination•of candidate molecules that are unlikely to survive the later stages of discovery and development. Pharmacogenomics is the study of genetic basis for the differences between individuals in response to drugs.•Computer-Aided Drug Design (CADD) is a specialised discipline that uses computational methods to simulate •drug-receptor interactions.In CADD research, one often knows the genetic sequence of multiple organisms or the amino acid sequence •of proteins from several species.MODELLER is a well-known tool in homology modeling, and the SWISS-MODEL Repository is a database •of protein structures created with homology modeling.QSAR consists of computing every possible number that can describe a molecule than doing an enormous curve •fittofindoutwhichaspectsofthemoleculecorrelatewellwiththedrugactivityorsideeffectseverity.

ReferencesChorghade, M. S., 2006. • Drug discovery and development, John Wiley and Sons.Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd., p.456. Young, D. • Computational Techniques in the Drug Design Process [Online] Available at: < http://www.ccl.net/cca/documents/dyoung/topics-orig/drug.html> [Accessed 28 February 2012].Computer aided drug design and bioinformatics: A current tool for designing • [Online] Available at: <http://www.pharmainfo.net/reviews/computer-aided-drug-design-and-bioinformatics-current-tool-designing> [Accessed 28 February 2012].Novartis, 2011.• Drug discovery and development process [Video Online] Available at: <http://www.youtube.com/watch?v=3Gl0gAcW8rw> [Accessed 28 February 2012].nicolazonta,2008.• User driven molecular modeling in drug design [Video Online] Available at: <http://www.youtube.com/watch?v=hd2YaygJC-w&feature=related> [Accessed 28 February 2012].

Recommended ReadingLarson, S., 2005. • Bioinformatics and Drug Discovery, Humana Press, p.444.Borlak, 2005. • Handbook of toxicogenomics: Strategies and applications, Wiley-VCH.Barnes, R., 2007. • Bioinformatics for geneticists: A bioinformatics primer for the analysis of genetic data, John Wiley & Sons.

Page 98: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

89

Self AssessmentThefocusofbioinformaticsinthedrugdiscoveryprocesshasshiftedfromtargetidentificationto________.1.

target validationa. target evaluationb. target predictionc. target mappingd.

What helps to prioritise the families in terms of their 3D structures?2. Bioinformaticsa. Pharmacogenomicsb. Structural genomicsc. Proteomicsd.

Which of these is not a well-known database?3. GenBanka. SWISS-PROTb. PDBc. EMRd.

_________eventually develops in to a database for reference and analysis.4. GenBanka. PIRb. PDBc. EMRd.

The biological databases are helping physicians to _____the disease and develop strategies for its therapy.5. diagnosea. treatb. predictc. clinically evaluated.

Drug-likeness could be predicted by _________and neural network based approaches.6. computational methoda. genetic algorithmb. drug discoveryc. data capturingd.

Which of the following statements is true?7. The full form of OMIM is Online Morey Inheritance of Man.a. The full form of PIR is Protein Information Resource.b. The full form of PDB is Proteome Data Bank.c. The full form of EMR is Electronic Medical Resource.d.

Page 99: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

90

Which of the following statements is false?8. In the near future one can see a situation where the complete information on the patient can be accessed a. from the EMR.From the pharmaceutical industry point of view, bioinformatics is the key to rational drug design.b. Proteomics is the study of genetic basis for the differences between individuals in response to drugs.c. Computer-Aided Drug Design (CADD) is a specialised discipline that uses computational methods to d. simulate drug-receptor interactions.

Which of the following statements is false?9. Bioinformatic methods are used extensively in molecular biology, genomics, and proteomics and in CADD a. research.IT, Information management, software applications, databases and computational resources all provide the b. infrastructure for bioinformatics.Pharmaceutical companies are always searching for new leads to develop into drug compounds.c. In sequence analysis, protein targets are screened against databases of small-molecule compounds to see d. which molecules bind strongly to the target.

Which is a well-known tool in homology modeling?10. MODELLERa. SWISS-MODELb. OpenEye’sc. WABEZINCd.

Page 100: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

91

Chapter VIII

Human Genome Project

Aim

The aim of this chapter is to:

definegenome•

enlist characteristics of HGP•

describe project goals of HGP•

Objectives

The objectives of this chapter are to:

explain genome sequenced in the public (HGP) and private projects•

elucidate about funding organisations for human genome sequencing•

describe DNA sequencing•

Learning outcome

At the end of this chapter, you will be able to:

comparedraftsequenceandfinishedsequence•

understand bioinformatic analysis•

know features of • BLAST

Page 101: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

92

8.1 IntroductionBioinformatics is the branch of biology concerned with the acquisition, storage and analysis of the information found in nucleic acid and protein sequence data. Computers and bioinformatics software are the tools of the trade.

When the Human Genome Project had begun in 1990 it was understood that to meet the project’s goals, the speed of DNA sequencing would have to increase and the cost would have to come down. Over the life of the project virtually every aspect of DNA sequencing was improved. It took the project approximately four years to sequence itsfirstonebillionbasesbutjustfourmonthstosequencethesecondbillionbases.

During the month of January, 2003, 1.5 billion bases were sequenced. As the speed of DNA sequencing increased, the cost decreased from 10 dollars per base in 1990 to 10 cents per base at the conclusion of the project in April 2003.Although,theHumanGenomeProjectisofficiallyover,improvementsinDNAsequencingcontinuetobemade. Researchers are experimenting with new methods for sequencing DNA that have the potential to sequence a human genome in just a matter of weeks for a few thousand dollars.

DNA sequencing performed on an industrial scale has produced a vast amount of data to analyse. In August 2005, it was announced that the three largest public collections of DNA and RNA sequences together store one hundred billion bases, representing over 1,65,000 different organisms. As sequence data began to pile up, the need for new and better methods of sequence analysis was critical.

Genetic data represent a treasure trove for researchers and companies interested in how genes contribute to our healthandwellbeing.AlmosthalfofthegenesidentifiedbytheHumanGenomeProjecthavenoknownfunction.Researchers are using bioinformatics to identify genes, establish their functions, and develop gene-based strategies for preventing, diagnosing, and treating disease.

8.2 Human Genome ProjectBegun formally in 1990, the U.S. Human Genome Project was a 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but rapid technological advances accelerated the completion date to 2003. Project goals are:

Identify all the approximately 20,000-25,000 genes in human DNA•Determine the sequences of the 3 billion chemical base pairs that make up human DNA•Store this information in databases•Improve tools for data analysis•Transfer related technologies to the private sector and•Address the ethical, legal, and social issues (ELSI) that may arise from the project.•

DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks called bases (abbreviated as A, T, C, and G) that make up the DNA of the 24 different human chromosomes, was the greatest technical challenge in the Human Genome Project. Achieving this goal has helped reveal the estimated 20,000-25,000 human genes within our DNA as well as their controlling regions. The resulting DNA sequence maps are being used by the scientists to explore human biology and other complex phenomena. To meet the Human Genome Project sequencing goals by 2003 required continual improvements in sequencing speed, reliability and costs.

8.3 Genome Sequenced in the Public (HGP) and Private ProjectsThe human genome reference sequences represent not only any one person’s genome. Rather, they serve as a starting point for broad comparisons across humanity. The knowledge obtained from the sequences applies to everyone because all humans share the same basic set of genes and genomic regulatory regions that control the development and maintenance of their biological structures and processes.

Page 102: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

93

In the international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few samples were processed as DNA resources. Thus, donors’ identities were protected so neither they nor scientists could know whose DNA was sequenced. DNA clones from many libraries were used in the overall project.

In the Celera Genomics private-sector project, DNA from a few different genomes was mixed and processed for sequencing DNA for these studies came from anonymous donors of European, African, American (North, Central, South), and Asian ancestry. The lead scientist of Celera Genomics at that time, Craig Venter, has since acknowledged that his DNA was among those sequenced.

8.4 Funding for Human Genome SequencingHuman Genome Project research was funded at many laboratories across the U.S. by the Department of Energy (DOE), the National Institutes of Health (NIH), or both. Other researchers at numerous colleges, universities, and laboratories throughout US also have received DOE and NIH funding for human genome research. At any given time, DOE Human Genome Project has funded about 100 principal investigators. Also, many large and small private U.S. companies are conducting genome research. Atleast, 18 other countries have participated in the Human Genome Project.

8.5 DNA Sequencing A DNA sequencing reaction produces a sequence that is several hundred bases long. Gene sequences typically runforthousandsofbases.Tostudygenes,scientistsfirstassemblelongDNAsequencesfromseriesofshorteroverlapping sequences.

Chromosomesranginginsizefrom50millionto250millionbasesmustfirstbebrokenintomuchshorter•pieces.Each short piece is used as a template to generate a set of fragments that differ in length from each other by a •singlebasethatwillbeidentifiedinalaterstep.The fragments in a set are separatedby technique called gel electrophoresis.Newfluorescent dyes allow•separation of all four fragments in a single lane on the gel.Thefinalbaseattheendofeachfragmentisidentified.ThisprocessrecreatestheoriginalsequenceofAs,Ts,•CsandGsforeachshortpiecegeneratedinthefirststep.

After the bases are ‘read,’ computers are used to assemble the short sequences (in blocks of about 500 bases each, called the read length) into long continuous stretches that are analysed for errors, gene-coding regions, and other characteristics. Finished sequences are submitted to major public sequence databases, such as GenBank. Thus, Human Genome Project sequence data are freely available to anyone around the world.

Scientists enter their assembled sequences into genetic databases so that other scientists may use the data. Since the sequences of the two DNA strands are complementary, it is only necessary to enter the sequence of one DNA strand into a database. By selecting an appropriate computer program, scientists can use sequence data to look for genes, get clues to gene functions, examine genetic variation, and explore evolutionary relationships. New bioinformatics software is being developed while existing software is continually updated.

Difference between draft sequence and finished sequenceIn generating the draft sequence (released in June 2000), scientists determined the order of base pairs in each chromosomal area at least 4 to 5 times to ensure data accuracy and to help with reassembling DNA fragments in their original order. This repeated sequencing is known as genome ‘depth of coverage.’ Draft sequence data are mostly intheformof10,000basepair-sizedfragmentswhoseapproximatechromosomallocationsareknown.

Page 103: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

94

To generate a high-quality reference sequence (completed in April 2003) additional sequencing was done to close gaps and allow for only a single error every 10,000 bases, the agreed-upon standard for the HGP. Investigators believe a high-quality sequence is critical for recognising gene-regulatory components important in understanding humanbiologyanddisorderssuchasheartdisease,cancer,anddiabetes.Thefinishedversionprovidesanestimated8x to 9x coverage of each chromosome.

Completely sequenced genomes The small genomes of several viruses and bacteria and the much larger genomes of three higher organisms have beencompletelysequenced;theyarebakers’orbrewers’yeast,theroundwormandthefruitfly.InOctober2001,thedraftsequenceofthepufferfish,thefirstvertebrateafterthehuman,wascompleted;andscientistsfinishedthefirstgeneticsequenceofaplant,thatoftheweedArabidopsis sp., in December 2000. Many more genome sequences have been completed since then.

8.6 Bioinformatic Analysis: Finding FunctionsOne of the most important aspects of bioinformatics is identifying genes within a long DNA sequence. Once a nucleic acid or amino acid sequence has been assembled, bioinformatic analysis can be used to determine if the sequence is similar to that of a known gene. This is where sequences from model organisms are helpful. A bioinformatic analysis findsasimilarsequencefrommousethatisassociatedwithagenethatcodesforamembraneproteinthatregulatessalt balance. It is a good bet that the human sequence also is part of a gene that codes for a membrane protein that regulates salt balance. Determining the similarity of two sequences is not as easy as you might think. For example, itwasrecentlyreportedthatthegenomesofhumansandchimpanzeesare96%similar.Consider the following two sequences:

Each sequence consists of 20 bases. There is just one base difference between them. Because the two sequences match at 19 out 20 bases, we can say that the two sequences are 95 % similar. Now, consider the following two DNA sequences:

Now, 16 out of 20 bases match. We can say that the two sequences are 80 % the same. Careful inspection however reveals another sort of similarity between Sequences 3 and 4. If we align the sequences as below, it is seen that the two sequences differ by just a missing base in Sequence 4 (or an added base to Sequence 3).

Page 104: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

95

The deletion (or insertion) of a single base is not equal to four base substitutions as suggested in this example. While comparing sequences, we must be concerned not only with the quantity of the differences but the type as well.

Scientists have written computer programs that can be used to see if a particular DNA sequence is similar to any others that are stored in a sequence database. One of the most popular programs is called BLAST (Basic Local Alignment Search Tool). Using this program is somewhat like using a search engine on the Internet. The user provides the program with a biological sequence (when using BLAST) or a subject (when using a search engine). In each case, the program compares the input information to the information found in the database. The results are givenwiththemostcloselymatchingitems(orsequences)listedfirst,followedbyitems(orsequences)thatmatchless well.

Page 105: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

96

See an example of a BLAST search. The input sequence that is being compared to others in the database is called the query sequence. In our example, the query is the short human DNA sequence listed below.

Once the query sequence is submitted, the BLAST program compares it, one-at-a-time, to every sequence in its database. Typically, the search results are displayed so that the query sequence is shown at the top and the matching sequences are listed below it. The listed sequence ‘hits’ also may include links to relevant bibliographic information. The results from this search are shown below.

Page 106: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

97

Examining variationThe BLAST program compares a single input sequence, one at a time, to others in a sequence database. The results can provide clues as to the identity and function of the input sequence. Sometimes you may want to compare a number of different sequences, all at the same time to see where they are alike and where they are different. The CLUSTAL program was developed to produce such multiple alignments. CLUSTAL gets its name because it deals with clusters of sequences.

CLUSTAL alignments are sometimes used by scientists examining genetic variation within a population. For example, once a gene has been associated with a disease, scientists can use CLUSTAL to examine how the gene sequence varies among people with and without the disease. The example below shows a CLUSTAL alignment ofDNAsequencesfromaportionofthegeneassociatedwithcysticfibrosis.Thepersonaffectedbythediseaseisseen to be missing a three-base DNA sequence.

Multiple sequence alignments are also useful to scientists investigating the evolutionary relationships among species. For example, the CLUSTAL program can be used to align a series of related sequences from different species. Once the program has produced the best alignment for the sequences, another program can calculate the evolutionary relationships between them. These data can be used to construct a tree diagram showing the evolutionary relationships for that sequence among the various species.

8.7 Insights Learned from the Human DNA SequenceThe human genome contains 3.2 billion chemical nucleotide base pairs (A, C, T, and G).The average gene consists of3,000basepairs,butsizesvarygreatly,withthelargestknownhumangenebeingdystrophinat2.4millionbasepairs. The total number of genes is estimated at 25,000, much lower than previous estimates of 80,000 to 140,000 that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor areas.

Page 107: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

98

The human genome sequence is almost exactly the same (99.9%) in all people. Functions are unknown for more than 50% of discovered genes. About 2% of the genome encodes instructions for the synthesis of proteins. Repeat sequences that do not code for proteins make up at least 50% of the human genome.

Repeat sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics. Over time, these repeats reshape the genome by rearranging it, thereby creating entirely new genes or modifying andreshufflingexistinggenes.Duringthepast50millionyears,adramaticdecreaseseemstohaveoccurredintherate of accumulation of repeats in the human genome.

The human genome’s gene-dense ‘urban centers’ are predominantly composed of the DNA building blocks G and C. In contrast, the gene-poor ‘deserts’ are rich in the DNA building blocks A and T. GC- and AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes.

Genes appear to be concentrated in random areas along the genome, with vast expanses of non-coding DNA between. Particular gene sequences have been associated with numerous diseases and disorders, including breast cancer, muscle disease, deafness, and blindness. Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the ‘junk DNA.’

Comparison of human genome with other organismsUnlike the human’s seemingly random distribution of gene-rich areas, many other organisms’ genomes are more •uniform, with genes evenly spaced throughout.HumanshaveonaveragethreetimesasmanykindsofproteinsastheflyorwormbecauseofmRNAtranscript•‘alternativesplicing’andchemicalmodificationstotheproteins.Thisprocesscanyielddifferentproteinproductsfrom the same gene.Humanssharemostofthesameproteinfamilieswithworms,flies,andplants,butthenumberofgenefamily•members has expanded in humans, especially in proteins involved in development and immunity.The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the •worm(7%),andthefly(3%).Over40%ofpredictedhumanproteinssharesimilaritywithfruit-flyorwormproteins.•Although, humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems •to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explainevolutionarycontrastsbetweenhumansandotherorganisms,includingthoseoflifespan,littersizes,inbreeding, and genetic drift.Variationsandmutations:Scientistshaveidentifiedmillionlocationswheresingle-baseDNAdifferencesoccur•inhumans.ThisinformationpromisestorevolutionisetheprocessesoffindingDNAsequencesassociatedwithsome common diseases.

8.8 Future Challenges The working-draft DNA sequence and the more polished 2003 version represent an enormous achieve-ment,akininscientificimportance,somesay,todevelopingtheperiodictableofelements.And,asinmostmajorscientificadvances,muchworkremainstorealisethefullpotentialoftheaccomplishment.

Early explorations of the human genome, now joined by projects on the genomes of several other organisms, are generating data whose volume and complex analyses are unprecedented in biology. Genomic-scale technologies will be needed to study and compare entire genomes, sets of expressed RNAs or proteins, gene families from a large number of species, variation among individuals, and the classes of gene regulatory elements.

Page 108: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

99

DerivingmeaningfulknowledgefromDNAsequenceswilldefinebiologicalresearchthroughthecomingdecadesand require the expertise and creativity of teams of biologists, chemists, engineers, and computational scientists, among others. A sampling follows of some research challenges in genetics,what we still don’t know, even with the full human DNA sequence in hand.

Thedraftsequencealreadyishavinganimpactonfindinggenesassociatedwithdisease.Oneofthegreatestimpactsof having the sequence may be in enabling an entirely new approach to biological research. In the past, researchers studied one or a few genes at a time. With whole-genome sequences and new high-throughput technologies, they can approach questions systematically and on a grand scale. They can study all the genes in a genome, for example, or all the transcripts in a particular tissue or organ or tumour, or how ten of thousands of genes and proteins work together in interconnected networks to orchestrate the chemistry of life.

Post-sequencing projects are well under way worldwide. These explorations will result in a profound, new, and more comprehensive understanding of complex living systems, with applications to agriculture, human health, energy, global climate change, and environmental remediation, among others.

The checklist for future research includes:Gene number, exact locations, and functions•Gene regulation•DNA sequence organisation•Chromosomal structure and organisation•Non-coding DNA types, amount, distribution, information content, and functions•Coordination of gene expression, protein synthesis, and post-translational events•Interaction of proteins in complex molecular machines•Predicted vs experimentally determined gene function•Evolutionary conservation among organisms•Protein conservation (structure and function)•Proteomes (total protein content and function) in organisms•Correlation of SNPs (single-base DNA variations among individuals) with health and disease•Disease-susceptibility prediction based on gene sequence variation•Genes involved in complex traits and multigene diseases•Complex systems biology, including microbial consortia useful for environmental restoration•Developmental genetics, genomics•

Page 109: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

100

SummaryBioinformatics is the branch of biology concerned with the acquisition, storage and analysis of the information •found in nucleic acid and protein sequence data.Researchers are using bioinformatics to identify genes, establish their functions, and develop gene-based •strategies for preventing, diagnosing, and treating disease.Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort coordinated by the U.S. Department •of Energy and the National Institutes of Health.DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks called •bases (abbreviated as A, T, C, and G) that make up the DNA of the 24 different human chromosomes, was the greatest technical challenge in the Human Genome Project.In the Celera Genomics private-sector project, DNA from a few different genomes was mixed and processed •for sequencing.Human Genome Project research was funded at many laboratories across the U.S. by the Department of Energy •(DOE), the National Institutes of Health (NIH), or both.A DNA sequencing reaction produces a sequence that is several hundred bases long. •Gene sequences typically run for thousands of bases.•Tostudygenes,scientistsfirstassemblelongDNAsequencesfromseriesofshorteroverlappingsequences.•Since the sequences of the two DNA strands are complementary, it is only necessary to enter the sequence of •one DNA strand into a database.By selecting an appropriate computer program, scientists can use sequence data to look for genes, get clues to •gene functions, examine genetic variation, and explore evolutionary relationships.Once a nucleic acid or amino acid sequence has been assembled, bioinformatic analysis can be used to determine •if the sequence is similar to that of a known gene.Scientists have written computer programs that can be used to see if a particular DNA sequence is similar to any •others that are stored in a sequence database. One of the most popular programs is called BLAST (Basic Local Alignment Search Tool).The BLAST program compares a single input sequence, one at a time, to others in a sequence database.•CLUSTAL alignments are sometimes used by scientists examining genetic variation within a population.•Multiple sequence alignments are also useful to scientists investigating the evolutionary relationships among •species.The human genome contains 3.2 billion chemical nucleotide base pairs (A, C, T, and G).•The human genome sequence is almost exactly the same (99.9%) in all people.•Repeat sequences that do not code for proteins make up at least 50% of the human genome.•The human genome’s gene-dense ‘urban centers’ are predominantly composed of the DNA building blocks G •and C.Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming •a barrier between the genes and the ‘junk DNA.’

ReferencesToriello, J., 2003. • The Human Genome Project, The Rosen Publishing Group.Cooper, G., 1994. • The Human Genome Project: deciphering the blueprint of heredity, University Science Books. NHGRI, Bioinformatics: Examining Variation • [Online] Available at: <http://www.genome.gov/25020003> [Accessed 28 February 2012].Insights Learned from the Human DNA Sequence • [Online] Available at: <http://www.ornl.gov/sci/techresources/Human_Genome/project/journals/insights.shtml> [Accessed 28 February 2012].

Page 110: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

101

bimaticsblog, 2009.• Human genome sequencing-Animated tutorial [Video Online] Available at: <http://www.youtube.com/watch?v=-gVh3z6MwdU>[Accessed28February2012].norman466, 2011. • Biology Project: Bioinformatics [Video Online] Available at: <http://www.youtube.com/watch?v=saW1oEbboUM> [Accessed 28 February 2012].

Recommended ReadingRamsden, J., 2009. • Bioinformatics: An introduction, 2nd ed., Springer. Polański,A.&Kimmel,M.,2007.• Bioinformatics, Springer.Boon, K. • The human genome project: What does decoding DNA mean for us?, Enslow Publishers.

Page 111: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

102

Self AssessmentDNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks called 1. _________.

basesa. nucleotidesb. arraysc. amino acidsd.

A DNA sequencing reaction produces a sequence that is several ______bases long.2. hundreda. thousandb. millionc. billiond.

Whataremostlyintheformof10,000basepair-sizedfragmentswhoseapproximatechromosomallocations3. are known?

High-quality reference sequence dataa. Finished sequence datab. Draft sequence datac. Analytical datad.

One of the most important aspects of bioinformatics is________. 4. identifying genes within a long DNA sequencea. findingsimilarsequenceb. assembling sequencec. sequencing DNA d.

Which of the following statements is false?5. Once a nucleic acid or amino acid sequence has been assembled, bioinformatic analysis can be used to a. determine if the sequence is similar to that of a known gene.Scientists have written computer programs that can be used to see if a particular DNA sequence is similar b. to any others that are stored in a sequence database.Once the query sequence is submitted, the BLAST program compares it, one-at-a-time, to every sequence c. in its database.The CLUSTAL program compares a single input sequence, one at a time, to others in a sequence d. database.

Which of the following statements is false?6. CLUSTAL alignments are sometimes used by scientists examining genetic variation within a population.a. The BLAST program was developed to produce such multiple alignments.b. CLUSTAL gets its name because it deals with clusters of sequences.c. The scientists can use CLUSTAL to examine how the gene sequence varies among people with and without d. the disease.

Page 112: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

103

The human genome contains 3.2 billion chemical__________.7. nucleotide base pairsa. amino acidsb. nucleotidesc. genesd.

About 2% of the genome encodes instructions for the synthesis of_________.8. proteinsa. amino acidsb. nucleotidesc. DNA d.

Repeat sequences that do not code for proteins make up at least ______ of the human genome.9. 10%a. 50%b. 100%c. 80%d.

Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming 10. a barrier between the genes and the ________.

junk DNAa. urban centersb. gene-densec. rich DNAd.

Page 113: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

104

Application I

Limitations of Current Multicore Architectures for Bioinformatics Applications – A Case Study on Cell BE

Abstract The fast growth of biological databases has attracted the attention of computer scientists calling for greater efforts to improve computational performance. From a computer architecture point of view, we intend to investigate how bioinformaticsapplicationscanbenefitfromfuturemulticoreprocessors.Here,wepresentapreliminarystudyofthe Cell BE limitations when executing a representative bioinformatics application performing multiple sequence alignment (that is ClustalW). The inherent large parallelism of the core algorithm used makes it ideal for architectures supportingmultipledimensionsofparallelism.However,inthecaseofCellBEweidentifiedseveralarchitecturallimitations that need a careful study.

Introduction BioinformaticsisarapidlygrowingfieldthatrequiresHighPerformanceComputing(HPC)systemsinordertocopewith the fast increase of biological databases. One of the most important tasks in bioinformatics is the alignment of biological sequences (DNA, proteins, RNA). Popular alignment algorithms like Needleman-Wunsch (NW) use dynamic programming techniques and are in most cases extremely computationally intensive. ClustalW is a widely used application that features NW as its main hot-spot kernel. The inherent multi-dimensional parallelism present in this type of applications makes them ideal to be mapped on a multicore platform where both thread-level and data-level parallelism can be exploited. We have used Cell BE processor as an example of a modern multicore processor.

With this research, we aim at identifying the architectural and micro-architectural limitations that Cell BE exhibit when targeting a representative multiple sequence alignment application such as ClustalW. We present different optimisation and parallelisation strategies and analyse the factors that limit the performance. Recent publications have mapped bioinformatics applications on Cell BE with a focus on software optimisation. Our work aims at identifying limitations of current architectures in order to guide the design of future multicore systems. ClustalW Implementation on Cell BE ClustalW performs the multiple alignment of a set of sequences in three main steps:

all-to-all pairwise alignment•creation of a phylogenetic tree•finalmultiplealignment•

Profiling experiments reveal that the core functions of the first step (that is forward_pass) consumes about70% of the total execution time. This function is called n(n-1)/2 times to calculate a similarity score among two sequences,implementingamodifiedversionofNW.Notonlytheindependenceamongcallsmakesparallelisationappealing but also vectorisation of the inner loop is possible. We ported forward pass function to the SPU ISA and implemented a number of optimisations. DMA transfers are used to exchange data between main memory and the SPUs LS. Saturated addition and maximum instructions not present in the SPUs were emulated with 9 and 2 SPU instructionsrespectively.Thefirstoptimisationuses16-bitvectorelementsinsteadof32-bitallowingatheoreticaldouble throughputbutrequiringtheimplementationofanoverflowcheckinsoftware.Theinner loopcontainsrandom scalar memory accesses and a complex branch for checking boundary conditions. This type of operations isveryinefficientintheSPUs.Wehaveunrolledthisloopandmanuallyevaluatedtheboundaryconditionsoutsidethe inner loop. In the multi-SPU versions, the PPU distributes pairs of sequences for each SPU to be processed independently.Suchaversionwasfirstimplementedusingasimpleround-robinstrategybuttheloaddistributionwasnotefficient.AsecondstrategyusesatableofflagsthatSPUscanraisetoindicateidleness.ThiswaythePPUcan take better decisions on where to allocate the tasks.

Results and Analysis Fig. 1 shows a comparison of ClustalW running on various single-core platforms as compared to different versions using a single SPU. Since the clock frequency of the G5 is more than twice as low as the Cell, it is clear that in terms of cycles it outperforms any Cell 1SPU version. The G5 platform contains a powerful out-of-order superscalar

Page 114: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

105

PowerPC970thatrunsscalarcodeveryefficientlywhilethePPUhasmorelimitedcapabilities(lessfunctionalunits and registers, in-order execution, and so on). The fourth bar shows the straightforward SPU implementation ofClustalWwithoutoptimisations.Thefifthbarshowsasignificantspeedup(1.7×)whenusing16-bitdatatype.Thisdoublevectorparallelismismostofthetimeachievablebuttheprogramshouldalwayscheckforoverflowandgobacktothe32-bitversionifneeded.SincetheSPUsdonotprovidesupportforoverflowcheck(unlikethePPU), this had to be implemented in software and consequently affecting the performance. The next two bars show results for unrolling a small loop located within the inner loop of the kernel, allowing us to achieve accumulative 2.6×speedup.Thelast twoversionsremovedboundaryconditionsinvolvingascalarbranchandhandledthemexplicitlyoutsidetheloop.Thisfinal(accumulative)optimisationprovidedabout4.2×speedupwithrespecttotheinitial version.

Fig. 2 shows the scalability of ClustalW kernel when using multiple SPUs. The black part of the bars reveals a perfectscalability(8×for8SPUs).Thisisduetotherelativelylowamountofdatatransferredandtheindependencebetween every instance of the kernel. In future experiments, it will be interesting to see how far this perfect scalability will continue.

Afterthesuccessfulreductionoftheexecutiontimeforforwardpass,significantapplicationspeedupsareonlypossible by accelerating other parts of the program. The progressive alignment phase is now the portion consuming most of the time. This issue is currently being studied.

Fig. 1. ClustalW performance for different platforms and optimisations

Fig. 2. ClsutalW speedup using multiple SPUs

Page 115: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

106

The following is a list of the limitations we have found in the experiments:

Unaligned data accesses: The lack of hardware support for unaligned data accesses is one of the issues that can limit the performance the most. When the application needs to do unaligned loads or stores, the compiler must introduce extra code that contains additional memory accesses plus instructions for data reorganisation. In ClustalW, this situation appears in critical parts of the code and then performance is affected.

Scalar operations: Given the SIMD-only nature of the SPUs ISA and the lack of unaligned access support, scalar instructions may cause performance degradation too. Since there are only vector instructions, scalar operations must beperformedemployingvectorswithonlyoneusefulelement.Apartfrompowerinefficiencyissues,thisworkswell only if the scalars are in the appropriate position within the vector. If not, the compiler has to introduce some extra instructions to make the scalar operands aligned and perform the instruction. This limitation is responsible for asignificantefficiencyreduction.

Saturated arithmetic’s: These frequently executed operations are present in Altivec but not in the SPU ISA. They areusedtocomputepartialscoresavoidingthattheyarezeroedwhenoverflowoccurswithunsignedaddition.Thislimitation may become expensive depending on the data types. For signed short, 9 additional SPU instructions are required.

Max instruction: One of the most important and frequent operations in both applications is the computation of a maximum between two or more values. Since the SPU ISA does not provide such an instruction, it is necessary to use two SPU instructions.

Overflow flag: ThisflagisnotavailableintheSPUsandhastobeimplementedinsoftware,addingoverhead.

Branch prediction:TheSPUsdonothandlebranchesefficientlyandthepenaltyofamispredictedbranchisabout18 cycles. The kernel of ClustalW has several branches that, when mispredicted, reduce the application execution speed.

Conclusions and future work We have described the mapping and some optimisation alternatives of a representative bioinformatics application targeting Cell BE. Our study revealed various architectural aspects that negatively impact Cell BE performance for bioinformatics workloads. More precisely, the missing HW support for unaligned memory accesses and the lack of saturating arithmetic instructions appear to be the most critical. We are performing additional experiments in order to have a quantitative measure of all these aspects.

Additionally, we intend to explore solutions to the issues found and validating them with simulations using UNISIM. We are using this research as guidance for the architecture design of future multicore systems incorporating domain specificacceleratorsforbioinformatics.Weintendtowidenourstudytootherapplicationsofthesamefield.Webelieve that heterogeneous multicore architectures able to exploit specialisation and multiple dimensions of parallelism will bring great performance improvements for bioinformatics workloads.

(Source:Isaza,S.&Gaydadjiev,G.,Limitations of Current Multicore Architectures for Bioinformatics Applications – A Case Study on Cell BE, Computer Engineering Lab, Delft University of Technology, The Netherlands [pdf] Availableat:http://ce.et.tudelft.nl/publicationfiles/1783_742_acaces_abstract_word.pdf)

Page 116: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

107

Questions

Enumerate the Cell BE limitations when executing a representative bioinformatics application performing 1. multiple sequence alignment.Answer: The limitations of Cell BE include:Unaligned data accesses•Scalar operations •Saturated arithmetic •Max instruction •Overflowflag•Branch prediction•

What was the conclusion drawn in the preliminary study of the Cell BE?2. Answer: We have described the mapping and some optimisation alternatives of a representative bioinformatics application targeting Cell BE. Our study revealed various architectural aspects that negatively impact Cell BE performance for bioinformatics workloads. More precisely, the missing HW support for unaligned memory accesses and the lack of saturating arithmetic instructions appear to be the most critical. We are performing additional experiments in order to have a quantitative measure of all these aspects.

What are the future works that need to be carried out to improve the performance for bioinformatics 3. workloads?Answer: Additionally, we intend to explore solutions to the issues found and validating them with simulations using UNISIM. We are using this research as guidance for the architecture design of future multicore systems incorporatingdomainspecificacceleratorsforbioinformatics.Weintendtowidenourstudytootherapplicationsofthesamefield.Webelievethatheterogeneousmulticorearchitecturesabletoexploitspecialisationandmultipledimensions of parallelism will bring great performance improvements for bioinformatics workloads.

Page 117: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

108

Application II

DNA Sequencing and Personal Genomics Case Study for Intro BiologyThe rapid advances in DNA sequencing technology are beginning to affect how human diseases are diagnosed, and willsoonaffectsignificantnumbersofpeopleinthedevelopedworld.Becausethistechnologywillfundamentallyaltermanyfieldsofbiologicalresearch,studentseveninfreshmanbiologycoursesshouldbecomeawareofthetechnology and its potential impact. I think stories of children with mystery diseases, who are diagnosed by genome sequencing and successfully treated as a result, will make a compelling learning experience and lead students to questions that address most aspects of genomics appropriate for a college-level introductory biology course.

Isn’t sequencing a human genome prohibitively expensive and time consuming?The graph below from the National Human Genome Research Institute shows that the cost of DNA sequencing hasplummetedinrecentyears.The$1,000humangenomesequenceiswithinsight.Thefigurebelowshowstheadvances in reducing the cost of sequencing a million bases of DNA, compared with Moore’s Law for advances in computing power.

The rapid decline in cost of sequencing resulted from advent of next-generation sequencing platforms such as Roche 454 (a YouTube playlist for a number of different sequencing technologies (http://www.youtube.com/view_play_list?p=1B2FEA81FFAD1748).

The development of massively parallel high-throughput sequencing technologies, coupled with single-molecule sequencing (the so-called third generation), in a highly competitive marketplace, continue to lower the cost of obtaining a whole human genome sequence. Illumina announced in a June 8, 2011 press release a huge price drop for a personal whole genome sequence, from $19,500 to $9,500 (Illumina 6/08/2011), along with release of a personal genome browser app for the iPad.

Page 118: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

109

What does this mean for ordinary people?It means that the era of personalised genomic medicine has arrived. Instead of individual genetic tests, it will become cost-effective for each person to have his or her own genome sequence.Here,isaseriesofexcellentarticlesintheMilwaukeeJournalSentinelaboutthefirstpublisheduseofgenomesequencing to diagnose and identify a cure for a boy, Nicholas Volker, suffering from a previously unknown disease.

In this case, rather than sequencing the entire genome, the researchers sequenced the boy’s exome, the 2% of the genome that encodes proteins. Their paper was published in March 2011 in Genetics in Medicine.

So what do you get when your DNA is sequenced?Too much information? A bunch of As, Gs, Cs and Ts, in strings of 200-400 letters. Your DNA sample is shredded and random fragments are sequenced. To get 99% of the target DNA sequenced at least once, the researchers sequenced Nick’s exome to an average of 34-fold. Individual sequence strings are matched against the human reference genome and differences noted. For Nick Volker’s exome, Worthey et al. found more than 16,000 differences from the reference human sequence. Which of these, if any, is causing the boy’s disease? The paper by Worthey et al. describes the process of sifting through the chaff to identify candidate gene mutations.

(Source: DNA Sequencing and Personal Genomics Case Study for Intro Biology [Online] Available at: <http://jchoigt.wordpress.com/2011/06/10/dna-sequencing-and-personal-genomics-case-study-for-intro-biology/> Accessed 7 March 2011.))

QuestionsIs sequencing a human genome expensive and time consuming? Justify.1. Howwasrapiddeclineincostofsequencingconsideredasabenefit?2. According to you, what could be the possible ethical, legal and social considerations for human genome 3. sequencing?

Page 119: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

110

Application III

Ivy Genomics-Based Medicine Project

ClientThe Ben and Catherine Ivy Foundation, and The Translational Genomics Research Institute (TGen).

OverviewFunded by the Ben and Catherine Ivy Foundation and guided by TGen, the Ivy Genomics-Based Medicine Project is an active collaboration among nine US institutions working together to better understand how the genetic differences in individual brain tumours can potentially inform the prediction of what will be the most effective treatment option for each patient.

Thisprojectwillcategorisetumoursbymolecularprofilingand,forthefirsttimeinbraincancerresearch,testeachtumouragainstawidespectrumoftreatmentstomatchdifferencesinresponsewiththeprofiles.Itchallengesnotonly many of the traditional boundaries of IT but the business processes that support the anticipated throughput of collaborative science and the consortia model.

ProblemThe Ivy GBM project challenged not only many of the traditional boundaries of IT but the business processes that support the anticipated throughput of collaborative science and the consortia model. The vision was to provide:

Firstaccesstothechemovulnerability-andgenomic-profilingdata•Full access to any GBM models used in the consortium•Use of consortium resources for independent research and/or follow-on sustained research projects•Demonstrationandpracticeofprofile-guidedmanagement•Synergy of collaborative, interdisciplinary, multi-institutional research•PositioningtoparticipateinstageIIclinicaltrialofprofile-guidedtreatments•

Solution5AM Solutions created a highly collaborative environment for participating institutions across the country to share studyinformationbefore,duringandafterthestudy.Wedevelopedcustomisedworkflows,inventorytracking,androle-based information collection and sharing, supporting subject enrolment, clinical data collection, specimen creation and tracking, data export and online analysis. 5AM Solutions launched an initial study in 4 weeks to meet the demanding timeline set by the consortium. 5AM effectively balanced the needs for speed, reliability, accuracy and the exposure of progress.

BenefitsAkeybenefitoccurredupfrontthroughthestudydefinitionandelicitationprocess.Thisseriesofactivitiessharpenedthe direction of the study (not just the software), forced the analysis of how the research would be run from a variety of perspectives and allowed us to be able to incrementally meet the needs as they were derived. 5AM’s hosting of the software eliminated concerns of HIPAA, data backup, encryption, in-house maintenance, and collaborator/customer service and support.

For the collaborators, the project portal provided unprecedented ability to share study related documents and SOPs for the study. User adoption was quick as new collaborators were able to contribute within an hour of supervised training.

(Source:Isaza,S.&Gaydadjiev,G.,LimitationsofCurrentMulticoreArchitecturesforBioinformaticsApplica-tions – A Case Study on Cell BE, Computer Engineering Lab, Delft University of Technology, The Netherlands [Online] Available at: <http://www.5amsolutions.com/resources/casestudies/ivy_gbm.php>)

Page 120: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

111

QuestionsWhat was the vision of the Ivy GBM project?1. What were the efforts of 5AM Solutions in Ivy GBM project?2. Whatwerethebenefitsprovidedby5AMSolutions?3.

Page 121: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

112

Bibliography

ReferenceABNOVA1, 2010.• BLAST - Multiple Alignment [Video Online] Available at: <http://www.youtube.com/watch?v=xdF6iZEPH_s> [Accessed 28 February 2012].Baxevanis, A. D. & Ouellette, B. F., 1998. • Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, John Wiley and Sons, New York.bimaticsblog, 2009.• Human genome sequencing-Animated tutorial [Video Online] Available at: <http://www.youtube.com/watch?v=-gVh3z6MwdU>[Accessed28February2012].Biology Computers. • Pairwise sequence alignment [Online] Available at: <http://gtbinf.wordpress.com/biol-41506150/pairwise-sequence-alignment/> [Accessed 28 February 2012].Branden,C.&Tooze,J.,1998.• An Introduction to Protein Structure. Garland, 1998.Brinkma, F. S. L., 2001. • Phylogenetic Analysis [pdf] Available at: <http://www.bioon.com/book/biology/bioinformatics/chapter-14.pdf > [Accessed 28 February 2012].Chorghade, M. S., 2006. • Drug discovery and development, John Wiley and Sons.cisn. • Genomics [Online] Available at: <http://cisncancer.org/research/what_we_know/omics/genomics.html> [Accessed 28 February 2012].clcbio. • Bioinformatics explained: Biological databases [Online] Available at: <http://www.clcbio.com/index.php?id=1238> [Accessed 28 February 2012].Clustering • [Online] Available at: <http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt > [Accessed 28 February 2012].Computer aided drug design and bioinformatics: A current tool for designing • [Online] Available at: <http://www.pharmainfo.net/reviews/computer-aided-drug-design-and-bioinformatics-current-tool-designing> [Accessed 28 February 2012].Cooper, G. 1994. • The Human Genome Project: deciphering the blueprint of heredity, University Science Books. EMBL-EBI. • What is Bioinformatics? [Online] Available at: <http://www.ebi.ac.uk/2can/bioinformatics/bioinf_biodatabases_1.html> [Accessed 28 February 2012].genomicseducation, 2009.• What is Genomics Part 2 - The Human Genome Project [Video Online] Available at: <http://www.youtube.com/watch?v=C86YbyEsct8&feature=results_main&playnext=1&list=PLE62E79AB3FDD7867> [Accessed 28 February 2012].genomicseducation, 2010. • What is Genomics - Chapter 1 [Video Online] Available at: <http://www.youtube.com/watch?v=9jZF74iqLac&feature=related> [Accessed 28 February 2012].HaroonBBT, 2011.• Microarray [Video Online] Available at: <http://www.youtube.com/watch?v=wKcQZVeIK-k&feature=related > [Accessed 28 February 2012].Huson, D., 2005. • A Brief Guide to Genomics [Online] Available at: <http://lectures.molgen.mpg.de/Algorithmische_Bioinformatik_WS0607/reinert1.pd> [Accessed 28 February 2012].InsGenomeSciences, 2010.• Introduction to Bioinformatics [Video Online] Available at: <http://www.youtube.com/watch?v=xODTm4a6nsM> [Accessed 28 February 2012].Insights Learned from the Human DNA Sequence • [Online] Available at: <http://www.ornl.gov/sci/techresources/Human_Genome/project/journals/insights.shtml> [Accessed 28 February 2012].jv51jjv5, 2010. • NCBI BLAST Tutorial - Part 1 [Video Online] Available at: <http://www.youtube.com/watch?v=ZuBMBJmfn-4&feature=related> [Accessed 28 February 2012].Khandekar. • Role of Bioinformatics In Medical Informatics A Case Study : Tuberculosis [pdf] Available at: <http://www.jbtdrc.org/Symposium/Topics/Role_bio.pdf> [Accessed 28 February 2012].Korol, A. B., 2001. • Microarray cluster analysis and applications [pdf] Available at: <http://www.science.co.il/enuka/essays/microarray-review.pdf > [Accessed 28 February 2012].

Page 122: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

113

Koslow, S. H., & Huerta, M. F., 2000. • Electronic collaboration in science, Routledge.Mount, D. W., 2001. • Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor laboratory Press.NCBI. • Systematics and Molecular Phylogenetics [Online] Available at: <http://www.ncbi.nlm.nih.gov/About/primer/phylo.html> [Accessed 28 February 2012].NHGRI, Bioinformatics: Examining Variation • [Online] Available at: <http://www.genome.gov/25020003> [Accessed 28 February 2012].NHGRI. • A Brief Guide to Genomics [Online] Available at: <http://www.genome.gov/18016863> [Accessed 28 February 2012].nicolazonta,2008.• User driven molecular modeling in drug design [Video Online] Available at: <http://www.youtube.com/watch?v=hd2YaygJC-w&feature=related> [Accessed 28 February 2012].norman466, 2011. • Biology Project: Bioinformatics [Video Online] Available at: <http://www.youtube.com/watch?v=saW1oEbboUM> [Accessed 28 February 2012].Novartis, 2011.• Drug discovery and development process [Video Online] Available at: <http://www.youtube.com/watch?v=3Gl0gAcW8rw> [Accessed 28 February 2012].plantbreedgenomics, 2010. • Bioinformatics 101 - Part 2 Intro [Video Online] Available at: <http://www.youtube.com/watch?v=WlVGTtqT4Tg&feature=related> [Accessed 28 February 2012].Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. • Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI Learning Pvt. Ltd. Robbins. • Bioinformatics: Essential Infrastructure For Global Biology [pdf] Available at: <http://www.esp.org/oecd.pdf> [Accessed 28 February 2012].sanjaysingh765, 2011. • Multiple sequence alignment with clustalw and boxshade [Video Online] Available at: <http://www.youtube.com/watch?v=BrzhdNvXXDs>[Accessed28February2012].Thermy33, 2011.• Understanding Phylogenetic Trees [Video Online] Available at: <http://www.youtube.com/watch?v=xwuhmMIIspo> [Accessed 28 February 2012].Toriello, J., 2003. • The Human Genome Project, The Rosen Publishing Group.UCBerkeley, 2010. • Biology 1B - Lecture 24: Phylogenetics [Video Online] Available at: <http://www.youtube.com/watch?v=vrGfDPteKqU> [Accessed 28 February 2012].wenl888, 2012. • Easy to use microarray data analysis tool - No training needed: Goober [Video Online] Available at: <http://www.youtube.com/watch?v=nSlhCaJKhjY > [Accessed 28 February 2012].Young, D. • Computational Techniques in the Drug Design Process [Online] Available at: < http://www.ccl.net/cca/documents/dyoung/topics-orig/drug.html> [Accessed 28 February 2012].

Recommended ReadingBarnes, R., 2007. • Bioinformatics for geneticists:A bioinformatics primer for the analysis of genetic data, John Wiley & Sons.Boon, K., • The human genome project: what does decoding DNA mean for us?, Enslow Publishers.Borlak, 2005. • Handbook of toxicogenomics: strategies and applications, Wiley-VCH.BrandenC.,&Tooze,J.,1999.• Introduction to Protein Structure, Garland Publishing, New York. International Human Genome Sequencing Consortium, 2001. • Initial Sequencing and Analysis of the Human Genome, Nature.Jogota, A., 2005. • Computational Methods in Phylogenetic Analysis, p.74.Larson, S. 2005. • Bioinformatics and Drug Discovery, Humana Press, p.444.Lehninger, A. L., 1984. • Principles of Biochemistry, CBS publishers and distributors, New Delhi, India. Letovsky, S• . Bioinformatics: Databases and Systems, O’REILLY.Livingstone & Barton, 1993. • Protein Sequence Alignments: a Strategy for the Hierarchical Analysis of Residue Conservation, Computer Applications in the Biosciences.

Page 123: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

114

Markel, S. and Leon, D., • Sequence Analysis in A Nutshell, O’REILLY.Patthy, L., 1999. • Protein Evolution, Blackwell Science. Polański,A.&Kimmel,M.,2007.• Bioinformatics, Springer.Ramsden, J., 2009. • Bioinformatics: an introduction, 2nd ed., Springer. Shanmughavel, P., 2005. • Principles of Bioinformatics, Pointer Publishers, Jaipur, India.Steel, M. A., 2003. • Phylogenetics, Oxford University Press.Stekel, D., 2003. • Microarray bioinformatics, Cambridge University Press, p.263.Zelikovsky, A., 2008. • Bioinformatics algorithms: Techniques and applications, John Wiley & Sons.

Page 124: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

115

Self Assessment Answers

Chapter Ia1. a2. c3. b4. a5. a6. b7. a8. a9. b10.

Chapter IIa1. a2. d3. a4. b5. c6. d7. c8. c9. a10.

Chapter IIIa1. b2. a3. b4. b5. a6. d7. b8. a9. b10.

Chapter IVa1. a2. a3. b4. c5. b6. b7. a8. d9. c10.

Page 125: Bioinformatics - jnujprdistance.comjnujprdistance.com/assets/lms/LMS JNU/MCA/Sem VI... · Bioinformatics 2 1.1 Introduction Bioinformatics is a newly emerged scientificdiscipline

Bioinformatics

116

Chapter Vb1. b2. b3. a4. d5. a6. b7. c8. d9. a10.

Chapter VIa1. b2. d3. d4. a5. a6. a7. a8. b9. d10.

Chapter VIIa1. c2. d3. d4. a5. b6. b7. c8. d9. a 10.

Chapter VIIIa1. a2. c3. a4. d5. b6. a7. a8. b9. a10.