bioinformatics review - january 2016 issue
Post on 26-Jul-2016
221 Views
Preview:
DESCRIPTION
TRANSCRIPT
JAN U ARY 2016 VOL 2 ISSUE 1
MUSCLE v/s T-COFFEE :
An overview and different aspects
Genetic Algorithm: Explanation and Perl Code
“The greatest leap in
bioinformatics is to
predict secondary
structure of protein”
- Charles Wins
Contents
January 2016
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Topics
03
22
34
34
Programming
CADD
Algorithms
Tools
Editorial.... 5
HTSeq : A Python framework to analyze high throughput sequencing data 06
Active learning in drug - target interactions 14
Genetic Algorithm: Explanation and Perl Code 08
MUSCLE v/s T-COFFEE : An overview and different aspects 12
CHIEF EDITOR
Dr. PRASHANT PANT
EDITORIAL
SECTION EDITORS
TARIQ ABDULLAH ALTAF ABDUL KALAM
MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS
REPRINTS AND PERMISSIONS
You must have permission before reproducing any material from Bioinformatics Review. Send E-mail
requests to info@bioinformaticsreview.com. Please include contact detail in your message.
BACK ISSUE
Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com
at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery,
subject to availability. Pre-payment is required
CONTACT
PHONE +91. 991 1942-428 / 852 7572-667
MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025
STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com
PUBLICATION INFORMATION
Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social
and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015
Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used
under licence by SEWA trust. Published in India
EXECUTIVE EDITOR FOUNDING EDITOR
FOZAIL AHMAD MUNIBA FAIZA
EDITORIAL: Welcoming BiR in its
2nd year
Bioinformatics, being one of the best field in terms of future prospect, lacks
one thing - a news source. For there are a lot of journals publishing a large
number of quality research on a variety of topics such as genome analysis,
algorithms, sequence analysis etc., they merely get any notice in the
popular press.
One reason behind this, rather disturbing trend, is that there are very few
people who can successfully read a research paper and make a news out of
it. Plus, the bioinformatics community has not been yet introduced to
research reporting. These factors are common to every relatively new (and
rising) discipline such as bioinformatics.
Although there are a number of science reporting websites and portals,
very few accept entries from their audience, which is expected to have
expertise in some or the other field.
Bioinformatics Review has been conceptualized to address all these
concerns. We will provide an insight into the bioinformatics - as an industry
and as a research discipline. We will post new developments in
bioinformatics, latest research.
We will also accept entries from our audience and if possible, we will also
award them. To create an ecosystem of bioinformatics research reporting,
we will engage all kind of people involved in bioinformatics - Students,
professors, instructors and industries. We will also provide a free job listing
service for anyone who can benefit out of it.
EDIT
OR
IAL
Dr. Prashant Pant
Editor
Letters and responses:
info@bioinformaticsreview.com
Bioinformatics Review | 6
HTSeq : A Python framework to analyze high throughput sequencing data Muniba Faiza
Image Credit: Google Images
“HTSeq is a Python library which easily develops the scripts required to fulfill a particular task on the HT data.”
igh throughput sequencing
is most widely used as it
saves a lot of time and
provide good results, and
produces a huge amount of data
which is difficult to manage and
especially the tasks and operations
performed on it are also very
difficult. To ease this purpose, a
Python framework have been
introduced by Simon Anders and
team members, this framework is
known as “HTSeq”.HTSeq is a Python
library which easily develops the
scripts required to fulfill a particular
task on the HT data. Basically,HTSeq
reads various formats and break it
down into recognized strings of
characters for further analysis. It also
consists of different classes genomic
coordinates, sequences, sequencing
reads, alignments, gene model
information, etc.
Two stand-alone applications have
also been developed along with
HTSeq, namely, htseq-qa for read
quality assessment and htseq-count
for preprocessing RNA-Seq
alignments for analyzing differential
expression.
HTSeq can read various formats such
as FASTA, FASTQ (short reads),
SAM/BAM (short-read
alignments). Wherever appropriate,
different parsers will yield the same
type of record objects. For example,
the
record class SequenceWithQualities is
used whenever sequencing read with
base-call qualities needs to be
presented, and hence yielded by
the FastqParser class and also
present as a field in the
SAM_Alignment objects yielded by
SAM_Reader or BAM_Reader parser
objects (Fig. 1). There are some
specific classes to represent Genomic
Position and Genomic Intervals of the
sequence. In order to achieve good
performance, various parts of HTSeq
is written in ‘Cython’ ( a tool which
translates Python code augmented
with C).
H
BIOINFORMATICS PROGRAMMING
Bioinformatics Review | 7
Fig. 1. ( a) The SAM_Alignment class
as an example of an HTSeq
data record: subsets of the content
are bundled in object-valued fields,
using classes (here
SequenceWithQualities and
GenomicInterval) that are also used
in other data records to provide a
common view on diverse data types.
( b) The cigar field in a
SAM_alignment object presents the
detailed structure of a read
alignment as a list of CigarOperation.
This allows for convenient
downstream processing of
complicated alignment
structures, such as the one given by
the cigar string on top and illustrated
in the middle. Five CigarOperation
objects, with slots for the columns
of the table (bottom) provide the
data from the cigar string, along
with the inferred coordinates of the
affected regions in read (‘query’)
and reference.
HTSeq also consists of a class which
deals with the gapped alignments,
namelySAM_Alignment, with multipl
e alignments and with paired-end
data. HTSeq provides a
function,pair_SAM_alignments_with
_buffer, to pair up the alignment
records by keeping a buffer of reads
whose end pair has not yet been
found, and so facilitates processing
data on the level of sequenced
fragments rather than reads. HTSeq
also facilitates the storage of
genome-position-dependent data,
which means that each base pair
position on the genome can be
given a particular value that can be
easily stored and retrieved by simply
entering the same value.
The script htseq-qa is a simple tool
for initial quality assessment of
sequencing runs. It produces plots
that summarize the nucleotide
compositions of the positions in the
read and the base-call qualities. As
we discussed earlier in this article
that htseq-count is a tool for RNA-
Seq data analysis. It counts for each
gene that how many aligned reads
overlap the sequence exons. Since it
is designed specifically to analyse
differential expression only reads
mapping unambiguously to a single
gene are considered and the reads
aligned to multiple positions or
overlapping with more than one gene
are discarded. In case of paired-end
data, htseq-count counts only the
fragment not the reads because the
two paired ends originating from
the same fragment provide only
evidence for one cDNA fragment and
should hence be counted only once.
In this way, HTSeq offers a
comprehensive solution to facilitate a
wide range of programming tasks
in HTS data analysis. For further
reading, click here.
Note:
An exhaustive list of references for
this article is available with the
author and is available on personal
request, for more details write
tomuniba@bioinformaticsreview.co
m
Bioinformatics Review | 8
Genetic
Algorithm:
Explanation
and Perl Code Tariq Abdullah
Image Credit: Stock Photos
“Genetic Algorithm was developed by John Holland. It use the concepts of Natural Selection and Genetic Inheritance and tries to mimic the biological evolution. It falls under the category of algorithms known as Evolutionary Algorithms . ”
hen it comes to
bioinformatics
algorithms, Genetic
algorithms top the list
of most used and talked about
algorithms in bioinformatics.
Understanding Genetic algorithm is
important not only because it helps
you to reduce computational time
taken to get result but also because
it is inspired by how nature works.
In this article, you will learn how
genetic algorithm works, the basic
concept behind it and we will also
write a program to illustrate the
concepts. You can skip the
explanation if you already know the
basic concepts of Genetic Algorithm
Genetic Algorithm was developed by
John Holland. It use the concepts of
Natural Selection and Genetic
Inheritance and tries to mimic the
biological evolution. It falls under the
category of algorithms known
asEvolutionary Algorithms. It can be
used to find solution to the hard
problems where we don’t know
much about the search space.
Let us understand how genetic
algorithm works. For this, let us
consider a cancer associated gene
expression matrix. This matrix
contains all the known genes found
in human being and their level of
expression.
For a given problem, the genetic
algorithm works by maintaining a set
of candidate solutions and then
applies three operators over them –
Selection, Recombination and
Mutation, which are collectively
known as stochastic operator.
Selection: In nature, if an
organism is adapted to the
environment, its population will
grow relative to its quality of
adaptation. This is referred to
as selection. It means if a
solution meets the conditional
constraints, it is replicated at a
rate which is proportional to
the relative quality.
Recombination: In nature, two
similar chromosomes of the
surviving individual exchange
genes during sexual
reproduction in a process
known as Crossing Over. In GA
we decompose two distinct
solutions and randomly mix
W
ALGORITHMS
Bioinformatics Review | 9
their parts to form novel
solutions
Mutation: Random changes in
an existing chromosome may
lead to some fitter individual.
This concept is utilized to
randomly perturbs a candidate
solution
1. produce an initial population of
individuals
2. evaluate the fitness of all
individuals
3. while termination condition not
met do
4. select fitter individuals for
reproduction
5. recombine between individuals
6. mutate individuals
7. evaluate the fitness of the
modified individuals
8. generate a new population
9. End while
Have a look at the Genetic Algoithm
illustrated in the diagram below to
understand it more clearly.
The program
We are going to implement the
Genetic Algorithm and write a
program in Perl for it. Although not
purely applicable to a real life
problem, but it should be sufficient
to familiarize you with Genetic
Algorithm.
Suppose that you had a set of Gene
expression data. The data is for all
25000 genes in the human genome
and you want to find out what are
the five values among all 25000
values whose sum can give you the
highest number.
For the purpose of this program we
will require four subroutines:
Generate: It will
generatechromosomes containi
ng 5 values(specified in variable
$GeneNumberConstraint)
selected at random at positions
Mutate: It mutates a
chromosome at random
position with a random value
less than specified in
$HighestMutationValue
Survival Check: It checks if the
newly formed chromosome is
viable. i.e. It has a value that is
upto a minimum specification.
(Checking for fitness)
Recombine: It will form new
combinations from existing
chromosome by crossing them
over with each other.
The Code
If you wish, you can download the
Perl code on
GitHubhttps://github.com/bioinform
aticsreview
/geneticalgorithm
So here is the final code
implementing Genetic Algorithm in
Perl:
$CurrentHighest=0;
@GeneExpressionData =
(1,3,8,5,2,4,46,6,78,7,9,
9
,0,1,1,1,5,59,9,97,7,6,5,
45
,4,3,23,2,22,2,2,4,5,5,6,
54);
@SolutionSpace = ();
$HighestMutationValue =
110;
$GeneNumberConstraint =
5;
$InitialThreshold = 10;
$genes = scalar
@GeneExpressionData;
@chromosome = ();
$sum = 0;
$steps= 10;
print "The Total Genes are:
$genes\n";
generate();
$steps = 10;
for($p=0;$p<=$steps;$p++){
generate();
SurvivalCheck();
mutate();
SurvivalCheck();
recombine();
SurvivalCheck();
Bioinformatics Review | 10
}
print "\n\n Genetic
Algorithm Result
\n\n\n\t\tHighest
Detected: $CurrentHighest
in $steps Steps\n\n";
sub mutate{
$randpos =
int(rand($gene));
$n =
int(rand($HighestMutation
Value));
$chromosome[$randpos] =
$n;
print "\n Mutation Took
Place in Chromosome
@chromosome ";
}
sub recombine
"\nRecombining\n\n";
@chromosome1 =
$SolutionSpace[int
rand($p)];
@chromosome2 =
$SolutionSpace[int
rand($p)];
print "Random Sequence
Chromosome from Solution
Space: @chromosome1 and
@chromosome2";
for($i=0;
$i<=$GeneNumberConstraint
/2; $i++){
my $random_number =
int(rand(3)) + 1;
$pos1 =
int(rand($GeneNumberConst
raint));
$pos1 =
int(rand($GeneNumberConst
raint));
$swap =
$chromosome1[$pos1];
$chromosome1[$pos1]
= $chromosome2[$pos2];
$chromosome2[$pos2]
= $swap;
}
print "The Recombination
led to @chromosome";
@chromosome = ();
@chromosome =
@chromosome1;
}
sub SurvivalCheck{
$sum = 0;
foreach $val
(@chromosome){
$sum += $val;
}
if($sum>$CurrentHighest){
$CurrentHighest = $sum;
push @SolutionSpace,
@chromosome;
print "\nIndividual is
alive! \nCurrent Highest
Expression:
$CurrentHighest";
return 1;
}
else{
print "\nSpecies Didn't
Survive! \n";
return 0;
}
}
sub generate{
@chromosome = ();
for($i=1;$i<=$GeneNumberC
onstraint;$i++){
$n = int(rand($genes));
push @chromosome,
$GeneExpressionData[$n];
$sum +=
$GeneExpressionData[$n];
}
print "\n\n\nGenerated
Chromosome: @chromosome
\n";
}
Thats all! Feel free to
comment and discuss if
you have any confusion.
Like this article? Share
it.. ha?
Bioinformatics Review | 11
MUSCLE v/s T-COFFEE :
An overview and different
aspects Muniba Faiza Image Credit: Google Images
“MU SCLE and T-COFFEE both are multiple sequence alignment tools and also helps to study the evolutionary relationships among the species .”
s I have discussed in my
earlier articles about the
multiple sequence
alignment (MSA) tools
(MUSCLE & T-COFFEE). Now in this
article, we will discuss different
aspects of these tools and which
one is more preferred over the
another. MUSCLE and T-COFFEE
both are multiple sequence
alignment tools and also helps to
study the evolutionary relationships
among the species.As I have already
explained the algorithms involved in
both the tools which are
comparable. During the alignment
using MUSCLE, it uses the UPGMA
tree construction method which
assumes that mutation occurs at the
constant rate. This may be a fact
which makes it different from other
tools.
On the positive side, MUSCLE is a
tool which is known for its speed
and accuracy on each of the four
benchmark test sets ( BAliBASE,
SABmark, SMART and PREFAB). It is
much faster than other MSA tools.
MUSCLE also uses a progressive
alignment which is iterated while it
gets a better SP score (explained in
“Basic concept of MSA” article).
T-COFFEE is an improvisation over
MUSCLE in the sense that it
combines both global and local
alignments which provides better
results and it also qualifies the four
benchmark tests. Second thing
which makes it better than other
tools is that it uses an optimization
method which provides the multiple
alignment that best fits in the input
library. T-COFFEE also uses
progressive alignment strategy
similar to MUSCLE, but unlike
MUSCLE, T-COFFEE uses Neighbor
Joining tree construction method
during alignment which corrects the
assumption of UPGMA method and
assumes that mutation never occurs
at a constant rate.
Let us take protein sequences of
‘Keratin’ protein of few species and
align them using both the tools and
construct the respective phylogeny
trees. In this example, I have taken
FASTA sequences of:Homo
sapiens (GI: 7717238) , Paralichthys
olivaceus (GI:
10716084), Pseudomonas
viridiflava (GI: 934022154)
andPseudomonas aeruginosa (GI:
856785229). The results are as
follows:
As we have seen both the trees are
slight different. The sequence
of Paralichthys olivaceus is placed
below to that of Homo sapiens, but
it is placed above in tree
constructed by T-COFFEE. Similarly,
this is case with other two species.
This is how MUSCLE & T-COFFEE are
different from each other.
A
TOOLS
Bioinformatics Review | 12
T-COFFEE is more preferred over
MUSCLE while aligning both closely
or distantly related species but
MUSCLE ia more suitable to align
distantly related species since it
uses global alignment only, but T-
COFFEE uses both.
Note:
An exhaustive list of references for
this article is available with the
author and is available on personal
request, for more details write
tomuniba@bioinformaticsreview.co
m.
Fig 1. Tree constructed using MUSCLE. Fig 2. Tree constructed using T-COFFEE.
Bioinformatics Review | 13
Active learning in
drug-target
interactions
Muniba Faiza
Image Credit: Google Images
“ Active learning is a powerful tool for drug discovery and development where it reduces the tedious process of performing a number of experiments which are required to produce s ignificant high-confidence predictions .”
Active learning is a kind of
machine learning. Basically in
active learning, a learning
algorithm is used to perform the
desired experiments to produce a
desired output.
Active learning is a powerful tool for
drug discovery and development
where it reduces the tedious
process of performing a number of
experiments which are required to
produce significant high-confidence
predictions. However, practically it
is difficult to decide when to stop
the experimentation process.
Therefore, if a reliable stopping
criteria is applied to the algorithm
reduces both time and cost of the
experimentation process.
The basic of active learning is having
good predictive models to guide
experimentation.
Active learning iteratively builds a
model for drug-target interactions.
Instead of relying on large training
data sets as performed manually,
the active learning procedure
increases the training set step wise.
Thus, the time and experimental
cost is reduced and it is only spent
on improving the model rather than
for the verification of a specific
model which even may not be the
desired outcome or suits the
specifications under consideration.
How active learning works?
Active learning is an iterative
process and is completed in four
steps:
1. Initialization
2. Model
3. Active learning algorithm
4. Accuracy measure of the
predicted output
The active learning strategy starts
with an initialization step in which
an interaction matrix for drug and
target is formed. With the help of
this matrix subset of known labels
for the the drug and target kernels
Kd and Kt respectively are provided.
A
CADD
Bioinformatics Review | 14
The model predicts the drug-target
interactions. Based on the obtained
predictions, the active learning
algorithm is applied to find new
experiments (labels) which will
improve the model according to the
requirements. Here, batchwise
learning is applied where a fixed
number of experiments is queried in
each training round and thereby
increases the size of experiments
(labels).
Each training round has a specific
time point and is measured by the
number of experiments performed.
For each time point the accuracy of
the model is predicted by using
various methods. The process is
stopped on some conditions, for
example, if a certain budget for
performing experiments is reached
or the predicted accuracy of the
model is high enough.
This is the basic idea for active
learning applied in drug-target
predictions. It saves a lot of time
and cost involved in performing
experiments in vitro. For further
reading click here
Note:
An exhaustive list of references for
this article is available with the
author and is available on personal
request, for more details write to
muniba@bioinformaticsreview.com
Bioinformatics Review | 15
.
Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and
never miss out on any of your favorite topics.
Log on to
www.bioinformaticsreview.com
top related