1/27 discrete and genetic algorithms in bioinformatics 許聞廉 中央研究院資訊所

27
1/27 Discrete and Genetic Algorithms in Bioinformatics 許許許 許許許許許許許許

Upload: laurel-harrington

Post on 16-Jan-2016

291 views

Category:

Documents


0 download

TRANSCRIPT

  • */27Discrete and Genetic Algorithms in Bioinformatics

  • */27Discrete AlgorithmsDiscrete Math. lies in the foundation of modern computer scienceMost algorithms we have learned in computer science are discreteDiscrete algorithms emphasize worst case analysisMany sequence manipulation algorithms in bioinformatics are discrete

  • */27Natural Problems (1)Natural problems: Problems arisen from nature, which are guaranteed to have feasible solutions if data is collected accurately. But because of noises in sampled data, such solutions are hard to come by.To tackle these problems one should focus on real data rather than worst case analysis.

  • */27Natural Problems (2)Techniques taking advantage of the natural constraints of these problems do not necessarily work for general data (especially the worst case), but could perform very well for those well-structured problems.Examples:many computational problems arisen from biology, speech recognition, and image processing

  • */27Constraints with ErrorsIn ordinary constraint optimization problems, one naturally assumes that the constraints are correct.

    What if these constraints are inconsistent?There is no feasible solution satisfying them

    What if every constraint is only partially correct?

  • */27Explicit Solution CandidatesIn ordinary optimization problems, most algorithms do not generate plausible solutions in the interim

    However, there are advantages to have some solution candidates when there are errors in the constraints.

  • */27Plausible Solution CandidatesFor some optimization problems, machine learning approaches generate plausible solutions in the interim.Solutions are getting better while the machine learning approach refines solution patterns iteratively.A better solution emerges from the cooperation of plausible solution candidates.

  • */27Fitness LandscapeEach solution candidate has its fitness score for the optimization problem.A fitness landscape shows the fitness distribution of the whole search space.Solution candidates are ranked by fitness judgment.

  • */27Genetic AlgorithmA search technique to find the exact or approximate solutions to optimization problems.It is based on the principle of evolutionSurvival of the fittest in Natural SelectionTwo basic processes from evolutionInheritance (passing of features from one generation to the next)Competition (survival of the fittest)

  • */27Basic description of GAAlgorithm is started with a set of solutions (represented by chromosomes) called population.Solutions from one population are taken and used to form a new population.The new population (offspring) will be better than the old one (parent).Solutions which are selected to form new solutions are selected according to their fitness - the more suitable they are the more chances they have to reproduce.

  • */27GA in Pseudo-codeChoose initial population Evaluate the fitness of each individual in the population Repeat Select best-ranking individuals to reproduce Breed new generation through crossover and mutation (genetic operations) and give birth to offspring Evaluate the individual fitness of the offspring Replace worst ranked part of population with offspring Until termination

  • */27Building Block HypothesisBuilding block: a short and highly fit schema providing benefit for the solution.The global optimal solution is made up of building blocks.Identify, recombine, and resample small building blocks to form a new solution with potentially higher fitness.By working with these particular building blocks, we have reduced the complexity of our problem.

  • */27The Fitness FunctionPlays the role of a judge

    Give more scores if the individual owns more building blocks

    Refine the fitness function based on the evolution results

  • */27Physical Mapping

  • Cutting and reassembling for DNA sequenceCut a DNA sequence into small pieces in different ways and reassemble them togetherthe small pieces (called clones) are still too large to find complete sequences biologically, use probeto mark the cloneseach probe could mark several clones clone could contain several probes

  • */27The Physical Mapping Problem with Noisy Genomic DataJournal of Computational Biology 10(5), 709-735, 2003 Each row represents a clone; Each column represents a probeDiagram on the left: input clone-probe matrix; Diagram on the right: after probe arrangement the clones are put in correct positions

  • */27Consecutive Ones with Errors

  • */27False Positives and False Negatives

  • */27A genetic algorithm for physical mapping

    A two-stage genetic algorithmFirst stage: generate the neighborhood information among probes

    Second stage: generate the maximum length of connecting probes

  • */27The first stage of GA (GA1)Purpose: find a probe ordering with the highest fitness score for each clone.

    Pseudo CodeRandom generate a population of probe permutationsEvaluate the fitness of each individual in the populationRepeatSelect best-ranking individuals to reproduce Breed new generation through crossover and mutation (genetic operations) and give birth to offspring Evaluate the individual fitnesses of the offspring Replace worst ranked part of population with offspring Until termination

  • */27The first stage of GA (GA1)Two building blocks that make partial consecutive ones

    4123586911121314151718

  • */27Crossover OperationP1P2Child

    23681910121351114151718

    23681910111213141817515

    91011121314818176532115

    236819

    23681910

    2368191011

    236819101112

  • */27Mutations

    23685910121311112151718

    23681910121351112151718

  • */27Detection of false Negatives

    123456789101112131415

  • */27The first stage of GA (GA1)Construct the probe neighboring information according to the GA1 results5: {3, 6}6: {5, 8}8: {6, 9}18: {17}5: {6}6: {5, 7}7: {6, 8}20: {19}+5: {3, 6}6: {5, 7, 8}7: {6, 8, 9}20: {19}Probe neighboring informationProbe ordering result for probe segment 1Probe ordering result for probe segment 2Probe ordering result for probe segment 20.A neighboring probe list

    12356891011121314151718

    5678910111314151617181920

    838586878889909192939596979899

  • */27The second stage of GA (GA2)Purpose: find the longest connecting probe sequence according to the probe neighboring information.

    Pseudo CodeRandom generate a population of probe permutationsEvaluate the fitness of each individual in the populationRepeatSelect best-ranking individuals to reproduce Breed new generation through crossover and mutation (genetic operations) and give birth to offspring Evaluate the individual fitnesses of the offspring Replace worst ranked part of population with offspring Until termination

  • */27The second stage of GA (GA2)Generate a probe ordering according to the probe neighboring information1: {2}2: {1, 3}3: {2, 4, 5}4: {3, 5}5: {3, 4, 6}6: {5, 7, 8}7: {6, 8, 9}99: {97, 98}

    123456793949596979899

    235471727355565799989796

    ****6, 3, 2, 1will disappear from the latter half* row consecutive ones false negative