pipelined instruction processing

Pipelined Instruction Processing

In

Computer Organization and Architecture

Sumit Gupta

Reg. No = 3050060107

99883-80416 (M)

(0181) 4639-871 (Home)

[email protected]

Term Paper for CSE-211

Computer Arithmetic

Fifth Term 2008

ABSTRACT

As we know that the technology is advancing day by day in every field like in science,

medical, defence etc. The topic which I have given for research i.e. “Pipelined Instruction

Processing” is a technique which helps us to increase the processing of instruction by using

pipelined method. In this paper, I present a detailed description of the paper in the field of

computer arithmetic. The procedure starts with the important step to understand paper topic

and continues further by finding relevant aspects which comes under my topic. I am

describing the principle, what are the problems and the solution for that particular problem,

Advantages & Disadvantages. After that I am giving the examples which help us to

understand more about topic and then Implementation. At last, we come to the development

in the pipelined method because technology is changing which leads to development.

Keywords: - Principle, Problems and Solution, Advantages & Disadvantages, Examples,

Implementation, Development, Conclusion, References

Introduction:-

This paper presents one of the instruction processor having pipeline structure which is differ

from other various processor technologies. Main processor families are CISC, RISC

superscalar, VLIW, super pipelined, symbolic processor. Various processor families can be

mapped onto a coordinate space of clock rate versus cycles per instruction as shown in fig.

As implementation technology evolves rapidly, the clock rate of various processors are

gradually moving from low to higher speeds toward the right of design space. Another trend

is that processors manufacturers are trying to lower the CPI rate using hardware and software

approaches.

20

10

5.0

2.0 Most likely future

1.0 processor speed

0.5

0.2

0.1

5 10 20 50 100 200 500 1000 MHz

Clock Rate

Conventional processor like the Intel 1486, M6840, IBM 390 etc fall into the family known

as complex-instruction-set computing (CISC) architecture. The typical clock rate of today’s

CISC processor ranges from 33 to 50 MHz On the other hand today’s reduced- instruction-set

computing (RISC) processors, such as the Intel 1860,SPSRC,IBM RS/6000 etc. have faster

clock rates ranging from 20 to 120 MHz determined by the implementation technology

employed. With the use of hardwired control, the CPI of most RISC instruction has been

Super pipelined

Scalar CISC

Vector Supercomputer

Scalar RISC

reduced to one to two cycles. The processor in vector supercomputers is mostly super

pipelined and uses multiple functional units for concurrent scalar and vector operations.

Now we come to the main topic which I have given for the research as a paper i.e. “Pipelined

Instruction Processing”. The idea is to divide the logic into stages, and to work on different

data within each stage. An often used real-world analogy involves doing the laundry: if you

have two loads of laundry to do, you can either wash the first load or then dry the first load,

before moving onto the next, or, you can wash the first load, and when you put it in to dry,

you can put the next load in to wash. If each step takes 20 minutes, then you will finish in 60

minutes instead of 80.

An instruction pipelined is a technique used in the design of computers and other digital

electronic devices to increase their instruction throughput (the number of instructions that can

be executed in a unit of time).The fundamental concept is to split the processing of a

computer instruction into a series of independent steps, with storage at the end of each step.

This allows the computer's control circuitry to issue instructions at the processing rate of the

slowest step, which is much faster than the time needed to perform all steps at once. The term

pipeline refers to the fact that each step is carrying data at once (like water), and each step is

connected to the next (like the links of a pipe.)The origin of pipelining is thought from the

IBM Stretch project. The IBM Stretch Project proposed the terms, "Fetch, Decode, and

Execute" that became common usage.

Non-pipeline architecture is inefficient because some CPU components (modules) are idle

while another module is active during the instruction cycle. Pipelining does not completely

cancel out idle time in a CPU but making those modules work in parallel improves program

execution significantly. Processors with pipelining are organized inside into stages which can

semi-independently work on separate jobs. Each stage is organized and linked into a 'chain'

so each stage's output is fed to another stage until the job is done. This organization of the

processor allows overall processing time to be significantly reduced.

If we take another example, in which a CPU or other circuit, previous data may have an

effect on later data (for instance, if a CPU is processing C = A + B, followed by E = C + D,

the value of C must finish being calculated before it can be used in the second instruction).

This type of problem is called a data dependency conflict. In order to resolve these conflicts,

even more logic must be added to stall or otherwise deal with the incoming data. A

significant part of the effort in modern CPU design goes into resolving these sorts of

dependencies.

http://en.wikipedia.org/wiki/Computer

http://www.nationmaster.com/encyclopedia/CPU-design

http://en.wikipedia.org/wiki/IBM_Stretch

An instruction has a number of stages. The various stages can be worked on simultaneously

through various blocks of production. This is a pipeline. This process is also referred as

instruction pipelining. Figure shown the pipeline of two independent stages, fetch instruction

and execution instruction. The first stage fetches an instruction and buffers it. While the

second stage is executing the instruction, the first stage takes advantage of any unused

memory cycles to fetch and buffer the next instruction. This process will speed up instruction

execution

Two stages Instruction Pipeline

Pipelined Instruction principle:-

In order to speed up the operation of a computer system beyond what is possible with

sequential execution, methods must be found to perform more than one task at a time. One

method for gaining significant speedup with modest hardware cost is the technique of

pipelining. In this technique, a task is broken down into multiple steps, and independent

processing units are assigned to each step. Once a task has completed its initial step, another

task may enter that step while the original task moves on to the following step.

Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory

(flip flops). When the clock signal arrives, the flip flops take their new value and the logic

then requires a period of time to decode the new values. Then the next clock pulse arrives and

the flip flops again take their new values, and so on. By breaking the logic into smaller pieces

and inserting flip flops between the pieces of logic, the delay before the logic gives valid

outputs is reduced. In this way the clock period can be reduced. For example, the RISC

pipeline is broken into five stages with a set of flip flops between each stage.

1. Instruction fetch

2. Instruction decode and register fetch

3. Execute

http://en.wikipedia.org/wiki/RISC

http://en.wikipedia.org/wiki/Flip-flop_(electronics)

http://en.wikipedia.org/wiki/Central_processing_unit

4. Memory access

5. Register write back

Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode,

EX = Execute, MEM = Memory access, WB = Register write back). The vertical axis is

successive instructions, the horizontal axis is time. So in the green column, the earliest

instruction is in WB stage, and the latest instruction is undergoing instruction fetch.

Problems in Instruction Pipelining

Several difficulties prevent instruction pipelining from being as simple as the above

description suggests. The principal problems are:

TIMING VARIATIONS:

Not all stages take the same amount of time. This means that the speed gain of a pipeline will

be determined by its slowest stage. This problem is particularly acute in instruction

processing, since different instructions have different operand requirements and sometimes

vastly different processing time. Moreover, synchronization mechanisms are required to

ensure that data is passed from stage to stage only when both stages are ready.

DATA HAZARDS:

When several instructions are in partial execution, a problem arises if they reference the same

data. We must ensure that a later instruction does not attempt to access data sooner than a

http://en.wikipedia.org/wiki/RISC

preceding instruction, if this will lead to incorrect results. For example, instruction N+1 must

not be permitted to fetch an operand that is yet to be stored into by instruction N.

BRANCHING:

In order to fetch the "next" instruction, we must know which one is required. If the present

instruction is a conditional branch, the next instruction may not be known until the current

one is processed.

INTERRUPTS:

Interrupts insert unplanned "extra" instructions into the instruction stream. The interrupt must

take effect between instructions, that is, when one instruction has completed and the next has

not yet begun. With pipelining, the next instruction has usually begun before the current one

has completed.

All of these problems must be solved in the context of our need for high speed performance.

If we cannot achieve sufficient speed gain, pipelining may not be worth the cost.

Solutions

Possible solutions to the problems described above include the following strategies:

Timing Variations

To maximize the speed gain, stages must first be chosen to be as uniform as possible in

timing requirements. However, a timing mechanism is needed. A synchronous method could

be used, in which a stage is assumed to be complete in a definite number of clock cycles.

However, asynchronous techniques are generally more efficient. A flag bit or signal line is

passed forward to the next stage indicating when valid data is available. A signal must also be

passed back from the next stage when the data has been accepted.

In all cases there must be a buffer register between stages to hold the data; sometimes this

buffer is expanded to a memory which can hold several data items. Each stage must take care

not to accept input data until it is valid, and not to produce output data until there is room in

its output buffer.

Data Hazards

To guard against data hazards it is necessary for each stage to be aware of the operands in use

by stages further down the pipeline. The type of use must also be known, since two

successive reads do not conflict and should not be cause to slow the pipeline. Only when

writing is involved is there a possible conflict.

The pipeline is typically equipped with a small associative check memory which can store the

address and operation type (read or write) for each instruction currently in the pipe. The

concept of "address" must be extended to identify registers as well. Each instruction can

affect only a small number of operands, but indirect effects of addressing must not be

neglected.

As each instruction prepares to enter the pipe, its operand addresses are compared with those

already stored. If there is a conflict, the instruction (and usually those behind it) must wait.

When there is no conflict, the instruction enters the pipe and its operands addresses are stored

in the check memory. When the instruction completes, these addresses are removed. The

memory must be associative to handle the high-speed lookups required.

Branching

The problem in branching is that the pipeline may be slowed down by a branch instruction

because we do not know which branch to follow. In the absence of any special help in this

area, it would be necessary to delay processing of further instructions until the branch

destination is resolved. Since branches are extremely frequent, this delay would be

unacceptable.

One solution which is widely used, especially in RISC architectures, is deferred branching.

In this method, the instruction set is designed so that after a conditional branch instruction,

the next instruction in sequence is always executed, and then the branch is taken. Thus every

branch must be followed by one instruction which logically precedes it and is to be executed

in all cases. This gives the pipeline some breathing room. If necessary this instruction can be

a no-op, but frequent use of no-ops would destroy the speed benefit.

Use of this technique requires a coding method which is confusing for programmers but not

too difficult for compiler code generators. A widely-used strategy in many current

architectures is some type of branch prediction. This may be based on information provided

by the compiler or on statistics collected by the hardware. The goal in any case is to make the

best guess as to whether or not a particular branch will be taken, and to use this guess to

continue the pipeline.

A more costly solution occasionally used is to split the pipeline and begin processing both

branches. This idea is receiving new attention in some of the newest processors.

Interrupts

The fastest but most costly solution to the interrupt problem would be to include as part of the

saved "hardware state" of the CPU the complete contents of the pipeline, so that all

instructions may be restored to their original state in the pipeline. This strategy is too

expensive in other ways and is not practical.

The simplest solution is to wait until all instructions in the pipeline complete, that is, flush the

pipeline from the starting point, before admitting the interrupt sequence. If interrupts are

frequent, this would greatly slow down the pipeline; moreover, critical interrupts would be

delayed.

A compromise solution identifies a "point of no return," the point in the pipe at which

instructions may first perform an irreversible action such as storing operands. Instructions

which have passed this point are allowed to complete, while instructions that have not

reached this point are cancelled.

Advantages and Disadvantages:-

Pipelining does not help in all cases. There are several possible disadvantages. An instruction

pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A

pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.

Advantages of Pipelining:

1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in

most cases.

2. Some combinatorial circuits such as adders or multipliers can be made faster by

adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more

complex combinatorial circuit.

Disadvantages of Pipelining:

http://en.wikipedia.org/wiki/Clock_cycle

1. A non-pipelined processor executes only a single instruction at a time. This prevents

branch delays (in effect, every branch is delayed) and problems with serial

instructions being executed concurrently. Consequently the design is simpler and

cheaper to manufacture.

2. The instruction latency in a non-pipelined processor is slightly lower than in a

pipelined equivalent. This is due to the fact that extra flip flops must be added to the

data path of a pipelined processor.

3. A non-pipelined processor will have a stable instruction bandwidth. The performance

of a pipelined processor is much harder to predict and may vary more widely between

different programs.

4. One of the major problems in designing an instruction pipeline is assuring a steady

flow of instructions to initial stages of the pipeline. However, 15-20% of instructions

in an assembly-level stream are (conditional) branches. Of these, 60-70% takes the

branch to a target address. Until the instruction is actually executed, it is impossible to

determine whether the branch will be taken or not.

Example:-

1) Generic pipeline

http://en.wikipedia.org/wiki/Flip-flop_(electronics)

http://en.wikipedia.org/wiki/Image:Pipeline,_4_stage.svg

Generic 4-stage pipeline the colored boxes represent instructions independent of each other

To the right is a generic pipeline with four stages:

1. Fetch

2. Decode

3. Execute

4. Write-back

The top gray box is the list of instructions waiting to be executed; the bottom gray box is the

list of instructions that have been completed; and the middle white box is the pipeline.

Execution is as follows:-

Time Execution

0 Four instructions are awaiting to be executed

1 the green instruction is fetched from memory

2 the green instruction is decoded

the purple instruction is fetched from memory

3 the green instruction is executed (actual operation is performed)

the purple instruction is decoded

the blue instruction is fetched

4 the green instruction's results are written back to the register file or memory

the purple instruction is executed

the blue instruction is decoded

the red instruction is fetched

5 the green instruction is completed

the purple instruction is written back

the blue instruction is executed

the red instruction is decoded

6 The purple instruction is completed

the blue instruction is written back

the red instruction is executed

7 the blue instruction is completed

the red instruction is written back

8 the red instruction is completed

9 All instructions are executed

2) Bubble

A bubble in cycle 3 delays execution

When a "hiccup" (difficulty) in execution occurs, a "bubble" is created in the pipeline in

which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and

the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple

instruction is delayed as well but everything "ahead" of the purple instruction continues with

execution.

Clearly, when compared to the execution above, the bubble yields a total execution time of 8

clock ticks instead of 7. Bubbles are like stalls, in which nothing useful will happen for the

fetch, decode, execute and write back.

http://en.wikipedia.org/wiki/Image:Pipeline,_4_stage_with_bubble.svg

Implementations:-

Buffered, Synchronous pipelines

Conventional microprocessors are synchronous circuits that use buffered, synchronous

pipelines. In the Synchronous method, one timing signal causes all outputs of units to be

transferred to the succeeding units. The timing signal occurs at fixed intervals, taking into

account the slowest unit. Instruction and arithmetic pipelines use the Synchronous method. In

the synchronous method, there is a staging register between each unit and the clock signal

activates all the staging registers simultaneously. Staging registers are used between the

stages to hold the information.

Buffered, Asynchronous pipelines

Asynchronous pipelines are used in asynchronous circuits, and have their pipeline registers

clocked asynchronously. In the asynchronous method, a pair of “handshaking” signals is used

between each unit and the next unit.

- A ready signal

- An acknowledge signal

The ready signal informs the next unit that it has finished the present operation and is ready

to pass the task and any results onwards. The acknowledge signal is returned when the

receiving unit has accepted the task and results.

http://en.wikipedia.org/wiki/Asynchronous_circuit

http://en.wikipedia.org/wiki/Synchronous_circuit

The AMULET microprocessor is an example of a microprocessor that uses buffered,

asynchronous pipelines.

Unbuffered pipelines

Unbuffered pipelines, called "wave pipelines", do not have registers in-between pipeline

stages. Instead, the delays in the pipeline are "balanced" so that, for each stage, the difference

between the first stabilized output data and the last is minimized. Thus, data flows in "waves"

through the pipeline, and each wave is kept as short (synchronous) as possible.

The maximum rate that data can be fed into a wave pipeline is determined by the maximum

difference in delay between the first piece of data coming out of the pipe and the last piece of

data, for any given wave. If data is fed in faster than this, it is possible for waves of data to

interfere with each other.

Pipelining Developments:-

In order to make processors even faster, various methods of optimizing pipelines have been

devised.

Super pipelining refers to dividing the pipeline into more steps. The more pipe stages there

are, the faster the pipeline is because each stage is then shorter. Ideally, a pipeline with five

stages should be five times faster than a non-pipelined processor (or rather, a pipeline with

one stage). The instructions are executed at the speed at which each stage is completed, and

each stage takes one fifth of the amount of time that the non-pipelined instruction takes.

Thus, a processor with an 8-step pipeline (the MIPS R4000) will be even faster than its 5-step

counterpart. The MIPS R4000 chops its pipeline into more pieces by dividing some steps into

two. Instruction fetching, for example, is now done in two stages rather than one. The stages

are as shown:

1. Instruction Fetch (First Half)

2. Instruction Fetch (Second Half)

3. Register Fetch

4. Instruction Execute

5. Data Cache Access (First Half)

6. Data Cache Access (Second Half)

7. Tag Check

http://en.wikipedia.org/wiki/AMULET_microprocessor

8. Write Back

Superscalar pipelining involves multiple pipelines in parallel. Internal components of the

processor are replicated so it can launch multiple instructions in some or all of its pipeline

stages. The RISC System/6000 has a forked pipeline with different paths for floating-point

and integer instructions. If there is a mixture of both types in a program, the processor can

keep both forks running simultaneously. Both types of instructions share two initial stages

(Instruction Fetch and Instruction Dispatch) before they fork. Often, however, superscalar

pipelining refers to multiple copies of all pipeline stages (In terms of laundry, this would

mean four washers, four dryers, and four people who fold clothes). Many of today's machines

attempt to find two to six instructions that it can execute in every pipeline stage. If some of

the instructions are dependent, however, only the first instruction or instructions are issued.

Dynamic pipelines have the capability to schedule around stalls. A dynamic pipeline is

divided into three units the instruction fetch and decode unit, five to ten execute or functional

units, and a commit unit. Each execute unit has reservation stations, which act as buffers and

hold the operands and operations.

While the functional units have the freedom to execute out of order, the instruction

fetch/decode and commit units must operate in-order to maintain simple pipeline behavior.

When the instruction is executed and the result is calculated, the commit unit decides when it

is safe to store the result. If a stall occurs, the processor can schedule other instructions to be

executed until the stall is resolved. This, coupled with the efficiency of multiple units

executing instructions simultaneously, makes a dynamic pipeline an attractive alternative.

Conclusion: -

After completing the paper, I conclude that if the more pipe stages are there, the faster the

pipeline is because each stage is then shorter. Ideally, a pipeline with five stages should be

five times faster than a non-pipelined processor. Super pipelined is the one of the example in

which no. of pipes are increased from the previous pipelined structure. Previous statement is

also proved by the example which I am described in paper. But I also find that the pipelined

structure have some disadvantages. There are certain problems which occur in the pipelined

structure but they are removed by using their respective possible solution.

References:-

http://en.wikipedia.org/wiki/Instruction_pipeline

www.freepatentsonline.com/5333280.html

www.cs.princeton.edu/courses/archive/spr02/cs217/lectures/pipeline.pdf

http://alexandria.tue.nl/extra1/wskrap/publichtml/200612.pdf

www.csee.wvu.edu/~jdm/classes/cs455/notes/tech/instrpipe.html

http://www.wipo.int/pctdb/en/wo.jsp?wo=2004084065

Kai Hwang, “Advanced Computer Architecture”, Parallelism, Scalability,

Programmability, McGraw Hill

http://www.csee.wvu.edu/~jdm/classes/cs455/notes/tech/instrpipe.html

http://www.cs.princeton.edu/courses/archive/spr02/cs217/lectures/pipeline.pdf

http://www.freepatentsonline.com/5333280.html

pipelined instruction processing

Documents