Computer Architecture/Inside the Machine

PowerPC Processor :600 series , 7 0 0 Series ,and 7400

Tony Lim 2021. 1. 5. 19:11
728x90

601's pipeline has a classic 4 stage RISC integer pipeline: fetch, decode/dispatch, exectue , write back

Instruction queue == is used mainly for dectecting an dealing with branches. if branch unit has enough information to resolve the branch immediately it is simply deleted from the instruction queue and replaced with the instruction located at the branch target. (branch folding)

 

When a branch is taken, the CPU's program counter is set to the argument of the jump instruction. So, the next instruction becomes the insturction at that address in memory. Therefore the flow of contorl changes.

When a branch is not taken, the CPU's progam counter is unchanged.

fall-through == act of allowing not-taken branches to fall out of the instruction queue

 

601's Back End

Integer Unit == fixed-point ALU for integer math including address calculations on the chip. While x86 designs, like  the original 

load address calculations, store address calculations , load-data operations are crammed into 601's single integer ALU.

multi-cycle integer insturctions (integer mutiplies and divides) are not fully pipelined.

Floating-Point Unit == 6 stages long including 4 basic stage but with an extra decode stage and extra execute stage.

single-precision and most double precision(64-bit) floating point operation are pipelined. 601's floating point hardware can turn out 1 instructino per cycle with a two-cycle latency.

can do single-precision fused multiply-add (fmadd) instructions with single-cycle throughput. fmadd is a core digital signal processing an dscientific computing function. 

integer unit handles all of the memory traffic , IU acts like a dedicated load-store unit(LSU) , whose sole purpose is to keep the FPU fed with data. this FPU+LSU performs well for 2 reason

1. integer and floating point are rarely mixed , so it doesn't matter for perfomance if the integer unit is tied up with floating-point-related memeroy traffic.

2. floating-point code is often data-intensive, with lots of loads and stores and thus high levels of memeroy traffic to keep a dedicated LSU busy

Sequencer Unit == execute some legacy insturctions particular to the older RSC. if design team had more time this would have thrown it out. 

 

division instruction sits in IU1's integer pipeline for 19 cycles, no othe instruction can execute in IU1
FPU has 3 pipeline each of number represents time in each stage. notice fdiv takes 32cycles during that term no other instruction can enter.

 

The PowerPC603 and 603e

603 couldn't perfom well on legacy 68K codes. since it has 16KB L1cache size, so 603e was better at legacy code.

603e's Backend

branches that aren't folded or don't fall through are dispatched from the instruction queue to the brach unit over a dispatch bus that isn't connected to any of the other execution units.

non branch instructions can dispatch at a rate of up to 2 instructions per cycle to the backend.

load store unit(LSU) == address caculation and executing store-data operations. its integer unit is freed up from having to handle memero ytraffic an dcan focus soley on integer arithmetic.

system unit == udpates to the PowerPC condition register. also contains a limited integer adder. whci can take som e of the burden off the integer ALU.

603's basic floating point has 3 cycle latency (one-cycle throughput) , can execute 3 instruction very 4 cycles, causing pipe line bubble.  

not fully pipelined for multiply operation of double precision.

Despite these shortcomings, 603e's nice perfomance on fmadd overcomes all of these shortcomings at DSP, scientific and media applications.

commit unit == contains 5 entry completion queue for keeping track of the program order of in-fligh instructions. 

603 would be better without Apple's legacy 68K code base. 603e's larger cache size helped with legacy problems somewhat but the updated chips still played scond fiddle in APple's product line to th elarger much more powerful 604.

 

PowerPC 604

The 604's Pipeline and Back end

The 604's pipeline is deeper has 6 stages. Fetch, Decode , Dispatch (ROB and rename) , Execute , Complete , Write-back. this enables higher clock speeds that its predecessors. because each pipeline stage is simpler, it takes less time to complete which means that the CPU's clock time can be shortened.

 

Branch unit(BU) / condition register unit (CRU) , Load-store unit(LSU) , Floating-point unit (FPU) , 2 simple integer units (SIUs) , complex integer unit (CIU)

any integer instruction that takes only a single cycle to execute can pass through one of the two SIUs. but mutiple cycle, like integer divides, goes to CIU.

register renaming == the 12-entry register rename file attached to the 32-entry general purpose register file. this allows avoiding flase dependencies and register-related stalls.

unlike 603e , 604's floating-point unit is fully pipelined for double-precision mutiplies. 

604's branch unit also features a dynamic branch prediction scheme that's vast improvement over the 603e's static branch preditor. larger branch history table (BHT) , 64 entry branch target address cache(BTAC) same thing as BTB

condition register unit(CRU) shares a dispatch bus and some other resources with the branch execution unit, so it's not a fully independent later on 604e there is going to be a unit in the backend that handles CR logical operations.

The 604's Frontend and istruction window

604's dispatch logic can dispatch up to 4 instructions per cycle from the bottom 4 entries of th einstruction queue to the backend's execution units

Note that 604 can dispatch at most 1 instruction to each execution unit. an instruction can't dispatch if the execution unit that it needs is not available.

reservation stations == they allow the instructions assigned to one execution unit to issue out of program order with respect to the instructions that are assigned to the other execution units.

 

The 4 rules of instruction dispatch

The inorder dispatch rule == instructions dispatch from the instruction queue in program order.

The issue buffer / execution unit availability rule 

1. reservation station must have entry available.

2. if an instruction doesn't need to go to a reservation station because its inputs are available at th etime of dispatch , required execution unit must have a pipeline slot available, and the unit's reservation station must be empty( no older instructions waiting to execute) before the instruction can be sent to the execution unit.

The completion buffer availability rule == for an instruction to dispatch , there must be space available in the completion queue so that a new entry can be created for the instruction. 

The rename register availability rule == msut be enough rename register available to temporarily store the results for each register that the instruciton will modify

 

The PowerPC 604e

Now with the 604e, handling condition register logical operations got an execution unit of their own. (CRU, condition register unit)

The PowerPC 750 (G3)

Instead of storing only the target address of recently taken branches in a BTB. the 750's 64-entry branch target instruction cache (BTIC) stores the instruction that is located at the branch's target address. 

The PowerPC 7400 (G4)

G4's FPU is a full-blown double -precision FPU, an dit odes single- and double-precision floating-point operations, including multiply and multiply-add , in 3 fully ppelined cycles.

SIMD == single instruction mutiple data

vecotr ALU (VALU) == perfomrs vector arithmetic and logical operations.

vector permute unit (VPU) == performs permute and shift operations on vectors.

 

728x90