Computer Architecture/Inside the Machine

The Intel Pentium and Pentium Pro

Tony Lim 2020. 12. 31. 22:24

 

pentium shares 4 stages in common : Fetch , Decode-1 , Decode-2 , Write

when instruction reaches the execute phase it enters a more specialized pipeline. depth of pipeline can be different

 

Petium's basic integer pipeline has 5 stages

1. Prefetch/Fetch == fetched from instruction cache and aligned in prefetch buffers for decoding.

2. Decode-1 == based on  set of hardware-based rules instructions are decoded. Branch prediction happens here.

3. Decode-2 == instruction that require the mcirocode ROM are decdoed here and address computation

4. Execute == the integer hardware ALU executes the instruction.

5. Write-back == wirte the result in register file.

 

Branch Prediction

sepculative execution == used to keep the delays associated with evalutating bracnhes from introducing bubbles into the pipeline.

if prediction goes wrong , pipeline needs to be flushed and it takes long time to fill it in.

static branch prediction == simple , relies on the assumption that it will be loop. 

dynamic branch prediction == based on program's past behavior. use 2 tables

branch history table(BHT) == creates an entry for each conditional branch that the BU ahs encountered on its last few cycles. also includes some bits that indicate th elikelihood taht the branch willl be taken based on its past history.

branch target buffer(BHB) == stores the brach target. 

if entry is not in BHT then then use static branch prediction.

 

The Floating-Point ALU

we can access every register stack not just top

to calculate 2 floating-point numbers one of the number must be in the stack top an dthe other can be in any of the other registers.

fadd ST, ST(5)  ==  ST = ST +ST(5)

compiler alone couldn't overcome two-perand lmmit and the stack-based limit.

so microacrchitectural hack comes in "fxch" almost free , what this does is swap any element of the stack with stack top.

 

Pentium couldn't perfom well because 30% of transistors are for legacy

 

The Intel P6 Microarchitecture: The Pentium Pro

this orginal pentium has 2 major drawbacks (static execution).

1. adapts poorly to the dynamic and ever-changing code stream.

2. makes poor use of wider superscalar hardware. 

make up of codestream , application changes but rules for sceduling execution on Pentium's backend is fixed.

 

 

unlike orginal Pro has a Reorder buffer. when the buffer is adequately full , we can use dynamic scheduling logic. 

Issue phase == instruction wait there for moment and go to execute phase and they may be out of program order also eliminate bubble from frontend if only frontend is faster than backend

completion phase == instructions that have finished executing wait in a second buffer to have their results wrriten back to the register file in program order.

Reservation station == where newly decoded instructions go. wait until all of its execution requirements are met. by using buffer we can dispatch 3 instructions per cycle. also 5 is possilbe . flexible.

Reorder Buffer == it's job is to ensure that the finished instructions get put back in program order. 

 

The Instruction window == ability to see some future

P6's long pipepine has 2 effects

1. since each  of the stages is shroter an dsimpler an dcan be completed quicker Intel can crank up the processor's clock speed.

2. allows the processor to hide hiccups in the fetch and decode stages. but on the downside if something goes wrong whole thing needs to be flushed

five instructions per cycle can pass from the reservation station through the issue ports and into the execution units

The use of such register-to-memory and memory-to-memory format let progammer focus on coding more because processor takes care of scheduling memeroy traffic

In case of RISC ISA compiler took care of memeroy accesses and other tpyes of code not the processor so it can be more fast and efficient

the P6’s three decoders are capable of producing up to six decoded micro-ops per cycle

P6 complex decoder works in conjunction with the microcode ROM to handle the really complex legacy instructions  which are translated into sequences of micro-ops that are read directly from the ROM.