Pentium 4's speed addiction
P4 had 20stages in basic pipeline. which made it have high relative frequency. but there are limits to how deeply you can pipeline an architecture before you begin to reach a point of diminishing returns.
1. deeper pipeline means increase in instruction execution time
2. processor's clock speed must increase in proportion to tis pipeline depth
regradless of the drawbacks of the NetBurst architecture(high power consumption , steady increase of performance) and its log-term prospects, it has been successful from both commerical and performance standpoints.
The general approaches and design philosophies fo the P4 and G4e
G4e's approach to perfomance can be summarized as a "wide and shallow". by adding more functional units to backend. G4e focuses on getting a small number of instructions onto the chip at once, spreading them out widely to execute in parallel, and then getting them off the chip in as few cylces as possible
P4 takes a "narrow and deep" approach. instead of spreading insturtions out more widely to execute in parallel , it pushes them throgh its narrower backend at a higher rate. insturction window was very large. this way, the processor can have many more instructions on chip for the OOO(out of order) exectuion logic to examine for dependencies and then rearrange to be rapidly fired to the execution units
G4e's architecture and pipeline
G4e breaks down the G4's 4 stage pipeline into seven, shroter stages.
fetch == grab an instruction form L1 cache.
decode/dispatch == 12-entry instruction queue to be decoded. dispatch to proper issue queue
issue == eliminates the potential dispatch stall condition by placing set of buffeers in between the dispatch stage an d the reservation stations. G4e's dispatcher can keep disptaching instructions and clearing instruction queue.
complete/write-back == instructions enter the completion queue to be put back into program order and their results are written back to the register file.
G4 and P4 put more work on branch prediction since they have longer pipeline.
Pentium 4's architecture
L1 instruction cache(trace cache) is in frontend.
reservation station's decoupling wasn't enoght since p4 ahd much longer pipeline. perfomance went down when frontend cannot keep up with backend speed.
The Trace Cache
modern x86 chips convert complext isntructions into a simple internal instruction (mircro-operation).
older P6 fetched instruction from L1 cache and translate the into micro-ops before passing them to reservation stations.
P4 fetched instruction from L2 cache and decodes them into string of mirco-ops called traces, and then fit these traces into L1 instruction cache. since they are already decode only fetch can is needed.
Only when there's a trace cache miss does that top part of the frontend kick in order to fetch and decode instructions from L2 cache. and build trace segment.
when building trace segment branch prediction hardware splices code from the branch that it speculates the program will take right into the trace behind th ecode that it knows the program will take.
by doing speculative execution using trace cache we have 2 advantage
1. normal machine, takes the branch predictor and BPU to predict and it takes at least 1 cycle. might end up to bubble in pipeline
2. nomral machine stop fetching when it meets branch instruction , wasting rest of line. Trace cache lines can contain both branch instructions and the speculaitve code after the branch instruction.
P4 has microcode ROM for long x86 instructions. whenever trace cache is building a trace segment and it encounters long instructions.
trace cache inserts into the trace segment a tag. later on when the race cache is streaming instructions out to the backend ,encounters tag and stop take tage to microcode ROM and it spits out sequence of micro ops in to stream. to backend it seems flawless
Pentium 4's pipeline
stage 5 Drive == since P4 runs so fast at the end of first 5 stages, P4's trace cache sends up to 3 micro-ops per cycle into micro-op queue. which smoothes out th eflow if instruction to the backend by squeezing out an y fetch or decode related bubbles.
stage 9 Queue == memory micro-op queue , arithmetic micro-op queue. Instruction that passes into and out of micro-op queue in program order with repsect to the other instructions in its own queue. But can still exit the bottom of each queue out of program order with respect to instructions in the other queue.
stage 10~12 Schedule == schedule micro-ops for execution and determine which one should be passed. Memory scheduler , Fast IU scheduler, Slow IU/general FPU scheduler , Simple FP scheduler these schedulers feed micro-ops through the 4 dispatch ports in the next stage
stage 13,14 Issue == 2 for memory ports the other 2 for execution ports. 6 micro-ops per cycle can move through 4 ports because 2 execution ports are double-speed.
'Computer Architecture > Inside the Machine' 카테고리의 다른 글
64-bit computing and x86-64 (0) | 2021.01.23 |
---|---|
Intel's Pentium 4 vs Motorola's G4e: The backend (0) | 2021.01.17 |
PowerPC Processor :600 series , 7 0 0 Series ,and 7400 (0) | 2021.01.05 |
The Intel Pentium and Pentium Pro (0) | 2020.12.31 |
Superscalar Execution (0) | 2020.12.29 |