Computer Architecture/Inside the Machine

Intel's Pentium 4 vs Motorola's G4e: The backend

Tony Lim 2021. 1. 17. 13:31

The Integer Execution Units

simple / fast integer instructions == insturctions like add and sub require very few steps. these are the major portion of average program

complex / slow ineger instructions == mutiple , division. have little portion compared to simple one.

 

The G4e's IUs : Making the common case fast

3 simple / fast integer execution units - SIUa, SIUb, SIUc == each of 3 fast iUs is fed by a single-entry reservation station.

1 complex / slow integer execution unit - CIU == take complex integer instructions including condition register logical operations.

by focusing on common case it was able to achieve more speed. but with extra "finish" stage on CIU added an extra cycle of latency

 

The Pentium4's IUs : Make the Common case twice as fast

2 simple / fast integer execution units == twice the core clock speed. but due to "narrow and deep" philosophy as we talked before , branch mispredictions in conjunction with cache latencies can kill integer performance

intel sold a version of the Pentium 4 called Xeon which had larger cache. because of the server market's need of increasing integer performance

 

The Floating-Point units(FPUs)

extremely floating-point intensive application was emerging. 

most of branches in floating-point code are few and extremely predictable. 

Integer programs need good branch prediction and caching to keep the IUs fed with instructions.

floating-point programs need good memory bandwidth to keep the FPUs fed with data.

2 features that differs from Integer code

1. execellent locailty of reference with respect to the instruction cache. (small loop which fits in 1 processor caches)
2. operate on large data files that are streamed from main memeroy. (need for large bandwidth)

 

G4e's FPU

it is not fully pipelined , it can't have 5 different instructions in 5 different stages of execution simultaneously.

G4e has cleaner PPC floating-point specification. which have 4 operand formats. 

 

Pentimum4's FPU

2 independent FPU pipelines , one of which is strictly for memory operation the other for arithmetic operations.

Pentium 4's clock speed is much higher than G4e. this compensate the drawbacks of x87's 

 

Although G4e has fine floating-point hardware , the influence of PPC ISA (multiple operands) is much larger. 

Pentium 4 had better hardware but x87 legacy was a weakness.

 

The vector execution units

Single instructions to multiple data elements at once is quite effective in  image processing, 3D rendering.

 

The AltiVec instruction set

16 elements, each element is an 8 bit signed or unsigned integer or an 8 bit character

8 elements, each element is a 16 bit signed or unsigend integer

4 elements, each element is either a 32-bit signed or unsigend integer, or a single precision (32-bit) IEEE floating-point number

 

AltiVec vector operations

AltiVec_instruction source1, source2 , filter/mod , destination 

filter/mod is also called a control operand , control vector. 

4 kinds of altivect_instructions

intra element arithmetic ==

just parallely summing up each element

intra element non arithmetic == same as above but including logical operations like AND, OR , XOR

inter element arithmetic operations == happen between the elements in a single vector.

inter element non arithmetic operations == vector permutation , rearranging the order of elements in an individual vector

VC contains the conrol vector , mentioned in the above, that tells AltiVec which elements it should put where, and VT is destination register

 

The G4e's VU: SIMD Done Right

The processor has 4 independent AltiVec units

Vector permute unit == handles rearranging the operands within vector. permute,merge,splat,pack,unpack

Vector simple integer unit == handles fast,simple vector integer instructions. has 1 pipeline stage

Vector complex integer unit == handles slower vector instrucitons, mutiply, mutiply-add...

Vector floating point unit == handles all vector floating-point instructions

 

Intel's MMX

integer-only SIMD solution

vectors are stored in 8 MMX registers but they are aliased onto the x87's stack based floating-point register FP0 to FP7 which can cause overwriting issue by programmer since there are no mod bit for toggling th eregister file betwen MMX and floating-point usage.

MMX_instruction mmreg1 , mmreg2

mmreg1 is both source and destination meaning that mmreg1 gets overwritten by the reuslt of the caclulation. 

 

SSE and SSE2

intel's goal with SSE (Streaming SIMD Extensions , MMX2) was to add 4 way 128 bit SIMD single precision floating  point computation to the x86 ISA.  it's shorcoming was SSE's vector floating point operations were limited to a single precision and vector integer operations were still limited to 64 bits.

SSE2 support 128 bit SIMD integer operation , double-precision SIMD floating-point operation 

 

The Pentum 4's Vector Unit

pentium 4 is able to offer competitive SIMD performance based on a combination of 3 factors:

1. relatively low instruction latencies

2. extremely high clock speeds

3. high bandwidth caching and memory subsystem

Pentium4's vector floating point operations speed was 3 times faster than G4e