The Integer Execution Units
simple / fast integer instructions == insturctions like add and sub require very few steps. these are the major portion of average program
complex / slow ineger instructions == mutiple , division. have little portion compared to simple one.
The G4e's IUs : Making the common case fast
3 simple / fast integer execution units - SIUa, SIUb, SIUc == each of 3 fast iUs is fed by a single-entry reservation station.
1 complex / slow integer execution unit - CIU == take complex integer instructions including condition register logical operations.
by focusing on common case it was able to achieve more speed. but with extra "finish" stage on CIU added an extra cycle of latency
The Pentium4's IUs : Make the Common case twice as fast
2 simple / fast integer execution units == twice the core clock speed. but due to "narrow and deep" philosophy as we talked before , branch mispredictions in conjunction with cache latencies can kill integer performance
intel sold a version of the Pentium 4 called Xeon which had larger cache. because of the server market's need of increasing integer performance
The Floating-Point units(FPUs)
extremely floating-point intensive application was emerging.
most of branches in floating-point code are few and extremely predictable.
Integer programs need good branch prediction and caching to keep the IUs fed with instructions.
floating-point programs need good memory bandwidth to keep the FPUs fed with data.
2 features that differs from Integer code
1. execellent locailty of reference with respect to the instruction cache. (small loop which fits in 1 processor caches)
2. operate on large data files that are streamed from main memeroy. (need for large bandwidth)
G4e's FPU
it is not fully pipelined , it can't have 5 different instructions in 5 different stages of execution simultaneously.
G4e has cleaner PPC floating-point specification. which have 4 operand formats.
Pentimum4's FPU
2 independent FPU pipelines , one of which is strictly for memory operation the other for arithmetic operations.
Pentium 4's clock speed is much higher than G4e. this compensate the drawbacks of x87's
Although G4e has fine floating-point hardware , the influence of PPC ISA (multiple operands) is much larger.
Pentium 4 had better hardware but x87 legacy was a weakness.
The vector execution units
Single instructions to multiple data elements at once is quite effective in image processing, 3D rendering.
The AltiVec instruction set
16 elements, each element is an 8 bit signed or unsigned integer or an 8 bit character
8 elements, each element is a 16 bit signed or unsigend integer
4 elements, each element is either a 32-bit signed or unsigend integer, or a single precision (32-bit) IEEE floating-point number
AltiVec vector operations
AltiVec_instruction source1, source2 , filter/mod , destination
filter/mod is also called a control operand , control vector.
4 kinds of altivect_instructions
intra element arithmetic ==
just parallely summing up each element
intra element non arithmetic == same as above but including logical operations like AND, OR , XOR
inter element arithmetic operations == happen between the elements in a single vector.
inter element non arithmetic operations == vector permutation , rearranging the order of elements in an individual vector
VC contains the conrol vector , mentioned in the above, that tells AltiVec which elements it should put where, and VT is destination register
The G4e's VU: SIMD Done Right
The processor has 4 independent AltiVec units
Vector permute unit == handles rearranging the operands within vector. permute,merge,splat,pack,unpack
Vector simple integer unit == handles fast,simple vector integer instructions. has 1 pipeline stage
Vector complex integer unit == handles slower vector instrucitons, mutiply, mutiply-add...
Vector floating point unit == handles all vector floating-point instructions
Intel's MMX
integer-only SIMD solution
vectors are stored in 8 MMX registers but they are aliased onto the x87's stack based floating-point register FP0 to FP7 which can cause overwriting issue by programmer since there are no mod bit for toggling th eregister file betwen MMX and floating-point usage.
MMX_instruction mmreg1 , mmreg2
mmreg1 is both source and destination meaning that mmreg1 gets overwritten by the reuslt of the caclulation.
SSE and SSE2
intel's goal with SSE (Streaming SIMD Extensions , MMX2) was to add 4 way 128 bit SIMD single precision floating point computation to the x86 ISA. it's shorcoming was SSE's vector floating point operations were limited to a single precision and vector integer operations were still limited to 64 bits.
SSE2 support 128 bit SIMD integer operation , double-precision SIMD floating-point operation
The Pentum 4's Vector Unit
pentium 4 is able to offer competitive SIMD performance based on a combination of 3 factors:
1. relatively low instruction latencies
2. extremely high clock speeds
3. high bandwidth caching and memory subsystem
Pentium4's vector floating point operations speed was 3 times faster than G4e
'Computer Architecture > Inside the Machine' 카테고리의 다른 글
64-bit computing and x86-64 (0) | 2021.01.23 |
---|---|
Intel's Pentium 4 vs. Motorola's G4e: approaches and design philosophies (0) | 2021.01.09 |
PowerPC Processor :600 series , 7 0 0 Series ,and 7400 (0) | 2021.01.05 |
The Intel Pentium and Pentium Pro (0) | 2020.12.31 |
Superscalar Execution (0) | 2020.12.29 |