Computer Architecture/Inside the Machine

Superscalar Execution

Tony Lim 2020. 12. 29. 16:56

now we have dispatch phase which determine whether or not 2 instructions can be executed in parallel

even though we have 2 ALU(parallel) the interface represents a sequential execution machine. so main memory still sees one sequential code stream, one data stream, and one results stream.

DLW-2 fetches instructions from memory in groups of 2 in one clock cycle. but if the first instruction is branch, the second instruction has to be discarded. causing bubble into the pipeline.

2 instructions are added to the Completed Instructions box on each cycle. looks like more ALU means better performance but we have some practical limit

 

data stream consists of 4 tpyes of numbers above. memory address fall into scalar integers

 

There are 2 kinds of operations that can be perfomred on those 4 types of numbers.

1. Arithmetic operations == addition , subtraction , mutiplication , division 

2. Logical operations == Boolean operation , also can be perfomed at processor status word(PSW, brach thing)

 

Simple Integer Unit(SIU) , Complex Integer Unit(CIU) , Floating-Point execution Unit(FPU) , Integer excution Unit(IU) , Load Store Unit(LSU)

 

Memory Access Units

LSU( load store unit ) == responsible for execution of load and store instructions and address generation. LSU have small integer addition hardware that can quickly perform the addition required to compute an address.

BLU( branch execution unit ) == responsible for executing conditional and unconditional brach instructions. also often has its own address generation unit for perfoming quick address calculations as needed. 

 

Instruction Set Architecture (ISA)

Programming model == single,integer only ALU , 4 general purpose register , PC , instruction register , processor status word(PSW) , control unit
+
Instruction set == arithmetic instructions , load and store instructions , branch instructions 

Even though DLW-2 has 2 ALU the instruction set and progamming model remain unchanged ( Interface is same)  but hardware implementation of this ISA are different since it is superscalar.

hardware implementation of an ISA == processor's mircoarchitecture 

DLW-1 DLW-2 is in same DLW ISA but with different microarchitecture , Intel x86 hardwared followed same sort of evolution, with each successive generation becoming more complex while the ISA stayed largely unchanged.

 

Adding new Functionality ( x86 floating-point , ISA extention )

Regarding The Pentium's inclusion of floating-point hardware ,  we might be wondering how the progammer was able to use the floating-point hardware since original x86 ISA didn't include any floating-point operation or register.

1. they had to modify progamming model by adding an FPU and floating-point-specific registers.

2. they had to extend the instruction set by adding a new group of floating-point arithmetic instructions.

ISA kind of work as middleware between programmer and hardware

 

Microcode

microcode engine == microcode ROM ( holds microcode programs )  + tiny storage , sort of like a CPU within a CPU

when System/360 instruction is executed ,

1. the microcode unit reads the insturction in ,

2. access the ROM and find corresponding microcode program 

3. produces a sequence of machine instructions 

By decoding instructions this way, all programs are effectively running in emulation. if hardware implemantion changes then all they have to do is rewrite the microcode so the programmer will never have to be aware of HW difference since ISA hasn't changed.

 

RISC ( reduced instruction set computing ) ISA

Microcode had allowed ISA designers to get elaborate with instruction sets, adding all sort of complex and specialized instruction that nobody uses which are waste of resource.

RISC reduced the number of instructions in instruction set. 

 

Challenges to Pipelining and Superscalar Design ( 3 Hazard )

1. Data Hazards
== e.g) add A B C  add C D D   // because the 2nd instruction depends on first cannot be execute simultaneously. e.g) add A B C add D B A   // second add writes a new value into A as its output. 

each can be alleviate with 2 tricks.

Forwarding == the processor takes the result of the first add from the ALU's output port and feed it directly back into ALU's input port, bypassin gthe register-file-write stage. 2nd add still has to wait but less time

Register renaming == even though programmer though there are only 4 register in ALU in the actual HW there are 16 register so HW can do superscalar computing. can solve 2nd example by writing an temporary register and after 1st add finished executing and write back its own result in A register.

 

Structural Hazards == 
e.g) add A B B add C D D   //  happens when the processor doesn't have enough resources to execute both instructions at once. in this example we are assuming A==C , B==D

trying to write memory and fetch in same cycle , 

register file == group of CPU's register to avoid dirty wires. it can be accessed through a interface which consists of data bus and 2 types of port , read and write.  In order to read a value from a single register in the register file. ALU access the register file's read port and request that the data from a specific register be placed on the special internal data bus that the register file shares with ALU , same for write

to design 2 ALU superscalar design we need 2 read port and 2 write port for each register file.

 

Control Hazrd (branch hazard) == arise when the processo rarrive at a conditional bracnh an dhas to decide which instruction to fetch next. waiting for processor to find branch target instruction.