Computer Architecture/Cornell ECE 5545

ML HW & Systems. Lecture 5: Microarchitecture

Tony Lim 2024. 7. 6. 16:13
728x90

from big picture (Architecture) to actual unit design (Mircro Architecture) , will not cover Circuit 

 

accumulator = adding number (need to hold a number) , adder doesn't need to hold state

this takes O(N) to compute dot product of length n vectors. e.g) 8 cycles if n is 8

 

usping multiple multiplier will be faster but how to merge the results ,need adder tree

now it takes 2 cycle to calculate 8 length vectors' dot product.

 

Int8 Precision consider to be inaccurate than other pecsion so in accumulator it needs to handle a lot of added results so it definitely need better precision

cannot do better than this

most circuit takes more than 1 op / cycle

using pipeline conecpt is to break down the computation into finer grained parts and put some register in between

so latency isn't issue

having data dependecy will make throughput less

we interleave operands, e.g) a + b + c, x + y + z

since there is no stall in data streams , now we have throughput of 1 ops/cycle.

block floating point share exponent's with above floats(sign + mantissa)

 

there can be 1 physical port and use as 2 read ,wirte logical port
or there can be acutal 2 phyiscal port to use as read,write

data size has nothing to do with bit-width of the address only the entries

multiple read ports can have concurreny issue , waiting for lock, but write ports can have same issue

both option has pros and cons

left is complicated and have concurrency issue, right one need to use acutal more phyiscal ram 

 

arbitration adn crossbar is to route read and write port's data to proper bank

recently, instead of relying on the crossbar, people have been thinking about how to read and write to each bank.

 

 

 

 

 

 

728x90

'Computer Architecture > Cornell ECE 5545' 카테고리의 다른 글

Lecture 7: Quantization  (0) 2024.07.07