from big picture (Architecture) to actual unit design (Mircro Architecture) , will not cover Circuit
accumulator = adding number (need to hold a number) , adder doesn't need to hold state
this takes O(N) to compute dot product of length n vectors. e.g) 8 cycles if n is 8
usping multiple multiplier will be faster but how to merge the results ,need adder tree
now it takes 2 cycle to calculate 8 length vectors' dot product.
Int8 Precision consider to be inaccurate than other pecsion so in accumulator it needs to handle a lot of added results so it definitely need better precision
cannot do better than this
most circuit takes more than 1 op / cycle
using pipeline conecpt is to break down the computation into finer grained parts and put some register in between
so latency isn't issue
having data dependecy will make throughput less
we interleave operands, e.g) a + b + c, x + y + z
since there is no stall in data streams , now we have throughput of 1 ops/cycle.
block floating point share exponent's with above floats(sign + mantissa)
there can be 1 physical port and use as 2 read ,wirte logical port
or there can be acutal 2 phyiscal port to use as read,write
data size has nothing to do with bit-width of the address only the entries
multiple read ports can have concurreny issue , waiting for lock, but write ports can have same issue
both option has pros and cons
left is complicated and have concurrency issue, right one need to use acutal more phyiscal ram
arbitration adn crossbar is to route read and write port's data to proper bank
recently, instead of relying on the crossbar, people have been thinking about how to read and write to each bank.
'Computer Architecture > Cornell ECE 5545' 카테고리의 다른 글
Lecture 7: Quantization (0) | 2024.07.07 |
---|