Lec10) Question Answering
SQUAD evaluation
F1 meause is primary , exact match: 1/0 accuracy on whether you match one of the 3 answers
F1 = 2*precision*recall / (precision + recall)
precision = tp/(tp+fp) , recall = tp/(tp+fn)
tp = true positive == true (the model prediction is actually true) , postive ( model said it is true,positive)
number of tokens that are shared between the correct answer and the prediction
fp = false positve == false (the model prediction is acutally wrong) , postive ( model said it is true,positive)
number of tokens that are in the prediction but not in the correct answer.
fn = false negative == false (the model prediction is acutally wrong) , negative ( model said it is false , negative)
number of tokens that are in the correct answer but not in the prediction.
tn = true negative == true (the model prediction is actually true) , negative (mode said it is false, negative)
in this case this won't make any sense ( number of token that are not in correction and not in the prediction)
F1 is less based on choosing exactly the same span that humans chose, which is susceptible to various effects including line breaks
SQUAD limitation
it is not actually understanding paragraph it is sort of doing matching solving.
answers are only span-based answer ( no yes or no , counting , implicit why)