algorithm - Ruby implementation for ROC curve -


i'm try implement calculation of roc curve in ruby. tried transform pseudocode http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf (see 6th site, chapter 5, algorithm 1 "efficient method generating roc points") ruby code.

i worked out simple example, i'm getting values on 1.0 recall. think misunderstood something, or made mistake @ programming. here gor far:

# results classifier # index 0: users voting # index 1: estimate system results = [[5.0,4.8],[4.6,4.2],[4.3,2.2],[3.1,4.9],[1.3,2.6],[3.9,4.3],[1.9,2.4],[2.6,2.3]] # on score of 2.5 item positive 1 threshold = 2.5 # sort index 1, estimate l_sorted = results.sort { |a,b| b[1] <=> a[1] }  # count real positives , negatives positives, negatives = 0, 0 positives, negatives = 0, 0 l_sorted.each |item|   if item[0] >= threshold     positives += 1   else     negatives += 1   end end  fp, tp = 0, 0 # array holds points r = [] f_prev = -float::infinity  # iterate on items l_sorted.each |item|   # if score of former iteration different,   # add point r   if item[1]!=f_prev     r.push [fp/negatives.to_f,tp/positives.to_f]     f_prev = item[1]   end   # if current item real positive   # (user likes item indeed, , estimater correct)   # add true positive, otherwise, add false positve   if item[0] >= threshold && item[1] >= threshold     tp += 1   else     fp += 1   end end  # push last point (1,1) array r.push [fp/negatives.to_f,tp/positives.to_f]  r.each |point|   puts "(#{point[0].round(3)},#{point[1].round(3)})" end 

based on results array of arrays, code tries calculate points. i'm not sure f_prev about. in f_prev score of classifier stored, or if it's true or false?

it awesome, if have quick @ code, , me find mistake. thx!

my second answer analysis of code, , pointing out think have made mistakes or confused. assuming want reproduce graph similar seen on page 864 of linked pdf.

an roc plot on p864, graph showing available compromises in predictive model between false positive , true positive rates. see possible compromises, need visit data points threshold make difference, , plot false positive vs true positive rate.

your first point of confusion seems have "users voting" float score instead of true/false category. example in pdf has p/n cases determined plotting roc.

# results classifier # index 0: users voting # index 1: estimate system results = [[5.0,4.8],[4.6,4.2],[4.3,2.2],[3.1,4.9],[1.3,2.6],[3.9,4.3],[1.9,2.4],[2.6,2.3]] 

so think better off having

results = [[true,4.8],[true,4.2],[true,2.2],[true,4.9],[false,2.6],[true,4.3],[false,2.4],[true,2.3]] 

before start plot roc. fine conversion inline, need separate concerns of how generate test data, roc plot - instance, fact user scores , machine estimate scores on same scale irrelevant.

which leads threshold variable. can use e.g. 2.5 convert user data, has no bearing on roc plot. in fact full roc plot need test multiple values of threshold how affect true , false positive rates.

# on score of 2.5 item positive 1 threshold = 2.5 

this sorts values reverse order, highest-scoring items first. either way, me means want start @ high threshold (where scores predict false), , @ position [0.0,0.0] on graph

# sort index 1, estimate l_sorted = results.sort { |a,b| b[1] <=> a[1] } 

the following code looks accurate enough, summing test positives , negatives, shouldn't messing concepts of threshold:

# count real positives , negatives positives, negatives = 0, 0 positives, negatives = 0, 0 l_sorted.each |item|   if item[0] >= threshold     positives += 1   else     negatives += 1   end end 

a nicer ruby way of putting same logic, assuming replace user scores true/fasle values somewhere else might

positives = l_sorted.select { |item| item[0] }.count negatives = l_sorted.count - positives 

this looks ok, indeed start @ [0.0,0.0] with

fp, tp = 0, 0 # array holds points r = [] 

however, looks starting threshold

f_prev = -float::infinity 

so logically positive float::infinity in opinion, such predictions false (hence fp , tp logically have 0 because there no p allowed @ all). doesn't matter though, since don't use value.


inside loop, going on code tracking total false positives , true positives if threshold set above current item. lower bar past groups of items same score, predict positive values (no need test versus threshold variable, confusing you). have sort positive values tp or fp counts. check versus f_prev helping group similar items, plot 1 point if 3 predictions have same score.

# iterate on items l_sorted.each |item|   if item[1]!=f_prev     # plot point, assuming predictions score equal or lower current     # item thresholded out negative.     r.push [fp/negatives.to_f,tp/positives.to_f]     f_prev = item[1]   end   # assume current prediction positive, , calculate how affects curve   # if current test item real positive   # add true positives, otherwise, has become false positve   if item[0]     tp += 1   else     fp += 1   end end  # push last point (1,1) array r.push [fp/negatives.to_f,tp/positives.to_f] 

as altering test, removed inaccurate comment ("the estimator correct") - not judging in code whether estimator "correct" or not single value, seeing how scores fp vs tp @ particular cutoff point. single pass process on sorted list relies on fact small incremental change last point plotted, based on changes fp , tp counts.

this should go [0.0,0.0] [1.0,1.0]

r.each |point|   puts "(#{point[0].round(3)},#{point[1].round(3)})" end 

Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -