classification - Python how to train the naives bayes classier -
i need classifier classify reviews positive or negative. each doc had done stopwords filtering , lemmatation , computed tf-idf each term , stored them doc_bow follow each doc.
doc_bow.append((term,tfidf)).
now, wan train classifier, have no idea how do. found example http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/, still can't it. how td-idf used or affect classifier?
i know little in area, can share understand. please correct me if wrong. see link, there no reference using tf-idf scores classification. should @ link understand how use naive bayes classifier. in general, code looks (i took code segment link)
import nltk.classify.util nltk.classify import naivebayesclassifier nltk.corpus import movie_reviews def word_feats(words): return dict([(word, true) word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)) classifier = naivebayesclassifier.train(trainfeats)
each training instance tuple of dictionary of features , class label, instance, ({"sucks":true, "bad":true, "boring":true}, "negative")
as numeric attribute, think 1 common approach can binning them categories e.g low/medium/high.
with regards tf-idf
scores, not certain. think 1 approach can used features selection, example if no. of features large may take top n words features.
Comments
Post a Comment