python - Special characters in countVectorizer Scikit-learn -


consider runnable example:

#coding: utf-8 sklearn.feature_extraction.text import countvectorizer  vectorizer = countvectorizer() corpus = ['öåa hej ho' 'åter aba na', 'äs äp äl'] x = vectorizer.fit_transform(corpus) l =  vectorizer.get_feature_names()  u in l:         print u 

the output be

aba hej ho na ter 

why åäö removed? note vectorizer strip_accents=none default. grateful if me this.

this intentional way reduce dimensionality while making vectorizer tolerant inputs authors not consistent use of accentuated chars.

if want disable feature, pass strip_accents=none countvectorizer explained in documentation of class.

>>> sklearn.feature_extraction.text import countvectorizer >>> countvectorizer(strip_accents='ascii').build_analyzer()(u'\xe9t\xe9') [u'ete'] >>> countvectorizer(strip_accents=false).build_analyzer()(u'\xe9t\xe9') [u'\xe9t\xe9'] >>> countvectorizer(strip_accents=none).build_analyzer()(u'\xe9t\xe9') [u'\xe9t\xe9'] 

Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -