python - Special characters in countVectorizer Scikit-learn -
consider runnable example:
#coding: utf-8 sklearn.feature_extraction.text import countvectorizer vectorizer = countvectorizer() corpus = ['öåa hej ho' 'åter aba na', 'äs äp äl'] x = vectorizer.fit_transform(corpus) l = vectorizer.get_feature_names() u in l: print u
the output be
aba hej ho na ter
why åäö removed? note vectorizer strip_accents=none default. grateful if me this.
this intentional way reduce dimensionality while making vectorizer tolerant inputs authors not consistent use of accentuated chars.
if want disable feature, pass strip_accents=none
countvectorizer
explained in documentation of class.
>>> sklearn.feature_extraction.text import countvectorizer >>> countvectorizer(strip_accents='ascii').build_analyzer()(u'\xe9t\xe9') [u'ete'] >>> countvectorizer(strip_accents=false).build_analyzer()(u'\xe9t\xe9') [u'\xe9t\xe9'] >>> countvectorizer(strip_accents=none).build_analyzer()(u'\xe9t\xe9') [u'\xe9t\xe9']
Comments
Post a Comment