[ACCEPTED]-How can I pass a preprocessor to TfidfVectorizer? - sklearn - python-scikit-learn

Accepted answer
Score: 31

You simply define a function that takes 14 a string as input and retuns what is to 13 be preprocessed. So for example a trivial 12 function to uppercase strings would look 11 like this:

def preProcess(s):
    return s.upper()

Once you have your function made 10 then you just pass it into your TfidfVectorizer object. For 9 example:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?'
     ]

X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()

Results in:

[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']

This indirectly answers 8 your follow-up question since despite lowercase 7 being set to true, the preprocess function 6 to uppercase overrides it. This is also 5 mentioned in the documentation:

preprocessor 4 : callable or None (default) Override the 3 preprocessing (string transformation) stage 2 while preserving the tokenizing and n-grams 1 generation steps.

More Related questions