[ACCEPTED]-How can I pass a preprocessor to TfidfVectorizer? - sklearn - python-scikit-learn
You simply define a function that takes 14 a string as input and retuns what is to 13 be preprocessed. So for example a trivial 12 function to uppercase strings would look 11 like this:
def preProcess(s):
return s.upper()
Once you have your function made 10 then you just pass it into your TfidfVectorizer
object. For 9 example:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?'
]
X = TfidfVectorizer(preprocessor=preProcess)
X.fit(corpus)
X.get_feature_names()
Results in:
[u'AND', u'DOCUMENT', u'FIRST', u'IS', u'ONE', u'SECOND', u'THE', u'THIRD', u'THIS']
This indirectly answers 8 your follow-up question since despite lowercase 7 being set to true, the preprocess function 6 to uppercase overrides it. This is also 5 mentioned in the documentation:
preprocessor 4 : callable or None (default) Override the 3 preprocessing (string transformation) stage 2 while preserving the tokenizing and n-grams 1 generation steps.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.