Skip to content Skip to sidebar Skip to footer

Columntransformer Fails With Countvectorizer In A Pipeline

I'm trying to transform text using sklearn's CountVectorizer within pipelines combined with ColumnTransformer. However, the pipeline returns an incorrect array. Why is my pipeline

Solution 1:

You can utilize make_column_transformer and do something like the following. remainder are the remaining features on which you can apply other transformations. By default, remainder is set to 'drop' which means that the remaining features without any transformations will be dropped.:

preprocess = make_column_transformer((CountVectorizer(), 'text_feat'), 
                                     remainder='passthrough')
make_pipeline(preprocess).fit_transform(X)

More info here

The following blog goes into more details: https://jorisvandenbossche.github.io/blog/2018/05/28/scikit-learn-columntransformer/

A few tips on your code: While transforming features, you do not need to (read: shouldn't) pass y (i.e. the target). The issue in your code is because you are passing the list of text features instead of name the column. If you change your code slightly, you should get the same results.

preprocessor = ColumnTransformer(
        transformers=[('text', text_transformer, 'text_feat')])

Solution 2:

# wrap in ColumnTransformerpreprocessor = ColumnTransformer(transformers=[('text', CountVectorizer(),'text_feat')])

# second pipelinepipeline = Pipeline(steps=[('preprocessor', preprocessor)])

X_test = pipeline.fit_transform(X)

This works and seems the simplest for me.

Post a Comment for "Columntransformer Fails With Countvectorizer In A Pipeline"