I have created the following subdirectory 

http://www.sc.ehu.es/ccwbayes/master/selected-dbs/nlp-naturallanguageprocessing/moviePolarity10662Reviews-TextPreprocessed/

with a set of items. All of them are .csv files that are the result of preprocessing the "movie-polarity-10662reviews.csv" corpus in different forms. The original file can be found here: 

https://www.kaggle.com/datasets/mrbaloglu/rotten-tomatoes-reviews-dataset

The steps of the preprocessing can be found in the 22nd slide of the outlierDetection topic: stopwords removing, stemming, keeping terms that appear at least in 1% of the documents of the corpus (that's why 0.99 sparseness), etc...

    "movie10662ReviewsSparse099.csv" is the results of the exposed cleaning
    "movie1000RandomReviews.csv" is a reduced version of the previous one: selecting at random 1000 documents (you can use this to reduce computing times)
    "movieReviews10662Positive.csv" saves only the "positive polarity" reviews: you can use it to learn AutoEncoders, as they demand oneClass training corpuses
    "movie1000RandomPositiveReviews.csv" is a reduced version of the previous one, 1000 randomly selected positive polarity documents

 

You can use them in case you have difficulties to generate a clean, preprocessed version of any other corpus that you like in kaggle.com: machine learning and multivariateOutlier detection techniques can be directly applied over them. 

Try to do first this cleaning process by the python's nltk library; and the scikit-learn's CountVectorizer function to create the bag-of-words. And the results of these functions should be a similar, cleaned .csv corpus in DocumentTermMatrix form, ready to apply ML and multivariateOutlier techniques over it.