I have created the following subdirectory http://www.sc.ehu.es/ccwbayes/master/selected-dbs/nlp-naturallanguageprocessing/moviePolarity10662Reviews-TextPreprocessed/ with a set of items. All of them are .csv files that are the result of preprocessing the "movie-polarity-10662reviews.csv" corpus in different forms. The original file can be found here: https://www.kaggle.com/datasets/mrbaloglu/rotten-tomatoes-reviews-dataset The steps of the preprocessing can be found in the 22nd slide of the outlierDetection topic: stopwords removing, stemming, keeping terms that appear at least in 1% of the documents of the corpus (that's why 0.99 sparseness), etc... "movie10662ReviewsSparse099.csv" is the results of the exposed cleaning "movie1000RandomReviews.csv" is a reduced version of the previous one: selecting at random 1000 documents (you can use this to reduce computing times) "movieReviews10662Positive.csv" saves only the "positive polarity" reviews: you can use it to learn AutoEncoders, as they demand oneClass training corpuses "movie1000RandomPositiveReviews.csv" is a reduced version of the previous one, 1000 randomly selected positive polarity documents You can use them in case you have difficulties to generate a clean, preprocessed version of any other corpus that you like in kaggle.com: machine learning and multivariateOutlier detection techniques can be directly applied over them. Try to do first this cleaning process by the python's nltk library; and the scikit-learn's CountVectorizer function to create the bag-of-words. And the results of these functions should be a similar, cleaned .csv corpus in DocumentTermMatrix form, ready to apply ML and multivariateOutlier techniques over it.