Session: Classification, Clustering, Data Analysis and Data Mining I (06/07, 14:30-16:30, Room 10A)

Vector Representation of Non-standard Spellings Using Dynamic Time Wrapping and a Denoising Autoencoder

Mehdi Ben Lazreg, Morten Goodwin and Ole-Christoffer Granmo

University of Agder, Norway University of Agder, Norway University of Agder, Norway

The presence of non-standard spellings in Twitter causes challenges for many natural language processing tasks. Traditional approaches mainly regard the problem as a translation, spell checking, or speech recognition problem. This paper proposes a method that represents the stochastic relationship between words and their non-standard versions in real vectors. The method uses dynamic time warping to preprocess the non-standard spellings and autoencoder to derive the vector representation. The derived vectors encode word patterns and the Euclidean distance between the vectors represents a distance in the word space that challenges the prevailing edit distance. After training the autoencoder on 1051 different words and their non-standard versions, the results show that the new distance can be used to obtain the correct standard word among the closest five words in 89.53% of the cases compared to only 68.22% using the edit distance.