{"id":2200,"date":"2018-08-28T09:21:11","date_gmt":"2018-08-28T09:21:11","guid":{"rendered":"https:\/\/procogia.com\/tweet-clustering-with-word2vec-and-k-means\/"},"modified":"2024-04-04T12:21:35","modified_gmt":"2024-04-04T12:21:35","slug":"tweet-clustering-with-word2vec-and-k-means","status":"publish","type":"post","link":"https:\/\/procogia.com\/tweet-clustering-with-word2vec-and-k-means\/","title":{"rendered":"Tweet clustering with word2vec and k-means"},"content":{"rendered":"\r\n

Most of the data we encounter in the real world is unstructured. A perfect example of unstructured data, text contains a vast amount of information that isn\u2019t structured in a way a computer can easily process. Translating unstructured data into something that a computer can act upon is a major focus of artificial intelligence. The field of natural language processing (NLP) has grown tremendously in the past several years, and products such as smart assistants have developed out of this work. With our clients at ProCogia, we find that an important application of data science is to bring structure to large unstructured datasets like tweets, customer reviews, or survey responses.<\/p>\r\n

Let\u2019s look at some real-world textual data and see how applying machine learning can lead to potentially valuable insights.<\/p>\r\n

For this short analysis we\u2019ll look at tweets. FiveThirtyEight<\/a>, the online news organization best known for political polling analysis, published a dataset<\/a> of tweets linked to Russian trolls.\u00a0We\u2019ll explore this dataset and use K-means, a relatively simple machine learning algorithm, to extract topics from similar tweets. Finally, we\u2019ll look at when some of these topics were popular in relation to news stories during the 2016 election.<\/p>\r\n

The published data is available<\/a> in 13 csv files and amounts to nearly three million tweets. For each tweet we know the author\u2019s Twitter handle, tweet content, published date, language, and region. To help focus our attention, let\u2019s examine only English language tweets in the United States. Applying these filters to the dataset results in just under two million tweets. To get started, let\u2019s plot the number of tweets per month from December 2015 to May 2018.<\/p>\r\n

\"\"<\/p>\r\n

Right away, we can see clear patterns between tweet activity and major developments in the presidential campaigns.<\/p>\r\n

Next, let\u2019s clean up the text of the tweets a bit. I remove all punctuation, make all words lowercase, and remove stop words (common words like \u201cthe\u201d), and non-words like internet links and emojis. Here\u2019s an example of the tweet cleaning process:<\/p>\r\n

\u201cVeteran strategist Paul Manafort becomes Trump’s campaign chairman https:\/\/t.co\/8c9twG9smG<\/a>\u201d<\/p>\r\n

Cleaned up, this tweet is:<\/p>\r\n

\u201cveteran strategist paul manafort becomes trumps campaign chairman\u201d<\/p>\r\n

Although the cleaning process goes a long way toward rendering the text understandable to a computer, at the end of the day the computer must deal with numbers, not words. The process of converting words to numbers is called \u201cvectorizing\u201d and there are many techniques<\/a> available to accomplish this task. I use a Python library called Gensim<\/a> to train a shallow neural network according to the word2vec<\/a> algorithm developed by researchers at Google to vectorize the tweet words.<\/p>\r\n

The nice thing about word2vec is that similar words, or words that are used in sentences in similar ways, are close to each other in vector space. For example, if we ask the word2vec model about the most similar words to \u201camerica,\u201d we get \u201ccountry,\u201d \u201cusa,\u201d and \u201cnation,\u201d among others. Conveniently, the word2vec model has already grouped similar word topics for us, but we desire to group whole tweets. To accomplish this, we need to translate the collection of words in an individual tweet into a single vector representation. I have chosen to simply take the average of each word vector in the tweet.<\/p>\r\n

With each tweet now represented by an average word vector, we can use any unsupervised clustering technique to group similar tweets. For simplicity, I have used K-means<\/a>, an algorithm that iteratively updates a predetermined number of cluster centers based on the Euclidean distance between the centers and the data points nearest them. In the end, any single tweet will fall into one of k<\/em> clusters, where k<\/em> is the user-defined number of expected clusters. Best practices exist for determining the optimal value of k<\/a>, but in this case I have simply chosen a large number\u201450. I have inspected the clusters manually to combine similar clusters and identify the most distinguished clusters.<\/p>\r\n

Here is the breakdown of cluster size (it\u2019s worth noting that the cluster label number is completely random\u2014there is no larger meaning behind any pattern that may arise in such a distribution):<\/p>\r\n

\"\"<\/p>\r\n

We see that some clusters have many more tweets than others. This could mean one of two things\u2014first, that there simply were many tweets posted with a similar topic, or that those clusters are defined by vague words and are not unique. The latter seems to be true of cluster 8, while the former is true of cluster 37 where a couple of popular, generic quotes were highly shared. By inspecting a few randomly chosen tweets in each cluster, I found the following interesting groups:<\/p>\r\n