In part 1 I graphed the tweets sent during the broadcast of the People’s Debate in Killarney and this indicated some of the clusters I expected. Text clustering involves grouping similar documents together based on their content and there are a number of ways this can be done. I primarily followed this tutorial by Brandon Rose which clustered the top 100 movies of all time from IMBD.
First I cleaned each tweet following an example taken from a Word2Vec tutorial.* All punctuation is removed, all URLs are removed and the vinb hashtag contained in every tweet was removed as this could cause tweets to cluster around it. Stop words, such as “the” and “and” are also removed and the whole tweet converted to lowercase. This is all done to make the clustering easier. Words such as “Ferris”, “ferris” and “ferris,” will now all be converted to the same word as opposed to showing up as three separate words. The cleaned up tweets are then reassembled and fed into a TfidfVectorizer.
So I want to cluster the tweets based on their content. What exactly does that mean? There are a lot of good tutorials out there on natural language processing, the Kaggle Bag of Words tutorial is a good place to start and this article explains some of the points behind tf-idf.
Basically the term frequency part will measure how many times a word appears in each document, in this case each tweet. So in our first tweet, if we have 20 words and “griffin” appears once, this gives us 1/20 = 0.05. For the inverse term frequency part we measure how important the term is in the overall list of documents, or tweets. We have about 887 tweets and if “griffin” appears in 30 our idf is log(887/30) = 1.47. Therefore, the weight for the word “griffin” in our first tweet is 0.05*1.47 = 0.0735.
So after doing this for every tweet we end up with a matrix of 887 rows, one for each tweet, and around 7500 columns, one for each unique word that appears in all of the tweets. The max_features parameter in the TfidfVectorizer can be used to limit this, so max_features=1000 would result in an 887 x 1000 matrix. Each column will contain the tfidf weight for that word in each tweet. For a tweet that doesn’t contain “griffin” the weight will have a value of 0 * 1.47 = 0. With an 887 x 1000 matrix we can now in effect plot each tweet in 1000 dimensions and run kmeans on them to find which ones are closest to each other. This should indicate that their content is similar.
The reason I did not include max_features, max_df or min_df in the TfidfVectorizer is because there are too many words that occur rarely or only once and too few that occur regularly. Adding and changing these numbers gave very different results but I am happy with the clusters generated below.
I tried a number of different clusters between 5 and 30 and was happiest with 15. Selecting a number of clusters is a dark art form because there is no such thing as a right or wrong number of clusters. (If you are unsure about kmeans this is a helpful video, especially now that you can see the tweets in terms of points instead of as a group of words.)
A description of all the clusters can be found here and if you run the code itself it will generate different clusters with different keywords but there are some that are very common, such as:
Grant Thornton gets a cluster all to himself after being mentioned as the author of a report into post office closures (I believe it is the name of company). The word fella is also an important keyword in this cluster.
Cluster 1 words: thornton, fella, grant, grant thornton, name, Who's the joe Duffy fella? #vinb That fella can't pronounce 'billion' properly. #vinb A fella by the name of Grant Thornton.ha! . #VinB
A cluster which features heavily a number of water related hashtags.
Cluster 3 words: pipe, peoplesdebate pipe, peoplesdebate, pipe peoplesdebate, irishwater, #inequality #Protest #austerity #LoneParents #not1pipe #PeoplesDebate #VinB #corruption #irishwater #government http://t.co/SXuiSPWFEe #vinb #PeoplesDebate #not1pipe #WaterCharges #government #Burtoncut #Lab #FG #LoneParents #cuts #austerity #Protest http://t.co/CIeKgVLVfb Plenty of money to throw at #irishwater corruption but pittance to mental health #not1pipe #PeoplesDebate #vinb https://t.co/OwOpuR0uJa
What's interesting is that there are two Healy-Rae clusters, one clustering around Michael and Healy-Rae singular, the other clustering around Healy-Raes plural.
Cluster 4 words: rae, healy rae, healy, michael, michael healy rae, 50 shades of Rae is out now on dvd #vinb Goota love the Healy -Rae's....every time they speak, tis like listening to a Munster Final on the radio.... #vinb Healy Rae is fooling nobody. His aul boy was a Fianna Failer and no amount of cap wearing can hide that #vinb Cluster 10 words: healy raes, raes, healy, may, il, I fear there may be Healy Raes in the Dáil for as long as Kerry is attached the island. #vinb Why do the Healy Rays exaggerate the Kerry accent when they have a microphone in front of them? Nobody else down there talks like that #vinb Methinks the 'Healy-Raes' are putting on the accent with a plough. #VinB
Tweets relating mainly to Martin Ferris many making mention of the lack of a scrap between himself and Vincent Browne.
Cluster 6 words: ferris, martin, martin ferris, man, well, Though not relevant to the peoples debate, I think vincents attack on Martin Ferris should,nt have been edited. Just the show the bias #vinb #vinb @vincentbrowne what happened your questioning of Martin Ferris. Cannot believe it was edited out. Going soft. Best part of the night Martin Ferris on #vinb wants more money in everyone's pockets. Doesn't say whose money tho. Populist shite. Martin Ferris using his closing remarks to hit every single populist point he could think of. Ugh. #vinb
Post offices got a cluster of its own, as did health, mental health, health service. Tralee got a small cluster, referencing the (controversial, what with it being Killarney and all) singing of “The Rose of Tralee” at the start of the night and a large cluster based around Jackie Crowe, a women who spoke during the night about her experience of cancer and treatment at Tralee hospital. These are the full list of clusters:
Top terms per cluster: Cluster 0 words: one, ireland, would, tv, fuck, Length: 53 Cluster 1 words: thornton, fella, grant, grant thornton, name, Length: 24 Cluster 2 words: peoplesdebate, people, great, kerry, day, Length: 97 Cluster 3 words: pipe, peoplesdebate pipe, peoplesdebate, pipe peoplesdebate, irishwater, Length: 18 Cluster 4 words: rae, healy rae, healy, michael, michael healy rae, Length: 51 Cluster 5 words: people, labour, kerry, people kerry, kerry people, Length: 51 Cluster 6 words: ferris, martin, martin ferris, man, well, Length: 48 Cluster 7 words: post, post offices, offices, good, re, Length: 60 Cluster 8 words: tralee, rose, rose tralee, way, brogue tralee, Length: 18 Cluster 9 words: get, jackie, jackie crowe, crowe, right, Length: 85 Cluster 10 words: healy raes, raes, healy, may, il, Length: 19 Cluster 11 words: debate, back, tonight, back door, debate tonight, Length: 22 Cluster 12 words: health, mental, mental health, ff, health service, Length: 36 Cluster 13 words: cap, let, party, please, song, Length: 249 Cluster 14 words: kerry, woman, see, tonight, brave, Length: 56
Of special note is cluster 13. While it would be great to think of 249 tweets about Michael Healy-Rae’s cap, this is what I termed an uber-cluster. Even at 30 clusters these continued to exist, a cluster taking in up to 33% of the total tweets, while some clusters held only a single tweet. Again, this has to do with the sheer divergence in topics being discussed on the night. References are made to the Berkley tragedy, the Greek crises, Joan Burton, a Transformers movie and a wealth of hashtags. There are about 7500 unique words in all these tweets put together when stop words are removed, which is very small considering that very few of those words repeat.
Continuously rerunning the script generates interesting results. As long as they are not sucked into the uber-cluster clusters around health, post offices, Martin Ferris, water, Brendan Griffin and at least one Healy-Rae cluster continuously appear. Others are a “johnny, healy rae, subtitles” cluster, a “healy rae, hat” cluster and the Cluster of Kerry which contained lots of mentions of the county.
There are a number of problems also, such as the “vote, orgasm vote” cluster. A number of tweets containing the word “vote” clusters around a tweet that says “I had an orgasm vote for me #Vinb”. Weak clusters like the Orgasm Vote cluster disappear when the ngram_range in the TfidfVectorizer is removed, defaulting the value to 1. This however then fails to pick up connections between words such as michael healy rae, fine gael, jackie crowe etc. Again, there are no right or wrong answers with clustering and I am pretty happy with the clusters I get.
I am happy that I was consistently able to get clusters covering the major topics on the night.
Cleaning the tweets was very important. A closer look at the tweets may indicate more cleaning that could be done, such as checking for the word “Healy-Rays” instead of “Healy-Raes”. Adding ngram_range=(1, 3) was a massive improvement and further tweaking of the TfidfVectorizer hyperparamaters, especially the max_features, may provide better results.
900 or so documents containing at max 140 characters are hard to cluster well, especially when there are so many topics and words which only appear once or twice. But considering this I am still pleased with the ability to pick out two Healy-Rae clusters, a post office cluster, a health cluster and a Martin Ferris cluster as these were some of the big issues on the night of the debate.
The twitter machine is dark and full of terrors.
*I ran a brief Word2Vec script but there were too few commonly occurring words to get any solid results. There was a strong correlation between “Kerry” and “healy” and there was also a strong connection between “rae” and, you guessed it, “hat”.