Pages

TwitterMining

This is an austere implementation of mining tweets using R. I will work on making it better sometime in the future (once I engulf R data structures more).

Connecting to Twitter API

The first step was to get consumer key and consumer secret from Twitter. This is used for authentication purposes. To get these, I just created a new app on Twitter.
Apparently, on Windows system the authentication requires an extra step. Thanks to this blog post for clarifying that. See below.

library("ROAuth")
library("twitteR")

#necessary step for Windows
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

#to get your consumerKey and consumerSecret see the twitteR documentation for instructions
credentials <- OAuthFactory$new
                        (consumerKey='#################...',
                         consumerSecret='#################...',
                         requestURL='https://api.twitter.com/oauth/request_token',
                         accessURL='https://api.twitter.com/oauth/access_token',
                         authURL='https://api.twitter.com/oauth/authorize')

credentials$handshake(cainfo="cacert.pem")
save(credentials, file="twitter authentication.Rdata")
registerTwitterOAuth(credentials)

After following the link provided, allowing access to the Twitter app and copying the security key produced back in R, everything was all set. To check:

> registerTwitterOAuth(credentials)
[1] TRUE

Tweet Cloud

library(tm)
library(wordcloud)

tag<- searchTwitter("#AFC", n=100, cainfo="cacert.pem")
df <- do.call("rbind", lapply(tag, as.data.frame))
twitterCorpus <- Corpus(VectorSource(df$text))
twitterCorpus <- tm_map(twitterCorpus, tolower)
# remove punctuation
twitterCorpus <- tm_map(twitterCorpus, removePunctuation)
# remove numbers
twitterCorpus <- tm_map(twitterCorpus, removeNumbers)
tdm <- TermDocumentMatrix(twitterCorpus, control = list(minWordLength = 1))
m <- as.matrix(tdm)
# calculate the frequency of words
v <- sort(rowSums(m), decreasing=TRUE)
words <- names(v)
d <- data.frame(word=words, freq=v)
wordcloud(d$word, d$freq, min.freq=3)

Result

#AFC

Although it generates the cloud, at this point, it is not effective enough for analysis. One way to make it better is to get rid of common English words such as "and", "but", etc. Secondly, getting rid of Unicode and non-English languages would be an improvement also. And, there is MORE that needs to be done.
Hopefully soon.

Soon

I just learnt that the common words mentioned above are referred as stop-words. And, it is pretty easy to handle them. After removing them and adding some colors the clouds look more beautiful. I also got rid of all words with substring "http" (maybe not in the most efficient way.. I will have to look into it).

# First of all make sure the connection to twitter is authenticated.
# see file twiiterconnection.R
# 
# if that has been done previously and this is a new session, open .Rdata
# that has stored credentials
# 
#if registerTwitterOAuth(credentials) is TRUE, you are good to go

library(tm)
library(wordcloud)
library(ROAuth)
library(twitteR)

# check connection
registerTwitterOAuth(credentials)

tag<- searchTwitter("#MH370", n=700, lang= "en", cainfo="cacert.pem")
df <- do.call("rbind", lapply(tag, as.data.frame))

twitterCorpus <- Corpus(VectorSource(df$text))
tdm <- TermDocumentMatrix(
  twitterCorpus, control = list(minWordLength = 1,
                                removePunctuation = TRUE, 
                                removeNumbers = TRUE,
                                stopwords = TRUE,
                                tolower=TRUE))

m <- as.matrix(tdm)

# frequency of words
frequency <- sort(rowSums(m), decreasing=TRUE)
words <- names(frequency)

# get rid of words that contains "http"
# hopefully there is a better way to do it
words[which(grepl("http",names(frequency)))] <- ""

d <- data.frame(word=words, freq=frequency)

#save the image in png format
png("mh370.png", width=12, height=12, units="in", res=300)
wordcloud(d$word, d$freq,scale=c(10,.4),min.freq=10,
          max.words=Inf, random.order=FALSE, rot.per=.3, colors=brewer.pal(8, "Dark2"))
dev.off()


Result

#MH370

#AFC

wordcloud(d$word, d$freq,scale=c(8,.1),min.freq=7,
          max.words=Inf, random.order=FALSE, 
          rot.per=.3, vfont=c("gothic english","plain"),
          colors=brewer.pal(8, "Dark2"))

After changing the font, inverting the color and messing around in Photoshop:





Ayush Subedi

Coffee Connoisseur