Connecting to Twitter APIThe first step was to get consumer key and consumer secret from Twitter. This is used for authentication purposes. To get these, I just created a new app on Twitter.
Apparently, on Windows system the authentication requires an extra step. Thanks to this blog post for clarifying that. See below.
library("ROAuth") library("twitteR") #necessary step for Windows download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem") #to get your consumerKey and consumerSecret see the twitteR documentation for instructions credentials <- OAuthFactory$new (consumerKey='#################...', consumerSecret='#################...', requestURL='https://api.twitter.com/oauth/request_token', accessURL='https://api.twitter.com/oauth/access_token', authURL='https://api.twitter.com/oauth/authorize') credentials$handshake(cainfo="cacert.pem") save(credentials, file="twitter authentication.Rdata") registerTwitterOAuth(credentials)
After following the link provided, allowing access to the Twitter app and copying the security key produced back in R, everything was all set. To check:
> registerTwitterOAuth(credentials)  TRUE
library(tm) library(wordcloud) tag<- searchTwitter("#AFC", n=100, cainfo="cacert.pem") df <- do.call("rbind", lapply(tag, as.data.frame)) twitterCorpus <- Corpus(VectorSource(df$text)) twitterCorpus <- tm_map(twitterCorpus, tolower) # remove punctuation twitterCorpus <- tm_map(twitterCorpus, removePunctuation) # remove numbers twitterCorpus <- tm_map(twitterCorpus, removeNumbers) tdm <- TermDocumentMatrix(twitterCorpus, control = list(minWordLength = 1)) m <- as.matrix(tdm) # calculate the frequency of words v <- sort(rowSums(m), decreasing=TRUE) words <- names(v) d <- data.frame(word=words, freq=v) wordcloud(d$word, d$freq, min.freq=3)
Although it generates the cloud, at this point, it is not effective enough for analysis. One way to make it better is to get rid of common English words such as "and", "but", etc. Secondly, getting rid of Unicode and non-English languages would be an improvement also. And, there is MORE that needs to be done.
I just learnt that the common words mentioned above are referred as stop-words. And, it is pretty easy to handle them. After removing them and adding some colors the clouds look more beautiful. I also got rid of all words with substring "http" (maybe not in the most efficient way.. I will have to look into it).
# First of all make sure the connection to twitter is authenticated. # see file twiiterconnection.R # # if that has been done previously and this is a new session, open .Rdata # that has stored credentials # #if registerTwitterOAuth(credentials) is TRUE, you are good to go library(tm) library(wordcloud) library(ROAuth) library(twitteR) # check connection registerTwitterOAuth(credentials) tag<- searchTwitter("#MH370", n=700, lang= "en", cainfo="cacert.pem") df <- do.call("rbind", lapply(tag, as.data.frame)) twitterCorpus <- Corpus(VectorSource(df$text)) tdm <- TermDocumentMatrix( twitterCorpus, control = list(minWordLength = 1, removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE, tolower=TRUE)) m <- as.matrix(tdm) # frequency of words frequency <- sort(rowSums(m), decreasing=TRUE) words <- names(frequency) # get rid of words that contains "http" # hopefully there is a better way to do it words[which(grepl("http",names(frequency)))] <- "" d <- data.frame(word=words, freq=frequency) #save the image in png format png("mh370.png", width=12, height=12, units="in", res=300) wordcloud(d$word, d$freq,scale=c(10,.4),min.freq=10, max.words=Inf, random.order=FALSE, rot.per=.3, colors=brewer.pal(8, "Dark2")) dev.off()
wordcloud(d$word, d$freq,scale=c(8,.1),min.freq=7, max.words=Inf, random.order=FALSE, rot.per=.3, vfont=c("gothic english","plain"), colors=brewer.pal(8, "Dark2"))
After changing the font, inverting the color and messing around in Photoshop: