Skip to main content

TwitterMining

This is an austere implementation of mining tweets using R. I will work on making it better sometime in the future (once I engulf R data structures more).

Connecting to Twitter API

The first step was to get consumer key and consumer secret from Twitter. This is used for authentication purposes. To get these, I just created a new app on Twitter.
Apparently, on Windows system the authentication requires an extra step. Thanks to this blog post for clarifying that. See below.

library("ROAuth")
library("twitteR")

#necessary step for Windows
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")

#to get your consumerKey and consumerSecret see the twitteR documentation for instructions
credentials <- OAuthFactory$new
                        (consumerKey='#################...',
                         consumerSecret='#################...',
                         requestURL='https://api.twitter.com/oauth/request_token',
                         accessURL='https://api.twitter.com/oauth/access_token',
                         authURL='https://api.twitter.com/oauth/authorize')

credentials$handshake(cainfo="cacert.pem")
save(credentials, file="twitter authentication.Rdata")
registerTwitterOAuth(credentials)

After following the link provided, allowing access to the Twitter app and copying the security key produced back in R, everything was all set. To check:

> registerTwitterOAuth(credentials)
[1] TRUE

Tweet Cloud

library(tm)
library(wordcloud)

tag<- searchTwitter("#AFC", n=100, cainfo="cacert.pem")
df <- do.call("rbind", lapply(tag, as.data.frame))
twitterCorpus <- Corpus(VectorSource(df$text))
twitterCorpus <- tm_map(twitterCorpus, tolower)
# remove punctuation
twitterCorpus <- tm_map(twitterCorpus, removePunctuation)
# remove numbers
twitterCorpus <- tm_map(twitterCorpus, removeNumbers)
tdm <- TermDocumentMatrix(twitterCorpus, control = list(minWordLength = 1))
m <- as.matrix(tdm)
# calculate the frequency of words
v <- sort(rowSums(m), decreasing=TRUE)
words <- names(v)
d <- data.frame(word=words, freq=v)
wordcloud(d$word, d$freq, min.freq=3)

Result

#AFC

Although it generates the cloud, at this point, it is not effective enough for analysis. One way to make it better is to get rid of common English words such as "and", "but", etc. Secondly, getting rid of Unicode and non-English languages would be an improvement also. And, there is MORE that needs to be done.
Hopefully soon.

Soon

I just learnt that the common words mentioned above are referred as stop-words. And, it is pretty easy to handle them. After removing them and adding some colors the clouds look more beautiful. I also got rid of all words with substring "http" (maybe not in the most efficient way.. I will have to look into it).

# First of all make sure the connection to twitter is authenticated.
# see file twiiterconnection.R
# 
# if that has been done previously and this is a new session, open .Rdata
# that has stored credentials
# 
#if registerTwitterOAuth(credentials) is TRUE, you are good to go

library(tm)
library(wordcloud)
library(ROAuth)
library(twitteR)

# check connection
registerTwitterOAuth(credentials)

tag<- searchTwitter("#MH370", n=700, lang= "en", cainfo="cacert.pem")
df <- do.call("rbind", lapply(tag, as.data.frame))

twitterCorpus <- Corpus(VectorSource(df$text))
tdm <- TermDocumentMatrix(
  twitterCorpus, control = list(minWordLength = 1,
                                removePunctuation = TRUE, 
                                removeNumbers = TRUE,
                                stopwords = TRUE,
                                tolower=TRUE))

m <- as.matrix(tdm)

# frequency of words
frequency <- sort(rowSums(m), decreasing=TRUE)
words <- names(frequency)

# get rid of words that contains "http"
# hopefully there is a better way to do it
words[which(grepl("http",names(frequency)))] <- ""

d <- data.frame(word=words, freq=frequency)

#save the image in png format
png("mh370.png", width=12, height=12, units="in", res=300)
wordcloud(d$word, d$freq,scale=c(10,.4),min.freq=10,
          max.words=Inf, random.order=FALSE, rot.per=.3, colors=brewer.pal(8, "Dark2"))
dev.off()


Result

#MH370

#AFC

wordcloud(d$word, d$freq,scale=c(8,.1),min.freq=7,
          max.words=Inf, random.order=FALSE, 
          rot.per=.3, vfont=c("gothic english","plain"),
          colors=brewer.pal(8, "Dark2"))

After changing the font, inverting the color and messing around in Photoshop:





Popular posts from this blog

Zorgania

The Zorganian Republic has some very strange customs. Couples only wish to have female children as only females can inherit the family's wealth, so if they have a male child they keep having more children until they have a girl. If they have a girl, they stop having children. What is the ratio of girls to boys in Zorgania?
The ratio of girls to boys in Zorgania is 1:1. This might be a little counter-intuitive at first. Here are some ways of tackling this problem. 1. Monte Carlo Simulation: Although, Monte Carlo simulation does not necessarily show why the result is 1:1, it is appropriate because of the very counter-intuitive nature of the problem. At the very least, it helps us see that the result is indeed 1:1. Therefore, this is a good start.
The following R code estimates the probability of a child being a boy in Zorgania. 
couples <-100000 boycount <-0for (i in1:couples){ # 0: boywhile (sample(c(0,1),1) ==0) { boycount=boycount+1 } } probability <- boycount/(co…

SalsaNight

Simple Launcher

A simple minimal launcher application for Android devices that shows battery percentage using lzyzsd's CirclProgress library (ArchProgress used in this case) and BroadcastReciever for battery state, Android's clock widgets, a built-in flash light switch and an app list view that can be toggled. Currently, the toggle simply filters all the app that I am working on at present. Future implementation can allow users to select their favorite apps or populate second toggle based on the most used applications.