Sentiment Analysis on Donald Trump using R and Tableau

By Fisseha Berhane

wordcloud_donald

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Recently, the presidential candidate Donal Trump has become controversial. Particularly, associated with his provocative call to temporarily bar Muslims from entering the US, he has faced strong criticism.
Some of the many uses of social media analytics is sentiment analysis where we evaluate whether posts on a specific issue are positive or negative.
We can integrate R and Tableau for text data mining in social media analytics, machine learning, predictive modeling, etc., by taking advantage of the numerous R packages and compelling Tableau visualizations.

In this post, let’s mine tweets and analyze their sentiment using R. We will use Tableau to visualize our results. We will see spatial-temporal distribution of tweets, cities and states with top number of tweets and we will also map the sentiment of the tweets. This will help us to see in which areas his comments are accepted as positive and where they are perceived as negative.

Load required packages:

library(twitteR)
library(ROAuth)
require(RCurl)
library(stringr)
library(tm)
library(ggmap)
library(dplyr)
library(plyr)
library(tm)
library(wordcloud)

Get Twitter authentication

All information below is obtained from twitter developer account. We will set working directory to save our authentication.

key="hidden"
secret="hidden"
setwd("/text_mining_and_web_scraping")

download.file(url="http://curl.haxx.se/ca/cacert.pem",
              destfile="/text_mining_and_web_scraping/cacert.pem",
              method="auto")
authenticate <- OAuthFactory$new(consumerKey=key,
                                 consumerSecret=secret,
                                 requestURL="https://api.twitter.com/oauth/request_token",
                                 accessURL="https://api.twitter.com/oauth/access_token",
                                 authURL="https://api.twitter.com/oauth/authorize")
setup_twitter_oauth(key, secret)
save(authenticate, file="twitter authentication.Rdata")

Get sample tweets from various cities

Let’s scrape most recent tweets from various cities across the US. Let’s request 2000 tweets from each city. We will need the latitude and longitude of each city.

N=2000  # tweets to request from each query
S=200  # radius in miles
lats=c(38.9,40.7,37.8,39,37.4,28,30,42.4,48,36,32.3,33.5,34.7,33.8,37.2,41.2,46.8,
       46.6,37.2,43,42.7,40.8,36.2,38.6,35.8,40.3,43.6,40.8,44.9,44.9)

lons=c(-77,-74,-122,-105.5,-122,-82.5,-98,-71,-122,-115,-86.3,-112,-92.3,-84.4,-93.3,
       -104.8,-100.8,-112, -93.3,-89,-84.5,-111.8,-86.8,-92.2,-78.6,-76.8,-116.2,-98.7,-123,-93)

#cities=DC,New York,San Fransisco,Colorado,Mountainview,Tampa,Austin,Boston,
#       Seatle,Vegas,Montgomery,Phoenix,Little Rock,Atlanta,Springfield,
#       Cheyenne,Bisruk,Helena,Springfield,Madison,Lansing,Salt Lake City,Nashville
#       Jefferson City,Raleigh,Harrisburg,Boise,Lincoln,Salem,St. Paul

donald=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Donald+Trump',
              lang="en",n=N,resultType="recent",
              geocode=paste(lats[i],lons[i],paste0(S,"mi"),sep=","))))

Let’s get the latitude and longitude of each tweet, the tweet itself, how many times it was re-twitted and favorited, the date and time it was twitted, etc.

donaldlat=sapply(donald, function(x) as.numeric(x$getLatitude()))
donaldlat=sapply(donaldlat, function(z) ifelse(length(z)==0,NA,z))  

donaldlon=sapply(donald, function(x) as.numeric(x$getLongitude()))
donaldlon=sapply(donaldlon, function(z) ifelse(length(z)==0,NA,z))  

donalddate=lapply(donald, function(x) x$getCreated())
donalddate=sapply(donalddate,function(x) strftime(x, format="%Y-%m-%d %H:%M:%S",tz = "UTC"))

donaldtext=sapply(donald, function(x) x$getText())
donaldtext=unlist(donaldtext)

isretweet=sapply(donald, function(x) x$getIsRetweet())
retweeted=sapply(donald, function(x) x$getRetweeted())
retweetcount=sapply(donald, function(x) x$getRetweetCount())

favoritecount=sapply(donald, function(x) x$getFavoriteCount())
favorited=sapply(donald, function(x) x$getFavorited())

data=as.data.frame(cbind(tweet=donaldtext,date=donalddate,lat=donaldlat,lon=donaldlon,
                           isretweet=isretweet,retweeted=retweeted, retweetcount=retweetcount,favoritecount=favoritecount,favorited=favorited))

First, let’s create a word cloud of the tweets. A word cloud helps us to visualize the most common words in the tweets and have a general feeling of the tweets.

# Create corpus
corpus=Corpus(VectorSource(data$tweet))

# Convert to lower-case
corpus=tm_map(corpus,tolower)

# Remove stopwords
corpus=tm_map(corpus,function(x) removeWords(x,stopwords()))

# convert corpus to a Plain Text Document
corpus=tm_map(corpus,PlainTextDocument)

col=brewer.pal(6,"Dark2")
wordcloud(corpus, min.freq=25, scale=c(5,2),rot.per = 0.25,
          random.color=T, max.word=45, random.order=F,colors=col)

Here is the word cloud:

We see from the word cloud that among the most frequent words in the tweets are ‘muslim’, ‘muslims’, ‘ban’. This suggests that most tweets were on Trump’s recent idea of temporarily banning Muslims from entering the US.

The dashboard below shows time series of the number of tweets scraped. We can change the time unit between hour and day and the dashboard will change based on the selected time unit. Pattern of number of tweets over time helps us to drill in and see how each activities/campaigns are being perceived.

Here is the screenshot. (View it live in this link)

Getting address of tweets

Since some tweets do not have lat/lon values, we will remove them because we want geographic information to show the tweets and their attributes by state, city and zip code.

data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)

Let’s get full address of each tweet location using the google maps API. The ggmaps package is what enables us to get the street address, city, zipcode and state of the tweets using the longitude and latitude of the tweets. Since the google maps API does not allow more than 2500 queries per day, I used a couple of machines to reverse geocode the latitude/longitude information in a full address. However, I was not lucky enough to reverse geocode all of the tweets I scraped. So, in the following visualizations, I am showing only some percentage of the tweets I scraped that I was able to reverse geocode.

result <- do.call(rbind, lapply(1:nrow(lonlat),
                     function(i) revgeocode(as.numeric(lonlat[i,1:2]))))

If we see some of the values of result, we see that it contains the full address of the locations where the tweets were posted.

result[1:5,]
     [,1]                                              
[1,] "1778 Woodglo Dr, Asheboro, NC 27205, USA"        
[2,] "1550 Missouri Valley Rd, Riverton, WY 82501, USA"
[3,] "118 S Main St, Ann Arbor, MI 48104, USA"         
[4,] "322 W 101st St, New York, NY 10025, USA"         
[5,] "322 W 101st St, New York, NY 10025, USA"

So, we will apply some regular expression and string manipulation to separate the city, zip code and state into different columns.

data2=lapply(result,  function(x) unlist(strsplit(x,",")))
address=sapply(data2,function(x) paste(x[1:3],collapse=''))
city=sapply(data2,function(x) x[2])
stzip=sapply(data2,function(x) x[3])
zipcode = as.numeric(str_extract(stzip,"[0-9]{5}"))   
state=str_extract(stzip,"[:alpha:]{2}")
data2=as.data.frame(list(address=address,city=city,zipcode=zipcode,state=state))

Concatenate data2 to data:

data=cind(data,data2)

Some text cleaning:

tweet=data$tweet
tweet_list=lapply(tweet, function(x) iconv(x, "latin1", "ASCII", sub=""))
tweet_list=lapply(tweet, function(x) gsub("htt.*",' ',x))
tweet=unlist(tweet)
data$tweet=tweet

We will use lexicon based sentiment analysis. A list of positive and negative opinion words or sentiment words for English was downloaded from here.

positives= readLines("positivewords.txt")
negatives = readLines("negativewords.txt")

First, let’s have a wrapper function that calculates sentiment scores.

sentiment_scores = function(tweets, positive_words, negative_words, .progress='none'){
  scores = laply(tweets,
                 function(tweet, positive_words, negative_words){
                 tweet = gsub("[[:punct:]]", "", tweet)    # remove punctuation
                 tweet = gsub("[[:cntrl:]]", "", tweet)   # remove control characters
                 tweet = gsub('d+', '', tweet)          # remove digits
                
                 # Let's have error handling function when trying tolower
                 tryTolower = function(x){
                     # create missing value
                     y = NA
                     # tryCatch error
                     try_error = tryCatch(tolower(x), error=function(e) e)
                     # if not an error
                     if (!inherits(try_error, "error"))
                       y = tolower(x)
                     # result
                     return(y)
                   }
                   # use tryTolower with sapply
                   tweet = sapply(tweet, tryTolower)
                   # split sentence into words with str_split function from stringr package
                   word_list = str_split(tweet, "s+")
                   words = unlist(word_list)
                   # compare words to the dictionaries of positive & negative terms
                   positive.matches = match(words, positive_words)
                   negative.matches = match(words, negative_words)
                   # get the position of the matched term or NA
                   # we just want a TRUE/FALSE
                   positive_matches = !is.na(positive_matches)
                   negative_matches = !is.na(negative_matches)
                   # final score
                   score = sum(positive_matches) - sum(negative_matches)
                   return(score)
                 }, positive_matches, negative_matches, .progress=.progress )
  return(scores)
}

score = sentiment_scores(tweet, positives, negatives, .progress='text')
data$score=score

Let’s plot a histogram of the sentiment score:

hist(score,xlab=" ",main="Sentiment of sample tweetsn that have Donald Trump in them ",
     border="black",col="skyblue")

Here is the plot:
hist1_2_16

We see from the histogram that the sentiment is slightly positive. Using Tableau, we will see the spatial distribution of the sentiment scores.

Save the data as csv file and import it to Tableau

The map below shows the tweets that I was able to reverse geocode. The size is proportional to the number of favorites each tweet got. In the interactive map, we can hover each circle and read the tweet, the address it was tweeted from, and the date and time it was posted.

Here is the screenshot (View it live in this link)
by_retweets

Similarly, the dashboard below shows the tweets and the size is proportional to the number of times each tweet was retweeted.
Here is the screenshot (View it live in this link)
by_retweets

In the following three visualizations, top zip codes, cities and states by the number of tweets are shown. In the interactive map, we can change the number of zip codes, cities and states to display by using the scrollbars shown in each viz. These visualizations help us to see the distribution of the tweets by zip code, city and state.

By zip code
Here is the screenshot (View it live in this link)
top10zip

By city
Here is the screenshot (View it live in this link)
top15cities

By state
Here is the screenshot (View it live in this link)
top15zip

Sentiment of tweets

Sentiment analysis has myriads of uses. For example, a company may investigate what customers like most about the company’s product, and what are the issues the customers are not satisfied with? When a company releases a new product, has the product been perceived positively or negatively? How does the sentiment of the customers vary across space and time? In this post, we are evaluating, the sentiment of tweets that we scraped on Donald Trump.

The viz below shows the sentiment score of the reverse geocoded tweets by state. We see that the tweets have highest positive sentiment in NY, NC and Tx.
Here is the screenshot (View it live in this link)
by_sentiment

Summary

In this post, we saw how to integrate R and Tableau for text mining, sentiment analysis and visualization. Using these tools together enables us to answer detailed questions.

We used a sample from the most recent tweets that contain Donald Trump and since I was not able to reverse geocode all the tweets I scraped because of the constraint imposed by google maps API, we just used about 6000 tweets. The average sentiment is slightly above zero. Some states show strong positive sentiment. However, statistically speaking, to make robust conclusions, mining ample size sample data is important.

The accuracy of our sentiment analysis depends on how fully the words in the the tweets are included in the lexicon. More over, since tweets may contain slang, jargon and collequial words which may not be included in the lexicon, sentiment analysis needs careful evaluation.

This is enough for today. I hope you enjoyed it! If you have any questions or feedback, feel free to leave a comment.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Explorable, multi-tabbed reports in R and Shiny

By Thomas Hopper

(This article was first published on Tom Hopper » RStats, and kindly contributed to R-bloggers)

Matt Parker recently showed us how to create multi-tab reports with R and jQuery UI. His example was absurdly easy to reproduce; it was a great blog post.

I have been teaching myself Shiny in fits and starts, and I decided to attempt to reproduce Matt’s jQuery UI example in Shiny. You can play with the app on shinyapps.io, and the complete project is up on Github. The rest of this post walks through how I built the Shiny app.

It’s a demo

The result demonstrates a few Shiny and ggplot2 techniques that will be useful in other projects, including:

  • Creating tabbed reports in Shiny, with different interactive controls or widgets associated with each tab;
  • Combining different ggplot2 scale changes in a single legend;
  • Sorting a data frame so that categorical labels in a legend are ordered to match the position of numerical data on a plot;
  • Borrowing from Matt’s work,
    • Summarizing and plotting data using dplyr and ggplot2;
    • Limiting display of categories in a graph legend to the top n (selectable by the user), with remaining values listed as “other;”
    • Coloring only the top n categories on a graph, and making all other categories gray;
    • Changing line weight for the top n categories on a graph, and making;

Obtaining the data

As with Matt’s original report, the data can be downloaded from the CDC WONDER database by selecting “Data Request” under “current cases.”

To get the same data that I’ve used, group results by “state” and by “year,” check “incidence rate per 100,000” and, near the bottom, “export results.” Uncheck “show totals,” then submit the request. This will download a .txt tab-delimited data file, which in this app I read in using read_tsv() from the readr package.

Looking at Matt’s example, his “top 5” states look suspiciously like the most populous states. He’s used total count of cases, which will be biased toward more populous states and doesn’t tell us anything interesting. When examining occurrences—whether disease, crime or defects—we have to look at the rates rather than total counts; we can only make meaningful comparisons and make useful decisions from an examination of the rates.

Setup

As always, we need to load our libraries into R. For this example, I use readr, dplyr, ggplot2 and RColorBrewer.

The UI

The app generates three graphs: a national total, that calculates national rates from the state values; a combined state graph that highlights the top n states, where the user chooses n; and a graph that displays individual state data, where the user can select the state to view. Each goes on its own tab.

ui.R contains the code to create a tabset panel with three tab panels.

tabsetPanel(
  tabPanel("National", fluidRow(plotOutput("nationPlot"))),
  tabPanel("By State",
           fluidRow(plotOutput("statePlot"),
                    wellPanel(
                      sliderInput(inputId = "nlabels",
                                  label = "Top n States:",
                                  min = 1,
                                  max = 10,
                                  value = 6,
                                  step = 1)
                    )
           )
  ),
  tabPanel("State Lookup",
           fluidRow(plotOutput("iStatePlot"),
                    wellPanel(
                      htmlOutput("selectState"))
           )
  )
)

Each panel contains a fluidRow element to ensure consistent alignment of graphs across tabs, and on tabs where I want both a graph and controls, fluidRow() is used to add the controls below the graph. The controls are placed inside a wellPanel() so that they are visually distinct from the graph.

Because I wanted to populate a selection menu (selectInput()) from the data frame, I created the selection menu in server.R and then displayed it in the third tab panel set using the htmlOutput() function.

The graphs

The first two graphs are very similar to Matt’s example. For the national rates, the only change is the use of rates rather than counts.

df_tb <- read_tsv("../data/OTIS 2013 TB Data.txt", n_max = 1069, col_types = "-ciiii?di")

df_tb %>%
  group_by(Year) %>%
  summarise(n_cases = sum(Count), pop = sum(Population), us_rate = (n_cases / pop * 100000)) %>%
  ggplot(aes(x = Year, y = us_rate)) +
  geom_line() +
  labs(x = "Year Reported",
       y = "TB Cases per 100,000 residents",
       title = "Reported Active Tuberculosis Cases in the U.S.") +
  theme_minimal()

The main trick, here, is the use of dplyr to summarize the data across states. Since we can’t just sum or average rates to get the combined rate, we have to sum all of the state counts and populations for each year, and add another column for the calculated national rate.

To create a graph that highlights the top n states, we generate a data frame with one variable, State, that contains the top n states. This is, again, almost a direct copy of Matt’s code with changes to make the graph interactive within Shiny. This code goes inside of the shinyServer() block so that it will update when the user selects a different value for n. Instead of hard-coding n, there’s a Shiny input slider named nlabels. With a list of the top n states ordered by rate of TB cases, df_tb is updated with a new field containing the top n state names and “Other” for all other states.

top_states <- df_tb %>%
      filter(Year == 2013) %>%
      arrange(desc(Rate)) %>%
      slice(1:input$nlabels) %>%
      select(State)

df_tb$top_state <- factor(df_tb$State, levels = c(top_states$State, "Other"))
df_tb$top_state[is.na(df_tb$top_state)] <- "Other"

The plot is generated from the newly-organized data frame. Where Matt’s example has separate legends for line weight (size) and color, I’ve had ggplot2 combine these into a single legend by passing the same value to the “guide =” argument in the scale_XXX_manual() calls. The colors and line sizes also have to be updated dynamically for the selected n.

    df_tb %>%
      ggplot() +
      labs(x = "Year reported",
           y = "TB Cases per 100,000 residents",
           title = "Reported Active Tuberculosis Cases in the U.S.") +
      theme_minimal() +
      geom_line(aes(x = Year, y = Rate, group = State, colour = top_state, size = top_state)) +
      scale_colour_manual(values = c(brewer.pal(n = input$nlabels, "Paired"), "grey"), guide = guide_legend(title = "State")) +
      scale_size_manual(values = c(rep(1,input$nlabels), 0.5), guide = guide_legend(title = "State"))
  })

})

The last graph is nearly a copy of the national totals graph, except that it is filtered for the state selected in the drop-down menu control. The menu is as selectInput() control.

renderUI({
    selectInput(inputId = "state", label = "Which state?", choices = unique(df_tb$State), selected = "Alabama", multiple = FALSE)
  })

With a state selected, the data is filtered by the selected state and TB rates are plotted.

df_tb %>%
  filter(State == input$state) %>%
  ggplot() +
  labs(x = "Year reported",
       y = "TB Cases per 100,000 residents",
       title = "Reported Active Tuberculosis Cases in the U.S.") +
  theme_minimal() +
  geom_line(aes(x = Year, y = Rate))

Wrap up

I want to thank Matt Parker for his original example. It was well-written, clear and easy to reproduce.

To leave a comment for the author, please follow the link and comment on their blog: Tom Hopper » RStats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Power of Decision Stumps

By statcompute

(This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers)

A decision stump is the weak classification model with the simple tree structure consisting of one split, which can also be considered a one-level decision tree. Due to its simplicity, the stump often demonstrates a low predictive performance. As shown in the example below, the AUC measure of a stump is even lower than the one of a single attribute in a separate testing dataset.

pkgs <- c('pROC', 'RWeka')
lapply(pkgs, require, character.only = T)
df1 <- read.csv("credit_count.txt")
df2 <- df1[df1$CARDHLDR == 1, ]
set.seed(2016)
n <- nrow(df2)
sample <- sample(seq(n), size = n / 2, replace = FALSE)
train <- df2[sample, ]
test <- df2[-sample, ]
x <- paste("AGE + ACADMOS + ADEPCNT + MAJORDRG + MINORDRG + OWNRENT + INCOME + SELFEMPL + INCPER + EXP_INC")
fml <- as.formula(paste("as.factor(DEFAULT) ~ ", x))

### IDENTIFY THE MOST PREDICTIVE ATTRIBUTE ###
imp <- InfoGainAttributeEval(fml, data = train)
imp_x <- test[, names(imp[imp == max(imp)])]
roc(as.factor(test$DEFAULT), imp_x)
# Area under the curve: 0.6243

### CONSTRUCT A WEAK CLASSIFIER OF DECISION STUMP ###
stump <- DecisionStump(fml, data = train)
print(stump)
roc(as.factor(test$DEFAULT), predict(stump, newdata = test, type = "probability")[, 2])
# Area under the curve: 0.5953

Albeit weak by itself, the decision stump can be used as a base model in many machine learning ensemble methods, such as bagging and boosting. For instance, the bagging classifier with 1,000 stumps combined outperforms the single stump by ~7% in terms of AUC (0.6346 vs. 0.5953). Moreover, AdaBoost with stumps can further improve the predictive performance over the single stump by ~11% (0.6585 vs. 0.5953) and also over the logistic regression benchmark by ~2% (0.6585 vs. 0.6473).

### BUILD A BAGGING CLASSIFIER WITH 1,000 STUMPS IN PARALLEL ###
bagging <- Bagging(fml, data = train, control = Weka_control("num-slots" = 0, I = 1000, W = "DecisionStump", S = 2016, P = 50))
roc(as.factor(test$DEFAULT), predict(bagging, newdata = test, type = "probability")[, 2])
# Area under the curve: 0.6346

### BUILD A BOOSTING CLASSIFIER WITH STUMPS ###
boosting <- AdaBoostM1(fml, data = train, control = Weka_control(I = 100, W = "DecisionStump", S = 2016))
roc(as.factor(test$DEFAULT), predict(boosting, newdata = test, type = "probability")[, 2])
# Area under the curve: 0.6585
 
### DEVELOP A LOGIT MODEL FOR THE BENCHMARK ###
logit <- Logistic(fml, data = train)
roc(as.factor(test$DEFAULT), predict(logit, newdata = test, type = "probability")[, 2])
# Area under the curve: 0.6473

To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Some programming language theory in R

By John Mount

Opus hyp
Recursive Opus (on a Hyperbolic disk)

Lots of ink is spilled on the ideas of “functional languages being based on the

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”).

Our reasoning is: if you want to work as a modern data scientist you have to program (this is not optional for reasons of documentation, sharing and scientific repeatability). If you do program you are going to have to eventually think a bit about programming theory (hopefully not too early in your studies, but it will happen). Let’s use R’s powerful programming language (and implementation) to dive into some deep issues in programming language theory:

  • References versus values
  • Function abstraction
  • Equational reasoning
  • Recursion
  • Substitution and evaluation
  • Fixed point theory

To do this we will translate some common ideas from a theory called “the lambda calculus” into R (where we can actually execute them). This translation largely involves changing the word “lambda” to “function” and introducing some parenthesis (which I think greatly improve readability, part of the mystery of the lambda calculus is how unreadable its preferred notation actually is).


Recursive Opus (on a Hyperbolic disk)

Lots of ink is spilled on the ideas of “functional languages being based on the lambda calculus.” This misses the point of the lambda calculus. Most functional programming languages deliberately include powerful features deliberately not found in the lambda calculus (such as being able to efficiently name and re-use variables and values). The interesting fact about the lambda calculus is that a re-writing system as seemingly weak as what is presented already can simulate the features that are directly implemented in typical functional.

Typical functional languages have more features than the lambda calculus, so in principle may be harder to reason about than the lambda calculus. In practice the lambda calculus is itself fairly hard to reason about due to fundamental issues of recursion, lack of helpful type annotations, and illegibility due to overuse of associative notation (the shunning of helpful parenthesis).

But let’s play with this in R a bit as it actually does help our “functional thinking” to see how equational reasoning, function creation, function application, and recursion are related.

The usual first example of a recursive function is the factorial function. Factorial is defined over the non-negative integers as:

  • factorial(0) = 1
  • factorial(x) = x*factorial(x-1)

It is usually written as x! (so x! is shorthand for factorial(x)). Now we are not really interested in the factorial() function as:

  • It is already implement in R as factorial() (see help(‘factorial’)).
  • Any recursive implementation of is going to very slow compared to a more standard “(special function)[https://en.wikipedia.org/wiki/Special_functions]” implementation.
  • You can’t even compute factorial(200) over double precision floating point- so a small look-up table could do all the work.

But it is a somewhat familiar function that is a one of the simplest examples of recursion. Here is a recursive implementation written in R:

# typical example recursive function
# function uses the fact it knows its own name to call itself
fact <- function(x) {
  if(x<=0) {
       return(1)
    } else {
       return(x*fact(x-1))
    }
}
fact(5)
## [1] 120

Now suppose your programming language was weaker, maybe functions don’t have names (like in the lambda calculus), or you don’t trust calling yourself directly (like subroutines in early FORTRAN). You could try the following. build a function returning a function (actually a very powerful ability) notice the new function may call its supplied argument (as a function) up to once

factstep <- function(h) {
  function(x) {
    if(x<=0) {
       return(1)
    } else {
       return(x*h(x-1))
    }
  }
}

factstep(NULL)(0)
## [1] 1

Compose this beast with itself a few time, we are using that each returned function is a new anonymous function (and thus has separate state) to make sure calls don’t interfere. The idea is: this would work even in FORTRAN if FORTRAN allowed us to create anonymous subroutines at run-time. What we are showing is the bookkeeping involved in function definition is already strong enough to support limited recursion (we don’t need) a separate facility for that, though R does have such).

f5 <- factstep(factstep(factstep(factstep(factstep(factstep(NULL))))))
f5(3)
## [1] 6
f5(4)
## [1] 24

But f5 isn’t the best implementation of factorial. It doesn’t work for long.

tryCatch(
  f5(6),
  error=function(e) print(e))
## <simpleError in h(x - 1): could not find function "h">

One way to solve this is introduce what is called a fixed point function. We used the fact we know our own function name (i.e. fix can call self) to define a function called “fix” that calculates fixed points of other functions. The idea is: By definition fix(f) is f(fix(f)). The idea is fix(factstep) will be a function that knows how to recurse or call itself enough times. The function fix() should obey equations like the following:

  • fix(f) = f(fix(f))

So if x = fix(f) we have (by substitution) x = f(x), or x is a fixed point of f() (f being an arbitrary function). In our case we will write fact1 = fix(factstep) and have fact1 = factstep(fact1), which means we don’t alter fact11 by one more application of factstep. fact1 is a function that seems to have an arbitrary number of applications of factstep rolled into it. So it is natural to suppose that fact1(x) = factorial(x) as both functions seem to obey the same recurrence relations.

This sort of direct algebra over functions is still using the fact we have explicit names on the functions (which we are also using as variable names) This does require the power to call yourself, but what we are showing is we can bootstrap one self-recursive function into a source of self-recursive functions.

Here is fix written in R:

fix <- function(f) { f(fix(f)) }

This gives us enough power to implement recursion/iteration for any function. This ability to write a function in terms of an expression or equation involving itself by name is considered ground breaking. Alan Kay called such an expression in the LISP 1.5 Programmer’s Manual “Maxwell’s Equations of Software!” (see here for some details). It is somewhat mind-blowing that both

  • You can write fix in terms of itself
  • The language implementation then essentially solves the above recursive equation (by mere substitution) and you use this to build recursive functions at well.

For example:

fact1 <- fix(factstep)
fact1(5)
## [1] 120
fact1(3)
## [1] 6
(factstep(fact1))(3)  # same answer
## [1] 6
fact1(10)
## [1] 3628800

Fixed points are a bit less mysterious when applied to simpler functions, such as constants (which are their own fixed points as they ignore their argument).

# We can also apply this to a simple function like the following
g <- function(x) { 9 }

# applying the fixing operator is the same as evaluating such a simple function
fix(g)
## [1] 9
g(fix(g))  # same answer
## [1] 9

Note: fix(g) is not a fixed-point of fix(). We do not have fix(g) = fix(fix(g)) and fix() can not find it’s own fixed point: fix(fix). fix(fix) fails to terminate even under R’s lazy argument semantics. Also this fix() doesn’t work in languages that don’t have lazy-argument semantics (which is most languages except R and Haskel).

# Bonus question: what is fix(fix)?
tryCatch(
  fix(fix),
  error=function(e) print(e))
## <simpleError: evaluation nested too deeply: infinite recursion / options(expressions=)?>

Why did that blow up even under R’s lazy evaluation semantics? Because fix() always tried to use its argument, which caused the argument to be instantiated and in turn triggered unboudned recursion. Notice factstep() doesn’t always use its argument, so it doesn’t necessarily recurse forever.

The above method depended on R’s lazy evaluation semantics. What if you are working with a language that has more standard aggressive evaluation semantics (arguments are evaluated before calling a function, meaning they are evaluated even if they don’t end up getting used). You can implement a fixed point operator by explicitly hiding your argument in a function construction abstraction as follows:

# Fixed point function, two argument version (works under most semantics, including Python)
fix2 <- function(f) { function(x) f(fix2(f))(x) }

fact2 <- fix2(factstep)
fact2(5)
## [1] 120
fact2(3)
## [1] 6

The idea being: with function introduction you can implement your own lazy evaluation scheme.

The main feature we used up until now is the fact we know the name of the function we are working with (and and thus conveniently evaluate it more than once and pass it around). A natural question is: if you didn’t have this naming convenience (as you do not in the pure lambda calculus, which works only over values without variables) can you still implement recursion on your own? The answer is use and one of the methods is an expression called “the Y combinator.”

Here is the Y combinator version (version without using own name) of implementing recursion. Again, you don’t have to do this if you have any of the following features already available:

  • Variable names
  • Language supplied recursion
  • A fixed point operator

The Y combinator implements a fixed point function using only: function creation (eta-abstraction) and application (beta-reduction) and NOT using your variable names name (in this case Y). The trick is: we can use any value passed in twice. So even though we may not have variables and names, we can give some other values access to to related values (enough to implement fixed point calculations and recursion ourselves).

Here is the Y combinator written in R:

# Y combinator written in R
Y <-
  function(f) {
    (function(z) {
      f(function(v) {
        z(z)(v)
      })
    })(function(z) {
      f(function(v) {
        z(z)(v)
      })
    })
  }

Y has the property for any function g(): Y(g) = g(Y(g)), exactly the equation fix() obeyed. This is often witten without parenthesis as Y g = g Y g. Either way Y is a solution to an equation over functions. What we wrote above isn’t the equation Y is the solution of, but an explicit solution for Y that does not involve Y refirent to itself. Notice it involves function introduction. The traditional way of writing the above function in lambda calculus notation is:

  • Y = lambda f . ( lambda x. f(x x)) ( lambda x.f(x x) )

Understand that the earlier R code is the same using R’s function introduction notation (“function” instead of “lambda”) and explicit introduction of more parenthesis to show the preferred association of operations.

Let’s prove what we have is a solution to the claimed recurrence equation. We are working over the lambda calculus, which is a theory of transformations of values The main inference step is function application or beta-reduction

  • function(x) { … } xvalue -> {…} with x replaced by xvalue throughout

The other step is eta-conversion which says the following two forms should be considered equivalent:

  • function(x) g(x) g

Parenthesis and variable names are added and removed for convenience. Function application can be written as f(x) or as f x. (Notation associates to the left so z(z)(v) is equivalent to (z(z))(v).)

Proof:

We are going to show one pattern of substitution that converts Y(g) to g(Y(g)).

The rules of the lambda calculus are that if one form can be converted to another then they are considered to be equivalent, and so we are done.

It is a non-trivial fact that the lambda-calculus is consistent. This is the Church-Rosser theorem and says if P -> Q (we can transform P to Q) and P -> R (we can transform P to R) then there is a form S such that Q->S and R->S. The idea is convertible statements form equivalence classes, so it makes sense to call them equivalent. So if we can find a conversion in any order of operations we are done (i.e. we are free to try operations in the order we want).

For clarity let’s introduce a substitution

T = function(f) ( function(z) { f(function(v) { z(z)(v) }) })
 so Y = function(f) { T(f)(T(f)) }
 Y(g) = (function(f) { T(f)(T(f)) })(g)                    # by definition
     = T(g)(T(g))                                          # function application, call this equation 2
     = ( function(z) { g(function(v) { z(z)(v) }) })(T(g)) # expanding left T(g)
     = g(function(v) { T(g)(T(g))(v) })                    # function application
     = g(T(g)(T(g)))                                       # eta conversion on inside g() argument
     = g(Y(g))                                             # substitute in equation 2

Let’s try it:

fact3 <- Y(factstep)
fact3(5)
## [1] 120
fact3(3)
## [1] 6
# Bonus question: what is Y(Y)?
tryCatch(
  Y(Y),
  error=function(e) print(e))
## <simpleError: evaluation nested too deeply: infinite recursion / options(expressions=)?>

Remember in R we don’t actually need the Y combinator as we have direct access to a large number of additional powerful constructs deliberately not found in the lambda calculus (where the Y combinator is must known):

  • Named variables (allowing re-use of values)
  • Explicit recursion (allowing direct definition of recursive functions)
  • Lazy evaluation (allowing recursion through mere substitution and evaluation)

R can also choose implement the Y combinator because we have all of the essential lambda calculus tools directly available in R:

  • Evaluation (reductions)
  • Function introduction (abstraction)

And this concludes our excursion into programming language theory.

This complete article (including all code) as an R knitr worksheet can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Simpson’s Paradox

By Cory Lesmeister

(This article was first published on Fear and Loathing in Data Science, and kindly contributed to R-bloggers)
“Statistics are used much like a drunk uses a lamppost: for support, not illumination.”

A.E. Housman (commonly attributed to Andrew Lang)

Recently at work I stumbled into a classic example of Simpson’s Paradox (Yule-Simpson Effect). Many people are not familiar with this phenomena, so I thought I would provide a brief example to help shed some light on it. If you are not aware of the problem of confounding or “lurking” variables in your analysis, you can box yourself into an analytical corner. To further understand what is happening in a Simpson’s Paradox check out the following links:




In short what is happening is that a trend or association that appears between two different variables reverses when a third variable is included. You will stumble into or at least need to be on the lookout for this effect of spurious correlation when you have unbalanced group sizes, like you often have using observational data. This is what happened in my recent discovery.

In the example below, let’s have a look at the UC Berkeley data, or at least the portion of it by Department, as provided in a Wikipedia article. https://en.wikipedia.org/wiki/Simpson%27s_paradox#cite_note-Bickel-11


What the data we explore contains is the number of applications and admissions by gender to six different graduate schools. Again, this is just a portion of the data, focusing in on the largest departments.

dpt = c(“A”,“B”,“C”,“D”,“E”,“F”,“A”,“B”,“C”,“D”,“E”,“F”)
app = c(825,560,325,417,191,272,108,25,593,375,393,341)
adm = c(512,353,120,138,53,16,89,17,202,131,94,24)
gen = c(“m”,“m”,“m”,“m”,“m”,“m”,“f”,“f”,“f”,“f”,“f”,“f”)
df = cbind.data.frame(dpt,app,adm,gen)
str(df)

## ‘data.frame’: 12 obs. of 4 variables:
## $ dpt: Factor w/ 6 levels “A”,”B”,”C”,”D”,..: 1 2 3 4 5 6 1 2 3 4 …
## $ app: num 825 560 325 417 191 272 108 25 593 375 …
## $ adm: num 512 353 120 138 53 16 89 17 202 131 …
## $ gen: Factor w/ 2 levels “f”,”m”: 2 2 2 2 2 2 1 1 1 1 …


There are a number of ways to demonstrate the effect, but I thought I would give it a go using dplyr. First, let’s group the data by gender (gen) then look at their overall admission rate.

library(dplyr)
by_gen = group_by(df, gen)
summarize(by_gen, adm_rate=sum(adm)/sum(app))

## Source: local data frame [2 x 2]
##
## gen adm_rate
## 1 f 0.3035422
## 2 m 0.4602317


Clearly there is a huge gender bias problem in graduate school admissions at UC Berkeley and it is time to round up a herd of lawyers. On a side note, what is the best way to save a drowning lawyer? It’s easy, just take your foot off their head.


We can even provide a statistical test to strengthen the assertion of bias. In R, a proportions test can be done via the prop.test() function, inputting the number of “hits” and the number of “trials”.

summarize(by_gen, sum(adm))

## Source: local data frame [2 x 2]
##
## gen sum(adm)
## 1 f 557
## 2 m 1192


summarize(by_gen, sum(app))

## Source: local data frame [2 x 2]
##
## gen sum(app)
## 1 f 1835
## 2 m 2590


prop.test(x=c(557,1192), n=c(1835,2590))

##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(557, 1192) out of c(1835, 2590)
## X-squared = 109.67, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.1856333 -0.1277456
## sample estimates:
## prop 1 prop 2
## 0.3035422 0.4602317


We see in the output a high level of statistical significance and a confidence interval with a difference of roughly 13 to 19 percent. However, beware of the analyst proclaiming truths using a 2×2 table with observational data! In this instance, the confounding variable is the department (dpt).

In order to have a look at the data by gender and by department, we will once again use the group_by() function, then calculate the admission rates and finally turn it from “long” to “wide” format.


by_dpt = group_by(df, dpt,gen)
df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))
df2

## dpt gen adm_rate
## 1 A f 0.82407407
## 2 A m 0.62060606
## 3 B f 0.68000000
## 4 B m 0.63035714
## 5 C f 0.34064081
## 6 C m 0.36923077
## 7 D f 0.34933333
## 8 D m 0.33093525
## 9 E f 0.23918575
## 10 E m 0.27748691
## 11 F f 0.07038123
## 12 F m 0.05882353


library(reshape2)
df2_wide = dcast(df2, dpt~gen, value.var=“adm_rate”)
df2_wide

## dpt f m
## 1 A 0.82407407 0.62060606
## 2 B 0.68000000 0.63035714
## 3 C 0.34064081 0.36923077
## 4 D 0.34933333 0.33093525
## 5 E 0.23918575 0.27748691
## 6 F 0.07038123 0.05882353

With the inclusion of department we now see that females actually have higher admission rates in four of the six departments (A,B,D,F). How can this be? It had to do with rates and the volume of admission to the various departments. Again, the groups were highly unbalanced. Alright, enough of this as it is now time to focus on something important as Ohio State is squaring off against Notre Dame and the NHL Winter Classic!

To leave a comment for the author, please follow the link and comment on their blog: Fear and Loathing in Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Google scholar scraping with rvest package

By Fisseha Berhane

barplot-gscholar

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In this post, I will show how to scrape google scholar. Particularly, we will use the 'rvest' R package to scrape the google scholar account of my PhD advisor. We will see his coauthors, how many times they have been cited and their affiliations. “rvest, inspired by libraries like beautiful soup, makes it easy to scrape (or harvest) data from html web pages”, wrote Hadley Wickham on RStudio Blog. Since it is designed to work with magrittr, we can express complex operations as elegant pipelines composed of simple and easily understood pieces of code.

Load required libraries:

We will use ggplot2 to create plots.

library(rvest)
library(ggplot2)

How many times have his papers been cited

Let’s use SelectorGadget to find out which css selector matches the “cited by” column.

page <- read_html("https://scholar.google.com/citations?user=sTR9SIQAAAAJ&hl=en&oi=ao")

Specify the css selector in html_nodes() and extract the text with html_text(). Finally, change the string to numeric using as.numeric().

citations <- page %>% html_nodes ("#gsc_a_b .gsc_a_c") %>% html_text()%>%as.numeric()

See the number of citations:

citations 
148 96 79 64 57 57 57 55 52 50 48 37 34 33 30 28 26 25 23 22 

Create a barplot of the number of citation:

barplot(citations, main="How many times has each paper been cited?", ylab='Number of citations', col="skyblue", xlab="")

Here is the plot:

Coauthors, thier affilations and how many times they have been cited

My PhD advisor, Ben Zaitchik, is a really smart scientist. He not only has the skills to create network and cooperate with other scientists, but also intelligence and patience.
Next, let’s see his coauthors, their affiliations and how many times they have been cited.
Similarly, we will use SelectorGadget to find out which css selector matches the Co-authors.

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
Coauthors = page%>% html_nodes(css=".gsc_1usr_name a") %>% html_text()
Coauthors = as.data.frame(Coauthors)
names(Coauthors)='Coauthors'

Now let’s exploring Coauthors

head(Coauthors) 
                  Coauthors
1               Jason Evans
2             Mutlu Ozdogan
3            Rasmus Houborg
4          M. Tugrul Yilmaz
5 Joseph A. Santanello, Jr.
6              Seth Guikema

dim(Coauthors) 
[1] 27  1

As of today, he has published with 27 people.

How many times have his coauthors been cited?

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
citations = page%>% html_nodes(css = ".gsc_1usr_cby")%>%html_text()

citations 
 [1] "Cited by 2231"  "Cited by 1273"  "Cited by 816"   "Cited by 395"   "Cited by 652"   "Cited by 1531" 
 [7] "Cited by 674"   "Cited by 467"   "Cited by 7967"  "Cited by 3968"  "Cited by 2603"  "Cited by 3468" 
[13] "Cited by 3175"  "Cited by 121"   "Cited by 32"    "Cited by 469"   "Cited by 50"    "Cited by 11"   
[19] "Cited by 1187"  "Cited by 1450"  "Cited by 12407" "Cited by 1939"  "Cited by 9"     "Cited by 706"  
[25] "Cited by 336"   "Cited by 186"   "Cited by 192" 

Let’s extract the numeric characters only using global substitute.

citations = gsub('Cited by','', citations)

citations
 [1] " 2231"  " 1273"  " 816"   " 395"   " 652"   " 1531"  " 674"   " 467"   " 7967"  " 3968"  " 2603"  " 3468"  " 3175" 
[14] " 121"   " 32"    " 469"   " 50"    " 11"    " 1187"  " 1450"  " 12407" " 1939"  " 9"     " 706"   " 336"   " 186"  
[27] " 192"  

Change string to numeric and then to data frame to make it easy to use with ggplot2

citations = as.numeric(citations)
citations = as.data.frame(citations)

Affilation of coauthors

page <- read_html("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=sTR9SIQAAAAJ")
affilation = page %>% html_nodes(css = ".gsc_1usr_aff")%>%html_text()
affilation = as.data.frame(affilation)
names(affilation)='Affilation'

Now, let’s create a data frame that consists of coauthors, citations and affilations

cauthors=cbind(Coauthors, citations, affilation)

cauthors 
                             Coauthors citations                                                                                  Affilation
1                          Jason Evans      2231                                                               University of New South Wales
2                        Mutlu Ozdogan      1273    Assistant Professor of Environmental Science and Forest Ecology, University of Wisconsin
3                       Rasmus Houborg       816                    Research Scientist at King Abdullah University of Science and Technology
4                     M. Tugrul Yilmaz       395 Assistant Professor, Civil Engineering Department, Middle East Technical University, Turkey
5            Joseph A. Santanello, Jr.       652                                                  NASA-GSFC Hydrological Sciences Laboratory
.....

Re-order coauthors based on their citations

Let’s re-order coauthors based on their citations so as to make our plot in a decreasing order.

cauthors$Coauthors <- factor(cauthors$Coauthors, levels = cauthors$Coauthors[order(cauthors$citations, decreasing=F)])

ggplot(cauthors,aes(Coauthors,citations))+geom_bar(stat="identity", fill="#ff8c1a",size=5)+
theme(axis.title.y   = element_blank())+ylab("# of citations")+
theme(plot.title=element_text(size = 18,colour="blue"), axis.text.y = element_text(colour="grey20",size=12))+
              ggtitle('Citations of his coauthors')+coord_flip()

Here is the plot:

He has published with scientists who have been cited more than 12000 times and with students like me who are just toddling.

Summary

In this post, we saw how to scrape Google Scholar. We scraped the account of my advisor and got data on the citations of his papers and his coauthors with thier affilations and how many times they have been cited.

As we have seen in this post, it is easy to scrape an html page using the rvest R package. It is also important to note that SelectorGadget is useful to find out which css selector matches the data of our interest.

If you have any question feel free to post a comment below.

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Happy New Year! Top posts of 2015

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Happy New Year everyone! It’s hard to believe that this blog has now been going since 2008: our anniversary was on December 9. Thanks to everyone who has supported this blog over the past 7 years by reading, sharing and commenting on our posts, and an extra special thanks to my co-bloggers Joe Rickert and Andrie de Vries and all the guest bloggers from Microsoft and elsewhere that have contributed this year.

2015 was a busy year for the blog, with a 8% increase in users and a 13% increase in page views compared to 2014. The most popular posts of the year, starting with the most popular, were:

That’s all from us from the team here at Revolutions for this week, and indeed for this year! We’ll be back in the New Year with more news, tips and tricks about R, but in the meantime we’ll let R have the last word thanks to some careful seed selection by Berry Boessenkool:

> set.seed(31612310)
> paste0(sample(letters,5,T))
[1] “h” “a” “p” “p” “y”
> set.seed(12353)
> sample(0:9,4,T)
[1] 2 0 1 6

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Error Control in Exploratory ANOVA’s: The How and the Why

By Daniel Lakens

(This article was first published on The 20% Statistician, and kindly contributed to R-bloggers)


/* Style Definitions */ table.MsoNormalTable {mso-style-name:”Table Normal”; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:””; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0cm; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:”Calibri”,sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;}

In a 2X2X2 design, there are three main effects, three two-way interactions, and one three-way interaction to test. That’s 7 statistical tests.The probability of making at least one Type 1 error in a single ANOVA is 1-(0.95)^7=30%.

There are earlier blog posts on this, but my eyes were not opened until I read this paper by Angelique Cramer and colleagues (put it on your reading list, if you haven’t read it yet). Because I prefer to provide solutions to problems, I want to show how to control Type 1 error rates in ANOVA’s in R, and repeat why it’s necessary if you don’t want to fool yourself. Please be aware that if you continue reading, you will lose the bliss of ignorance if you hadn’t thought about this issue before now, and it will reduce the amount of p <0.05 you'll find in exploratory ANOVA's.

Simulating Type 1 errors in 3-way ANOVA’s

Let’s simulate 250000 2x2x2 ANOVAs where all factors are manipulated between individuals, with 50 participants in each condition, and without any true effect (all group means are equal).The R code is at the bottom of this page. We store the p-values of the 7 tests. The total p-value distribution has the by now familiar uniform shape we see if the null hypothesis is true.

If we count the number of significant findings (even though there is no real effect), we see that from 250000 2x2x2 ANOVA’s, approximately 87.500 p-values were smaller than 0.05 (the left most bar in the Figure). This equals 250.000 ANOVA’s x 0.05 Type 1 errors x 7 tests. If we split up the p-values for each of the 7 tests, we see in the table below that as expected, each test has it’s own 5% error rate, which together add up to a 30% error rate due to multiple testing. With a 2x2x2x2 ANOVA, the Type 1 errors you’ll a massive 54%, making you about as accurate as a scientist as a coin-flipping toddler.

Let’s fix this. We need to adjust the error rate. The Bonferroni correction (divide your alpha level by the number of tests, so for 7 tests and alpha = 0.05 use 0.05/7-= 0.007 for each test) communicates the basic idea very well, but the Holm-Bonferroni correction is slightly better. In fields outside of psychology (e.g., economics, gene discovery) work on optimal Type 1 error control procedures continues. I’ve used the mutoss package in R in my simulations to check a wide range of corrections, and came to the conclusion that unless the number of tests is huge, we don’t need anything more fancy than the Holm-Bonferroni (or sequential Bonferroni) correction (please correct me if I’m wrong in the comments!). It orders p-values from lowest to highest, and tests them sequentially against an increasingly more lenient alpha level. If you prefer a spreadsheet, go here.

In a 2x2x2 ANOVA, we can test three main effects, three 2-way interactions, and one 3-way interaction. The table below shows the error rate for each of these 7 tests is 5% (for a total of 1-0.95^7=30%) but after the Holm-Bonferroni correction, the Type 1 error rate nicely controlled.



However, another challenge is to not let Type 1 error control increase the Type 2 errors too much. To examine this, I’ve simulated 2x2x2 ANOVA’s where there is a true effect. One of the eight cells has a small positive difference, and one has a small negative difference. As a consequence, with sufficient power, we should find 4 significant effects (a main effect, two 2-way interactions, and the 3-way interaction).

Let’s first look at the p-value distribution. I’ve added a horizontal and vertical line. The horizontal line indicates the null-distribution caused by the four null-effects. The vertical line indicates the significance level of 0.05. The two lines create four quarters. Top left are the true positives, bottom left are the false positives, top right are the false negatives (not significant due to a lack of power) and the bottom right are the true negatives.

Now let’s plot the adjusted p-values using Holm’s correction (instead of changing the alpha level for each test, we can also keep the alpha fixed, but adjust the p-value).


We see a substantial drop in the left-most column, and this drop is larger than the false height due to false positives. We also see a peculiarly high bar on the right, caused by the Holm correction adjusting a large number of p-values to 1. We can see this drop in power in the Table below as well. It’s substantial: From 87% power to 68% power.

If you perform a 2x2x2 ANOVA, we might expect you are not really interested in the main effects (if you were, a simply t-test would have sufficed). The power cost is already much lower if the exploratory analysis focusses on only four tests, the three 2-way interactions, and the 3-way interaction (see the third row in the Table below). Even exploratory 2x2x2 ANOVA’s are typically not 100% exploratory. If so, preregistering the subset of all tests you are interesting in, and controlling the error rate for this subset of tests, provides an important boost in power.




Oh come on you silly methodological fetishist!

If you think Type 1 error control should not endanger the discovery of true effects, here’s what you should not do. You should not wave your hands at controlling Type 1 error rates, saying it is ‘methodological fetishism’ (Ellemers, 2013). It ain’t gonna work. If you choose to report p-values (by all means, don’t), and want to do quantitative science (by all means, don’t) than the formal logic you are following (even if you don’t realize this) is the Neyman-Pearson approach. It allows you to say: ‘In the long run, I’m not saying there’s something, when there is nothing, more than X% of the time’. If you don’t control error rates, your epistemic foundation of making statements reduces to ‘In the long run, I’m not saying there’s something, when there is nothing, more than … uhm … nice weather for the time of the year, isn’t it?’.

Now just because you need to control error rates, doesn’t mean you need to use a Type 1 error rate of 5%. If you plan to replicate any effect you find in an exploratory study, and you set the alpha to 0.2, the probability of making a Type 1 error twice in a row is 0.2*0.2 = 0.04. If you want to explore four different interactions in a 2x2x2 ANOVA you intend to replicate in any case, setting you overall Type 1 error across two studies to 0.2, and then using an alpha of 0.05 for each of the 4 tests might be a good idea. If some effects would be costlier to miss, but others less costly, you can use an alpha of 0.8 for two effects, and an alpha of 0.02 for the other two. This is just one example. It’s your party. You can easily pre-register the choices you make to the OSFor AsPredicted to transparently communicate them.

You can also throw error control out of the window. There are approximately 1.950.000 hits in Google Scholar when I search for ‘An Exploratory Analysis Of‘. Put these words in the title, list all your DV’s in the main test (e.g., in a table), add Bayesian statistics and effect sizes with their confidence intervals, and don’t draw strong conclusions (Bender & Lange, 2001).

Obviously, the tricky thing is always what to do if your prediction was not confirmed. I think you enter a Lakatosian degenerative research line (as opposed to the progressive research line you’d be in if your predictions were confirmed). With some luck, there’s an easy fix. The same study, but using a larger sample, (or, if you designed a study using sequential analyses, simply continue the data collection after the first look at the data, Lakens, 2014) might get you back in a progressive research line after an update in the predicted effect size. Try again, with a better manipulation of dependent variable. Giving up on a research idea after a single failed confirmation is not how science works, in general. Statistical inferences tell you how to interpret the data without fooling yourself. Type 1 error control matters, and in most psychology experiments, is relatively easy to do. But it’s only one aspect of the things you take into account when you decide which research you want to do.

My main point here is that there are many possible solutions, and all you have to do is choose one that best fits your goals. Since your goal is very unlikely to be a 30% Type 1 error rate in a single study which you interpret as a 5% Type 1 error rate, you have to do something. There’s a lot of room between 100% exploratory and 100% confirmatory research, and there are many reasonable ideas about what the ‘family’ of errors is you want to control (for a good discussion on this, see Bender & Lange, 2001). I fully support their conclusion (p. 344): “Whatever the decision is, it should clearly be stated why and how the chosen analyses are performed, and which error rate is controlled for”. Clear words, no hand waving.


Thanks to @RogierK for correcting an error in an earlier version of this blog post.


Bender, R., & Lange, S. (2001). Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology, 54(4), 343–349.

/* Style Definitions */ table.MsoNormalTable {mso-style-name:”Table Normal”; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:””; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0cm; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:”Calibri”,sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;}

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., … Wagenmakers, E.-J. (2014). Hidden multiplicity in multiway ANOVA: Prevalence, consequences, and remedies. arXiv Preprint arXiv:1412.3416. Retrieved from http://arxiv.org/abs/1412.3416

Ellemers, N. (2013). Connecting the dots: Mobilizing theory to reveal the big picture in social psychology (and why we should do this): The big picture in social psychology. European Journal of Social Psychology, 43(1), 1–8. http://doi.org/10.1002/ejsp.1932
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710. http://doi.org/10.1002/ejsp.2023



/* Style Definitions */ table.MsoNormalTable {mso-style-name:”Table Normal”; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:””; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0cm; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:”Calibri”,sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;}


/* Style Definitions */ table.MsoNormalTable {mso-style-name:”Table Normal”; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-parent:””; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:8.0pt; mso-para-margin-left:0cm; line-height:107%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:”Calibri”,sans-serif; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;}

To leave a comment for the author, please follow the link and comment on their blog: The 20% Statistician.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News