Presidential Election Predictions 2016 (an ASA competition)

By Tal Galili

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

Guest post by Jo Hardin, professor of mathematics, Pomona College.

ASA’s Prediction Competition

In this election year, the American Statistical Association (ASA) has put together a competition for students to predict the exact percentages for the winner of the 2016 presidential election. They are offering cash prizes for the entry that gets closest to the national vote percentage and that best predicts the winners for each state and the District of Columbia. For more details see:

http://thisisstatistics.org/electionprediction2016/

To get you started, I’ve written an analysis of data scraped from fivethirtyeight.com. The analysis uses weighted means and a formula for the standard error (SE) of a weighted mean. For your analysis, you might consider a similar analysis on the state data (what assumptions would you make for a new weight function?). Or you might try some kind of model – either a generalized linear model or a Bayesian analysis with an informed prior. The world is your oyster!

Getting the Data

Thanks to the Internet, there is a lot of polling data which is publicly accessible. For the competition, you are welcome to get your data from anywhere. However, I’m going to take mine from 538. http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/ (Other good sources of data are http://www.realclearpolitics.com/epolls/latest_polls/ and http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton and http://www.gallup.com/products/170987/gallup-analytics.aspx)

Note the date indicated above as to when this R Markdown file was written. That’s the day the data were scraped from 538. If you run the Markdown file on a different day, you are likely to get different results as the polls are constantly being updated.

Because the original data were scraped as a JSON file, it gets pulled into R as a list of lists. The data wrangling used to convert it into a tidy format is available from the source code at https://github.com/hardin47/prediction2016/blob/master/predblog.Rmd.

url = “http://projects.fivethirtyeight.com/2016-election-forecast/national-polls/”
doc <- htmlParse(url, useInternalNodes = TRUE)

sc = xpathSApply(doc, “//script[contains(., ‘race.model’)]”,
function(x) c(xmlValue(x), xmlAttrs(x)[[“href”]]))

jsobj = gsub(“.*race.stateData = (.*);race.pathPrefix.*”, 1″, sc)

data = fromJSON(jsobj)
allpolls <- data$polls

#unlisting the whole thing
indx <- sapply(allpolls, length)
pollsdf <- as.data.frame(do.call(rbind, lapply(allpolls, `length<-`, max(indx))))

A Quick Visualization

Before coming up with a prediction for the vote percentages for the 2016 US Presidential Race, it is worth trying to look at the data. The data are in a tidy form, so ggplot2 will be the right tool for visualizing the data.

ggplot(subset(allpolldata, ((polltypeA == “now”) & (endDate > ymd(“2016-08-01”)))),
aes(y=adj_pct, x=endDate, color=choice)) +
geom_line() + geom_point(aes(size=wtnow)) +
labs(title = “Vote percentage by date and poll weightn,
y = “Percent Vote if Election Today”, x = “Poll Date”,
color = “Candidate”, size=“538 PollnWeight”)

A Quick Analysis

Let’s try to think about the percentage of votes that each candidate will get based on the now cast polling percentages. We’d like to weight the votes based on what 538 thinks (hey, they’ve been doing this longer than I have!), the sample size, and the number of days since the poll closed.

Using my weight, I’ll calculate a weighted average and a weighted SE for the predicted percent of votes. (The SE of the weighted variance is taken from Cochran (1977) and cited in Gatz and Smith (1995).) The weights can be used to calculate the average or the running average for the now cast polling percentages.

allpolldata2 <- allpolldata %>%
filter(wtnow > 0) %>%
filter(polltypeA == “now”) %>%
mutate(dayssince = as.numeric(today() – endDate)) %>%
mutate(wt = wtnow * sqrt(sampleSize) / dayssince) %>%
mutate(votewt = wt*pct) %>%
group_by(choice) %>%
arrange(choice, -dayssince) %>%
mutate(cum.mean.wt = cumsum(votewt) / cumsum(wt)) %>%
mutate(cum.mean = cummean(pct))

Plotting the Cumulative Mean / Weighted Mean

In tidy format, the data are ready to plot. Note that the cumulative mean is much smoother than the cumulative weighted mean because the weights are much heavier toward the later polls.

ggplot(subset(allpolldata2, ( endDate > ymd(“2016-01-01”))),
aes(y=cum.mean, x=endDate, color=choice)) +
geom_line() + geom_point(aes(size=wt)) +
labs(title = “Cumulative Mean Vote Percentagen,
y = “Cumulative Percent Vote if Election Today”, x = “Poll Date”,
color = “Candidate”, size=“Calculated Weight”)

ggplot(subset(allpolldata2, (endDate > ymd(“2016-01-01”))),
aes(y=cum.mean.wt, x=endDate, color=choice)) +
geom_line() + geom_point(aes(size=wt)) +
labs(title = “Cumulative Weighted Mean Vote Percentagen,
y = “Cumulative Weighted Percent Vote if Election Today”, x = “Poll Date”,
color = “Candidate”, size=“Calculated Weight”)

Additionally, the weighted average and the SE of the average (given by Cochran (1977)) can be computed for each candidate. Using the formula, we have our prediction of the final percentage of the popular vote for each major candidate!

pollsummary <- allpolldata2 %>%
select(choice, pct, wt, votewt, sampleSize, dayssince) %>%
group_by(choice) %>%
summarise(mean.vote = weighted.mean(pct, wt, na.rm=TRUE),
std.vote = sqrt(weighted.var.se(pct, wt, na.rm=TRUE)))

pollsummary

## # A tibble: 2 x 3
## choice mean.vote std.vote
##
## 1 Clinton 43.64687 0.5073492
## 2 Trump 39.32071 1.1792667

Other people’s advice

Prediction is very difficult, especially about the future. – Niels Bohr

Along with good data sources, you should also be able to find information about prediction and modeling. I’ve provided a few resources to get you started.

Andrew Gelman: http://andrewgelman.com/2016/08/17/29654/

Sam Wang: http://election.princeton.edu/2016/08/21/sharpening-the-forecast/

Fivethirtyeight: http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/

Christensen and Florence: http://www.amstat.org/misc/tasarticle.pdf and https://tofu.byu.edu/electionpollproject/

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Tickets now available for EARL2016 Boston

By Angela Roberts

boston_Reception_Science_Museum-mango1

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

We are pleased to announce that registration has now opened for this year’s Boston EARL Conference, which will be held at the Boston Science Museum on the 7-9th November.

The call for abstracts is still open so it’s not to late to submit a presentation should you wish to do so. Presentations are sought around the business usage and commercial application of R. The full conference agenda will be announced in September. Pre-conference workshops will be held on the 7th November with two full conference days following this.

For more information please visit the website or contact earl-team@mango-solutions.com

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Analysis of the #7FavPackages hashtag

By David Robinson

center

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

Twitter has seen a recent trend of “first 7” and “favorite 7” hashtags, like #7FirstJobs and #7FavFilms. Last week I added one to the mix, about my 7 favorite R packages:

devtools
dplyr
ggplot2
knitr
Rcpp
rmarkdown
shiny#7FavPackages #rstats

— David Robinson (@drob) August 16, 2016

Hadley Wickham agreed to share his own, but on one condition:

@drob I’ll do it if you write a script to scrape the tweets, plot overall most common, and common co-occurences

— Hadley Wickham (@hadleywickham) August 16, 2016

Hadley followed through, so now it’s my turn.

Setup

We can use the same twitteR package that I used in my analysis of Trump’s Twitter account:

library(twitteR)
library(purrr)
library(dplyr)
library(stringr)

# You'd need to set up authentication before running this
# See help(setup_twitter_oauth)
tweets <- searchTwitter("#7FavPackages", n = 3200) %>%
  map_df(as.data.frame)

# Grab only the first for each user (some had followups), and ignore retweets
tweets <- tweets %>%
  filter(!str_detect(text, "^RT ")) %>%
  arrange(created) %>%
  distinct(screenName, .keep_all = TRUE)

There were 116 (unique) tweets in this hashtag. I can use the tidytext package to analyze them, using a custom regular expression.

library(BiocInstaller)

# to avoid non-package words
built_in <- tolower(sessionInfo()$basePkgs)
cran_pkgs <- tolower(rownames(available.packages()))
bioc_pkgs <- tolower(rownames(available.packages(repos = biocinstallRepos()[1:3])))
blacklist <- c("all")

library(tidytext)

spl_re <- "[^a-zA-Zd@#.]"
link_re <- "https://t.co/[A-Za-zd]+|&amp;"

packages <- tweets %>%
  mutate(text = str_replace_all(text, link_re, "")) %>%
  unnest_tokens(package, text, token = "regex", pattern = spl_re) %>%
  filter(package %in% c(cran_pkgs, bioc_pkgs, built_in)) %>%
  distinct(id, package) %>%
  filter(!package %in% blacklist)

pkg_counts <- packages %>%
  count(package, sort = TRUE)

Note that since a lot of non-package words got mixed in with these tweets, I filtered for only packages in CRAN and Bioconductor (so packages that are only on GitHub or elsewhere won’t be included, though anecdotally I didn’t notice any among the tweets). Tweeters were sometimes inconsistent about case as well, so I kept all packages lowercase throughout this analysis.

General results

There were 700 occurrences of 184 packages in these tweets. What were the most common?

Some observations:

  • ggplot2 and dplyr were the most popular packages, each mentioned by more than half the tweets, and other packages by Hadley like tidyr, devtools, purrr and stringr weren’t far behind. This isn’t too surprising, since much of the attention to the hashtag came with Hadley’s tweet.

@drob @JaySun_Bee @ma_salmon HOW IS THAT BIASED?

— Hadley Wickham (@hadleywickham) August 16, 2016

  • The next most popular packages involved reproducible research (rmarkdown and knitr), along with other RStudio tools like shiny. What if I excluded packages maintained by RStudio (or RStudio employees like Hadley and Yihui)?

center

  • The vast majority of packages people listed as their favorite were CRAN packages: only 7 Bioconductor packages were mentioned (though it’s worth noting they occurred across four different tweets):
packages %>%
  filter(package %in% bioc_pkgs)
## # A tibble: 7 x 2
##                   id       package
##                <chr>         <chr>
## 1 765622260556238848          xcms
## 2 765637117976387584         edger
## 3 765637117976387584         limma
## 4 765669197284245504       biomart
## 5 765669197284245504 genomicranges
## 6 765669197284245504        deseq2
## 7 765630231948308481    genefilter
  • There were 109 CRAN packages that were mentioned only once, and those showed a rather large variety. A random sample of 10:
set.seed(2016)
pkg_counts %>%
  filter(n == 1, !package %in% bioc_pkgs) %>%
  sample_n(10)
## # A tibble: 10 x 2
##       package     n
##         <chr> <int>
## 1      fclust     1
## 2          dt     1
## 3   shinystan     1
## 4        domc     1
## 5     mapview     1
## 6        daff     1
## 7     pbapply     1
## 8  visnetwork     1
## 9          af     1
## 10        arm     1

Correlations

What packages tend to be “co-favorited”- that is, listed by the same people? Here I’m using my in-development widyr package, which makes it easy to calculate pairwise correlations in a tidy data frame.

# install with devtools::install_github("dgrtwo/widyr")
library(widyr)

# use only packages with at least 4 mentions, to reduce noise
pkg_counts <- packages %>%
  count(package) %>%
  filter(n >= 4)

pkg_correlations <- packages %>%
  semi_join(pkg_counts) %>%
  pairwise_cor(package, id, sort = TRUE, upper = FALSE)

pkg_correlations
## # A tibble: 465 x 3
##         item1       item2 correlation
##         <chr>       <chr>       <dbl>
## 1        base    graphics   0.5813129
## 2    graphics       stats   0.4716609
## 3        base       stats   0.4495913
## 4       dplyr       tidyr   0.3822511
## 5       rstan    rstanarm   0.3791074
## 6       dplyr     ggplot2   0.3483315
## 7     ggplot2       knitr   0.3032979
## 8       dplyr       shiny   0.3027743
## 9  data.table htmlwidgets   0.2937083
## 10    ggplot2       tidyr   0.2811096
## # ... with 455 more rows

For instance, this shows the greatest correlation (technically a phi coefficient) were between the base, graphics, and stats packages, by people showing loyalty to built in packages.

I like using the ggraph package to visualize these relationships:

library(ggraph)
library(igraph)

set.seed(2016)

# we set an arbitrary threshold of connectivity
pkg_correlations %>%
  filter(correlation > .2) %>%
  graph_from_data_frame(vertices = pkg_counts) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation)) +
  geom_node_point(aes(size = n), color = "lightblue") +
  theme_void() +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme(legend.position = "none")

center

You can recognize most of RStudio’s packages (ggplot2, dplyr, tidyr, knitr, shiny) in the cluster on the bottom left of the graph. At the bottom right you can see the “base” cluster (stats, base, utils, grid, graphics), with people who showed their loyalty to base packages.

Beyond that, the relationships are a bit harder to parse (outside of some expected combinations like rstan and rstanarm): we may just not have enough data to create reliable correlations.

Compared to CRAN dependencies

This isn’t a particularly scientific survey, to say the least. So how does it compare to another metric of a package’s popularity: the number of packages that Depend, Import, or Suggest it on CRAN? (You could also compare to # of CRAN downloads using the cranlogs package, but since most downloads are due to dependencies, the two metrics give rather similar results).

We can discover this using the available.packages() function, along with some processing.

library(tidyr)

pkgs <- available.packages() %>%
  as.data.frame() %>%
  tbl_df()

requirements <- pkgs %>%
  unite(Requires, Depends, Imports, Suggests, sep = ",") %>%
  transmute(Package = as.character(Package),
            Requires = as.character(Requires)) %>%
  unnest(Requires = str_split(Requires, ",")) %>%
  mutate(Requires = str_replace(Requires, "n", "")) %>%
  mutate(Requires = str_trim(str_replace(Requires, "(.*", ""))) %>%
  filter(!(Requires %in% c("R", "NA", "", built_in)))

requirements
## # A tibble: 34,781 x 2
##    Package     Requires
##      <chr>        <chr>
## 1       A3       xtable
## 2       A3      pbapply
## 3       A3 randomForest
## 4       A3        e1071
## 5   abbyyR         httr
## 6   abbyyR          XML
## 7   abbyyR         curl
## 8   abbyyR        readr
## 9   abbyyR     progress
## 10  abbyyR     testthat
## # ... with 34,771 more rows
package_info <- requirements %>%
  count(Package = Requires) %>%
  rename(NRequiredBy = n) %>%
  left_join(count(requirements, Package)) %>%
  rename(NRequires = n) %>%
  replace_na(list(NRequires = 0))

package_info
## # A tibble: 2,925 x 3
##               Package NRequiredBy NRequires
##                 <chr>       <int>     <dbl>
## 1              a4Base           1         0
## 2              a4Core           1         0
## 3                 abc           3         5
## 4            abc.data           2         0
## 5                 abd           1        12
## 6               abind          95         0
## 7  AcceptanceSampling           1         0
## 8             acepack           2         0
## 9                 acp           1         2
## 10                acs           3         4
## # ... with 2,915 more rows

We can compare the number of mentions in the hashtag to the number of pacakges:

center

Some like dplyr, ggplot2, and knitr are popular both within the hashtag and as CRAN dependencies. Some relatively new packages like purrr are popular on Twitter but haven’t built up as many packages needing them, and others like plyr and foreach are a common dependency but are barely mentioned. (This isn’t even counting the many packages never mentioned in the hashtag).

Since we have this dependency data, I can’t resist looking for correlations just like we did with the hashtag data. What packages tend to be depended on together?

## # A tibble: 64,770 x 3
##            item1         item2 correlation
##            <chr>         <chr>       <dbl>
## 1           R.oo   R.methodsS3   0.9031741
## 2    R.methodsS3          R.oo   0.9031741
## 3     doParallel       foreach   0.7341718
## 4        foreach    doParallel   0.7341718
## 5       timeDate    timeSeries   0.7084369
## 6     timeSeries      timeDate   0.7084369
## 7  gWidgetsRGtk2      gWidgets   0.6974373
## 8       gWidgets gWidgetsRGtk2   0.6974373
## 9           ergm       network   0.6336521
## 10       network          ergm   0.6336521
## # ... with 64,760 more rows

center

(I skipped the code for these, but you can find it all here).

Some observations from the full network (while it’s not related to the hashtag, still quite interesting):

  • The RStudio cluster is prominent in the lower left, with ggplot2, knitr and testthat serving as the core anchors. A lot of packages depend on these in combination.
  • You can spot a tight cluster of spatial statistics packages in the upper left (around “sp”) and of machine learning packages near the bottom right (around caret, rpart, and nnet)
  • Smaller clusters include parallelization on the left (parallel, doParallel), time series forecasting on the upper right (zoo, xts, forecast), and parsing API data on top (RCurl, rjson, XML)

One thing I like about this 2D layout (much as I’ve done with programming languages using Stack Overflow data) is that we can bring in our hashtag information, and spot visually what types of packages tended to be favorited.

center

This confirms our observation that the favorited packages are slanted towards the tidyverse/RStudio cluster.

The #7First and #7Fav hashtags have been dying down a bit, but it may still be interesting to try this analysis for others, especially ones with more activity. Maëlle Salmon is working on a great analysis of #7FirstJobs and I’m sure others would be informative.

To leave a comment for the author, please follow the link and comment on their blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R with Power BI: Import, Transform, Visualize and Share

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Power BI, Microsoft’s data visualization and reporting platform, has made great strides in the past year integrating the R language. This Computerworld article describes the recent advances with Power BI and R. In short, you can:

Power BI desktop is completely free to download and use, and includes all the features you need to create visualizations, reports and dashboards. (Publishing to Power BI online requires a subscription, though.) Power BI desktop and R are both included in the Data Science Virtual Machine, so that’s another easy way to get started.

Sharon Laivand from the Power BI engineering team recently gave a webinar showcasing the capabilities of Power BI and R. Fast-forward to the 29-minute mark to see how to create a report incorporating R-based calculations and graphics, and then share it with others (even people who don’t have R installed!) using Power BI online.

For more information about Power BI, visit powerbi.microsoft.com.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Five ways to calculate internal consistency

By Simon Jackson

unnamed-chunk-8-1.png

(This article was first published on blogR, and kindly contributed to R-bloggers)

Let’s get psychometric and learn a range of ways to compute the internal consistency of a test or questionnaire in R. We’ll be covering:

  • Average inter-item correlation
  • Average item-total correlation
  • Cronbach’s alpha
  • Split-half reliability (adjusted using the Spearman–Brown prophecy formula)
  • Composite reliability

If you’re unfamiliar with any of these, here are some resources to get you up to speed:

The data

For this post, we’ll be using data on a Big 5 measure of personality that is freely available from Personality Tests. You can download the data yourself HERE, or running the following code will handle the downloading and save the data as an object called d:

temp <- tempfile()
download.file("http://personality-testing.info/_rawdata/BIG5.zip", temp, mode="wb")
d <- read.table(unz(temp, "BIG5/data.csv"), header = TRUE, sep="t")
unlink(temp); rm(temp)

At the time this post was written, this data set contained data for 19719 people, starting with some demographic information and then their responses on 50 items: 10 for each Big 5 dimension. This is a bit much, so let’s cut it down to work on the first 500 participants and the Extraversion items (E1 to E10):

d <- d[1:500, paste0("E", 1:10)]
str(d)
#> 'data.frame':    500 obs. of  10 variables:
#>  $ E1 : int  4 2 5 2 3 1 5 4 3 1 ...
#>  $ E2 : int  2 2 1 5 1 5 1 3 1 4 ...
#>  $ E3 : int  5 3 1 2 3 2 5 5 5 2 ...
#>  $ E4 : int  2 3 4 4 3 4 1 3 1 5 ...
#>  $ E5 : int  5 3 5 3 3 1 5 5 5 2 ...
#>  $ E6 : int  1 3 1 4 1 3 1 1 1 4 ...
#>  $ E7 : int  4 1 1 3 3 2 5 4 5 1 ...
#>  $ E8 : int  3 5 5 4 1 4 4 3 2 4 ...
#>  $ E9 : int  5 1 5 4 3 1 4 4 5 1 ...
#>  $ E10: int  1 5 1 5 5 5 1 3 3 5 ...

Here is a list of the extraversion items that people are rating from 1 = Disagree to 5 = Agree:

  • E1 I am the life of the party.
  • E2 I don’t talk a lot.
  • E3 I feel comfortable around people.
  • E4 I keep in the background.
  • E5 I start conversations.
  • E6 I have little to say.
  • E7 I talk to a lot of different people at parties.
  • E8 I don’t like to draw attention to myself.
  • E9 I don’t mind being the center of attention.
  • E10 I am quiet around strangers.

You can see that there are five items that need to be reverse scored (E2, E4, E6, E8, E10). Because ratings range from 1 to 5, we can do the following:

d[, paste0("E", c(2, 4, 6, 8, 10))] <- 6 - d[, paste0("E", c(2, 4, 6, 8, 10))]

We’ve now got a data frame of responses with each column being an item (scored in the correct direction) and each row being a participant. Let’s get started!

Average inter-item correlation

The average inter-item correlation is any easy place to start. To calculate this statistic, we need the correlations between all items, and then to average them. Let’s use my corrr package to get these correlations as follows (no bias here!):

library(corrr)
d %>% correlate()
#> # A tibble: 10 x 11
#>    rowname        E1        E2        E3        E4        E5        E6
#>      <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#> 1       E1        NA 0.4528892 0.5002331 0.5237516 0.5378563 0.3657830
#> 2       E2 0.4528892        NA 0.4792026 0.5549109 0.5916995 0.5694593
#> 3       E3 0.5002331 0.4792026        NA 0.4930294 0.6161848 0.3296186
#> 4       E4 0.5237516 0.5549109 0.4930294        NA 0.5123502 0.4714739
#> 5       E5 0.5378563 0.5916995 0.6161848 0.5123502        NA 0.4996636
#> 6       E6 0.3657830 0.5694593 0.3296186 0.4714739 0.4996636        NA
#> 7       E7 0.6360617 0.4731673 0.5684515 0.4999339 0.6205428 0.3725773
#> 8       E8 0.4498123 0.3791802 0.4177098 0.4505763 0.3850780 0.3310247
#> 9       E9 0.5280366 0.3958574 0.4753085 0.4631383 0.4851778 0.3280024
#> 10     E10 0.4908861 0.4487868 0.5000737 0.5234228 0.5525188 0.4137737
#> # ... with 4 more variables: E7 <dbl>, E8 <dbl>, E9 <dbl>, E10 <dbl>

Because the diagonal is already set to NA, we can obtain the average correlation of each item with all others by computing the means for each column (excluding the rowname column):

inter_item <- d %>% correlate() %>% select(-rowname) %>% colMeans(na.rm = TRUE)
inter_item
#>        E1        E2        E3        E4        E5        E6        E7 
#> 0.4983678 0.4827948 0.4866458 0.4991764 0.5334524 0.4090418 0.5133091 
#>        E8        E9       E10 
#> 0.4270190 0.4731896 0.4814496

Aside, note that select() comes from the dplyr package, which is imported when you use corrr.

We can see that E5 and E7 are more strongly correlated with the other items on average than E8. However, most items correlate with the others in a reasonably restricted range around .4 to .5.

To obtain the overall average inter-item correlation, we calculate the mean() of these values:

mean(inter_item)
#> [1] 0.4804446

However, with these values, we can explore a range of attributes about the relationships between the items. For example, we can visualise them in a histogram and highlight the mean as follows:

library(ggplot2)

data.frame(inter_item) %>% 
  ggplot(aes(x = inter_item)) +
    geom_histogram(bins = 10, alpha = .5) +
    geom_vline(xintercept = mean(inter_item), color = "red") +
    xlab("Mean inter-item correlation") +
    theme_bw()

Average item-total correlation

We can investigate the average item-total correlation in a similar way to the inter-item correlations. The first thing we need to do is calculate the total score. Let’s say that a person’s score is the mean of their responses to all ten items:

d$score <- rowMeans(d)
head(d)
#>   E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 score
#> 1  4  4  5  4  5  5  4  3  5   5   4.4
#> 2  2  4  3  3  3  3  1  1  1   1   2.2
#> 3  5  5  1  2  5  5  1  1  5   5   3.5
#> 4  2  1  2  2  3  2  3  2  4   1   2.2
#> 5  3  5  3  3  3  5  3  5  3   1   3.4
#> 6  1  1  2  2  1  3  2  2  1   1   1.6

Now, we’ll correlate() everything again, but this time focus() on the correlations of the score with the items:

item_total <- d %>% correlate() %>% focus(score)
item_total
#> # A tibble: 10 x 2
#>    rowname     score
#>      <chr>     <dbl>
#> 1       E1 0.7520849
#> 2       E2 0.7297506
#> 3       E3 0.7350644
#> 4       E4 0.7485092
#> 5       E5 0.7945943
#> 6       E6 0.6367799
#> 7       E7 0.7768180
#> 8       E8 0.6640914
#> 9       E9 0.7273987
#> 10     E10 0.7306038

Again, we can calculate their mean as:

mean(item_total$score)
#> [1] 0.7295695

And we can plot the results:

item_total %>% 
  ggplot(aes(x = score)) +
    geom_histogram(bins = 10, alpha = .5) +
    geom_vline(xintercept = mean(item_total$score), color = "red") +
    xlab("Mean item-total correlation") +
    theme_bw()

Cronbach’s alpha

Cronbach’s alpha is one of the most widely reported measures of internal consistency. Although it’s possible to implement the maths behind it, I’m lazy and like to use the alpha() function from the psych package. This function takes a data frame or matrix of data in the structure that we’re using: each column is a test/questionnaire item, each row is a person. Let’s test it out below. Note that alpha() is also a function from the ggplot2 package, and this creates a conflict. To specify that we want alpha() from the psych package, we will use psych::alpha()

d$score <- NULL  # delete the score column we made earlier

psych::alpha(d)
#> 
#> Reliability analysis   
#> Call: psych::alpha(x = d)
#> 
#>   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd
#>        0.9       0.9     0.9      0.48 9.2 0.0065    3 0.98
#> 
#>  lower alpha upper     95% confidence boundaries
#> 0.89 0.9 0.91 
#> 
#>  Reliability if an item is dropped:
#>     raw_alpha std.alpha G6(smc) average_r S/N alpha se
#> E1       0.89      0.89    0.89      0.48 8.2   0.0073
#> E2       0.89      0.89    0.89      0.48 8.3   0.0072
#> E3       0.89      0.89    0.89      0.48 8.3   0.0073
#> E4       0.89      0.89    0.89      0.48 8.2   0.0073
#> E5       0.89      0.89    0.89      0.47 7.9   0.0075
#> E6       0.90      0.90    0.90      0.50 8.9   0.0068
#> E7       0.89      0.89    0.89      0.47 8.1   0.0075
#> E8       0.90      0.90    0.90      0.49 8.8   0.0069
#> E9       0.89      0.89    0.89      0.48 8.4   0.0071
#> E10      0.89      0.89    0.89      0.48 8.3   0.0072
#> 
#>  Item statistics 
#>       n raw.r std.r r.cor r.drop mean  sd
#> E1  500  0.75  0.75  0.72   0.68  2.6 1.3
#> E2  500  0.73  0.73  0.70   0.66  3.2 1.3
#> E3  500  0.74  0.74  0.70   0.67  3.3 1.3
#> E4  500  0.75  0.75  0.72   0.68  2.8 1.3
#> E5  500  0.79  0.80  0.78   0.74  3.3 1.3
#> E6  500  0.64  0.64  0.59   0.55  3.6 1.3
#> E7  500  0.78  0.77  0.75   0.70  2.8 1.5
#> E8  500  0.66  0.66  0.61   0.58  2.7 1.3
#> E9  500  0.73  0.72  0.68   0.64  3.1 1.5
#> E10 500  0.73  0.73  0.69   0.66  2.3 1.3
#> 
#> Non missing response frequency for each item
#>        1    2    3    4    5 miss
#> E1  0.28 0.23 0.23 0.16 0.10    0
#> E2  0.12 0.21 0.22 0.21 0.23    0
#> E3  0.09 0.20 0.25 0.22 0.24    0
#> E4  0.19 0.26 0.23 0.21 0.12    0
#> E5  0.11 0.21 0.20 0.23 0.25    0
#> E6  0.09 0.15 0.17 0.28 0.30    0
#> E7  0.28 0.20 0.17 0.16 0.20    0
#> E8  0.23 0.27 0.19 0.19 0.12    0
#> E9  0.19 0.21 0.17 0.19 0.25    0
#> E10 0.36 0.27 0.12 0.15 0.09    0

This function provides a range of output, and generally what we’re interested in is std.alpha, which is “the standardised alpha based upon the correlations”. Also note that we get “the average interitem correlation”, average_r, and various versions of “the correlation of each item with the total score” such as raw.r, whose values match our earlier calculations.

If you’d like to access the alpha value itself, you can do the following:

psych::alpha(d)$total$std.alpha
#> [1] 0.9024126

Split-half reliability (adjusted using the Spearman–Brown prophecy formula)

There are times when we can’t calculate internal consistency using item responses. For example, I often work with a decision-making variable called recklessness. This variable is calculated after people answer questions (e.g., “What is the longest river is Asia”), and then decide whether or not to bet on their answer being correct. Recklessness is calculated as the proportion of incorrect answers that a person bets on.

If you think about it, it’s not possible to calculate internal consistency for this variable using any of the above measures. The reason for this is that the items that contribute to two people’s recklessness scores could be completely different. One person could give incorrect answers on questions 1 to 5 (thus these questions go into calculating their score), while another person might incorrectly respond to questions 6 to 10. Thus, calculating recklessness for many individuals isn’t as simple as summing across items. Instead, we need an item pool from which to pull different combinations of questions for each person.

To overcome this sort of issue, an appropriate method for calculating internal consistency is to use a split-half reliability. This entails splitting your test items in half (e.g., into odd and even) and calculating your variable for each person with each half. For example, I typically calculate recklessness for each participant from odd items and then from even items. These scores are then correlated and adjusted using the Spearman-Brown prophecy/prediction formula (for examples, see some of my publications such as this or this). Similar to Cronbach’s alpha, a value closer to 1 and further from zero indicates greater internal consistency.

We can still calculate split-half reliability for variables that do not have this problem! So let’s do this with our extraversion data as follows:

# Calculating total score...
score_e <- rowMeans(d[, c(TRUE, FALSE)])  # with even items
score_o <- rowMeans(d[, c(FALSE, TRUE)])  # with odd items

# Correlating scores from even and odd items
r <- cor(score_e, score_o)
r
#> [1] 0.7681034

# Adjusting with the Spearman-Brown prophecy formula
(2 * r) / (1 + r)
#> [1] 0.8688445

Thus, in this case, the split-half reliability approach yields an internal consistency estimate of .87.

Composite reliability

The final method for calculating internal consistency that we’ll cover is composite reliability. Where possible, my personal preference is to use this approach. Although it’s not perfect, it takes care of many inappropriate assumptions that measures like Cronbach’s alpha make. If the specificities interest you, I suggest reading this post.

Composite reliability is based on the factor loadings in a confirmatory factor analysis (CFA). In the case of a unidimensional scale (like extraversion here), we define a one-factor CFA, and then use the factor loadings to compute our internal consistency estimate. I won’t go into the detail, but we can interpret a composite reliability score similarly to any of the other metrics covered here (closer to one indicates better internal consistency). We’ll fit our CFA model using the lavaan package as follows:

library(lavaan)

# Define the model
items <- paste(names(d), collapse = "+")
model <- paste("extraversion", items, sep = "=~")
model
#> [1] "extraversion=~E1+E2+E3+E4+E5+E6+E7+E8+E9+E10"

# Fit the model
fit <- cfa(model, data = d)

There are various ways to get to the composite reliability from this model. We’ll extract the standardized factor loadings and work with those:

sl <- standardizedSolution(fit)
sl <- sl$est.std[sl$op == "=~"]
names(sl) <- names(d)
sl  # These are the standardized factor loadings for each item
#>        E1        E2        E3        E4        E5        E6        E7 
#> 0.7266721 0.6903172 0.7154705 0.7118762 0.7841427 0.5791526 0.7589250 
#>        E8        E9       E10 
#> 0.5962855 0.6726232 0.6940120

We then obtain the composite reliability via the following:

# Compute residual variance of each item
re <- 1 - sl^2

# Compute composite reliability
sum(sl)^2 / (sum(sl)^2 + sum(re))
#> [1] 0.9029523

There you have it. The composite reliability for the extraversion factor is .90.

One appealing aspect of composite reliability is that we can calculate it for multiple factors in the same model. For example, say we had included all personality items in a CFA with five factors, we could do the above calculations separately for each factor and obtain their composite reliabilities.

Just to finish off, I’ll mention that you can use the standardised factor loadings to visualise more information like we did earlier with the correlations. I’ll leave this part up to you!

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Markdown: How to format tables and figures in .docx files

By Norbert Köhler

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In research, we usually publish the most important findings in tables and figures. When writing research papers using Rmarkdown (*.Rmd), we have several options to format the output of the final MS Word document (.docx).

Tables can be formated using either the knitr package’s kable() function or several functions of the pander package.

Figure sizes can be determined in the chunk options, e.g. {r name_of_chunk, fig.height=8, fig.width=12}.

However, options for customizing tables and figures are rather limited in Rmarkdown. Thus, I usually customize tables and figures in the final MS Word document.

In this blog post, I show how to quickly format tables and figures in the final MS Word document using a macro). MS Word macros are written in VBA (Visual Basic for Applications) and can be accessed from a menu list or from the tool bar and run by simply clicking. There are loads of tutorials explaining how to write a macro for MS Word, e.g Usman Javaid’s Create Macros In Word 2010.

The following two macros are very helpful to format drafts. Since I want drafts to be as compact as possible, tables and figures should not to be too space consuming.

The first macro called FormatTables customizes the format of all tables of the active MS Word document. With wdTableFormatGrid2, we use a table style predefined in MS Word. A list of other table styles can be found under the following link. Furthermore, we define font name (Arial) and font size (8 pt), space before (6 pt) and after (10 pt) the table. Finally, the row height is set to 18 pt exactly.

Sub FormatTables()

 Dim tbl As Table
    For Each tbl In ActiveDocument.Tables
         tbl.AutoFormat wdTableFormatGrid2
         tbl.Range.Font.Name = "Arial"
         tbl.Range.Font.Size = 8
         tbl.Range.ParagraphFormat.SpaceBefore = 6
         tbl.Range.ParagraphFormat.SpaceAfter = 10
         tbl.Range.Cells.SetHeight RowHeight:=18, HeightRule:=wdRowHeightExactly

    Next

End Sub

The second macro called FormatFigures merely reduces the size of all figures in the active MS Word document to 45% of its original size.

Sub FormatFigures()

Dim shp As InlineShape


For Each shp In ActiveDocument.InlineShapes
    shp.ScaleHeight = 45
    shp.ScaleWidth = 45
Next

End Sub

I hope you find this post useful and If you have any question please post a comment below. You are welcome to visit my personal blog Scripts and Statistics for more R tutorials.

    Related Post

    1. R Markdown: How to insert page breaks in a MS Word document
    2. Working on Data-Warehouse (SQL) with R
    3. Implementing Apriori Algorithm in R
    4. Handling missing data with MICE package; a simple approach
    5. Best packages for data manipulation in R
    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Extending sparklyr to Compute Cost for K-means on YARN Cluster with Spark ML Library

    By Marcin Kosiński

    (This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

    Machine and statistical learning wizards are becoming more eager to perform analysis with Spark ML library if this is only possible. It’s trendy, posh, spicy and gives the feeling of doing state of the art machine learning and being up to date with the newest computational trends. It is even more sexy and powerful when computations can be performed on the extraordinarily enormous computation cluster – let’s say 100 machines on YARN hadoop cluster makes you the real data cruncher! In this post I present sparklyr package (by RStudio), the connector that will transform you from a regular R user, to the supa! data scientist that can invoke Scala code to perform machine learning algorithms on YARN cluster just from RStudio! Moreover, I present how I have extended the interface to K-means procedure, so that now it is also possible to compute cost for that model, which might be beneficial in determining the number of clusters in segmentation problems. Thought about learnig Scala? Leave it – user sparklyr!

    If you don’t know much about Spark yet, you can read my April post Answers to FAQ about SparkR for R users – where I explained how could we use SparkR package that is distributed with Spark. Many things (code) might have changed since that time, due to the rapid development caused by great popularity of Spark. Now we can use version 2.0.0 of Spark. If you are migrating from previous versions I suggest you should look at Migration Guide – Upgrading From SparkR 1.6.x to 2.0.

    sparklyr basics

    This packages is based on sparkapi package that enables to run Spark applications locally or on YARN cluster just from R. It translates R code to bash invocation of spark-shell. It’s biggest advantage is dplyr interface for working with Spark Data Frames (that might be Hive Tables) and possibility to invoke algorithms from Spark ML library.

    Installation of sparklyr, then Spark itself and simple application initiation is described by this code

    library(devtools)
    install_github('rstudio/sparklyr')
    library(sparklyr)
    spark_install(version = "2.0.0")
    sc <- 
    spark_connect(master="yarn",
       config = list(
         default = list(
           spark.submit.deployMode= "client",
           spark.executor.instances= 20, 
           spark.executor.memory= "2G",
           spark.executor.cores= 4,
           spark.driver.memory= "4G")))

    One don’t have to specify config by himself, but if this is desired then remember that you could also specify parameters for Spark application with config.yml files so that you can benefit from many profiles (development, production). In version 2.0.0 it is desired to name master yarn instead of yarn-client and passing the deployMode parameter, which is different from version 1.6.x. All available parameters can be found in Running Spark on YARN documentation page.

    dplyr and DBI interface on Spark

    When connecting to YARN, it is most probable that you would like to use data tables that are stored on Hive. Remember that

    Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.

    where conf/ is set as HADOOP_CONF_DIR. Read more about using Hive tables from Spark

    If everything is set up and the application runs properly, you can use dplyr interface to provide lazy evaluation for data manipulations. Data are stored on Hive, Spark application runs on YARN cluster, and the code is invoked from R in the simple language of data transformations (dplyr) – everything thanks to sparklyr team great job! Easy example is below

    library(dplyr)
    # give the list of tables
    src_tbls(sc) 
    # copies iris from R to Hive
    iris_tbl <- copy_to(sc, iris, "iris") 
    # create a hook for data stored on Hive
    data_tbl <- tbl(sc, "table_name") 
    data_tbl2 <- tbl(sc, sql("SELECT * from table_name"))

    You can also perform any operation on datasets use by Spark

    iris_tbl %>%
       select(Petal_Length, Petal_Width) %>%
       top_n(40, Petal_Width) %>%
       arrange(Petal_Length)

    Note that original commas in iris names have been translated to _.

    This package also provides interface for functions defined in DBI package

    library(DBI)
    dbListTables(sc)
    dbGetQuery(sc, "use database_name")
    data_tbl3 <- dbGetQuery(sc, "SELECT * from table_name")
    dbListFields(sc, data_tbl3)

    Running Spark ML Machine Learning K-means Algorithm from R

    The basic example on how sparklyr invokes Scala code from Spark ML will be presented on K-means algorithm.
    If you check the code of sparklyr::ml_kmeans function you will see that for input tbl_spark object, named x and character vector containing features’ names (featuers)

    envir <- new.env(parent = emptyenv())
    df <- spark_dataframe(x)
    sc <- spark_connection(df)
    df <- ml_prepare_features(df, features)
    tdf <- ml_prepare_dataframe(df, features, ml.options = ml.options, envir = envir)

    sparklyr ensures that you have proper connection to spark data frame and prepares features in convenient form and naming convention. At the end it prepares a Spark DataFrame for Spark ML routines.

    This is done in a new environment, so that we can store arguments for future ML algorithm and the model itself in its own environment. This is safe and clean solution. You can construct a simple model calling a Spark ML class like this

    envir$model <- "org.apache.spark.ml.clustering.KMeans"
    kmeans <- invoke_new(sc, envir$model)

    which invokes new object of class KMeans on which we can invoke parameters setters to change default parameters like this

    model <- kmeans %>%
        invoke("setK", centers) %>%
        invoke("setMaxIter", iter.max) %>%
        invoke("setTol", tolerance) %>%
        invoke("setFeaturesCol", envir$features)
    # features where set in ml_prepare_dataframe

    For an existing object of KMeans class we can invoke its method called fit that is responsible for starting the K-means clustering algorithm

    fit <- model %>%
    invoke("fit", tdf)

    which returns new object on which we can compute, e.g centers of outputted clustering

    kmmCenters <- invoke(fit, "clusterCenters")

    or the Within Set Sum of Squared Errors (called Cost) (which is mine small contribution #173 )

    kmmCost <- invoke(fit, "computeCost", tdf)

    This sometimes helps to decide how many clusters should we specify for clustering problem

    and is presented in print method for ml_model_kmeans object

    iris_tbl %>%
       select(Petal_Width, Petal_Length) %>%
       ml_kmeans(centers = 3, compute.cost = TRUE) %>%
       print()
    
    K-means clustering with 3 clusters
    
    Cluster centers:
      Petal_Width Petal_Length
    1    1.359259     4.292593
    2    2.047826     5.626087
    3    0.246000     1.462000
    
    Within Set Sum of Squared Errors =  31.41289

    All that can be better understood if we’ll have a look on Spark ML docuemtnation for KMeans (be carefull not to confuse with Spark MLlib where methods and parameters have different names than those in Spark ML). This enabled me to provide simple update for ml_kmeans() (#179) so that we can specify tol (tolerance) parameter in ml_kmeans() to support tolerance of convergence.

    To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Python style logging in R

    By Jonathan Callahan

    monitoring_v3_map

    (This article was first published on R – Working With Data, and kindly contributed to R-bloggers)
    This entry is part 20 of 20 in the series Using R

    We are increasingly using R in “operational” settings that require robust error handling and logging. In this post we describe a quick-and-dirty way to get python style multi-level log files by wrapping the futile.logger package.

    Making Messy Data Pretty

    Our real world scenario involves R scripts that process raw smoke monitoring data that is updated hourly. The raw data comes from various different instruments, set up by different agencies and transmitted over at least two satellites before eventually arriving on our computers.

    Data can be missing, delayed or corrupted for a variety of reasons before it gets to us. And then our R scripts perform QC based on various columns available in the raw (aka “engineering level”) data.

    Ultimately, the cleaned up data gets displayed in a user interface here.

    No one needs to know all of the hard work that goes into cleaning and QC’ing the data that goes into this map. Our goal, after all, is to make it easier for stakeholders to make decisions without having to worry about the fussy details of raw data.

    Logging When Things Go Wrong

    However, neither the incoming data nor the R scripts we have written to process it are perfect. Things occasionally go wrong in a variety of ways and, when decision makers depend on the data being available, phone calls and text messages are next.

    Bhuddist teacher Pema Chödrön’s book When Things Fall Apart has several advice categories that actually apply in this situation. We will focus on one:

    • Using painful emotions to cultivate wisdom, compassion, and courage
    • Communicating so as to encourage others to open up rather than shut down
    • Practices for reversing habitual patterns
    • Methods for working with chaotic situations
    • Ways for creating effective social action

    One of the methods for working with chaotic situations in operational software is to have lots and Lots and LOTS of logging.

    Python has the brilliant “logging” module that allows us to quickly set up separate output files for log statements at different levels:

    DEBUG, INFO, WARNING, ERROR, WARNING, CRITICAL

    . Inside our operational code we only need to issue single calls like:

    logger.info('Ingesting data file: %s', data_file)

    and this information will automatically be written out to both the DEBUG and the INFO log files if we have set them up.

    In our situation it makes sense to write out three log files for quick perusal by authorized personnel:

    • ~_DEBUG.log — for programmers needing to debug what went wrong
    • ~_INFO.log — for managers trying to understand the overall situation
    • ~_ERROR.log — (hopefully zero length!) for programmers and managers

    These files get written to a uniform directory every time a data processing script runs with the ‘~’ replaced by the name of the processing script.

    Wrapping futile.logger

    Luckily, most of the functionality we need is provided by futile.logger package (CRAN or github). Unfortunately, this package does not currently support multiple “appenders” per “logger” so we found ourselves initializing multiple named loggers to mimic the natural behavior of Python’s logging module:

    # Only FATAL messages go to the console
    dummy <- flog.threshold(FATAL)
    
    # Set up new log files
    if ( file.exists(DEBUG_LOG) ) dummy <- file.remove(DEBUG_LOG)
    dummy <- flog.logger("debug", DEBUG, appender.file(DEBUG_LOG))
    
    if ( file.exists(INFO_LOG) ) dummy <- file.remove(INFO_LOG)
    dummy <- flog.logger("info", INFO, appender.file(INFO_LOG))
    
    if ( file.exists(ERROR_LOG) ) dummy <- file.remove(ERROR_LOG)
    dummy <- flog.logger("error", ERROR, appender.file(ERROR_LOG))
    
    # Silence other warning messages
    options(warn=-1) # -1=ignore, 0=save/print, 1=print, 2=error

    Then, inside the code, we created logging and error handling blocks like this:

    if ( class(result)[1] == "try-error" ) {
        err_msg <- geterrmessage()
        flog.error('Error creating SpatialPoitnsDataFrame: %s', err_msg, name='debug')
        flog.error('Error creating SpatialPointsDataFrame: %s', err_msg, name='info')
        flog.error('Error creating SpatialPointsDataFrame: %s', err_msg, name='error')
        stop(err_msg)
      }
      ...
      flog.info('Using rgdal::writeOGR to create geojson file %s', filename, name='debug')
      flog.info('Using rgdal::writeOGR to create geojson file %s', filename, name='info')

    Yuck!

    After writing just a few lines like this we tried a different approach and came up with a simple py_logging.R script that can be used to wrap futile.logger calls and give us the python style behavior we want. Setup is a cinch:

    source('py_logging.R') # python style logging
    
    # Set up logging
    logger.setup(debugLog, infoLog, errorLog)
    
    # Silence other warning messages
    options(warn=-1) # -1=ignore, 0=save/print, 1=print, 2=error

    And logging is much more natural:

    if ( class(result)[1] == "try-error" ) {
        err_msg <- geterrmessage()
        logger.error('Error creating SpatialPointsDataFrame: %s', err_msg)
        stop(err_msg)
      }
      ...
      logger.info('Using rgdal::writeOGR to create goejson file %s', filename)

    Output from our logger looks great!

    INFO [2016-08-24 10:11:54] Running process_data.R version 0.1.5
    
    INFO [2016-08-24 10:11:54] Working on /home/data/location/datafile_1.RData
    INFO [2016-08-24 10:11:55] Found 1232 data columns
    INFO [2016-08-24 10:12:03] Working on /home/data/location/datafile_2.RData
    INFO [2016-08-24 10:12:03] Found 103 data columns
    INFO [2016-08-24 10:12:03] Working on /home/data/location/datafile_3.RData
    INFO [2016-08-24 10:12:03] Found 44 data columns

    At the end of the day, we can only hope we have succeeded at one or another other of:

    • Using painful emotions to cultivate wisdom, compassion, and courage
    • Communicating so as to encourage others to open up rather than shut down

    Best wishes for excellent logging!

    To leave a comment for the author, please follow the link and comment on their blog: R – Working With Data.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Using MANOVA to Analyse a Banking Crisis Exercises

    By Miodrag Sljukic

    multivariate-q-q-plot

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    In this set of exercises we will practice multivariate analysis of variance – MANOVA.

    We shall try to find if there is a difference in the combination of export and bank reserves, depending on the status of banking sector (is there a crisis or not). The data set is fictitious and servers for education purposes only. It consist of variables crisis, which is factor, meaning that there exists or there does not exist banking crisis and export and reserves, in billions of currency units. You can download it here.

    In this set of exercises we use two packages: MVN and heplots. If you haven’t already installed them, do it using the following code:


    install.packages(c("MVN", "heplots"))

    and load them into the session using the following code:


    library("MVN")
    library("heplots")

    before proceeding.

    Answers to the exercises are available here.

    If you have different solution, feel free to post it.

    Exercise 1

    Is the sample size large enough for conducting MANOVA? (Tip: You should have at least 2 cases for each cell.)

    1. Yes
    2. No

    Exercise 2

    Are there univariate and multivariate outliers?

    1. There are univariate, but not multivariate outliers
    2. There doesn’t exist a univariate outlier, but there are multivariate outliers
    3. There exist both univariate and multivariate outliers

    Exercise 3

    How do you estimate univariate and multivariate normality of dependent variables?

    1. Both variables are univariate normal, but they are not multivariate normally distributed
    2. None of the variables is univariate normal, and hence there doesn’t exist multivariate normality
    3. Both variables are univariate normal and the data is multivariate normally distributed

    Exercise 4

    Using the matrix of scatter plots, check for the linearity between dependent variables export and reserves for each category of independent variable.

    Exercise 5

    Calculate the correlation between dependent variables export and reserves. Is it appropriate to justify conducting MANOVA?

    1. Yes
    2. No

    Exercise 6

    Is there equality of covariances of the dependent variables export and reserves across the groups. (Tip: You should perform Box’s M test of equality of covariance matrices.)

    1. Yes
    2. No

    Exercise 7

    Is there equality of variances of the dependent variables export and reserves across groups? (Tip: Use Levens’s test of error variances.)

    1. Yes
    2. No

    Exercise 8

    On the level of significance of 0.05, is there effect of banking crisis to export and banking reserves combination?

    1. Yes
    2. No

    Exercise 9

    How much of the variance in the dependent variables export and reserves is explained by banking crisis?

    Exercise 10

    Does the export differ when banking sector is in the crisis compared to when banking sector is not in the crisis? What about reserves?

    1. Only export differ
    2. Only reserves differ
    3. Both export and reserves differ
    4. None of them differ
    To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    stringr 1.1.0

    By hadleywickham

    (This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

    I’m pleased to announce version 1.1.0 of stringr. stringr makes string manipulation easier by using consistent function and argument names, and eliminating options that you don’t need 95% of the time. To get started with stringr, check out the strings chapter in R for data science. Install it with:

    install.packages("stringr")

    This release is mostly bug fixes, but there are a couple of new features you might care out.

    • There are three new datasets, fruit, words and sentences, to help you practice your regular expression skills:

      str_subset(fruit, "(..)1")
      #> [1] "banana"      "coconut"     "cucumber"    "jujube"      "papaya"     
      #> [6] "salal berry"
      head(words)
      #> [1] "a"        "able"     "about"    "absolute" "accept"   "account"
      sentences[1]
      #> [1] "The birch canoe slid on the smooth planks."
    • More functions work with boundary(): str_detect() and str_subset() can detect boundaries, and str_extract() and str_extract_all() pull out the components between boundaries. This is particularly useful if you want to extract logical constructs like words or sentences.
      x <- "This is harder than you might expect, e.g. punctuation!"
      x %>% str_extract_all(boundary("word")) %>% .[[1]]
      #> [1] "This"        "is"          "harder"      "than"        "you"        
      #> [6] "might"       "expect"      "e.g"         "punctuation"
      x %>% str_extract(boundary("sentence"))
      #> [1] "This is harder than you might expect, e.g. punctuation!"
    • str_view() and str_view_all() create HTML widgets that display regular expression matches. This is particularly useful for teaching.

    For a complete list of changes, please see the release notes.

    To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News