Track changes in data with the lumberjack %>>%

By mark

(This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R

> data(retailers, package="validate")
> head(retailers, 3)
  size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1  sc0      0.02    75       NA        NA      1130          NA       18915  20045  NA
2  sc3      0.14     9     1607        NA      1607         131        1544     63  NA
3  sc3      0.14    NA     6886       -33      6919         324        6493    426  NA

This data is dirty with missings and full of errors. Let us do some imputations with simputation.

> out % 
+   impute_lm(other.rev ~ turnover) %>%
+   impute_median(other.rev ~ size)
> 
> head(out,3)
  size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1  sc0      0.02    75       NA  6114.775      1130          NA       18915  20045  NA
2  sc3      0.14     9     1607  5427.113      1607         131        1544     63  NA
3  sc3      0.14    NA     6886   -33.000      6919         324        6493    426  NA
> 

Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.

The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.

> library(lumberjack)
> retailers$id  out >% 
+   start_log(log=cellwise$new(key="id")) %>>%
+   impute_lm(other.rev ~ turnover) %>>%
+   impute_median(other.rev ~ size) %>>%
+   dump_log(stop=TRUE)
Dumped a log at cellwise.csv
> 
> read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
  step                     time                      expression key  variable old      new
1    2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size)   1 other.rev  NA 6114.775
2    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   2 other.rev  NA 5427.113
3    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   6 other.rev  NA 6341.683
> 

So, to track changes we only need to switch from %>% to %>>% and add the start_log() and dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.

There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.

If this post got you interested, please install the package using

install.packages('lumberjack')

You can get started with the introductory vignette or even just use the lumberjack operator %>>% as a (close) replacement of the %>% operator.

As always, I am open to suggestions and comments. Either through the packages github page.

Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.

And finally, here’s a picture of a lumberjack smoking a pipe.

[1] It really should be called a function composition operator, but potetoes/potatoes.

Markdown with by ❤wp-gfm
To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: >%” >R News

The R community is one of R’s best features

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And that’s an area where R really shines, as Shannon Ellis explains in this lovely ROpenSci blog post. For software, a thriving community offers developers, expertise, collaborators, writers and documentation, testers, agitators (to keep the community and software on track!), and so much more. Shannon provides links where you can find all of this in the R community:

  • #rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter
  • R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
  • Local R meetup groups — a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
  • Rweekly — an incredible weekly recap of all things R
  • R-bloggers — an awesome resource to find posts from many different bloggers using R
  • DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research
  • Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)

I’ll add a couple of others as well:

  • R Conferences — The annual useR! conference is the major community event of the year, but there are many smaller community-led events on various topics.
  • Github — there’s a fantastic community of R developers on Github. There’s no directory, but the list of trending R developers is a good place to start.
  • The R Consortium — proposing or getting involved with an R Consortium project is a great way to get involved with the community

As I’ve said before, the R community is one of the greatest assets of R, and is an essential component of what makes R useful, easy, and fun to use. And you couldn’t find a nicer and more welcoming group of people to be a part of.

To learn more about the R community, be sure to check out Shannon’s blog post linked below.

ROpenSci Blog: Hey! You there! You are welcome here

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Logarithmic Scale Explained with U.S. Trade Balance

By Gregory Kanevsky

(This article was first published on novyden, and kindly contributed to R-bloggers)
Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest.
Consider U.S. 2016 merchandise trade partner balances data set where each point is a country with 2 features: U.S. imports and exports against it:
Suppose we decided to visualize top 30 U.S trading partners using bubble chart, which simply is a 2D scatter plot with the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports for xy coordinates and trade balance (abs(export – import)) for size:
China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to “solve” this problem is to eliminate 3 mentioned outliers from the picture:

While this plot does look better it no longer serves its original purpose of displaying all top trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.
Quick refresher from algebra. Log function (in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbers A, B, and C such that

`A*B=C and A,B,C > 0`

applying log results in additive relationship:

`log(A) + log(B) = log(C)`
For example, let A=100, B=1000, and C=100000 then

`100 * 1000 = 100000`

so that after transformation it becomes

`log(100) + log(1000) = log(100000)` or `2 + 3 = 5`

Observe this on 1D plane:

Logarithmic scale is simply a log transformation applied to all feature’s values before plotting them. In our example we used it on both trading partners’ features – imports and exports which gives bubble chart new look:

The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.
For more detailed discussion of logarithmic scale refer to When Should I Use Logarithmic Scales in My Charts and Graphs? Oh, and how about that trade deficit with China?
This is a re-post from the original blog on LinkedIn.
To leave a comment for the author, please follow the link and comment on their blog: novyden.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

State-space modelling of the Australian 2007 federal election

By Peter's stats stuff – R

(This article was first published on Peter’s stats stuff – R, and kindly contributed to R-bloggers)

Pooling the polls with Bayesian statistics

In an important 2005 article in the Australian Journal of Political Science, Simon Jackman set out a statistically-based approach to pooling polls in an election campaign. He describes the sensible intuitive approach of modelling a latent, unobserved voting intention (unobserved except on the day of the actual election) and treats each poll as a random observation based on that latent state space. Uncertainty associated with each measurement comes from sample size and bias coming from the average effect of the firm conducting the poll, as well as of course uncertainty about the state of the unobserved voting intention. This approach allows house effects and the latent state space to be estimated simultaneously, quantifies the uncertainty associated with both, and in general gives a much more satisfying method of pooling polls than any kind of weighted average.

Jackman gives a worked example of the approach in his excellent book Bayesian Analysis for the Social Sciences, using voting intention for the Australian Labor Party (ALP) in the 2007 Australian federal election for data. He provides JAGS code for fitting the model, but notes that with over 1,000 parameters to estimate (most of those parameters are the estimated voting intention for each day between the 2004 and 2007 elections) it is painfully slow to fit in general purpose MCMC-based Bayesian tools such as WinBUGS or JAGS – several days of CPU time on a fast computer in 2009. Jackman estimated his model with Gibbs sampling implemented directly in R.

Down the track, I want to implement Jackman’s method of polling aggregation myself, to estimate latent voting intention for New Zealand to provide an alternative method for my election forecasts. I set myself the familiarisation task of reproducing his results for the Australian 2007 election. New Zealand’s elections are a little complex to model because of the multiple parties in the proportional representation system, so I wanted to use a general Bayesian tool for the purpose to simplify my model specification when I came to it. I use Stan because its Hamiltonian Monte Carlo method of exploring the parameter space works well when there are many parameters – as in this case, with well over 1,000 parameters to estimate.

Stan describes itself as “a state-of-the-art platform for statistical modeling and high-performance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.” It lets the programmer specify a complex statistical model, and given a set of data will return a range of parameter estimates that were most likely to produce the observed data. Stan isn’t something you use as an end-to-end workbench – it’s assumed that data manipulation and presentation is done with another tool such as R, Matlab or Python. Stan focuses on doing one thing well – using Hamiltonian Monte Carlo to estimate complex statistical models, potentially with many thousands of hierarchical parameters, with arbitrarily set prior distributions.

Caveat! – I’m fairly new to Stan and I’m pretty sure my Stan programs that follow aren’t best practice, even though I am confident they work. Use at your own risk!

Basic approach – estimated voting intention in the absence of polls

I approached the problem in stages, gradually making my model more realistic. First, I set myself the task of modelling latent first-preference support for the ALP in the absence of polling data. If all we had were the 2004 and 2007 election results, where might we have thought ALP support went between those two points? Here’s my results:

For this first analysis, I specified that support for the ALP had to be a random walk that changed by a normally distributed variable with standard deviation of 0.25 percentage points for each daily change. Why 0.25? Just because Jim Savage used it in his rough application of this approach to the US Presidential election in 2016. I’ll be relaxing this assumption later.

Here’s the R code that sets up the session, brings in the data from Jackman’s pscl R package, and defines a graphics function that I’ll be using for each model I create.

library(tidyverse)
library(scales)
library(pscl)
library(forcats)
library(rstan)

rstan_options(auto_write = TRUE)
options(mc.cores = 7)

#=========2004 election to 2007 election==============
data(AustralianElectionPolling)
data(AustralianElections)

days_between_elections  as.integer(diff(as.Date(c("2004-10-09", "2007-11-24")))) + 1

#' Function to plot time series extracted from a stan fit of latent state space model of 2007 Australian election
plot_results  function(stan_m){
   if(class(stan_m) != "stanfit"){
      stop("stan_m must be an object of class stanfit, with parameters mu representing latent vote at a point in time")
   }
   ex  as.data.frame(rstan::extract(stan_m, "mu"))
   names(ex)  1:d1$n_days
   
   p  ex %>%
      gather(day, value) %>%
      mutate(day = as.numeric(day),
             day = as.Date(day, origin = "2004-10-08")) %>%
      group_by(day) %>%
      summarise(mean = mean(value),
                upper = quantile(value, 0.975),
                lower = quantile(value, 0.025)) %>%
      ggplot(aes(x = day)) +
      labs(x = "Shaded region shows a pointwise 95% credible interval.", 
           y = "Voting intention for the ALP (%)",
           caption = "Source: Jackman's pscl R package; analysis at https://ellisp.github.io") +
      geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3) +
      geom_line(aes(y = mean)) +
      scale_y_continuous(breaks = 31:54, sec.axis = dup_axis(name = "")) +
      theme(panel.grid.minor = element_blank())
   
   return(p)
}

Here’s the Stan program that specifies this super simple model of changing ALP support from 2004 to 2007:

// oz-polls-1.stan

data {
  int n_days;           // number of days
  real mu_start;                 // value at starting election
  real mu_finish;                // value at final election
}
parameters {
  real mu[n_days];               // underlying state of vote intention
}

model {
  
  // state model
  mu[1] ~ normal(mu_start, 0.01);
  for (i in 2:n_days) 
      mu[i] ~ normal(mu[i - 1], 0.25);
      
  // measurement model
  // 1. Election result
  mu[n_days] ~ normal(mu_finish, 0.01);
  
}

And here’s the R code that calls that Stan program and draws the resulting summary graphic. Stan works by compiling a program in C++ that is based on the statistical model specified in the *.stan file. Then the C++ program zooms around the high-dimensional parameter space, moving slower around the combinations of parameters that seem more likely given the data and the specified prior distributions. It can use multiple processors on your machine and works super fast given the complexity of what it’s doing.

#----------------no polls inbetween the elections------------
d1  list(mu_start = 37.64, mu_finish = 43.38, n_days = days_between_elections)

# returns some warnings first time it compiles; see
# http://mc-stan.org/misc/warnings.html suggests most compiler
# warnings can be just ignored.
system.time({
  stan_mod1  stan(file = 'oz-polls-1.stan', data = d1,
  control = list(max_treedepth = 20))
  }) # 1800 seconds

plot_results(stan_mod1) +
   ggtitle("Voting intention for the ALP between the 2004 and 2007 Australian elections",
           "Latent variable estimated with no use of polling data")

Adding in one polling firm

Next I wanted to add a single polling firm. I chose Nielsen’s 42 polls because Jackman found they had a fairly low bias, which removed one complication for me as I built up my familiarity with the approach. Here’s the result:

That model was specified in Stan as set out below. The Stan program is more complex now; I’ve had to specify how many polls I have (y_n), the values for each poll (y_values), and the days since the last election each poll was taken (y_days). This way I only have to specify 42 measurement errors as part of the probability model – other implementations I’ve seen of this approach ask for an estimate of measurement error for each poll on each day, treating the days with no polls as missing values to be estimated. That obviously adds a huge computational load I wanted to avoid.

In this program, I haven’t yet added in the notion of a house effect for Nielsen. Each measurement Nielsen made is assumed to have been an unbiased one. Again, I’ll be relaxing this later. The state model is also the same as before ie standard deviation of the day to day innovations is still hard coded as 0.25 percentage points.

// oz-polls-2.stan

data {
  int n_days;            // number of days
  real mu_start;                  // value at starting election
  real mu_finish;                 // value at final election
  int y_n;               // number of polls
  real y_values[y_n];             // actual values in polls
  int y_days[y_n];       // the number of days since starting election each poll was taken
  real y_se[y_n];
}
parameters {
  real mu[n_days];               // underlying state of vote intention
}

model {
  
  // state model
  mu[1] ~ normal(mu_start, 0.01);
  for (i in 2:n_days) 
      mu[i] ~ normal(mu[i - 1], 0.25);
      
  // measurement model
  // 1. Election result
  mu[n_days] ~ normal(mu_finish, 0.01);
  
  // 2. Polls
  for(t in 1:y_n)
      y_values[t] ~ normal(mu[y_days[t]], y_se[t]);
  
}

Here’s the R code to prepare the data and pass it to Stan. Interestingly, fitting this model is noticeably faster than the one with no polling data at all. My intuition for this is that now the state space is constrained to being reasonably close to some actually observed measurements, it’s an easier job for Stan to know where is good to explore.

#--------------------AC Nielson-------------------
ac  AustralianElectionPolling %>%
  filter(org == "Nielsen") %>%
  mutate(MidDate = startDate + (endDate - startDate) / 2,
         MidDateNum = as.integer(MidDate - as.Date("2004-10-08")),  # ie number of days since first election; last election (9 October 2004) is day 1
         p = ALP / 100,
         se_alp = sqrt(p * (1- p) / sampleSize) * 100)

d2  list(
  mu_start = 37.64,
  mu_finish = 43.38,
  n_days = days_between_elections,
  y_values = ac$ALP,
  y_days = ac$MidDateNum,
  y_n = nrow(ac),
  y_se = ac$se_alp
)

system.time({
  stan_mod2  stan(file = 'oz-polls-2.stan', data = d2,
                   control = list(max_treedepth = 20))
}) # 510 seconds

plot_results(stan_mod2) +
   geom_point(data = ac, aes(x = MidDate, y = ALP)) +
   ggtitle("Voting intention for the ALP between the 2004 and 2007 Australian elections",
           "Latent variable estimated with use of just one firm's polling data (Nielsen)")

Including all five polling houses

Finally, the complete model replicating Jackman’s work:

As well as adding the other four sets of polls, I’ve introduced five house effects that need to be estimated (ie the bias for each polling firm/mode); and I’ve told Stan to estimate the standard deviation of the day-to-day innovations in the latent support for ALP rather than hard-coding it as 0.25. Jackman specified a uniform prior on [0, 1] for that parameter, but I found this led to lots of estimation problems for Stan. The Stan developers give some great practical advice on this sort of issue and I adapted some of that to specify the prior distribution for the standard deviation of day to day innovation as N(0.5, 0.5), constrained to be positive.

Here’s the Stan program:

// oz-polls-3.stan

data {
  int n_days;            // number of days
  real mu_start;                  // value at starting election
  real mu_finish;                 // value at final election
  
  // change the below into 5 matrixes with 3 columns each for values, days, standard error
  int y1_n;                     // number of polls
  int y2_n;
  int y3_n;
  int y4_n;
  int y5_n;
  real y1_values[y1_n];       // actual values in polls
  real y2_values[y2_n];       
  real y3_values[y3_n];       
  real y4_values[y4_n];       
  real y5_values[y5_n];       
  int y1_days[y1_n];          // the number of days since starting election each poll was taken
  int y2_days[y2_n]; 
  int y3_days[y3_n]; 
  int y4_days[y4_n]; 
  int y5_days[y5_n]; 
  real y1_se[y1_n];             // the standard errors of the polls
  real y2_se[y2_n];           
  real y3_se[y3_n];           
  real y4_se[y4_n];           
  real y5_se[y5_n];           
}
parameters {
  real mu[n_days];               // underlying state of vote intention
  real d[5];                                        // polling effects
  real sigma;                              // sd of innovations
}

model {
  
  // state model
  mu[1] ~ normal(mu_start, 0.01); // starting point
  
  // Jackman used uniform(0, 1) for sigma, but this seems to be the cause
  // of a lot of problems with the estimation process.
  // https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
  // recommends not using a uniform, but constraining sigma to be positive
  // and using an open ended prior instead.  So:
  sigma ~ normal(0.5, 0.5);              // prior for innovation sd.  
  
  for (i in 2:n_days) 
      mu[i] ~ normal(mu[i - 1], sigma);
      
  // measurement model
  // 1. Election result
  mu[n_days] ~ normal(mu_finish, 0.01);
  
  // 2. Polls
  d ~ normal(0, 7.5); // ie a fairly loose prior for house effects
  
  for(t in 1:y1_n)
      y1_values[t] ~ normal(mu[y1_days[t]] + d[1], y1_se[t]);
  for(t in 1:y2_n)
      y2_values[t] ~ normal(mu[y2_days[t]] + d[2], y2_se[t]);
  for(t in 1:y3_n)
      y3_values[t] ~ normal(mu[y3_days[t]] + d[3], y3_se[t]);
  for(t in 1:y4_n)
      y4_values[t] ~ normal(mu[y4_days[t]] + d[4], y4_se[t]);
  for(t in 1:y5_n)
      y5_values[t] ~ normal(mu[y5_days[t]] + d[5], y5_se[t]);
}

Building the fact there are 5 polling firms (or firm-mode combinations, as Morgan is in there twice) directly into the program must be bad practice, but seeing as there are different numbers of polls taken by each firm and on different days I couldn’t work out a better way to do it. Stan doesn’t support ragged arrays, or objects like R’s lists, or (I think) convenient subsetting of tables, which would be the three ways I’d normally try to do that in another language. So I settled for the approach above, even though it has some ugly bits of repetition.

Here’s the R code that sorts the data and passes it to Stan

#-------------------all 5 polls--------------------
all_polls  AustralianElectionPolling %>%
  mutate(MidDate = startDate + (endDate - startDate) / 2,
         MidDateNum = as.integer(MidDate - as.Date("2004-10-08")),  # ie number of days since starting election
         p = ALP / 100,
         se_alp = sqrt(p * (1- p) / sampleSize) * 100,
         org = fct_reorder(org, ALP))


poll_orgs  as.character(unique(all_polls$org))

p1  filter(all_polls, org == poll_orgs[[1]])
p2  filter(all_polls, org == poll_orgs[[2]])
p3  filter(all_polls, org == poll_orgs[[3]])
p4  filter(all_polls, org == poll_orgs[[4]])
p5  filter(all_polls, org == poll_orgs[[5]])


d3  list(
  mu_start = 37.64,
  mu_finish = 43.38,
  n_days = days_between_elections,
  y1_values = p1$ALP,
  y1_days = p1$MidDateNum,
  y1_n = nrow(p1),
  y1_se = p1$se_alp,
  y2_values = p2$ALP,
  y2_days = p2$MidDateNum,
  y2_n = nrow(p2),
  y2_se = p2$se_alp,
  y3_values = p3$ALP,
  y3_days = p3$MidDateNum,
  y3_n = nrow(p3),
  y3_se = p3$se_alp,
  y4_values = p4$ALP,
  y4_days = p4$MidDateNum,
  y4_n = nrow(p4),
  y4_se = p4$se_alp,
  y5_values = p5$ALP,
  y5_days = p5$MidDateNum,
  y5_n = nrow(p5),
  y5_se = p5$se_alp
)


system.time({
  stan_mod3  stan(file = 'oz-polls-3.stan', data = d3,
                    control = list(max_treedepth = 15,
                                   adapt_delta = 0.8),
                    iter = 4000)
}) # about 600 seconds

plot_results(stan_mod3) +
   geom_point(data = all_polls, aes(x = MidDate, y = ALP, colour = org), size = 2) +
   geom_line(aes(y = mean)) +
   labs(colour = "") +
   ggtitle("Voting intention for the ALP between the 2004 and 2007 Australian elections",
           "Latent variable estimated with use of all major firms' polling data")

Estimates of polling house effects

Here’s the house effects estimated by me with Stan, compared to those in Jackman’s 2009 book:

Basically we got the same results – certainly close enough anyway. Jackman writes:

“The largest effect is for the face-to-face polls conducted by Morgan; the point estimate of the house effect is 2.7 percentage points, which is very large relative to the classical sampling error accompanhying these polls.”

Interestingly, Morgan’s phone polls did much better.

Here’s the code that did that comparison:

house_effects  summary(stan_mod3, pars = "d")$summary %>%
  as.data.frame() %>%
  round(2) %>%
  mutate(org = poll_orgs,
         source = "Stan") %>%
  dplyr::select(org, mean, `2.5%`, `97.5%`, source)

jackman  data_frame(
   org = c("Galaxy", "Morgan, F2F", "Newspoll", "Nielsen", "Morgan, Phone"),
   mean = c(-1.2, 2.7, 1.2, 0.9, 0.8),
   `2.5%` = c(-3.1, 1, -0.5, -0.8, -1),
   `97.5%` = c(0.6, 4.3, 2.8, 2.5, 2.3),
   source = "Jackman"
)

d  rbind(house_effects, jackman) %>%
   mutate(org = fct_reorder(org, mean),
          ypos = as.numeric(org) + 0.1 - 0.2 * (source == "Stan")) 

d %>%
   ggplot(aes(y = ypos, colour = source)) +
   geom_segment(aes(yend = ypos, x = `2.5%`, xend = `97.5%`)) +
   geom_point(aes(x = mean)) +
   scale_y_continuous(breaks = 1:5, labels = levels(d$org),
                      minor_breaks = NULL) +
   theme(panel.grid.major.y = element_blank(),
         legend.position = "right") +
   labs(x = "House effect for polling firms, 95% credibility intervalsn(percentage points over-estimate of ALP vote)",
        y = "",
        colour = "",
        caption = "Source: Jackman's pscl R package; analysis at https://ellisp.github.io")  +
   ggtitle("Polling 'house effects' in the leadup to the 2007 Australian election",
           "Basically the same results in new analysis with Stan as in the original Jackman (2009)")

So there we go – state space modelling of voting intention, with variable house effects, in the Australian 2007 federal election.

To leave a comment for the author, please follow the link and comment on their blog: Peter’s stats stuff – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Hey! You there! You are welcome here

By Shannon E. Ellis

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

What’s that? You’ve heard of R? You use R? You develop in R? You know someone else who’s mentioned R? Oh, you’re breathing? Well, in that case, welcome! Come join the R community!

We recently had a group discussion at rOpenSci‘s #runconf17 in Los Angeles, CA about the R community. I initially opened the issue on GitHub. After this issue was well-received (check out the emoji-love below!), we realized people were keen to talk about this and decided to have an optional and informal discussion in person.

To get the discussion started I posed two general questions and then just let discussion fly. I prompted the group with the following:

  1. The R community is such an asset. How do we make sure that everyone knows about it and feels both welcome and comfortable?
  2. What are other languages/communities doing that we’re not? How could we adopt their good ideas?

The discussion focused primarily on the first point, and I have to say the group’s answers…were awesome. Take a look!

Photo (c) Nistara Randawa

How to find the community

Everyone seemed to be in agreement that (1) the community is one of R’s biggest strengths and (2) a lot within the R community happens on twitter. During discussion, Julia Lowndes mentioned she joined twitter because she heard that people asked and answered questions about R there, and others echoed this sentiment. Simply, the R community is not just for ‘power users’ or developers. It’s a place for users and people interested in learning more about R. So, if you want to get involved in the community and you are not already, consider getting a twitter account and check out the #rstats hashtag. We expect you’ll be surprised by how responsive, welcoming, and inclusive the community is.

In addition to twitter, there are many resources available within the R community where you can learn more about all things R. Below is a brief list of resources mentioned during our discussion that had helped us feel more included in the community. Feel free to suggest more!

  • R-Ladies – – a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
  • Local R meetup groups – a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
  • Rweekly – an incredible weekly recap of all things R
  • R-bloggers – an awesome resource to find posts from many different bloggers using R
  • DataCarpentry and Software Carpentry – a resource of openly available lessons that promote and model reproducible research
  • Stack Overflow – chances are your R question has already been answered here (with additional resources for people looking for jobs)

Improving inclusivity

No community is perfect, and being willing to consider our shortcomings and think about ways in which we can improve is so important. The group came up with a lot of great suggestions, including many I had not previously thought of personally.

Alice Daish did a great job capturing the conversation and allowing for more discussion online:

Join the lunchtime #runconf17 discussion about the #rstats communities – what do we need to do to improve? pic.twitter.com/ztbXxNfqU7

— Alice Data (@alice_data) May 26, 2017

To summarize here:

  • Take the time to welcome new people. A simple hello can go a long way!
  • Reach out to people we may be missing: high school students, people of different backgrounds, individuals in other countries, etc.
  • Avoid the word “just” when helping others. “Here’s one way of thinking about that” >> “Just do it this way”
  • Include people whose primary language is not English in the conversation! Consider tweeting & retweeting in your own language to extend the community. This helps include others and spread knowledge!
  • Be involved in open projects. If you chose to turn down an opportunity that is not open, do your best to explain why being involved in open projects is important to you.
  • David Smith recently suggested getting #rbeginners to take off as a hashtag – a great way to direct newer members’ attention to tips and resources!
  • Be conscious of your tone. When in doubt, check out tone checker.
  • If you see someone being belittling in their answers, consider reaching out to the person who is behaving inappropriately. There was some agreement that reaching out privately may be more effective as a first approach than calling them out in public.Strong arguments against that strategy and in favor of a public response from Oliver Keyes can be found here.
  • Also, it’s often easier to defend on behalf of someone else than it is on one’s own behalf. Keep that in mind if you see negative things happening, and consider defending on someone else’s behalf.
  • Having a code of conduct is important. rOpenSci has one, and we like it a whole lot.

And, when times get tough, look to your community. Get out there. Be active. Communicate with one another. Tim Phan brilliantly summarized the importance of action and community in this thread:

Dear #runconf17, bye for now!Thanks to the organizers for all you do. Here’s an incoming tweet storm on R, community, and open science 1/6 pic.twitter.com/7DpkceOUC8

— Timothy Phan (@timothy_phan) May 26, 2017

Thank you

Thank you to all who participated in this conversation and all who contribute to the community to make R such a fun language in which to work and develop! Thank you to rOpenSci for hosting and giving us all the opportunity to get to know one another and work together. I’m excited to see where this community goes moving forward!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

May New Package Picks

By R Views

(This article was first published on R Views, and kindly contributed to R-bloggers)



Two hundred and twenty-nine new packages were submitted to CRAN in May. Here are my picks for the “Top 40”, organized into five categories: Data, Data Science and Machine Learning, Education, Miscellaneous, Statistics and Utilities.

Data

angstroms v0.0.1: Provides helper functions for working with Regional Ocean Modeling System (ROMS) output.

bikedata v0.0.1: Download and aggregate data from public bicycle systems from around the world. There is a vignette.

datasauRushttps://CRAN.R-project.org/package=datasauRus v0.1.2: The Datasaurus Dozen is a set of datasets that have the same summary statistics, despite having radically different distributions. As well as being an engaging variant on the Anscombe’s Quartet, the data is generated in a novel way through a simulated annealing process. Look here for details, and in the vignette for examples.

dwapi v0.1.1: Provides a set of wrapper functions for data.world’s REST API. There is a quickstart guide.

HURDAT v0.1.0: Provides datasets from the Hurricane Research Division’s Hurricane Re-Analysis Project, giving details for most known hurricanes and tropical storms for the Atlantic and northeastern Pacific ocean (northwestern hemisphere). The vignette describes the datasets.

neurohcp v0.6: Implements an interface to the Human Connectome Project. The vignette shows how it works.

osmdata v0.0.3: Provides functions to download and import of OpenStreetMap data as ‘sf’ or ‘sp’ objects. There is an Introduction and a vignette describing Translation to Simple Features.

parlitools v0.0.4: Provides various tools for analyzing UK political data, including creating political cartograms and retrieving data. There is an Introduction, and vignettes on the British Election Study, Mapping Local Suthorities, and Using Cartograms.

rerddap v0.4.2: Implements an R client to NOAA’s ERDDDAP data servers. There is an Introduction.

soilcarbon v1.0.0: Provides tools for analyzing the Soil Carbon Database created by Powell Center Working Group. The vignette launches a local Shiny App.

suncalc v0.1: Implements an R interface to the ‘suncalc.js’ library, part of the SunCalc.net’s project for calculating sun position, sunlight phases, moon position and lunar phase for the given location and time.

Data Science and Machine Learning

EventStudy v0.3.1: Provides an interface to the EventStudy API. There is an Introduction, and vignettes on Preparing EventStudy, parameters, and the RStudio Addin.

kmcudaR v1.0.0: Provides a fast, drop-in replacement for the classic K-means algorithm based on Yingyang K-Means. Look here for details.

openEBGM v0.1.0: Provides an implementation of DuMouchel’s Bayesian data mining method for the market basket problem. There is an Introduction, and vignettes for Processing Raw Data, Hyperparameter Estimation, Empirical Bayes Metrics, and Objects and Class Functions.

spacyr v0.9.0: Provides a wrapper for the Python spaCy Natural Language Processing library. Look here for help with installation and use.

Education

learnr v0.9: Provides functions to create interactive tutorials for learning about R and R packages using R Markdown, using a combination of narrative, figures, videos, exercises, and quizzes. Look here to get started.

olsrr v0.2.0: Provides tools for teaching and learning ordinary least squares regression. There is an Introduction and vignettes on Heteroscedascitity, Measures of Influence, Collinearity Diagnostics, Residual Diagnostics and Variable Selection Methods.

rODE v0.99.4: Contains functions to show students how an ODE solver is made and how classes can be effective for constructing equations that describe natural phenomena. Have a look at the free book Computer Simulations in Physics. There are several vignettes providing brief examples, including one on the Pendulum and another on Planets.

Miscelaneous

atlantistools v0.4.2: Provides access to the Atlantis framework for end-to-end marine ecosystem modelling. There is a package demo and vignettes for model preprocessing, model calibration, species calibration, and model comparison.

phylodyn v0.9.0: Provides statistical tools for reconstructing population size from genetic sequence data. There are several vignettes including a Coalescent simulation of genealogies and a case study using New York Influenza data.

Statistics

adaptiveGPCA v0.1: Implements the adaptive gPCA algorithm described in Fukuyama. The vignette shows an example using data stored in a phyloseq object.

BayesNetBP v1.2.1: Implements belief propagation methods for Bayesian Networks based on the paper by Cowell. There is a function to invoke a Shiny App.

RPEXE.RPEXT v0.0.1: Implements the likelihood ration test and backward elimination procedure for the reduced piecewise exponential survival analysis technique described in described in Han et al. 2012 and 2016. The vignette provides examples.

sfdct v0.0.3: Provides functions to construct a constrained ‘Delaunay’ triangulation from simple features objects. There is a vignette.

simglm v0.5.0: Provides functions to simulate linear and generalized linear models with up to three levels of nesting. There is an Introduction and vignettes for simulating GLMs and Missing Data performing Power Analysis and dealing with Unbalanced Data.

Utilities

checkarg v0.1.0: Provides utility functions that allow checking the basic validity of a function argument or any other value, including generating an error and assigning a default in a single line of code.

CodeDepends v0.5-3: Provides tools for analyzing R expressions or blocks of code and determining the dependencies between them. The vignette shows how to use them.

desctable v0.1.0: Provides functions to create descriptive and comparative tables that are ready to be saved as csv, or piped to DT::datatable() or pander::pander() to integrate into reports. There is a vignette to get you started.

lifelogr v0.1.0: Provides a framework for combining self-data from multiple sources, including fitbit and Apple Health. There is a general introduction as well as an introduction for visualization functions.

processx v2.0.0: Portable tools to run system processes in the background.

printr v0.1: Extends knitr generic function knit_print() to automatically print objects using an appropriate format such as Markdown or LaTeX. The vignette provides an introduction.

RHPCBenchmark v0.1.0: Provides microbenchmarks for determining the run-time performance of aspects of the R programming environment, and packages that are relevant to high-performance computation. There is an Introduction.

rlang v0.1.1: Provides a toolbox of functions for working with base types, core R features like the condition system, and core ‘Tidyverse’ features like tidy evaluation. The vignette explains R’s capabilities for creating Domain Specific Languages.

readtext v0.50: Provides functions for importing and handling text files and formatted text files with additional meta-data, including ‘.csv’, ‘.tab’, ‘.json’, ‘.xml’, ‘.pdf’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’ and other file types. There is a vignette

tangram v0.2.6: Provides an extensible formula system to implements a grammar of tables for creating production-quality tables using a three-step process that involves a formula parser, statistical content generation from data, and rendering. There is a vignette introducing the Grammar, a Global Style for Rmd, and duplicating SAS PROC Tabulate.

tatoo v1.0.6: Provides functions to combine data.frames and to add metadata that can be used for printing and xlsx export. The vignette shows some examples.

Visualizations

ContourFunctions v0.1.0: Provides functions for making contour plots. A vignette introduces the package.

mbgraphic v1.0.0: Implements a two-step process for describing univariate and bivariate behavior similar to the cognostics measures proposed by Paul and John Tuke. First, measures describing variables are computed and then plots are selected. The vignette describes the details.

polypoly v0.0.2: Provides tools for reshaping, plotting, and manipulating matrices of orthogonal polynomials. The vignette provides an overview.

RJSplot v2.1: Provides functions to create interactive graphs with ‘R’. It joins the data analysis power of R and the visualization libraries of JavaScript in one package There is a tutorial.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Set Theory Arbitrary Union and Intersection Operations with R

By Aaron Schlegel

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)
Part 3 of 3 in the series Set Theory

The union and intersection set operations were introduced in a previous post using two sets, and b. These set operations can be generalized to accept any number of sets.

Arbitrary Set Unions Operation

Consider a set of infinitely many sets:

 A = large{{b_0, b_1, b_2, cdots } large}

It would be very tedious and unnecessary to repeat the union statement repeatedly for any non-trivial amount of sets, for example, the first few unions would be written as:

 large{b_0 cup b_1 cup b_2 cup b_3 cup b_4 cup b_5large}

Thus a more general operation for performing unions is needed. This operation is denoted by the bigcup symbol. For example, the set A above and the desired unions of the member sets can be generalized to the following using the new notation:

 large{bigcup A = bigcup_i b_i}

We can then state the following definition: For a set A, the union bigcup A of A is defined by:

 large{bigcup A = {x space | space (exists b in A) space x in b } large}

For example, consider the three sets:

 large{a = {2, 4, 6 } qquad b = {3, 5, 7} qquad c = {2, 3, 8 } large} The union of the three sets is written as:
 large{bigcup Big{{2,4,6}, {3,5,7}, {2,3,8} Big} = {2,3,4,5,6,7,8}}

Recalling our union axiom from a previous post, the union axiom states for two sets A and B, there is a set whose members consist entirely of those belonging to sets A or B, or both. More formally, the union axiom is stated as:

 large{forall a space forall b space exists B space forall x (x in B Leftrightarrow x in a space vee space x in b)}

As we are now dealing with an arbitrary amount of sets, we need an updated version of the union axiom to account for the change.

Restating the union axiom:

For any set A, there exists a set B whose members are the same elements of the elements of A. Stated more formally:

 large{forall x big[ x in B space Leftrightarrow space (exists b in A) space x in b big] }

The definition of bigcup A can be stated as:

 large{x in bigcup A Leftrightarrow (exists b in A) space x in b}

For example, we can demonstrate the updated axiom with the union of four sets {a, b, c, d}:

 large{bigcup {a, b, c, d } = big{(exists B in A) space x in {a, b, c, d}big} large}
 large{ bigcup {a, b, c, d } = a cup b cup c cup d large}

We can implement the set operation for an arbitrary amount of sets by expanding upon the function we wrote previously.

set.unions 

Perform the set union operation of four sets:

 large{a = {1,2,3} qquad b = {3,4,5} qquad c = {1,4,6} qquad d ={2,5,7} large}
a 
Intersections of an Arbitrary Number of Sets

The intersection set operation can also be generalized to any number of sets. Consider the previous set containing an infinite number of sets.

 large{A = {b_0, b_1, b_2, cdots }}

As before, writing out all the intersections would be tedious and not elegant. The intersection can instead be written as:

 large{bigcap A = bigcap_i b_i}

As before in our previous example of set intersections, there is no need for a separate axiom for intersections, unlike unions. Instead, we can state the following theorem, for a nonempty set A, a set B exists that such for any element x:

 large{x in B Leftrightarrow x in forall A}

Consider the following four sets:

 large{a = {1,2,3} qquad b = {1,3,5} qquad c = {1,4,5,3} qquad d = {2,6,1,3}}

The intersection of the sets is written as:

 large{bigcap big{{1,2,3,5}, {1,3,5}, {1,4,5,3}, {2,5,1,3}big}}  = large{{1,2,3,5} cap {1,3,5} cap {1,4,5,3} cap {2,5,1,3} = {1,3,5}}

We can write another function to implement the set intersection operation given any number of sets.

set.intersections 

Perform set intersections of the four sets specified earlier.

a 
References

Enderton, H. (1977). Elements of set theory (1st ed.). New York: Academic Press.

The post Set Theory Arbitrary Union and Intersection Operations with R appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

RTutor: Emission Certificates and Green Innovation

By Economics and R – R posts

(This article was first published on Economics and R – R posts, and kindly contributed to R-bloggers)

Which policy instruments should we use to cost-effectively reduce greenhouse gas emissions? For a given technological level there are many economic arguments in favour of tradeable emission certificates or a carbon tax: they generate static efficiency by inducing emission reductions in those sectors and for those technologies where it is most cost effective.

Specialized subsidies, like the originally extremely high subsidies on solar energy in Germany and other countries are often much more costly. Yet, we have seen a tremendous cost reduction for photovoltaics, which may have not been achieved on such a scale without those subsidies. And maybe in a world, where the current president of a major polluting country seems not to care much about the risks of climate change, the development of cheap green technology that even absent goverment support can cost-effectively substitute fossil fuels, is the most decisive factor to fight climate change.

Yet, the impact of different policy measures on innovation of green technology is very hard to assess. Are focused subsidies or mandates the best way, or can also emission trading or carbon taxes considerably boost innovation of green technologies? That is a tough quantitative question, but we can try to get at least some evidence.

In their article Environmental Policy and Directed Technological Change: Evidence from the European carbon market, Review of Economic and Statistics (2016), Raphael Calel and Antoine Dechezlepretre study the impact of the EU carbon emission trading system on patent activities of the regulated firms. By matching them with unregulated firms, they estimate that the emission trading has increased the innovation activities for low carbon technologies of the regulated firms by 10%.

As part of his Master Thesis at Ulm University, Arthur Schäfer has generated an RTutor problem set that allows you to replicate the main insights of the paper in an interactive fashion.

Here is screenshoot:


Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by many multiple choice quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/ArthurS90/RTutorEmissionTrading

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)

https://arthurs90.shinyapps.io/RTutorEmissionTrading/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

To leave a comment for the author, please follow the link and comment on their blog: Economics and R – R posts.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Interactive R visuals in Power BI

By David Smith

Power BI has long had the capability to include custom R charts in dashboards and reports. But in sharp contrast to standard Power BI visuals, these R charts were static. While R charts would update when the report data was refreshed or filtered, it wasn’t possible to interact with an R chart on the screen (to display tool-tips, for example). But in the latest update to Power BI, you can create create R custom visuals that embed interactive R charts, like this:

The above chart was created with the plotly package, but you can also use htmlwidgets or any other R package that creates interactive graphics. The only restriction is that the output must be HTML, which can then be embedded into the Power BI dashboard or report. You can also publish reports including these interactive charts to the online Power BI service to share with others. (In this case though, you’re restricted to those R packages supported in Power BI online.)

Power BI now provides four custom interactive R charts, available as add-ins:

You can also create your own custom R visuals. The documentation explains how to create custom R visuals from HTML output, and you can also use the code on Github for the provided visuals linked above as a guide. For more on the new custom visuals, take a look at the blog post linked below.

Microsoft Power BI Blog: Interactive R custom visuals support is here!

Source:: R News

Two years as a Data Scientist at Stack Overflow

By David Robinson

Last Friday marked my two year anniversary working as a data scientist at Stack Overflow. At the end of my first year I wrote a blog post about my experience, both to share some of what I’d learned and as a form of self-reflection.

After another year, I’d like to revisit the topic. While my first post focused mostly on the transition from my PhD to an industry position, here I’ll be sharing what has changed for me in my job in the last year, and what I hope the next year will bring.

Hiring a Second Data Scientist

In last year’s blog post, I noted how difficult it could be to be the only data scientist on a team:

Most of my current statistical education has to be self-driven, and I need to be very cautious about my work: if I use an inappropriate statistical assumption in a report, it’s unlikely anyone else will point it out.

This continued to be a challenge, and fortunately in December we hired our second data scientist, Julia Silge.

I have some very exciting news! I am joining the data team at @StackOverflow. ✨✨✨

— Julia Silge (@juliasilge) December 13, 2016

We started hiring for the position in September, and there were a lot of terrific candidates I got to meet and review during the application and review process. But I was particularly excited to welcome Julia to the team because we’d been working together during the course of the year, ever since we met and created the tidytext package at the 2016 rOpenSci unconference.

Julia, like me, works on analysis and visualization rather than building and productionizing features, and having a second person in that role has made our team much more productive. This is not just because Julia is an exceptional colleague, but because the two of us can now collaborate on statistical analyses or split them up to give each more focus. I did enjoy being the first data scientist at the company, but I’m glad I’m no longer the only one. Julia’s also a skilled writer and communicator, which was essential in achieving the next goal.

Company blog posts

In last year’s post, I shared some of the work that I’d done to explore the landscape of software developers, and set a goal for the following year (emphasis is new):

I’m also just intrinsically pretty interested in learning about and visualizing this kind of information; it’s one of the things that makes this a fun job. One plan for my second year here is to share more of these analyses publicly. In a previous post looked at which technologies were the most polarizing, and I’m looking forward to sharing more posts like that soon.

I’m happy to say that we’ve made this a priority in the last six months. Since December I’ve gotten the opportunity to write a number of posts for the Stack Overflow company blog:

Other members of the team have written data-driven blog posts as well, including:

I’ve really enjoyed sharing these snapshots of the software developer world, and I’m looking forward to sharing a lot more on the blog this next year.

Teaching R at Stack Overflow

Last year I mentioned that part of my work has been developing data science architecture, and trying to spread the use of R at the company.

This also has involved building R tutorials and writing “onboarding” materials… My hope is that as the data team grows and as more engineers learn R, this ecosystem of packages and guides can grow into a true internal data science platform.

At the time, R was used mostly by three of us on the data team (Jason Punyon, Nick Larsen, and me). I’m excited to say it’s grown since then, and not just because of my evangelism.

“I’ve been thinking of switching to R, do you have any opinions on that?” he asked me at lunch, ill-advisedly

— David Robinson (@drob) March 1, 2017

Every Friday since last September, I’ve met with a group of developers to run internal “R sessions”, in which we analyze some of our data to develop insights and models. Together we’ve made discoveries that have led to real projects and features, for both the Data Team and other parts of the engineering department.

Every Friday for six months we’ve been doing internal #rstats lessons for @StackOverflow devs. In the last two sessions we made this! pic.twitter.com/M4duFAmolC

— David Robinson (@drob) March 10, 2017

There are about half a dozen developers who regularly take part, and they all do great work. But I especially appreciate Ian Allen and Jisoo Shin for coming up with the idea of these sessions back in September, and for following through in the months since. Ian and Jisoo joined the company last summer, and were interested in learning R to complement their development of product features. Their curiosity, and that of others in the team, has helped prove that data analysis can be a part of every engineer’s workflow.

Writing production code

My relationship to production code (the C# that runs the actual Stack Overflow website) has also changed. In my first year I wrote much more R code than C#, but in the second I’ve stopped writing C# entirely. (My last commit to production was more than a year ago, and I often go weeks without touching my Windows partition). This wasn’t really a conscious decision; it came from a gradual shift in my role on the engineering team. I’d usually rather be analyzing data than shipping features, and focusing entirely on R rather than splitting attention across languages has been helpful for my productivity.

Instead, I work with engineers to implement product changes based on analyses and push models into production. One skill I’ve had to work on is writing technical specifications, both for data sources that I need to query or models that I’m proposing for production. One developer I’d like to acknowledge specifically Nick Larsen, who works with me on the Data Team. Many of the blog posts I mention above answer questions like “What tags are visited in New York vs San Francisco”, or “What tags are visited at what hour of the day”, and these wouldn’t have been possible without Nick. Until recently, this kind of traffic data was very hard to extract and analyze, but he developed processes that extract and transform the data into more readily queryable tables. This has many important analyses possible besides the blog posts, and I can’t appreciate this work enough.

(Nick also recently wrote an awesome post, How to talk about yourself in a developer interview, that’s worth checking out).

Working with other teams

Last year I mentioned that one of my projects was developing targeting algorithms for Job Ads, which match Stack Overflow visitors with jobs they may be interested in (such as, for example, matching people who visit Python and Javascript questions with Python web developer jobs). These are an important part of our business and still make up part of my data science work. But I learned in the last year about a lot of components of the business that data could help more with.

One team that I’ve worked with that I hadn’t in the first year is Display Ads. Display Ads are separate from job ads, and are purchased by companies with developer-focused products and services.

For example, I’ve been excited to work closer with Steve Feldman on the Display Ad Operations team. If you’re wondering why I’m not ashamed to work on ads, please read Steve’s blog post on how we sell display ads at Stack Overflow– he explains it better than I could. We’ve worked on several new methods for display ad targeting and evaluation, and I think there’s a lot of potential for data to have a postive impact for the company.

Changes in the rest of my career

There’ve been other changes in my second year out of academia. In my first year, I attended only one conference (NYR 2016) but I’ve since had more of a chance to travel, including to useR and JSM 2017, PLOTCON, rstudio::conf 2017, and NYR 2017. I spoke at a few of these, about my broom package, about gganimate and about the history of R as seen by Stack Overflow.

Julia and I wrote and published an O’Reilly book, Text Mining with R (now available on Amazon and free online here). I also self-published an e-book, Introduction to Empirical Bayes: Examples from Baseball Statistics, based on a series of blog posts. I really enjoyed the experience of turning blog posts into a larger narrative, and I’d like to continue doing so this next year.

There are some goals I didn’t achieve. I’ve had a longstanding interest in getting R into production (and we’ve idly investigated some approaches like Microsoft R Server), but as of now we’re still productionizing models by rewriting them in C#. And there are many teams at Stack Overflow that I’d like to give better support to- prioritizing the Data Team’s time has been a challenge, though having a second data scientist has helped greatly. But I’m still happy with how my work has gone, and excited about the future.

In any case, this made the whole year worthwhile:

Easily my favorite thing to come out of the Trump Twitter analysis was @nypost calling @StackOverflow “a Q&A site for egghead programmers” pic.twitter.com/0xrYkM2OOU

— David Robinson (@drob) November 4, 2016

Source:: R News