Going Bayes #rstats

By Daniel

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Some time ago I started working with Bayesian methods, using the great rstanarm-package. Beside the fantastic package-vignettes, and books like Statistical Rethinking or Doing Bayesion Data Analysis, I also found the ressources from Tristan Mahr helpful to both better understand Bayesian analysis and rstanarm. This motivated me to implement tools for Bayesian analysis into my packages, as well.

Due to the latest tidyr-update, I had to update some of my packages, in order to make them work again, so – beside some other features – some Bayes-stuff is now avaible in my packages on CRAN.

Finding shape or location parameters from distributions

The following functions are included in the sjstats-package. Given some known quantiles or percentiles, or a certain value or ratio and its standard error, the functions find_beta(), find_normal() or find_cauchy() help finding the parameters for a distribution. Taking the example from here, the plot indicates that the mean value for the normal distribution is somewhat above 50. We can find the exact parameters with find_normal(), using the information given in the text:

library(sjstats)
find_normal(x1 = 30, p1 = .1, x2 = 90, p2 = .8)
#> $mean
#> [1] 53.78387
#>
#> $sd
#> [1] 30.48026

High Density Intervals for MCMC samples

The hdi()-function computes the high density interval for posterior samples. This is nothing special, since there are other packages with such functions as well – however, you can use this function not only on vectors, but also on stanreg-objects (i.e. the results from models fitted with rstanarm). And, if required, you can also transform the HDI-values, e.g. if you need these intervals on an expontiated scale.

library(rstanarm)
fit #>          term   hdi.low  hdi.high
#> 1 (Intercept) 32.158505 42.341421
#> 2          wt -6.611984 -4.022419
#> 3          am -2.567573  2.343818
#> 4       sigma  2.564218  3.903652

# fit logistic regression model
fit #>          term      hdi.low     hdi.high
#> 1 (Intercept) 4.464230e+02 3.725603e+07
#> 2          wt 6.667981e-03 1.752195e-01
#> 3          am 8.923942e-03 3.747664e-01

Marginal effects for rstanarm-models

The ggeffects-package creates tidy data frames of model predictions, which are ready to use with ggplot (though there’s a plot()-method as well). ggeffects supports a wide range of models, and makes it easy to plot marginal effects for specific predictors, includinmg interaction terms. In the past updates, support for more model types was added, for instance polr (pkg MASS), hurdle and zeroinfl (pkg pscl), betareg (pkg betareg), truncreg (pkg truncreg), coxph (pkg survival) and stanreg (pkg rstanarm).

ggpredict() is the main function that computes marginal effects. Predictions for stanreg-models are based on the posterior distribution of the linear predictor (posterior_linpred()), mostly for convenience reasons. It is recommended to use the posterior predictive distribution (posterior_predict()) for inference and model checking, and you can do so using the ppd-argument when calling ggpredict(), however, especially for binomial or poisson models, it is harder (and much slower) to compute the „confidence intervals“. That’s why relying on posterior_linpred() is the default for stanreg-models with ggpredict().

Here is an example with two plots, one without raw data and one including data points:

library(sjmisc)
library(rstanarm)
library(ggeffects)
data(efc)
# make categorical
efc$c161sex #> # A tibble: 128 x 5
#>        x predicted conf.low conf.high  group
#>    
#>  1     4  10.80864 10.32654  11.35832   Male
#>  2     4  11.26104 10.89721  11.59076 Female
#>  3     5  10.82645 10.34756  11.37489   Male
#>  4     5  11.27963 10.91368  11.59938 Female
#>  5     6  10.84480 10.36762  11.39147   Male
#>  6     6  11.29786 10.93785  11.61687 Female
#>  7     7  10.86374 10.38768  11.40973   Male
#>  8     7  11.31656 10.96097  11.63308 Female
#>  9     8  10.88204 10.38739  11.40548   Male
#> 10     8  11.33522 10.98032  11.64661 Female
#> # ... with 118 more rows

plot(dat)
plot(dat, rawdata = TRUE)

As you can see, if you work with labelled data, the model-fitting functions from the rstanarm-package preserves all value and variable labels, making it easy to create annotated plots. The „confidence bands“ are actually hidh density intervals, computed with the above mentioned hdi()-function.

Next…

Next I will integrate ggeffects into my sjPlot-package, making sjPlot more generic and supporting more models types. Furthermore, sjPlot shall get a generic plot_model()-function which will replace former single functions like sjp.lm(), sjp.glm(), sjp.lmer() or sjp.glmer(). plot_model() should then produce a plot, either marginal effects, forest plots or interaction terms and so on, and accept (m)any model class. This should help making sjPlot more convenient to work with, more stable and easier to maintain…

Tagged:

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Rcpp now used by 10 percent of CRAN packages

By Thinking inside the box

10 percent of CRAN packages

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Over the last few days, Rcpp passed another noteworthy hurdle. It is now used by over 10 percent of packages on CRAN (as measured by Depends, Imports and LinkingTo, but excluding Suggests). As of this morning 1130 packages use Rcpp out of a total of 11275 packages. The graph on the left shows the growth of both outright usage numbers (in darker blue, left axis) and relative usage (in lighter blue, right axis).

Older posts on this blog took note when Rcpp passed round hundreds of packages, most recently in April for 1000 packages. The growth rates for both Rcpp, and of course CRAN, are still staggering. A big thank you to everybody who makes this happen, from R Core and CRAN to all package developers, contributors, and of course all users driving this. We have built ourselves a rather impressive ecosystem.

So with that a heartfelt Thank You! to all users and contributors of R, CRAN, and of course Rcpp, for help, suggestions, bug reports, documentation, encouragement, and, of course, code.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Simple practice: data wrangling the iris dataset

By Sharp Sight

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

In last weeks post, I emphasized the importance of practicing R and the Tidyverse with small, simple problems, drilling them until you are competent.

In that post, I gave you a few very small scripts to practice (which I suggest that you memorize).

This week, I want to give you another small example. We’re going to clean up the iris dataset.

More specifically, we’re going to:

  1. Coerce the iris dataset from an old-school data frame into a tibble.
  2. Rename the variables, such that the characters are lower case, and such that “snake case” is applied in place of periods.

Like last week, this is a very simple example. However, (like I mentioned in the past) this is the sort of small task that you’ll need to be able to execute fluidly if you want to work on larger projects.

If you want to do large, complex analyses, it really pays to first master techniques on a small scale using much simpler datasets.

Ok, let’s dive in.

First, let’s take a look at the complete block of code.

library(tidyverse)
library(stringr)


#------------------
# CONVERT TO TIBBLE
#------------------
# – the iris dataframe is an old-school dataframe
#   ... this means that by default, it prints out
#   large numbers of records.
# - By converting to a tibble, functions like head()
#   will print out a small number of records by default

df.iris %
  colnames() %>%
  str_to_lower() %>%
  str_replace_all(".","_")

# INSPECT

df.iris %>% head()

What have we done here? We’ve combined several discrete functions of the Tidyverse together in order to perform a small amount of data wrangling.

Specifically, we’ve turned the iris dataset into a tibble, and we’ve renamed the variables to be more consistent with modern R code standards and naming conventions.

This example is quite simple, but useful. This is the sort of small task that you’ll need to be able to do in the context of a large analysis.

Breaking down the script

To make this a little clearer, let’s break this down into its component parts.

In the section where we renamed the variables, we only used three core functions:

  • colnames()
  • str_to_lower()
  • str_replace_all()

Each of these individual pieces are pretty straight forward.

We are using colnames() to retrieve the column names.

Then, we pipe the output into the stringr function str_to_lower_() to convert all the characters to lower case.

Next, we use str_replace_all() to replace the periods (“.”) with underscores (“_”). This effectively transforms the variable names to “snake case.” (Keep in mind that str_replace_all() uses regular expressions. You have learned regular expressions, right?)

Finally, using the assignment operator (at the upper, left hand side of the code), we assign the resulting transformed column names to the tibble by using colnames(df.iris).

I will point out that we have used these functions in a “waterfall” pattern; we have combined them by using the the pipe operator, %>%, such that the output of one step becomes the immediate input for the next step. This is a key feature of the Tidyverse. We can combine very simple functions together in new ways to accomplish tasks. This might not seem like a big deal, but it is extremely powerful. The modular nature of the Tidyverse functions, when used with the pipe operator, make the Tidyverse flexible and syntactically powerful, while allowing the code to remain clear and easy to read.

A test of skill: can you write this fluently?

The functions that we just used are all critical for doing data science in R. With that in mind, this script is a good test of your skill: can you write code like this fluently, from memory?

That should be your goal.

To get there, you need to know how the individual functions work. What that means is that you need to study the functions (how they work). But to be able to put them into practice, you need to drill them. So after you understand how they work, drill each individual function until you can write each individual function from memory. Next, you should drill small scripts (like the one in this blog post). You ultimately want to be able to “put the pieces together” quickly and seamlessly in order to solve problems and get things done.

I’ve said it before: if you want a great data science job, you need to be one of the best. If you want to be one of the best, you need to master the toolkit. And to master the toolkit, you need to drill.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to practice.

You need to know what to practice, and you need to know how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

Sign up now for our email list, and you’ll receive regular tutorials and lessons. You’ll learn:

  • What data science tools you should learn (and what not to learn)
  • How to practice those tools
  • How to put those tools together to execute analyses and machine learning projects
  • … and more

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post Simple practice: data wrangling the iris dataset appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

useR!2017 Roundup

By Open Analytics

hNMF

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

Organising useR!2017 was a challenge but a very rewarding experience. With about 1200 attendees of over 55 nationalities exploring an interesting program, we believe it is appropriate to call it a success – something the aftermovie only seems to confirm.

Behind the Scenes

To give you a glimpse behind the scenes of the conference organization, Maxim Nazarov held a lightning talk on ‘redmineR and the story of automating useR!2017 abstract review process’

You can find the R package on the Open Analytics Github and slides are available here.

Laure Cougnaud presented during the useR! Newbies session in a talk called ‘Making the most of useR!’ and assisted the newbies throughout the conference as a conference buddy. She also served as the chair of the Bioinformatics I session.

In spite of recent appearances Open Analytics does more than organizing useR! Conferences and, as a platinum sponsor, Tobias Verbeke had the opportunity to present Open Analytics in a sponsorship talk.

Open Analytics offers its services in four different service lines:

  • statistical consulting,
  • scientific programming,
  • application development & integration and
  • data analysis hardware & hosting.

The talks our consultants contributed can be nicely laid out along these service lines.

Statistical Consulting

On the methodological side (statistical consulting) Kathy Mutambanengwe held a talk on ‘A restricted composite likelihood approach to modelling Gaussian geostatistical data’.

Adriaan Blommaert and Nicolas Sauwen co-authored the lightning talk by Tor Maes on Multivariate statistics for PAT data analysis: short overview of existing R packages and methods.
Finally, the poster session presented work by Machteld Varewyck on Analyzing Digital PCR Data in R and Rytis Bagdziunas presented a poster on BIGL: assessing and visualizing drug synergy.

Scientific Programming

In the scientific programming area, Nicolas Sauwen held a talk on the ‘Differentiation of brain tumor tissue using hierarchical non-negative matrix factorization’

<!–

–>

The application to differentation of brain tumor issue is an interesting case, but the hNMF method is currently the fastest NMF implementation and can be put to use in many other contexts of unsupervised learning. If interested, the discussed hNMF package can be found on CRAN.

Kirsten Van Hoorde held a lightning talk on the ‘R (‘template’) package to automate analysis workflow and reporting’,

co-authored by Laure Cougnaud. For an example package demonstrating the approach, please see the Open Analytics Github – slides can be found here.

Application Development and Integration

Regarding application development and integration, Marvin Steijaert and Griet Laenen held a long talk on ‘Biosignature-Based Drug Design: from high dimensional data to business impact’ demonstrating how machine learning is put to use to design drugs (slides here).

Data Analysis Hardware and Hosting

Regarding hosting of data analysis applications, Tobias Verbeke held a well attended talk on ShinyProxy a fully open source product that allows to run Shiny apps at scale in an enterprise context.

All information can be found on the ShinyProxy website and sources are on Github.

We hope you enjoyed the conference as much as we did. Let us know if you have any questions or comments on these talks or on Open Analytics and its services offer.

On popular demand: here you can find the source code of the Poissontris game – that other Shiny app 😉

Poissontris screenshot

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Gender roles in film direction, analyzed with R

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle and squeal, while men (but not women) shoot, gallop and strap things to other things.

This is all based on an analysis of almost 2,000 film scripts mostly from 1990 and after. The words come from pairs of words beginning with “he” and “she” in the stage directions (but not the dialogue) in the screenplays — directions like “she snuggles up to him, strokes his back” and “he straps on a holster under his sealskin cloak”. The essay also includes an analysis of words by the writer and character’s gender, and includes lots of lovely interactive elements (including the ability to see examples of the stage directions).

The analysis, including the chart above, was was created using the R language, and the R code is available on GitHub. The screenplay analysis makes use on the tidytext package, which simplifies the process of handling the text-based data (the screenplays), extracting the stage directions, and tabulating the word pairs.

You can find the complete essay linked below, and it’s well worth checking out to experience the interactive elements.

ThePudding: She Giggles, He Gallops

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Caching httr Requests? This means WAR[C]!

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to work with WARC files it’s a tad fragile and improving it would mean reinventing many wheels (i.e. there are longstanding solid implementations of WARC libraries in many other languages that could be tapped vs writing a C++-backed implementation).

One of those implementations is JWAT, a library written in Java (as many WARC use-cases involve working in what would traditionally be called map-reduce environments). It has a small footprint and is structured well-enough that I decided to take it for a spin as a set of R packages that wrap it with rJava. There are two packages since it follows a recommended CRAN model of having one package for the core Java Archive (JAR) files — since they tend to not change as frequently as the functional R package would and they tend to take up a modest amount of disk space — and another for the actual package that does the work. They are:

I’ll exposit on the full package at some later date, but I wanted to post a snippet showng that you may have a use for WARC files that you hadn’t considered before: pairing WARC files with httr web scraping tasks to maintain a local cache of what you’ve scraped.

Web scraping consumes network & compute resources on the server end that you typically don’t own and — in many cases — do not pay for. While there are scraping tasks that need to access the latest possible data, many times tasks involve scraping data that won’t change.

The same principle works for caching the results of API calls, since you may make those calls and use some data, but then realize you wanted to use more data and make the same API calls again. Caching the raw API results can also help with reproducibility, especially if the site you were using goes offline (like the U.S. Government sites that are being taken down by the anti-science folks in the current administration).

To that end I’ve put together the beginning of some “WARC wrappers” for httr verbs that make it seamless to cache scraping or API results as you gather and process them. Let’s work through an example using the U.K. open data portal on crime and policing API.

First, we’ll need some helpers:

library(rJava)
library(jwatjars) # devtools::install_github("hrbrmstr/jwatjars")
library(jwatr) # devtools::install_github("hrbrmstr/jwatr")
library(httr)
library(jsonlite)
library(tidyverse)

Just doing library(jwatr) would have covered much of that but I wanted to show some of the work R does behind the scenes for you.

Now, we’ll grab some neighbourhood and crime info:

wf 

As you can see, the standard httr response object is returned for processing, and the HTTP response itself is being stored away for us as we process it.

file.info("~/Data/wrap-test.warc.gz")$size
## [1] 76020

We can use these results later and, pretty easily, since the WARC file will be read in as a tidy R tibble (fancy data frame):

xdf  "https://data.police.uk/api/leicestershire/neighbourhoods", "https://data.police.uk/api/crimes-street...
## $ ip_address                  "54.76.101.128", "54.76.101.128", "54.76.101.128"
## $ warc_content_type           "application/http; msgtype=response", "application/http; msgtype=response", "application/http; msgtyp...
## $ warc_type                   "response", "response", "response"
## $ content_length              2984, 511564, 688
## $ payload_type                "application/json", "application/json", "application/json"
## $ profile                     NA, NA, NA
## $ date                        2017-08-22, 2017-08-22, 2017-08-22
## $ http_status_code            200, 200, 200
## $ http_protocol_content_type  "application/json", "application/json", "application/json"
## $ http_version                "HTTP/1.1", "HTTP/1.1", "HTTP/1.1"
## $ http_raw_headers            [ "", "",...
## $ payload                     [

The URLs are all there, so it will be easier to map the original calls to them.

Now, the payload field is the HTTP response body and there are a few ways we can decode and use it. First, since we know it’s JSON content (that’s what the API returns), we can just decode it:

for (i in 1:nrow(xdf)) {
  res 

We can also use a jwatr helper function — payload_content() — which mimics the httr::content() function:

for (i in 1:nrow(xdf)) {
  
  payload_content(
    xdf$target_uri[i], 
    xdf$http_protocol_content_type[i], 
    xdf$http_raw_headers[[i]], 
    xdf$payload[[i]], as = "text"
  ) %>% 
    jsonlite::fromJSON() -> res
  
  print(str(res, 2))
  
}

The same output is printed, so I’m saving some blog content space by not including it.

Future Work

I kept this example small, but ideally one would write a warcinfo record as the first WARC record to identify the file and I need to add options and functionality to store the a WARC request record as well as a responserecord`. But, I wanted to toss this out there to get feedback on the idiom and what possible desired functionality should be added.

So, please kick the tyres and file as many issues as you have time or interest to. I’m still designing the full package API and making refinements to existing function, so there’s plenty of opportunity to tailor this to the more data science-y and reproducibility use cases R folks have.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Some Neat New R Notations

By John Mount

abacus

An Abacus, which gives us the term “calculus.”

The first notation is an operator called the

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The R package seplyr supplies a few neat new coding notations.

An Abacus, which gives us the term “calculus.”

The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the following:

library("seplyr")

names    a   b 
#> "x" "y"

This can be very useful when programming in R, as it allows indirection or abstraction on the left-hand side of inline name assignments (unlike c(a = 'x', b = 'y'), where all left-hand-sides are concrete values even if not quoted).

A nifty property of the named map builder is it commutes (in the sense of algebra or category theory) with R‘s “c()” combine/concatenate function. That is: c('a' := 'x', 'b' := 'y') is the same as c('a', 'b') := c('x', 'y'). Roughly this means the two operations play well with each other.

The second notation is an operator called “anonymous function builder“. For technical reasons we use the same “:=” notation for this (and, as is common in R, pick the correct behavior based on runtime types).

The function construction is written as: “variables := { code }” (the braces are required) and the semantics are roughly the same as “function(variables) { code }“. This is derived from some of the work of Konrad Rudolph who noted that most functional languages have a more concise “lambda syntax” than “function(){}” (please see here and here for some details, and be aware the seplyr notation is not as concise as is possible).

This notation allows us to write the squares of 1 through 4 as:

sapply(1:4, x:={x^2})

instead of writing:

sapply(1:4, function(x) x^2)

It is only a few characters of savings, but being able to choose notation can be a big deal. A real victory would be able to directly use lambda-calculus notation such as “(λx.x^2)“. In the development version of seplyr we are experimenting with the following additional notations:

sapply(1:4, lambda(x)(x^2))
sapply(1:4, λ(x, x^2))

(Both of these currenlty work in the development version, though we are not sure about submitting source files with non-ASCII characters to CRAN.)

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

So you (don’t) think you can review a package

By Mara Averick

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Contributing to an open-source community without contributing code is an oft-vaunted idea that can seem nebulous. Luckily, putting vague ideas into action is one of the strengths of the rOpenSci Community, and their package onboarding system offers a chance to do just that.

This was my first time reviewing a package, and, as with so many things in life, I went into it worried that I’d somehow ruin the package-reviewing process— not just the package itself, but the actual onboarding infrastructure…maybe even rOpenSci on the whole.

Barring the destruction of someone else’s hard work and/or an entire organization, I was fairly confident that I’d have little to offer in the way of useful advice. What if I have absolutely nothing to say other than, yes, this is, in fact, a package?!

So, step one (for me) was: confess my inadequacies and seek advice. It turns out that much of the advice vis-à-vis how to review a package is baked right into the documents. The reviewer template is a great trail map, the utility of which is fleshed out in the rOpenSci Package Reviewing Guide. Giving these a thorough read, and perusing a recommended review or two (links in the reviewing guide) will probably have you raring to go. But, if you’re feeling particularly neurotic (as I almost always am), the rOpenSci onboarding editors and larger community are endless founts of wisdom and resources.

visdat 📦👀

I knew nothing about Nicholas Tierney‘s visdat package prior to receiving my invitation to review it. So the first (coding-y) thing I did was play around with it in the same way I do for other cool R packages I encounter. This is a totally unstructured mish-mash of running examples, putting my own data in, and seeing what happens. In addition to being amusing, it’s a good way to sort of “ground-truth” the package’s mission, and make sure there isn’t some super helpful feature that’s going unsung.

If you’re not familiar with visdat, it “provides a quick way for the user to visually examine the structure of their data set, and, more specifically, where and what kinds of data are missing.”1 With early-stage EDA (exploratory data analysis), you’re really trying to get a feel of your data. So, knowing that I couldn’t be much help in the “here’s how you could make this faster with C++” department, I decided to fully embrace my role as “naïve user”.2

Questions I kept in mind as ~myself~ resident naïf:

  • What did I think this thing would do? Did it do it?
  • What are things that scare me off?

The latter question is key, and, while I don’t have data to back this up, can be a sort of “silent” usability failure when left unexamined. Someone who tinkers with a package, but finds it confusing doesn’t necessarily stop to give feedback. There’s also a pseudo curse-of-knowledge component. While messages and warnings are easily parsed, suppressed, dealt with, and/or dismissed by the veteran R user/programmer, unexpected, brightly-coloured text can easily scream Oh my gosh you broke it all!! to those with less experience.

Myriad lessons learned 💡

I can’t speak for Nick per the utility or lack thereof of my review (you can see his take here, but I can vouch for the package-reviewing experience as a means of methodically inspecting the innards of an R package. Methodical is really the operative word here. Though “read the docs,” or “look at the code” sounds straight-forward enough, it’s not always easy to coax oneself into going through the task piece-by-piece without an end goal in mind. While a desire to contribute to open-source software is noble enough (and is how I personaly ended up involved in this process– with some help/coaxing from Noam Ross), it’s also an abstraction that can leave one feeling overwhelmed, and not knowing where to begin.3

There are also self-serving bonus points that one simply can’t avoid, should you go the rOpenSci-package-reviewing route– especially if package development is new to you.4 Heck, the package reviewing guide alone was illuminating.

Furthermore, the wise-sage 🦉 rOpenSci onboarding editors5 are excellent matchmakers, and ensure that you’re actually reviewing a package authored by someone who wants their package to be reviewed. This sounds simple enough, but it’s a comforting thought to know that your feedback isn’t totally unsolicited.


  1. Yes, I’m quoting my own review.

  2. So, basically just playing myself… Also I knew that, if nothing more, I can proofread and copy edit.

  3. There are lots of good resources out there re. overcoming this obstacle, though (e.g. First Timers Only; or Charlotte Wickham‘s Collaborative Coding from useR!2017 is esp. 👍 for the R-user).

  4. OK, so I don’t have a parallel world wherein a very experienced package-developer version of me is running around getting less out of the process, but if you already deeply understand package structure, you’re unlikely to stumble upon quite so many basic “a-ha” moments.

  5. 👋Noam Ross, Scott Chamberlain, Karthik Ram, & Maëlle Salmon

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Onboarding visdat, a tool for preliminary visualisation of whole dataframes

By Nicholas Tierney

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Take a look at the data

This is a phrase that comes up when you first get a dataset.

It is also ambiguous. Does it mean to do some exploratory modelling? Or make some histograms, scatterplots, and boxplots? Is it both?

Starting down either path, you often encounter the non-trivial growing pains of working with a new dataset. The mix ups of data types – height in cm coded as a factor, categories are numerics with decimals, strings are datetimes, and somehow datetime is one long number. And let’s not forget everyone’s favourite: missing data.

These growing pains often get in the way of your basic modelling or graphical exploration. So, sometimes you can’t even start to take a look at the data, and that is frustrating.

The visdat package aims to make this preliminary part of analysis easier. It focuses on creating visualisations of whole dataframes, to make it easy and fun for you to “get a look at the data”.

Making visdat was fun, and it was easy to use. But I couldn’t help but think that maybe visdat could be more.

  • I felt like the code was a little sloppy, and that it could be better.
  • I wanted to know whether others found it useful.

What I needed was someone to sit down and read over it, and tell me what they thought. And hey, a publication out of this would certainly be great.

Too much to ask, perhaps? No. Turns out, not at all. This is what the rOpenSci onboarding process provides.

rOpenSci onboarding basics

Onboarding a package onto rOpenSci is an open peer review of an R package. If successful, the package is migrated to rOpenSci, with the option of putting it through an accelerated publication with JOSS.

What’s in it for the author?

  • Feedback on your package
  • Support from rOpenSci members
  • Maintain ownership of your package
  • Publicity from it being under rOpenSci
  • Contribute something to rOpenSci
  • Potentially a publication

What can rOpenSci do that CRAN cannot?

The rOpenSci onboarding process provides a stamp of quality on a package that you do not necessarily get when a package is on CRAN 1. Here’s what rOpenSci does that CRAN cannot:

  • Assess documentation readability / usability
  • Provide a code review to find weak points / points of improvement
  • Determine whether a package is overlapping with another.

So I submitted visdat to the onboarding process. For me, I did this for three reasons.

  1. So visdat could become a better package
  2. Pending acceptance, I would get a publication in JOSS
  3. I get to contribute back to rOpenSci

Submitting the package was actually quite easy – you go to submit an issue on the onboarding page on GitHub, and it provides a magical template for you to fill out 2, with no submission gotchas – this could be the future 3. Within 2 days of submitting the issue, I had a response from the editor, Noam Ross, and two reviewers assigned, Mara Averick, and Sean Hughes.

I submitted visdat and waited, somewhat apprehensively. What would the reviewers think?

In fact, Mara Averick wrote a post: “So you (don’t) think you can review a package” about her experience evaluating visdat as a first-time reviewer.

Getting feedback

Unexpected extras from the review

Even before the review started officially, I got some great concrete feedback from Noam Ross, the editor for the visdat submission.

  • Noam used the goodpractice package, to identify bad code patterns and other places to immediately improve upon in a concrete way. This resulted in me:
    • Fixing error prone code such as using 1:length(...), or 1:nrow(...)
    • Improving testing using the visualisation testing software vdiffr)
    • Reducing long code lines to improve readability
    • Defining global variables to avoid a NOTE (“no visible binding for global variable”)

So before the review even started, visdat is in better shape, with 99% test coverage, and clearance from goodpractice.

The feedback from reviewers

I received prompt replies from the reviewers, and I got to hear really nice things like “I think visdat is a very worthwhile project and have already started using it in my own work.”, and “Having now put it to use in a few of my own projects, I can confidently say that it is an incredibly useful early step in the data analysis workflow. vis_miss(), in particular, is helpful for scoping the task at hand …”. In addition to these nice things, there was also great critical feedback from Sean and Mara.

A common thread in both reviews was that the way I initially had visdat set up was to have the first row of the dataset at the bottom left, and the variable names at the bottom. However, this doesn’t reflect what a dataframe typically looks like – with the names of the variables at the top, and the first row also at the top. There was also suggestions to add the percentage of missing data in each column.

On the left are the old visdat and vismiss plots, and on the right are the new visdat and vismiss plots.

Changing this makes the plots make a lot more sense, and read better.

Mara made me aware of the warning and error messages that I had let crop up in the package. This was something I had grown to accept – the plot worked, right? But Mara pointed out that from a user perspective, seeing these warnings and messages can be a negative experience for the user, and something that might stop them from using it – how do they know if their plot is accurate with all these warnings? Are they using it wrong?

Sean gave practical advice on reducing code duplication, explaining how to write general construction method to prepare the data for the plots. Sean also explained how to write C++ code to improve the speed of vis_guess().

From both reviewers I got nitty gritty feedback about my writing – places where documentation that was just a bunch of notes I made, or where I had reversed the order of a statement.

What did I think?

I think that getting feedback in general on your own work can be a bit hard to take sometimes. We get attached to our ideas, we’ve seen them grow from little thought bubbles all the way to “all growed up” R packages. I was apprehensive about getting feedback on visdat. But the feedback process from rOpenSci was, as Tina Turner put it, “simply the best”.

Boiling down the onboarding review process down to a few key points, I would say it is transparent, friendly, and thorough.

Having the entire review process on GitHub means that everyone is accountable for what they say, and means that you can track exactly what everyone said about it in one place. No email chain hell with (mis)attached documents, accidental reply-alls or single replies. The whole internet is cc’d in on this discussion.

Being an rOpenSci initiative, the process is incredibly friendly and respectful of everyone involved. Comments are upbeat, but are also, importantly thorough, providing constructive feedback.

So what does visdat look like?

library(visdat)

vis_dat(airquality)

visdat-example

This shows us a visual analogue of our data, the variable names are shown on the top, and the class of each variable is shown, along with where missing data.

You can focus in on missing data with vis_miss()

vis_miss(airquality)

vis-miss-example

This shows only missing and present information in the data. In addition to vis_dat() it shows the percentage of missing data for each variable and also the overall amount of missing data. vis_miss() will also indicate when a dataset has no missing data at all, or a very small percentage.

The future of visdat

There are some really exciting changes coming up for visdat. The first is making a plotly version of all of the figures that provides useful tooltips and interactivity. The second and third changes to bring in later down the track are to include the idea of visualising expectations, where the user can search their data for particular things, such as particular characters like “~” or values like -99, or -0, or conditions “x > 101”, and visualise them. Another final idea is to make it easy to visually compare two dataframes of differing size. We also want to work on providing consistent palettes for particular datatypes. For example, character, numerics, integers, and datetime would all have different (and consistently different) colours.

I am very interested to hear how people use visdat in their work, so if you have suggestions or feedback I would love to hear from you! The best way to leave feedback is by filing an issue, or perhaps sending me an email at nicholas [dot] tierney [at] gmail [dot] com.

The future of your R package?

If you have an R package you should give some serious thought about submitting it to the rOpenSci through their onboarding process. There are very clear guidelines on their onboarding GitHub page. If you aren’t sure about package fit, you can submit a pre-submission enquiry – the editors are nice and friendly, and a positive experience awaits you!


  1. CRAN is an essential part of what makes the r-project successful and certainly without CRAN R simply would not be the language that it is today. The tasks provided by the rOpenSci onboarding require human hours, and there just isn’t enough spare time and energy amongst CRAN managers.

  2. Never used GitHub? Don’t worry, creating an account is easy, and the template is all there for you. You provide very straightforward information, and it’s all there at once.

  3. With some journals, the submission process means you aren’t always clear what information you need ahead of time. Gotchas include things like “what is the residential address of every co-author”, or getting everyone to sign a copyright notice.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to Create an Online Choice Simulator

By Tim Bock

Choice Simulator

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Choice model simulator

What is a choice simulator?

A choice simulator is an online app or an Excel workbook that allows users to specify different scenarios and get predictions. Here is an example of a choice simulator.

Choice simulators have many names: decision support systems, market simulators, preference simulators, desktop simulators, conjoint simulators, and choice model simulators.


How to create a choice simulator

In this post, I show how to create an online choice simulator, with the calculations done using R, and the simulator is hosted in Displayr.


Step 1: Import the model(s) results

First of all, choice simulators are based on models. So, the first step in building a choice simulator is to obtain the model results that are to be used in the simulator. For example, here I use respondent-level parameters from a latent class model, but there are many other types of data that could have been used (e.g., parameters from a GLM, draws from the posterior distribution, beta draws from a maximum simulated likelihood model).

If practical, it is usually a good idea to have model results at the case level (e.g., respondent level), as the resulting simulator can then be easily automatically weighted and/or filtered. If you have case level data, the model results should be imported into Displayr as a Data Set. See Introduction to Displayr 2: Getting your data into Displayr for an overview of ways of getting data into Displayr.

The table below shows estimated parameters of respondents from a discrete choice experiment of the market for eggs. You can work your way through the choice simulator example used in this post here (the link will first take you to a login page in Displayr and then to a document that contains the data in the variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM).


Step 2: Simplify calculations using variable sets

Variable sets are a novel and very useful aspect of Displayr. Variable sets are related variables that are grouped. We can simplify the calculations of a choice simulator by using the variable sets, with one variable set for each attribute.

Variable sets in data tree

In this step, we group the variables for each attribute into separate variable sets, so that they appear as shown on the right. This is done as follows:

  1. If the variables are already grouped into a variable set, select the variable set, and select Data Manipulation > Split (Variables). In the dataset that I am using, all the variables I need for my calculation are already grouped into a single variable set called Individual-Level Parameter Means for Segments 26-Jun-17 9:01:57 AM, so I click on this and split it.
  2. Next, select the first attribute’s variables. In my example, this is the four variables that start with Weight:, each of which represents the respondent-level parameters for different egg weights. (The first of these contains only 0s, as dummy coding was used.)Variables in data tree for choice simulator
  3. Then, go to Data Manipulation > Combine (Variables).
  4. Next set the Label for the new variable set to something appropriate. For reasons that will become clearer below, it is preferable to set it to a single, short word. For example, Weight.
  5. Set the Label field for each of the variables to whatever label you plan to show in the choice simulator. For example, if you want people to be able to choose an egg weight of 55g (about 2 ounces), set the Label to 55g.
  6. Finally, repeat this process for all the attributes. If you have any numeric attributes, then leave these as a single variable, like Price in the example here.

Step 3: Create the controls

Choice model simulation inputs

In my choice simulator, I have separate columns of controls (i.e., combo boxes) for each of the brands. The fast way to do this is to first create them for the first alternative (column), and then copy and paste them:

  1. Insert > Control (More).
  2. Type the levels, separated by semi-colons, into the Item list. These must match, exactly, to the labels that you have entered for the Labels for the first attribute in point 5 in the previous step. For example: 55g; 60g; 65g; 70g. I recommend using copy and paste because if you make some typos they will be difficult to track down. Where you have a numeric attribute, such as Price in the example, you enter the range of values that you wish the user to be able to choose from (e.g., 1.50; 2.00; 2.50; 3.00; 3.50; 4.00; 4.50; 5.00).
  3. Select the Properties tab in the Object Inspector and set the Name of the control to whatever you set as the Label for the corresponding variable set with the number 1 affixed at the end. For example, Weight.1 (You can use any label, but following this convention will save you time later on.)
  4. Click on the control and select the first level. For example, 55g.
  5. Repeat these steps until you have created controls for each of the attributes, each under each other, as shown above.
  6. Select all the controls that you have created, and then select Home > Copy and Home > Paste, and move the new set of labels to the right of the previous labels. Repeat this for as many sets of alternatives as you wish to include. In my example, there are four alternatives.
  7. Finally, add labels for the brands and attributes: Insert > TextBox (Text and Images).

See also Adding a Combo Box to a Displayr Dashboard for an intro to creating combo boxes.


Step 4: Calculate preference shares

  1. Insert an R Output (Insert > R Output (Analysis)), setting it to Automatic with the appropriate code, and positioning it underneath the first column of combo boxes. Press the Calculate button, and it should calculate the share for the first alternative. If you paste the code below, and everything is setup properly, you will get a value of 25%.
  2. Now, click on the R Output you just created, and copy-and-paste it. Position the new version immediately below the second column of combo boxes.
  3. Modify the very last line of code, replacing [1] with [2], which tells it to show the results of the second alternative.
  4. Repeat steps 2 and 3 for alternatives 3 and 4.

The code below can easily be modified for other models. A few key aspects of the code:

  • It works with four alternatives and is readily modified to deal with different numbers of alternatives.
  • The formulas for the utility of each alternative are expressed as simple mathematical expressions. Because I was careful with the naming of the variable sets and the controls, they are easy to read. If you are using Displayr, you can hover over the various elements of the formula and you will get a preview of their data.
  • The code is already setup to deal with weights. Just click on the R Output that contains the formula and apply a weight (Home > Weight).
  • It is set up to automatically deal with any filters. More about this below.
R Code to paste:
# Computing the utility for each alternative
u1 = Weight[, Weight.1] + Organic[, Organic.1] + Charity[, Charity.1] + Quality[, Quality.1] + Uniformity[, Uniformity.1] + Feed[, Feed.1] + Price*as.numeric(gsub("$", "", Price.1))
u2 = Weight[, Weight.2] + Organic[, Organic.2] + Charity[, Charity.2] + Quality[, Quality.2] + Uniformity[, Uniformity.2] + Feed[, Feed.2] + Price*as.numeric(gsub("$", "", Price.2))
u3 = Weight[, Weight.3] + Organic[, Organic.3] + Charity[, Charity.3] + Quality[, Quality.3] + Uniformity[, Uniformity.3] + Feed[, Feed.1] + Price*as.numeric(gsub("$", "", Price.3))
u4 = Weight[, Weight.4] + Organic[, Organic.4] + Charity[, Charity.4] + Quality[, Quality.4] + Uniformity[, Uniformity.4] + Feed[, Feed.1] + Price*as.numeric(gsub("$", "", Price.4))
# Computing preference shares
utilities = as.matrix(cbind(u1, u2, u3, u4))
eutilities = exp(utilities)
shares = prop.table(eutilities, 1)
# Filtering the shares, if a filter is applied.
shares = shares[QFilter, ]
# Filtering the weight variable, if required.
weight = if (is.null(QPopulationWeight)) rep(1, length(u1)) else QPopulationWeight
weight = weight[QFilter]
# Computing shares for the total sample
shares = sweep(shares, 1, weight, "*")
shares = as.matrix(apply(shares, 2, sum))
shares = 100 * prop.table(shares, 2)[1]

Step 5: Make it pretty

If you wish, you can make your choice simulator prettier. The R Outputs and the controls all have formatting options. In my example, I got our designer, Nat, to create the pretty background screen, which she did in Photoshop, and then added using Insert > Image.


Step 6: Add filters

If you have stored the data as variable sets, you can quickly create filters. Note that the calculations will automatically update when the viewer selects the filters.


Step 7: Share

To share the dashboard, go to the Export tab in the ribbon (at the top of the screen), and click on the black triangle under the Web Page button. Next, check the option for Hide page navigation on exported page and then click Export… and follow the prompts.

Note, the URL for the choice simulator I am using in this example is https://app.displayr.com/Dashboard?id=21043f64-45d0-47af-9797-cd4180805849. This URL is public. You cannot guess or find this link by web-searching for security reasons. If, however, you give the URL to someone, then they can access the document. Alternatively, if you have an annual Displayr account, you can instead go into Settings for the document (the cog at the top-right of the screen) and press Disable Public URL. This will limit access to only people who are set up as users for your organization. You can set up people as users in the company’s Settings, accessible by clicking on the cog at the top-right of the screen. If you don’t see these settings, contact support@displayr.com to buy a license.


Worked example of a choice simulator

You can see the choice simulator in View Mode here (as an end-user will see it), or you can create your own choice simulator here (first log into Displayr and then edit or modify a copy of the document used to create this post).

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News