asymptotically exact inference in likelihood-free models [a reply from the authors]

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with MN, for large N the dominant costs at each timestep are usually the constraint Jacobian (c/u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (c/u)(c/u) with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach.

A lot of the improvement seems to be coming from using gradient information: running standard HMC on the inputs u with a Gaussian ABC kernel which is equivalent to our baseline in the pose and MNIST experiments or the method proposed in the recent Pseudo-marginal Hamiltonian Monte Carlo by Lindsten and Doucet, seems to give comparable performance when normalised for computation time for moderate ϵ. Although the standard HMC updates are much cheaper to compute, the large difference in scale between the change in density in directions normal to the (soft) constraint and in the tangent space of the constraint manifold mean a small step-size needs to be used for the standard HMC updates to maintain reasonable accept rates (with in general the manifold being non-linear and so we cannot adjust for the scale differences by a simple linear transformation / non-identity mass matrix) which means that despite the much cheaper updates standard HMC tends to give similar effective samples per computation time (and as ϵ becomes smaller this approach becomes increasingly less efficient compared to the constrained HMC method).

I also find it surprising that this projection step does not jeopardise the stationary distribution of the process, as the argument found therein about the approximation of the approximation is not particularly deep.

The overall simulated constrained dynamic including the projection step is symplectic on the constraint manifold (as shown in Symplectic numerical integrators in constrained Hamiltonian systems by Leimkuhler and Reich) and time reversible, providing the projection iteration is run to convergence and an ‘appropriate’ step size is chosen (i.e. sufficiently small compared to the curvature of the manifold that there is guaranteed to be a solution to the projection on to the manifold and that there is only one close by solution to the projection step that will be converge to). The issues with requiring an appropriate step size to ensure a solution to the non-linear equations being solved exists and is locally unique are the same as in the implicit integrators used in Riemannian-manifold HMC methods.

In practice we found the use of the geodesic integration scheme which splits up the unforced motion on the constrained manifold in to multiple smaller steps for each outer forced step helped in allowing an appropriate step-size for the curvature of the manifold to be chosen independently to that appropriate for the change in the density. Providing an appropriately small step size was used non-convergence was very rarely an issue and generally only in the initial updates where the geometry of the constraint manifold might be expected to be non-typical of that seen after warm-up.

But the main thing that remains unclear to me after reading the paper is how the constraint that the pseudo-data be equal to the observable data can be turned into a closed form condition like g⁰(u)=0.

For concreteness I’ll assume a parameter inference problem with parameters θ generated from some prior distribution given a vector of standard variates u¹ i.e. θ=ρ(u¹) where ρ is a deterministic function.

Further let f(θ,u²) be our simulator function which given access to a further vector of variates from a random number generator u² and parameters θ=ρ(u¹) can produce simulated data x=f(ρ(u¹),u²).

If we have observed data x˘ then the constraint that x=x˘ can be written f(ρ(u¹),u²)x˘=0, so we would have c(u)=f(ρ(u¹),u²)x˘=0 where u=[u¹;u²].

As mentioned above, the authors assume a generative model based on uniform (or other simple) random inputs but this representation seems impossible to achieve in reasonably complex settings.

The representation requirements can be split in to two components:

  1. We can express our simulator model in the form y=g(u)
  2. g is differentiable with respect to u

The assumed differentiability of the generator is definitely a strong restriction, and does limit the models which this can be applied to.

I’d argue that its quite rare that in the context of simulator type models that the first assumption isn’t valid (though this may be a somewhat circular argument as it could just be what I’m defining as a simulator model is something which it applies to!).

All (1) requires is that in the simulator code all ‘randomness’ is introduced by drawing random variates from a (pseudo-)random number generator where the density of the drawn variates is known (up to some potentially unknown normalisation constant). More explicitly for the parameter inference setting above if we can write our generator function in the form

def generator(rng):
  params = generate_from_prior(rng)
  simulated_data = simulator(rng, params)
  return [params, simulated_data]

where rng is a pseudo-random number generator object allowing independent samples to be generated from standard densities then this assumption holds.

For (1) alone to hold the code in this function can be completely arbitrary . This could be code numerically integrating a set of stochastic differential equations, a graphics rendering pipeline or a learned parametric ‘neural network’ type model (with the three experiments in our paper providing toy examples of each of these).

There are degrees of freedom in what to consider the random inputs. I believe pseudo-random number generator implementations generally generate continuous random variates from just a pseudo-random uniform primitive (which itself will be generated from a pseudo-random integer primitive), e.g. RandomKit which is used for the random number implementation in (amongst other libraries) NumPy provides a rk_double function to generate a double-precision pseudo-random uniform variate in [0, 1) (and a rk_gauss function to generate a pair of Gaussian random variates but this uses rk_double internally). So assuming u is an arbitrarily long vector of random uniform variates will often be sufficient for allowing the simulator to be expressed as in (1) as mentioned by Dennis Prangle in the comments of your recent post ‘rare events for ABC’.

In general our proposed method is actually more suited to using random inputs with unbounded support otherwise it is necessary to deal with reflections at the intersection of the constraint manifold and bounds of the support (which is possible while maintaining reversibility but a bit painful implementation wise), so for example it is better to use Gaussian variates directly than specify uniform inputs then transform to Gaussians. A uniform variate might be generated by for example setting the base density to the logistic distribution and then transforming through the logistic sigmoid. This problem of automatically transforming to unconstrained inputs has been well studied in for example Stan.

Returning to the limitations applied by assuming (2) i.e. differentiability: some of the transforms from uniform variates used to generate random variates in random number generator implementations are non-differentiable e.g. if using an accept / reject step, for example common methods for generating a Gamma variate. There are a couple of options here. We can often just use the density of the output of the transform itself e.g. a Gamma base density on the relevant ui; if the parameters of the variate distribution are themselves dependent on other random inputs we need to make sure to include this dependency, but its possible to track these dependencies automatically – again probabilistic programming frameworks like Stan do just this. In other cases we might be able to use alternative (potentially less computationally efficient) transformations that avoid the accept/reject step e.g. using the original Box-Muller transform versus the more efficient but rejection based polar variant, or use tricks such as in the recent Rejection Sampling Variational Inference by Nasesseth et al.

Filed under:

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Frequency and chi-square test for independence Exercises

By Hasan Imtiaz


(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this exercise, we cover some basics on frequency tables. We also briefly look at chi-square test for independence to find relationships of two variables. Before proceeding, it might be helpful to look over the help pages for the table, summary, and margin.table functions.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
use the attach() command to load the trees dataset in R

Exercise 2
Use the table() command with the arguments: trees$Height and trees$Volume. This will generate a two-way frequency table. Store this in variable mytable.

Exercise 3
If you are familiar with excel pivot tables, then you will know this function. Use the margin.table( ) function to get the Height frequencies summed over Volume

Exercise 4
Use the margin.table( ) function to get the Volume frequencies summed over Height.

Exercise 5
Now use the table() function again but using all the features of the trees dataset, that includes girth, height and volume. This will print out a multidimensional 3 way frequency table.

Exercise 6
Suppose you have a variable ‘a’ that stores a second sample of heights of trees.

a=c(70, 65, 63, 72, 80, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75, 74, 85, 8, 71, 63, 78, 80, 74, 72, 77, 81, 82, 80, 86, 80, 87)

Use the cbind() to add the a column to your trees dataset. Store the results back into trees.

Exercise 7
Now create a 2 way frequency table between Height and a as the arguments. Store this table in mytable_2.

Exercise 8

Use the margin.table() function again from Q3 and get Height frequencies summer over a. What differences do you observe from the results of Q3.

Exercise 9
Chi Square test for independance:
a)Print the results of the summary() function on mytable. Note the Chi Square test for independance results and P value
b)Print the results of the summary() function on mytable_2. Note the Chi Square test for independance results and P value.

Exercise 10
What did the chi square test for independance help you to see?

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Missing Values, Data Science and R

By Joseph Rickert

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Joseph Rickert

One great advantages of working in R is the quantity and sophistication of the statistical functions and techniques available. For example, R’s quantile() function allows you to select one of the nine different methods for computing quantiles. Who would have thought there could be so many ways to do something that seems to be so simple? The issue here is not unnecessary complication, but rather an appreciation of the nuances associated with inference problems gained over the last hundred years of modern statistical practice. Much of this knowledge is reflected in R. As I imagine it, if R’s functions were laid out like city streets there would be many broad avenues displaying commonly used algorithms, but these would flow into many thousands of small, rarely-visited alleys, containing treasures for those who seek them out.

My impression is that most data science is done on the avenues, but that data scientists could benefit from occasionally venturing deeper into the alleys of statistical practice. Statistics might be the least important part of data science but there are times when coming to grips with sound statistical inference is essential. Consider the problem of dealing with missing values for example. It is difficult to imagine any large, real-world data set that wouldn’t require a strategy for imputing missing values. The obvious first step in developing a strategy would be to form some ideas about why the data are missing. Gelman and Hill identify four different “missingness mechanisms”: (1) Missingness completely at random, (2) Missingness at random, (3) Missingness that depends on unobserved predictors, (4) Missingness that depends on the missing value itself; and they provide some advice on how to cope with each of them. In a large data set, there is no reason to believe that there is just one mechanism in play! Evaluating your data with respect to these categories and their combinations could require a frightening amount of exploratory work. So tools for looking at patterns in missing values are likely to be very helpful, even if using them requires sampling your data set.

The next step of deciding how to proceed in a statistically sound manner will likely pose considerable technical challenges, and opting for simple solutions may not be advisable, or even possible. Even with big data, ignoring observations with missing values could result in a catastrophic loss of information, and simple approaches to imputation such as replacing missing the missing values of each variable with the variable mean, or a common value, will produce a “completed” data set reflecting an artificial amount of certainty that will likely underestimate the variability of the data and bias results. R can’t eliminate the hard work, but it does provide a formidable array of missing value imputation tools that can get you in and out of the statistical alleys and help you to decide what techniques might be useful.

Before evaluating the various methods, it is helpful to make a couple of distinctions about imputation methods for multivariate data. The first is between single and multiple imputation. In single imputation, a particular algorithm or technique is used to produce a single estimate for each missing value. Multiple imputation methods, on the other hand, develop distributions for missing values and estimate individual missing values by drawing from this distribution. In general, these algorithms proceed in three steps. In the first imputation step, multiple draws are made from the distribution of missing values for each variable. This process results in several “completed” data sets where each missing value has been estimated by a plausible value. In the second step, the intended analysis is performed on each completed data set. In the last step, a “pooling”algorithm estimates the final values for the statistics of interest.

The second distinction is between Joint Modeling (JM) and Fully Conditional Specification (FCS) imputing. In JM, the data are assumed to follow a multivariate parametric distribution of some sort. This is theoretically sound but may not be flexible enough to adequately model the data. In FCS, the multivariate data model is specified by developing a separate conditional model for each variable with missing values.

Here is a short list of R packages for missing value imputation. I have selected these to give some idea of the variety of tools available.

Amelia implements the Amelia II algorithm which assumes that the complete data set (missing and observed data) are multivariate normal. Imputations are done via the EMB (expectation-maximization with bootstrapping) algorithm. The JSS paper describes a strategy for combining the models resulting from each imputed data set. The Amelia vignette contains examples.

BaBoon provides two variants of the the Bayesian Bootstrap predictive mean matching to impute multiple missing values. Originally developed for survey data, the imputation algorithms are described as being robust with respect to imputation model misspecification. The best description and rationale for the algorithms seems to be the PhD thesis of one of the package authors.

Hmisc contains several functions that are helpful for missing value imputation including agreImpute(), impute() and transcan(). Documentation on Hmisc can be found here.

mi takes a Bayesian approach to imputing missing values. The imputation algorithm runs multiple MCMC chains to iteratively draw imputed values from conditional distributions of observed and imputed data. In addition to imputation algorithm, the package contains functions for visualizing the pattern of missing values in a data set and assessing the convergence of the MCMC chains. A vignette shows a worked example and the associated JSS paper delves deeper into the theory and the mechanics of using the method.

mice which is an acronym for multivariate imputation of chained equations, formalizes the multiple implementation process outline above and is probably the gold standard for FCS multiple imputation. Package features include:

  • Columnwise specification of the imputation model
  • Support for arbitrary patterns of missing data
  • Passive imputation techniques that maintain consistency among data transformations
  • Subset selection of predictors
  • Support of arbitrary complete-data methods
  • Support pooling various types of statistics
  • Diagnostics for imputations
  • Callable user-written imputation functions

The JSS paper describes how the desire to provide a separate imputation model for each variable led to the development of the chained equation technique where a Gibbs sampler fills out the missing data. miceAdds provides additional functions to be used with mice including plausible value imputation, multilevel imputation functions, imputation using partial least squares (PLS) for high dimensional predictors, nested multiple imputation, and two-way imputation.

missingDataGUI implements a nice graphical interface for exploring missing data patterns with numeric and graphical summaries for numeric and categorical missing values and implements a number of imputation methods. The figure below shows the missing value map for the HouseVotes84 data set in the mlbench package.

missMDA performs principal component methods on incomplete data sets obtaining scores, loadings and graphical representations despite missing values. The package also includes functions to perform single and multiple imputation. The JSS paper provides the details.

VIM provides tools for visualizing missing or imputed values. Before imputation they can be used to study the pattern of missing values, afterwards these same tools can be used as diagnostics. VIMGUI puts a front end on the VIM functions and helps with handling the plot and imputation functions. The vignette is thorough and provides some nice examples of how one might look at missing value distributions.

vtreat provides tools to assist in the statistically sound preparation of data sets. It is not a package explicitly devoted to missing value imputation, but it can produce “cleaned” data sets that have no “Infinite/NA/NaN in the effective variable columns”. I include it here to emphasize that proper data preparation can simplify the missing value problem. The package has several vignettes.

yaImpute takes what might be thought of as a machine learning approach to imputing missing values by using the k-nearest neighbor (kNN) algorithm to impute missing values. The JSS paper covers the theory and explains the package using a forestry application.

For additional R packages see Stef van Buuren’s webpage cataloging software for imputation which lists twelve R packages that implement some method of single imputation and eighteen R packages concerned with multiple imputation and provides a brief explanation of each of these. Also, have a look on the post on that provides informative short write ups on amelia, Hmisc, mi, mice and missForest that include some sample code.

I also recommend looking at Stef van Buuren presentation on the history and future of Fully Conditional Specification from the 2015 Rennes missing value conference and Julie Josse’s tutorial at useR! 2016.

Even with the best of tools there is no doubt that dealing with missing values in a statistically sound manner is difficult. However, this is the that kind of work that helps to put the “science” in data science, and R is the most helpful environment you are likely to find.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Jupyter And R Markdown: Notebooks With R

By DataCamp Blog

Compare Jupyter With R Markdown

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

When working on data science problems, you might want to set up an interactive environment to work and share your code for a project with others. You can easily set this up with a notebook.

In other cases, you’ll just want to communicate about the workflow and the results that you have gathered for the analysis of your data science problem. For a transparent and reproducible report, a notebook can also come in handy.

That’s right; notebooks are perfect for situations where you want to combine plain text with rich text elements such as graphics, calculations, etc.

The topic of today’s blog post focuses on the two notebooks that are popular with R users, namely, the Jupyter Notebook and, even though it’s still quite new, the R Markdown Notebook. You’ll discover how to use these notebooks, how they compare to one another and what other alternatives exist.

R And The Jupyter Notebook

Contrary to what you might think, Jupyter doesn’t limit you to working solely with Python: the notebook application is language agnostic, which means that you can also work with other languages.

There are two general ways to get started on using R with Jupyter: by using a kernel or by setting up an R environment that has all the essential tools to get started on doing data science.

Running R in Jupyter With The R Kernel

As described above, the first way to run R is by using a kernel. If you want to have a complete list of all the available kernels in Jupyter, go here.

To work with R, you’ll need to load the IRKernel and activate it to get started on working with R in the notebook environment.

First, you’ll need to install some packages. Make sure that you don’t do this in your RStudio console, but in a regular R terminal, otherwise you’ll get an error like this:

Error in IRkernel::installspec() :
Jupyter or IPython 3.0 has to be installed but could neither run “jupyter” nor “ipython”, “ipython2” or “ipython3”.
(Note that “ipython2” is just IPython for Python 2, but still may be IPython 3.0)
$ R

> install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest'))

This command will prompt you to type in a number to select a CRAN mirror to install the necessary packages. Enter a number and the installation will continue.

> devtools::install_github('IRkernel/IRkernel')

Then, you still need to make the R kernel visible for Jupyter:

# Install IRKernel for the current user

> IRkernel::installspec()

# Or install IRKernel system-wide

> IRkernel::installspec(user = FALSE)

Now open up the notebook application with jupyter notebook. You’ll see R appearing in the list of kernels when you create a new notebook.

Using An R Essentials Environment In Jupyter

The second option to quickly work with R is to install the R essentials in your current environment:

conda install -c r r-essentials

These “essentials” include the packages dplyr, shiny, ggplot2, tidyr, caret, and nnet. If you don’t want to install the essentials in your current environment, you can use the following command to create a new environment just for the R essentials:

conda create -n my-r-env -c r r-essentials

Now open up the notebook application to start working with R.

You might wonder what you need to do if you want to install additional packages to elaborate your data science project. After all, these packages might be enough to get you started, but you might need other tools.

Well, you can either build a Conda R package by running, for example:

conda skeleton cran ldavis conda build r-ldavis/

Or you can install the package from inside of R via install.packages() or devtools::install_github (to install packages from GitHub). You just have to make sure to add the new package to the correct R library used by Jupyter:

install.packages("ldavis", "/home/user/anaconda3/lib/R/library")

If you want to know more about kernels or about running R in a Docker environment, check out this page.

Adding Some R Magic To Jupyter

A huge advantage of working with notebooks is that they provide you with an interactive environment. That interactivity comes mainly from the so-called “magic commands”.

These commands allow you to switch from Python to command line instructions or to write code in another language such as R, Julia, Scala, …

To switch from Python to R, you first need to download the following package:

%load_ext rpy2.ipython

After that, you can get started with R, or you can easily switch from Python to R in your data analysis with the %R magic command.

Let’s demonstrate how the R magic works with a small example:

# Hide warnings if there are any
import warnings
# Load in the r magic
%load_ext rpy2.ipython
# We need ggplot2
%R require(ggplot2)
# Load in the pandas library
import pandas as pd 
# Make a pandas DataFrame
df = pd.DataFrame({'Alphabet': ['a', 'b', 'c', 'd','e', 'f', 'g', 'h','i'],
                   'A': [4, 3, 5, 2, 1, 7, 7, 5, 9],
                   'B': [0, 4, 3, 6, 7, 10,11, 9, 13],
                   'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
# Take the name of input variable df and assign it to an R variable of the same name
%%R -i df
# Plot the DataFrame df
ggplot(data=df) + geom_point(aes(x=A, y=B, color=C))

If you want more details about Jupyter, on how to set up a notebook, where to download the application, how you can run the notebook application (via Docker, pip install or with the Anaconda distribution) or other details, check out our Definitive Guide.

The R Notebook

Up until recently, Jupyter seems to have been a popular solution for R users, next to notebooks such as Apache Zeppelin or Beaker.

Also, other alternatives to report results of data analyses, such as R Markdown, Knitr or Sweave, have been hugely popular in the R community.

However, this might change with the recent release of the R or R Markdown Notebook by RStudio.

You see it: the context of the R Markdown Notebook is complex, and it’s worth looking into the history of reproducible research in R to understand what drove the creation and development of this notebook. Ultimately, you will also realize that this notebook is different from others.

R And The History of Reproducible Research

In his talk, J.J Allaire, confirms that the efforts in R itself for reproducible research, the efforts of Emacs to combine text code and input, the Pandoc, Markdown and knitr projects, and computational notebooks have been evolving in parallel and influencing each other for a lot of years. He confirms that all of these factors have eventually led to the creation and development of notebooks for R.

Firstly, computational notebooks have quite a history: since the late 80s, when Mathematica’s front end was released, there have been a lot of advancements. In 2001, Fernando Pérez started developing IPython, but only in 2011 the team released the 0.12 version of IPython was realized. The SageMath project began in 2004. After that, there have been many notebooks. The most notable ones for the data science community are the Beaker (2013), Jupyter (2014) and Apache Zeppelin (2015).

Then, there are also the markup languages and text editors that have influenced the creation of RStudio’s notebook application, namely, Emacs, Markdown, and Pandoc. Org-mode was released in 2003. It’s an editing and organizing mode for notes, planning and authoring in the free software text editor Emacs. Six years later, Emacs org-R was there to provide support for R users. Markdown, on the other hand, was released in 2004 as a markup language that allows you to format your plain text in such a way that it can be converted to HTML or other formats. Fast forward another couple of years, and Pandoc was released. It’s a writing tool and as a basis for publishing workflows.

Lastly, the efforts of the R community to make sure that research can be reproducible and transparent have also contributed to the rise of a notebook for R. 2002, Sweave was introduced in 2002 to allow the embedding of R code within LaTeX documents to generate PDF files. These pdf files combined the narrative and analysis, graphics, code, and the results of computations. Ten years later, knitr was developed to solve long-standing problems in Sweave and to combine features that were present in other add-on packages into one single package. It’s a transparent engine for dynamic report generation in R. Knitr allows any input languages and any output markup languages.

Also in 2012, R Markdown was created as a variant of Markdown that can embed R code chunks and that can be used with knitr to create reproducible web-based reports. The big advantage was and still is that it isn’t necessary anymore to use LaTex, which has a learning curve to learn and use. The syntax of R Markdown is very similar to the regular Markdown syntax but does have some tweaks to it, as you can include, for example, LaTex equations.

R Markdown Versus Computational Notebooks

R Markdown is probably one of the most popular options in the R community to report on data analyses. It’s no surprise whatsoever that it is still a core component in the R Markdown Notebook.

And there are some things that R Markdown and notebooks share, such as the delivering of a reproducible workflow, the weaving of code, output, and text together in a single document, supporting interactive widgets and outputting to multiple formats. However, they differ in their emphases: R Markdown focuses on reproducible batch execution, plain text representation, version control, production output and offers the same editor and tools that you use for R scripts.

On the other hand, the traditional computational notebooks focus on outputting inline with code, caching the output across sessions, sharing code and outputting in a single file. Notebooks have an emphasis on an interactive execution model. They don’t use a plain text representation, but a structured data representation, such as JSON.

That all explains the purpose of RStudio’s notebook application: it combines all the advantages of R Markdown with the good things that computational notebooks have to offer.

That’s why R Markdown is a core component of the R Markdown Notebook: RStudio defines its notebook as “an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input”.

How To Work With R Notebooks

If you’ve ever worked with Jupyter or any other computational notebook, you’ll see that the workflow is very similar. One thing that might seem very different is the fact that now you’re not working with code cells anymore by default: you’re rather working with a sort of text editor in which you indicate your code chunks with R Markdown.

How To Install And Use The R Markdown Notebook

The first requirement to use the notebook is that you have the newest version of RStudio available on your PC. Since notebooks are a new feature of RStudio, they are only available in version 1.0 or higher of RStudio. So, it’s important to check if you have a correct version installed.

If you don’t have version 1.0 or higher of RStudio, you can download the latest version here.

Then, to make a new notebook, you go to File tab, select“New File”, and you’ll see the option to create a new R Markdown Notebook. If RStudio prompts you to update some packages, just accept the offer and eventually a new file will appear.

Tip: double-check whether you’re working with a notebook by looking at the top of your document. The output should be html_notebook.

You’ll see that the default text that appears in the document is in R Markdown. R Markdown should feel pretty familiar to you, but if you’re not yet quite proficient, you can always check out our Reporting With R Markdown course or go through the material that is provided by RStudio.

Note that you can always use the gear icon to adjust the notebook’s working space: you have the option to expand, collapse, and remove the output of your code, to change the preview options and to modify the output options.

This last option can come in handy if you want to change the syntax highlighting, apply another theme, adjust the default width and height of the figures appearing in your output, etc.

From there onwards, you can start inserting code chunks and text!

You can add code chunks in two ways: through the keyboard shortcut Ctrl + Alt + I or Cmd + Option + I, or with the insert button that you find in the toolbar.

What’s great about working with these R Markdown notebooks is the fact that you can follow up on the execution of your code chunks, thanks to the little green bar that appears on the left when you’re executing large code chunks or multiple code chunks at once. Also, note that there’s a progress bar on the bottom.

You can see the green progress bar appearing in the gif below:

Using the R Notebook

Talking about code execution: there are multiple ways in which you can execute your R code chunks.

You can run a code chunk or run the next chunk, run all code chunks below and above; but you can also choose to restart R and run all chunks or to restart and to clear the output.

Note that when you execute the notebook’s code, you will also see the output appearing on your console! That might be a rather big difference for those who usually work with other computational notebooks such as Jupyter.

If there are any errors while the notebook’s code chunks are being executed, the execution will stop, and there will appear a red bar alongside the code piece that produces the error.

You can suppress the halt of the execution by adding errors = TRUE in the chunk options, just like this:

```{r, error=TRUE}
iris <- read.csv(url(""), header = FALSE)
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

Note that the error will still appear, but that the notebook’s code execution won’t be halted!

How To Use R Markdown Notebook’s Magic

Just like with Jupyter, you can also work interactively with your R Markdown notebooks. It works a bit differently from Jupyter, as there are no real magic commands; To work with other languages, you need to add separate Bash, Stan, Python, SQL or Rcpp chunks to the notebook.

These options might seem quite limited to you, but it’s compensated in the ease with which you can easily add these types of code chunks with the toolbar’s insert button.

Also working with these code chunks is easy: you can see an example of SQL chunks in this document, published by J.J Allaire. For Bash commands, you just type the command. There’s no need extra characters such as ‘!‘ to signal that you’re working in Bash, like you would do when you would work with Jupyter.

How To Output Your R Markdown Notebooks

Before you render the final version of a notebook, you might want to preview what you have been doing. There’s a handy feature that allows you to do this: you’ll find it in your toolbar.

Click on the “preview” button and the provisional version of your document will pop up on the right-hand side, in the “Viewer” tab.

By adding some lines to the first section on top of the notebook, you can adjust your output options, like this:

title: "Notebook with KNN Example"
    highlight: tango
    toc: yes
    toc: yes

To see where you can get those distributions, you can just try to knit, and the console output will give you the sites where you can download the necessary packages.

Note that this is just one of the many options that you have to export a notebook: there’s also the possibility to render GitHub documents, word documents, beamer presentation, etc. These are the output options that you already had with regular R Markdown files. You can find more info here.

Tips And Tricks To Work With R Notebook

Besides the general coding practices that you should keep in mind, such as documenting your code and applying a consistent naming scheme, code grouping and name length, you can also use the following tips to make a notebook awesome for others to use and read:

  • Just like with computational notebooks, it might be handy to split large code chunks or code chunks that generate more than one output into multiple chunks. This way, you will improve the general user experience and increase the transparency of a notebook.
  • Make use of the keyboard shortcuts to speed up your work. You will find most of them in the toolbar, next to the commands that you want to perform.
  • Use the spellchecker in the toolbar to make sure your report’s vocabulary is correct.
  • Take advantage of the option to hide your code if a notebook is code-heavy. You can do this through code chunk options or in the HTML file of the notebook itself!

The R Notebook Versus The Jupyter Notebook

Besides the differences between the Jupyter and R Markdown notebooks that you have already read above, there are some more things.

Let’s compare Jupyter with the R Markdown Notebook!

There are four aspects that you will find interesting to consider: notebook sharing, code execution, version control, and project management.

Notebook Sharing

The source code for an R Markdown notebook is an .Rmd file. But when you save a notebook, an .nb.html file is created alongside it. This HTML file is an associated file that includes a copy of the R Markdown source code and the generated output.

That means that you need no special viewer to see the file, while you might need it to view notebooks that were made with the Jupyter application, which are simple JSON documents, or other computational notebooks that have structured format outputs. You can publish your R Markdown notebook on any web server, GitHub or as an email attachment.

There also are APIs to render and parse R Markdown notebooks: this gives other frontend tools the ability to create notebook authoring modes for R Markdown. Or the APIs can be used to create conversion utilities to and from different notebook formats.

To share the notebooks you make in the Jupyter application, you can export the notebooks as slideshows, blogs, dashboards, etc. You can find more information in this tutorial. However, there are also the default options to generate Python scripts, HTML files, Markdown files, PDF files or reStructured Text files.

Code Execution

R Markdown Notebooks have options to run a code chunk or run the next chunk, run all code chunks below and above; In addition to these options, you can also choose to restart R and run all chunks or to restart and to clear the output.

These options are interesting when you’re working with R because the R Markdown Notebook allows all R code pieces to share the same environment. However, this can prove to be a huge disadvantage if you’re working with non-R code pieces, as these don’t share environments.

All in all, these code execution options add a considerable amount of flexibility for the users who have been struggling with the code execution options that Jupyter offers, even though if these are not too much different: in the Jupyter application, you have the option to run a single cell, to run cells and to run all cells. You can also choose to clear the current or all outputs. The code environment is shared between code cells.

Version control

There have been claims that Jupyter messes up the version control of notebooks or that it’s hard to use git with these notebooks. Solutions to this issue are to export the notebook as a script or to set up a filter to fix parts of the metadata that shouldn’t change when you commit or to strip the run count and output.

The R Markdown notebooks seem to make this issue a bit easier to handle, as they have associated HTML files that save the output of your code and the fact that the notebook files are essentially plain text files, version control will be much easier. You can choose to only put your .Rmd file on GitHub or your other versioning system, or you can also include the .nb.html file.

Project Management

As the R Markdown Notebook is native to the RStudio development kit, the notebooks will seamlessly integrate with your R projects. Also, these notebooks support other languages, including Python, C, and SQL.

On the other hand, the Jupyter project is not native to any development kit: in that sense, it will cost some effort to integrate this notebook seamlessly with your projects. But this notebook still supports more languages and will be a more suitable companion for you if you’re looking for use Scala, Apache Toree, Julia, or another language.

Alternatives to Jupyter or R Markdown Notebooks

Apart from the notebooks that you can use as interactive data science environments which make it easy for you to share your code with colleagues, peers, and friends, there are also other alternatives to consider.

Because sometimes you don’t need a notebook, but a dashboard, an interactive learning platform or a book, for example.

You have already read about options such as Sweave and Knitr in the second section. Some other options that are out there, are:

  • Even though this blog post has covered R Markdown to some extent, you should know that you can do so much more with it. For example, you can build dashboards with flexdashboard.
  • Or you can use Bookdown to quickly publish HTML, PDF, ePub, and Kindle books with R Markdown.
  • Shiny is a tool that you can also use to create dashboards. To get started with Shiny, go to this page.
  • In an educational setting, DataCamp Light might also come in handy to create interactive tutorials on your blog or website. If you want to see DataCamp light at work, go to this tutorial, for example.
To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Microsoft R Open 3.3.2 now available

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Open 3.3.2, Microsoft’s enhanced distribution of open source R, is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to version 3.3.2, adds new bundled packages and updates others, and upgrades the Intel Math Kernel Libraries.

The updated R 3.3.2 engine includes some performance improvements (particularly in calculation of eigenvalues), better handling of date axes in graphics, and improved documentation for the methods package. (See here for a complete list of fixes.) There are no user-visible changes in the language itself, which means that scripts and packages should work without changes from MRO 3.3.1.

This update to MRO comes bundled with some additional packages, notably jsonlite (for fast processing of JSON-formatted data), png (to support reading and writing of PNG images), R6 (allowing the creation of classes with reference semantice), and curl (for connecting to web-based data sources). The checkpoint and deployrRserve packages have also been updated.

The MKL libraries, which provide high-performance matrix and vector calculations to MRO, have also been updated. (This fixes some issues with matrix multiplication and arima that were reported.)

For reproducibility, installing packages with MRO 3.3.2 will by default get you package versions as of November 1, 2016. Many packages have been updated or released since MRO 3.3.1. (See here for some highlights of new packages.) As always, if you want to use newer versions of CRAN packages (or access older versions, for reproducibility), just use the checkpoint function to access the CRAN Time Machine.

MRO is supported on Windows, Mac and Linux. This release adds support for two new platforms: MacOS Sierra (10.12) and Ubuntu 16.04.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. To download Microsoft R Open (it’s free!), simply follow the link below.

MRAN: Download Microsoft R Open

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

satRday in Cape Town

By Andrew Collier


(This article was first published on R | Exegetic Analytics, and kindly contributed to R-bloggers)

The second satRday (and first satRday on African soil) will happen in Cape Town on 18 August 2017. It’s going to be a one day celebration of R.

We have a trio of phenomenal keynote speakers (Hilary Parker, Jenny Bryan and Julia Silge) who will be giving inspiring talks at the conference and also conducting workshops prior to the conference. There will be numerous other talks from local and international speakers covering a variety of topics relating to R.

Registration is open and early bird prices are available until 23 December 2016. Submit a talk proposal and you could join the lineup of R luminaries (in addition to getting a free ticket to the conference!).

The post satRday in Cape Town appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R | Exegetic Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Package ggguitar on CRAN

By C

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

The initial release of ggguitar is available on CRAN! This package allows you to create guitar tablature in the style of ggplot2. As a quick example, the C Major chord shown above can be created as follows.

C_M ->
tablature(‘C Major’, C_M)

The tablature function takes the name of the chord for the first argument and a six element vector representing the six strings of the guitar. An NA indicates a string that is not played, zero represents an open string, and a number indicates a fret where a finger is to be placed on the specified string.

The “String” and “Fret” labels and corresponding axis ticks are optional.

tablature(‘C Major’, C_M, FALSE, FALSE)

See the package at CRAN, the vignette, or the code at Github for more information.

To leave a comment for the author, please follow the link and comment on their blog: R-Chart. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

BERT: a newcomer in the R Excel connection

By The R Trader


A few months ago a reader point me out this new way of connecting R and Excel. I don’t know for how long this has been around, but I never came across it and I’ve never seen any blog post or article about it. So I decided to write a post as the tool is really worth it and before anyone asks, I’m not related to the company in any way.

BERT stands for Basic Excel R Toolkit. It’s free (licensed under the GPL v2) and it has been developed by Structured Data LLC. At the time of writing the current version of BERT is 1.07. More information can be found here. From a more technical perspective, BERT is designed to support running R functions from Excel spreadsheet cells. In Excel terms, it’s for writing User-Defined Functions (UDFs) in R.

In this post I’m not going to show you how R and Excel interact via BERT. There are very good tutorials here, here and here. Instead I want to show you how I used BERT to build a “control tower” for my trading.

How do I use BERT?

My trading signals are generated using a long list of R files but I need the flexibility of Excel to display results quickly and efficiently. As shown above BERT can do this for me but I also want to tailor the application to my needs. By combining the power of XML, VBA, R and BERT I can create a good looking yet powerful application in the form of an Excel file with minimum VBA code. Ultimately I have a single Excel file gathering all the necessary tasks to manage my portfolio: database update, signal generation, orders submission etc… My approach could be broken down in the 3 steps below:

  1. Use XML to build user defined menus and buttons in an Excel file.
  2. The above menus and buttons are essentially calls to VBA functions.
  3. Those VBA functions are wrapup around R functions defined using BERT.

With this approach I can keep a clear distinction between the core of my code kept in R, SQL and Python and everything used to display and format results kept in Excel, VBA & XML. In the next sections I present the prerequisite to developed such an approach and a step by step guide that explains how BERT could be used for simply passing data from R to Excel with minimal VBA code.


1 – Download and install BERT from this link. Once the installation has completed you should have a new Add-Ins menu in Excel with the buttons as shown below. This is how BERT materialized in Excel.

2 – Download and install Custom UI editor: The Custom UI Editor allows to create user defined menus and buttons in Excel ribbon. A step by step procedure is available here.

Step by step guide

1 – R Code: The below R function is a very simple piece of code for illustration purposes only. It calculates and return the residuals from a linear regression. This is what we want to retrieve in Excel. Save this in a file called myRCode.R (any other name is fine) in a directory of your choice.

myFunction <- function(){
 aa <- rnorm(200)
 bb <- rnorm(200)
 res <- lm(aa~bb)$res

2 – functions.R in BERT: From Excel select Add-Ins -> Home Directory and open the file called functions.R. In this file paste the following code. Make sure you insert the correct path.


This is just sourcing into BERT the R file you created above. Then save and close the file functions.R. Should you want to make any change to the R file created in step 1 you will have to reload it using the BERT button “Reload Startup File” from the Add-Ins menu in Excel

3 – In Excel: Create and save a file called myFile.xslm (any other name is fine). This is a macro-enabled file that you save in the directory of your choice. Once the file is saved close it.

4 – Open the file created above in Custom UI editor: Once the file is open, paste the below code.

<customUI xmlns="">
 <ribbon startFromScratch="false">
 <tab id="RTrader" label="RTrader">
 <group id="myGroup" label="My Group">
 <button id="button1" label="New Button" size="large" onAction="myRCode" imageMso="Chart3DColumnChart" />

You should have something like this in the XML editor:

Essentially this piece of XML code creates an additional menu (RTrader), a new group (My Group) and a user defined button (New Button) in the Excel ribbon. Once you’re done, open myFile.xslm in Excel and close the Custom UI Editor. You should see something like this.


5 – Open VBA editor: In myFile.xlsm insert a new module. Paste the code below in the newly created module.

Sub myRCode(control As IRibbonControl)
   Dim a As Variant
   Dim theLength As Integer
   a = Application.Run("BERT.Call", "myFunction") 
   theLength = UBound(a, 1) + 1 
   ThisWorkbook.Sheets("Sheet1").Range("B1:B" & theLength).Value = a 
End Sub 

This erases previous results in the worksheet prior to coping new ones.

6 – Click New Button: Now go back to the spreadsheet and in the RTrader menu click the “New Button” button. You should see something like the below appearing.


You’re done!

The guide above is a very basic version of what can be achieved using BERT but it shows you how to combine the power of several specific tools to build your own custom application. From my perspective the interest of such an approach is the ability to glue together R and Excel obviously but also to include via XML (and batch) pieces of code from Python, SQL and more. This is exactly what I needed. Finally I would be curious to know if anyone has any experience with BERT?

Source:: R News

How to create a ggplot Theme – Unicorn Edition

By Florian Teschner

plot of chunk unnamed-chunk-2

Themes are an convenient way to give ggplot charts an individualized, sometimes stylish look. Most of the time, I rely on the ggthemes package and the Economist style. Last week colleagues asked me to change the look of my charts. We joked around and I agreed to create a unicorn ggplot theme. I want to use the challenge to detail a) how to create custom ggplot themes and b) look at unicorn startup data.

I took the Unicorn startup dataset, which contains startups with a valuation above 1 bil$.
For the theme, in the first step, I changed some basic parameters (coloring, no ticks axis.ticks=element_blank(), no legend title legend.title = element_blank() and the same background for legend elements as well as the legend background legend.key = element_rect(fill = "lightskyblue1", color = "lightskyblue1")). For the impatient reader; the complete code is at the end of the post.

ggplot(dd, aes(year, N, color=Industry, fill=Industry)) + geom_bar(stat="identity") +  xlab("") + ylab("Number of Unicorns") + theme_unicorn(base_size = 5)

Data-wise, we see that the number of unicorn valuations massively increased in recent years. Most unicorns are active in the area of ecommerce and marketplaces.
Theme-wise; while the chart looks colorful, there are some parts missing; font-style and appropriate coloring. In the chart above, ggplot defaulted to the standard color palette for the plotted elements.
In order to fix this, I got the “unicorn” color palette from here and created two functions to overwrite the color and fill functions. Additionally, a true unicorn chart needs a different font. Even though it is pretty old, Comic Sans might be appropriate here.

windowsFonts(F = windowsFont('Comic Sans MS'))
p <- ggplot(ddd[1:5,], aes(reorder(Industry, -m), m, color=Industry, fill=Industry)) + geom_point(size=4, shape="O")
p+ theme_unicorn(base_size = 8,font="F") +scale_color_discrete_unicorn()+scale_fill_unicorn() + ylab("Mean Valuation in Bil$")+ xlab("Industries")

plot of chunk unnamed-chunk-3

So, what’s missing to complete the unicorn theme? A real unicorn mixed with all the startups. Let’s add an png-image of an unicorn using annotation_custom(g, xmin=30, xmax=35, ymin=10, ymax=12). The annoying part of it is, that one needs to place the image somehow manually on top of the chart using the x,y parameters. I also tried to place it within a bar-chart with factors/categories along the x-axis but I have not been able to figure how to do that.

## adding an image
img <- readPNG(source = "image/unicorn.png")
g <- rasterGrob(img, interpolate=TRUE)
p <- ggplot(d4, aes(N, m)) + geom_point(size=2, shape=1, color="purple")
p<- p + theme_unicorn(base_size = 10,font="F")  +xlab("")+ ylab("Mean Valuation in Bil$") + xlab("Number of Unicorns") 
p <- p + annotation_custom(g, xmin=30, xmax=35, ymin=10, ymax=12) 
p <- p + annotate(geom="text", x=3.5, y=12, label="Transportation", color="lightblue" , family=F)
p <- p + annotate(geom="text", x=12, y=11.5, label="On Demand", color="lightblue")
p <- p + annotate(geom="text", x=34, y=4, label="Ecommerce", color="lightblue")

plot of chunk unnamed-chunk-4

While ggplot certainly has massively improved in terms of style and customizability, there are some parts which do not work 100% (or I was not able to get it working properly).
Font-family setting leads to warnings/errors (even though the chart appears). Additionally, the passing of the font-family breaks down. The chart above illustrates that, the axis labels are Comic Sans, the annotated text is not. I tried to set it directly in the annotate-function, but that would not work either.

Another issue is color-setting. Colors have to be defined differently depending on whether one works with a continuous scale or a discrete scale. Hence, they cannot be set within the “theme”, but need to be individually set depending on the chart. While that makes somehow sense, I assume a general custom color-setting function would be highly appreciated. Another encountered issue -when setting custom colors- is the number of colors to be defined in the discrete case. If you pass too many classes (e.g. over 20 industry types as in chart 1), you need to define at least 20+ colors. For the custom scale_color_manual(), it would be great if it would provide a fallback to pick n colors along the colors passed.

I hope you enjoyed the colorful unicorn. Please reach out if you have solutions to the problems mentioned above.
I hope I will find the time to create a customizable function in the form of: theme_custom(colorpalette=colors[1:5])

Source:: R News

Free online course: Analyzing big data with Microsoft R Server

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you’re already familiar with R, but struggling with out-of-memory or performance problems when attempting to analyze large data sets, you might want to check out this new EdX course, Analyzing Big Data with Microsoft R Server, presented by my colleague Seth Mottaghinejad. In the course, you’ll learn how to build models using the RevoScaleR package, and deploy those models to production environments like Spark and SQL Server. The course is self-paced with videos, tutorials and tests, and is free to audit.

(By the way, if you don’t already know R, you might want to check out the courses Introduction to R for Data Science and Programming in R for Data Science first.)

The RevoScaleR package isn’t available on CRAN: it’s included with Microsoft R Server and Microsoft R Client. You can download and use Microsoft R Client for free, which provides an installation of R with the RevoScaleR library built in and loaded when you start the session. An R IDE is also recommended: you can use R Tools for Visual Studio or RStudio.

The course is open now, and you can get started at EdX at the link below.

EdX: Analyzing Big Data with Microsoft R Server

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News