## Lattice exercises – part 1

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below we will use the lattice package. First, we have to install this package with install.packages("lattice") and then we will call it library(lattice) . The Lattice package permits us to create univariate, bivariate and trivariate plots. For this set of exercises we will see univariate and bivariate plots. We will use a dataset example that we have to download from here.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Create a barchart of goals done (Variable gd, grouping by team.

Exercise 2
Now let’s create two datesets (subset dataset by team) and then use the bwplot function to plot goals received (Var: gs).

Exercise 3

Exercise 4
Create a histogram of win matches (Var: Win), grouping by team.

Exercise 5
Now we introduce the xyplot command. In this exercises you have to create a plot of goals received against goals done and play at home (variable home), grouping by team.

Exercise 6
From the previous issue we will plot in this exercises a time-series xyplot. Again use goals done variable by team (hint:you have to do two different plots).

Exercise 7
Return to exercise 5 and create a time-series xyplot, grouping by team for goals received variable. Customize your plot with the cut option.

Exercise 8
Finally, produce a quantile-quantile (qqplot) plot of win matches against goals received and injuries (variable inj), grouping by team.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Base R Nostalgia — by, tapply, ave, …

(This article was first published on data_steve, and kindly contributed to R-bloggers)

photo credit: Paul Yoakum

This evening I was feeling nostalgic for base R group-bys. Before there was dplyr, there was apply and its cousins. I thought it’d be nice to get out the ol’ photo-album.

To start off, the base R proto-ancestor of magrittr piping for me was the with function, especially with apply. It just cleaned up the
syntax and visual appearance of the code by pulling out the redundancy of declaring the data.frame. So even though it isn’t necessary to use with for the
functions below, I think it makes things easier on the eyes and brain.

#### Aggregate Group-Bys

In terms of exploratory analysis, base R’s equivalents to dplyr::summarize are by and tapply. In the case below for both tapply and by you have some a factor variable cyl for which you want to execute a function mean over the corresponding cases in vector of numbers mpg. So since mtcars cylinder variable cyl has 3 levels (4, 6, 8), we take the average miles-per-gallon for cars grouped by each of those cylinder categories.

with(mtcars, by(mpg, cyl, mean))
cyl: 4
[1] 26.66364
-------------------------------------------------------------------------------------
cyl: 6
[1] 19.74286
-------------------------------------------------------------------------------------
cyl: 8
[1] 15.1

with(mtcars, tapply(mpg, cyl, mean))
4        6        8
26.66364 19.74286 15.10000

We can even get a similar behavior out of sapply by adding split to the mix. Since sapply doesn’t natively have a way to handle the grouped aspects of the calculation, we use the function split to break up mpg into the 3 groups first, like so:

\$`4`
[1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4

\$`6`
[1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7

\$`8`
[1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0

Using split returns a ragged list of 3 groups which sapply handles nicely:

with(mtcars, sapply(split(mpg, cyl), mean))
4        6        8
26.66364 19.74286 15.10000

I was delighted to see I could hack out the same output using 2 xtabs (sum/n):

with(mtcars, xtabs(mpg ~ cyl) / xtabs(~ cyl))
cyl
4        6        8
26.66364 19.74286 15.10000

tapply is the most compact for my taste, both in terms of code and output; but I must confess by does the vertically stacked display of output I got initially used to from my earliest exposures with SPSS and Stata. We can get that and a data frame to boot from aggregate, as long as we pass in our group variable as a list:

with(mtcars, aggregate(mpg, list(cyl), mean))
Group.1        x
1       4 26.66364
2       6 19.74286
3       8 15.10000

And this brings us back to dplyr with its dataframe output:

library(dplyr)
mtcars %>% group_by(cyl) %>% summarize(mean(mpg))
Source: local data frame [3 x 2]
cyl mean(mpg)
(dbl)     (dbl)
1     4  26.66364
2     6  19.74286
3     8  15.10000

#### Non-Aggregate Group-Bys

If tapply resembles dplyr‘s group_by() %>% summarize(), then ave somewhat resembles dplyr‘s group_by() %>% mutate(). ave‘s syntax works just like tapply‘s, though their outputs differ notably. Unlike tapply, ave returns a single vector answer of the same length of the data passed in.

with(mtcars, ave(mpg, cyl, FUN=mean))
[1] 19.74 19.74 26.66 19.74 15.10 19.74 15.10 26.66 26.66
[10] 19.74 19.74 15.10 15.10 15.10 15.10 15.10 15.10 26.66
[19] 26.66 26.66 26.66 15.10 15.10 15.10 15.10 26.66 26.66
[28] 26.66 15.10 19.74 15.10 26.66

This is because if tapply is for summarizing the data, then ave is for prepping those data for assignment <- back into the parent data.frame, as with mutate.

mtcars %>% group_by(cyl) %>% mutate(mean(mpg))

And again, with some cleverness we can get sapply return the same result as ave; this time passing in the levels of cyl to subset mpg and take its mean.

with(mtcars, sapply(cyl, function(x) mean(mpg[cyl==x])))

If you want to get dplyr to have somewhat similar behavior as ave, returning only the variables at play, use transmute instead of mutate. mutate returns the whole data.frame with the new variable included; transmute returns only the variables called or created in the code chunk.

mtcars %>% group_by(cyl) %>% transmute(mean(mpg))

Base R Nostalgia — by, tapply, ave, … was originally published by Steve Simpson at data_steve on April 30, 2016.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Identify, describe, plot, and remove the outliers from the dataset

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In statistics, a outlier is defined as a observation which stands far away from the most of other observations. Often a outlier is present due to the measurements error. Therefore, one of the most important task in data analysis is to identify and (if is necessary) to remove the outliers.

There are different methods to detect the outliers, including standard deviation approach and Tukey’s method which use interquartile (IQR) range approach. In this post I will use the Tukey’s method because I like that it is not dependent on distribution of data. Moreover, the Tukey’s method ignores the mean and standard deviation, which are influenced by the extreme values (outliers).

## The Script

I developed a script to identify, describe, plot and remove the outliers if it is necessary. To detect the outliers I use the command boxplot.stats()\$out which use the Tukey’s method to identify the outliers ranged above and below the 1.5*IQR. To describe the data I preferred to show the number (%) of outliers and the mean of the outliers in dataset. I also show the mean of data with and without outliers. Regarding the plot, I think that boxplot and histogram are the best for presenting the outliers. In the script below, I will plot the data with and without the outliers. Finally, with help from Selva, I added a question (yes/no) to ask whether to keep or remove the outliers in data. If the answer is yes then outliers will be replaced with NA.

Here it is the script:

outlierKD

You can simply run the script by using the code below. Replace the dat with your dataset name, and variable with your variable name.

source("http://goo.gl/UUyEzD")
outlierKD(dat, variable)

Here it is an example of the data description:

Outliers identified: 58
Propotion (%) of outliers: 3.8
Mean of the outliers: 108.1
Mean without removing outliers: 53.79
Mean if we remove outliers: 52.82
Do you want to remove outliers and to replace with NA? [yes/no]: y
Outliers successfully removed

Here it is an example of the plot:

I am aware that the scrip can be improved by adding other features, or revising and cleaning the code. Therefore, as always your feedback is appreciated.

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## First Thoughts on Automatically Generating Accessible Text Descriptions of ggplot Charts in R

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

In a course team accessibility briefing last week, Richard Walker briefly mentioned a tool for automatically generating text descriptions of Statistics Canada charts to support accessibility. On further probing, the tool, created by Leo Ferres, turned out to be called iGraph-Lite:

… an extensible system that generates natural language descriptions of statistical graphs, particularly those created for Statistics Canada online publication, “The Daily”. iGraph-Lite reads a graph from a set of graphing programs, builds intermediate representations of it in several formats (XML, CouchDB and OWL) and outputs a description using NLG algorithms over a “string template” mechanism.

The tool is a C# application compiled as a Windows application, with the code available (unchanged now for several years) on Github.

The iGraph-Lite testing page gives a wide range of example outputs:

Increasingly, online charts are labeled with tooltip data that allows a mouseover or tab action to pop-up or play out text or audio descriptions of particular elements of the chart. A variety of “good chart design” principles designed to support clearer, more informative graphics (for example, sorting categorical variable bars in a bar chart by value rather than using an otherwise arbitrary alphabetic ordering) not only improves the quality of the graphic, it also makes a tabbed through audio description of the chart more useful. For more tips on writing clear chart and table descriptions, see Almost Everything You Wanted to Know About Making Tables and Figures.

The descriptions are quite formulaic, and to a certain extent represent a literal reading of the chart, along with some elements of interpretation and feature detection/description.

Here are a couple of examples – first, for a line chart:

This is a line graph. The title of the chart is “New motor vehicle sales surge in January”. There are in total 36 categories in the horizontal axis. The vertical axis starts at 110.0 and ends at 160.0, with ticks every 5.0 points. There are 2 series in this graph. The vertical axis is Note: The last few points could be subject to revisions when more data are added. This is indicated by the dashed line.. The units of the horizontal axis are months by year, ranging from February, 2005 to January, 2007. The title of series 1 is “Seasonally adjusted” and it is a line series. The minimum value is 126.047 occuring in September, 2005. The maximum value is 153.231 occuring in January, 2007. The title of series 2 is “Trend” and it is a line series. The minimum value is 133.88 occuring in February, 2005. The maximum value is 146.989 occuring in January, 2007.

And then for a bar chart:

This is a vertical bar graph, so categories are on the horizontal axis and values on the vertical axis. The title of the chart is “Growth in operating revenue slowed in 2005 and 2006”. There are in total 6 categories in the horizontal axis. The vertical axis starts at 0.0 and ends at 15.0, with ticks every 5.0 points. There is only one series in this graph. The vertical axis is % annual change. The units of the horizontal axis are years, ranging from 2001 to 2006. The title of series 1 is “Wholesale” and it is a series of bars. The minimum value is 2.1 occuring in 2001. The maximum value is 9.1 occuring in 2004.

One of the things I’ve pondered before was the question of textualising charts in R using a Grammar of Graphics approach such as that implemented by ggplot2. As well as literal textualisation of grammatical components, a small amount of feature detection or analysis could also be used to pull out some meaningful points. (It also occurs to me that the same feature detection elements could also then be used to drive a graphical highlighting layer back in the visual plane.)

Prompted by my quick look at the iGraph documentation, I idly wondered how easy it would be to pick apart an R ggplot2 chart object and generate a textual description of various parts of it.

A quick search around suggested two promising sources of structured data associated with the chart objects directly: firstly, we can interrogate the object return from calling ggplot() and it’s associated functions directly; and ggplot_build(g), which allows us to interrogate a construction generated from that object.

Here’s an example from the quickest of plays around this idea:

library(ggplot2)
g=ggplot(economics_long, aes(date, value01, colour = variable))
g = g + geom_line() + ggtitle('dummy title')

#The label values may not be the actual axis limits
txt=paste('The chart titled&quot;', g\$labels\$title,'&quot;',
'with x-axis', g\$labels\$x,'labeled from',
ggplot_build(g)\$panel\$ranges[[1]]\$x.labels[1], 'to',
tail(ggplot_build(g)\$panel\$ranges[[1]]\$x.labels, n=1),
'and y-axis', g\$labels\$y,' labeled from',
ggplot_build(g)\$panel\$ranges[[1]]\$y.labels[1], 'to',
tail(ggplot_build(g)\$panel\$ranges[[1]]\$y.labels,n=1), sep=' ')
if ('colour' %in% attributes(g\$labels)\$names){
txt=paste(txt, 'nColour is used to represent', g\$labels\$colour)

if ( class(g\$data[[g\$labels\$colour]]) =='factor') {
txt=paste(txt,', a factor with levels: ',
paste(levels(g\$data[[g\$labels\$colour]]), collapse=', '), '.', sep='')
}
}

txt

#The chart titled &quot;dummy title&quot; with x-axis date labeled from 1970 to 2010 and y-axis value01 labeled from 0.00 to 1.00
#Colour is used to represent variable, a factor with levels: pce, pop, psavert, uempmed, unemploy.&quot;

So for a five minute hack, I’d rate that approach as plausible, perhaps?!

I guess an alternative approach might also be to add an additional textualise=True property to the ggplot() call that forces each ggplot component to return a text description of its actions, and/or features that might be visually obvious from the action of the component as well as, or instead of, the graphical elements. Hmm… maybe I should use the same syntax as ggplot2 and try to piece together something along the lines of a ggdescribe library? Rather than returning graphical bits, each function would return appropriate text elements to the final overall description?!

I’m not sure if we could call the latter approach a “transliteration” of a chart from a graphical to a textual rendering of its structure and some salient parts of a basic (summary) interpretation of it, but it feels to me that this has some merit as a strategy for thinking about data2text conversions in general. Related to this, I note that Narrative Science have teamed up with Microsoft Power BI to produce a “Narratives for Power BI” plugin. I imagine this will start trickling down to Excel at some point, though I’ve already commented on how simple text summarisers can be added into to Excel using simple hand-crafted formulas (Using Spreadsheets That Generate Textual Summaries of Data – HSCIC).

PS does anyone know if Python/matplotib chart elements also be picked apart in a similar way to ggplot objects?

PPS In passing, here’s something that does look like it could be handy for simple natural language processing tasks in Python: polyglot, “a natural language pipeline that supports massive multilingual applications”; includes language detection, tokenisation, named entity extraction, part of speech tagging, transliteration, polarity (i.e. crude sentiment).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Tufte-style graphics in R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It’s not an overstatement to say that, at least for me personally, Edward Tufte’s book The Visual Display of Quantitative Information was transformative. Reading this book got me and, I feel confident saying, many many other data scientists passionate about visualizing data. This is the book that popularized Minard’s chart depicting Napoleon’s march on Russia, introduced the world to the concepts of chartjunk and the data-ink ratio, and demonstrated many times over the value of telling stories with data (as opposed to merely displaying it).

Tufte’s book was also a direct influence on the graphics system of the S Language (and also its successor, R): it was the first statistical programming language where Tufte’s concepts could easily be expressed in small amounts of code (even if his principles weren’t fully adhered to in the default settings). So it’s great to find Lukasz Piwek’s Tufte in R page, where many of the examples from The Visual Display are recreated in base R code (and sometimes using lattice and ggplot2 as well). Here for example is Tufte’s famous rugplot, where the axis tickmarks are replaced by dashes at the data points, giving a sense of the marginal distributions while also marking the data:

And here is Tufte’s original sparklines chart: minimal time series presented as small multiples of individual data units (and now all the range in business intelligence tools and spreadsheets).

Each of the examples comes with corresponding R code, usually just a dozen lines or so. Even the document itself is laid out in the style of Tufte’s book, with footnotes presented as sidenotes in the margins of the text, right where they’re referenced. (RStudio has an RMarkdown style for Tufte handouts like this.) Check out all the examples at the link below.

Lukasz Piwek: Tufte in R

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Cross-Validation: Estimating Prediction Error

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

## What is cross-validation?

Cross-Validation is a technique used in model selection to better estimate the test error of a predictive model. The idea behind cross-validation is to create a number of partitions of sample observations, known as the validation sets, from the training data set. After fitting a model on to the training data, its performance is measured against each validation set and then averaged, gaining a better assessment of how the model will perform when asked to predict for new observations. The number of partitions to construct depends on the number of observations in the sample data set as well as the decision made regarding the bias-variance trade-off, with more partitions leading to a smaller bias but a higher variance.

## K-Fold cross-validation

This is the most common use of cross-validation. Observations are split into K partitions, the model is trained on K – 1 partitions, and the test error is predicted on the left out partition k. The process is repeated for k = 1,2…K and the result is averaged. If K=n, the process is referred to as Leave One Out Cross-Validation, or LOOCV for short. This approach has low bias, is computationally cheap, but the estimates of each fold are highly correlated. In this tutorial we will use K = 5.

## Getting started

We will be using the boot package and data found in the MASS library. Let’s see how cross-validation performs on the dataset cars, which measures the speed versus stopping distance of automobiles.

install.packages("boot")
require(boot)
library(MASS)
plot(speed~dist, data=cars, main = "Cars" ,
xlab = "Stopping Distance", ylab = "Speed")

Here is the plot:

Let’s apply a generalized linear model to our data, and see how our cross-validated error estimate changes with each degree polynomial.

glm.fit = glm(speed~dist, data=cars)
degree=1:5
cv.error5=rep(0,5)
for(d in degree){
glm.fit = glm(speed~poly(dist, d), data=cars)
cv.error5[d] = cv.glm(cars,glm.fit,K=5)\$delta[1]
}

Here is the plot:

As you can see, a degree 1 or 2 polynomial seems to fit the model the closest while also holding the most predictive power. Since the difference is negligible, it is best to opt for the simpler model when possible. Notice how overfitting occurs after a certain degree polynomial, causing the model to lose its predictive performance.

## Conclusion

Cross-validation is a good technique to test a model on its predictive performance. While a model may minimize the Mean Squared Error on the training data, it can be optimistic in its predictive error. The partitions used in cross-validation help to simulate an independent data set and get a better assessment of a model’s predictive performance.

Related Post

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## testthat 1.0.0

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

testthat 1.0.0 is now available on CRAN. Testthat makes it easy to turn your existing informal tests into formal automated tests that you can rerun quickly and easily. Learn more at http://r-pkgs.had.co.nz/tests.html. Install the latest version with:

install.packages("testthat")

This version of testthat saw a major behind the scenes overhaul. This is the reason for the 1.0.0 release, and it will make it easier to add new expectations and reporters in the future. As well as the internal changes, there are improvements in four main areas:

• New expectations.
• Support for the pipe.
• More consistent tests for side-effects.
• Support for testing C++ code.

These are described in detail below. For a complete set of changes, please see the release notes.

## Improved expectations

There are five new expectations:

• expect_type() checks the base type of an object (with typeof()), expect_s3_class() tests that an object is S3 with given class, and expect_s4_class() tests that an object is S4 with given class. I recommend using these more specific expectations instead of the generic expect_is(), because they more clearly convey intent.
• expect_length() checks that an object has expected length.
• expect_output_file() compares output of a function with a text file, optionally update the file. This is useful for regression tests for print() methods.

A number of older expectations have been deprecated:

• expect_more_than() and expect_less_than() have been deprecated. Please use expect_gt() and expect_lt() instead.
• takes_less_than() has been deprecated.
• not() has been deprecated. Please use the explicit individual forms expect_error(..., NA) , expect_warning(.., NA), etc.

We also did a thorough review of the documentation, ensuring that related expectations are documented together.

## Piping

Most expectations now invisibly return the input object. This makes it possible to chain together expectations with magrittr:

factor("a") %>%
expect_type("integer") %>%
expect_s3_class("factor") %>%
expect_length(1)

To make this style even easier, testthat now imports and re-exports the pipe so you don’t need to explicitly attach magrittr.

## Side-effects

Expectations that test for side-effects (i.e. expect_message(), expect_warning(), expect_error(), and expect_output()) are now more consistent:

• expect_message(f(), NA) will fail if a message is produced (i.e. it’s not missing), and similarly for expect_output(), expect_warning(), and expect_error().
quiet <- function() {}
noisy <- function() message("Hi!")

expect_message(quiet(), NA)
expect_message(noisy(), NA)
#> Error: noisy() showed 1 message.
#> * Hi!
• expect_message(f(), NULL) will fail if a message isn’t produced, and similarly for expect_output(), expect_warning(), and expect_error().
expect_message(quiet(), NULL)
#> Error: quiet() showed 0 messages
expect_message(noisy(), NULL)

There were three other changes made in the interest of consistency:

• Previously testing for one side-effect (e.g. messages) tended to muffle other side effects (e.g. warnings). This is no longer the case.
• Warnings that are not captured explicitly by expect_warning() are tracked and reported. These do not currently cause a test suite to fail, but may do in the future.
• If you want to test a print method, expect_output() now requires you to explicitly print the object: expect_output("a", "a") will fail, expect_output(print("a"), "a") will succeed. This makes it more consistent with the other side-effect functions.

## C++

Thanks to the work of Kevin Ushey, testthat now includes a simple interface to unit test C++ code using the Catch library. Using Catch in your packages is easy – just call testthat::use_catch() and the necessary infrastructure, alongside a few sample test files, will be generated for your package. By convention, you can place your unit tests in src/test-.cpp. Here’s a simple example of a test file you might write when using testthat + Catch:

#include <testthat.h>
test_that("two plus two equals four") {
int result = 2 + 2;
expect_true(result == 4);
}
}

These unit tests will be compiled and run during calls to devtools::test(), as well as R CMD check. See ?use_catch for a full list of functions supported by testthat, and for more details.

For now, Catch unit tests will only be compiled when using the gcc and clang compilers – this implies that the unit tests you write will not be compiled + run on Solaris, which should make it easier to submit packages that use testthat for C++ unit tests to CRAN.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Talk on regtools and P-Values

By matloff

I’m deeply greatful to Hui Lin and the inimitable Yihui Xie for arranging for me to give a “virtual seminar talk” to the Central Iowa R Users Group. You can view my talk, including an interesting Q&A session, online. (The actual start is at 0:34.) There are two separate topics, my regtools package (related to my forthcoming book, From Linear Algebra to Machine Learning: Regression and Classification, with Examples in R), and the recent ASA report on p-values.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Interactions between Categorical Variables in Mixed Graphical Models

(This article was first published on Jonas Haslbeck – r, and kindly contributed to R-bloggers)

In a previous post we recovered the conditional independence structure in a dataset of mixed variables describing different aspects of the life of individuals diagnosed with Autism Spectrum Disorder, using the mgm package. While depicting the independence structure in multivariate data set gives a first overview of the relations between variables, in most applications we interested in the exact parameter estimates. For instance, for interactions between continuous variables, we would like to know the sign and the size of parameters – i.e., if the nodes in the graph are positively or negatively related, and how strong these associations are. In the case of interactions between categorical variables, we are interested in the signs and sizes of the set of parameters that describes the exact non-linear relationship between variables.

In this post, we take the analysis a step further and show how to use the output of the mgm package to take a closer look at the recovered dependencies. Specifically, we will recover the sign and weight of interaction parameter between continuous variables and zoom into interactions between categorical and continuous variables and between two categorical variables. Both the dataset and the code are available on Github.

We start out with the conditional dependence graph estimated in the previous post, however, now with variables grouped by their type:

We obtained this graph by fitting a mixed graphical model using the mgmfit() function as in the previous post:

## Display Edge Weights and Signs

We now also display the weights of the dependencies. In addition, for interactions between continuous (Gaussian, Poisson) variables, we are able determine the sign of the dependency, as it only depends on one parameter. The signs are saved in fit\$signs. To make plotting easier, there is also a matrix fit\$edgecolor, which gives colors to positive (green), negative (red) and undefined (grey) edge signs.

Now, to plot the weighted adjacency matrix with signs (where defined), we give fit\$edgecolor as input to the argument edge.color in qgraph and plot the weighted adjacency matrix fit\$wadj instead of the unweighted adjacency matrix fit\$adj:

This gives us the following figure:

Red edges correspond to negative edge weights and green edge weights correspond to positive edge weights. The width of the edges is proportional to the absolut value of the parameter weight. Grey edges connect categorical variables to continuous variables or to other categorical variables and are computed from more than one parameter and thus we cannot assign a sign to these edges.

While the interaction between continuous variables can be interpreted as a conditional covariance similar to the well-known multivariate Gaussian case, the interpretation of edge-weights involving categorical variables is more intricate as they are comprised of several parameters.

## Interpretation of Interaction: Continuous – Categorical

We first consider the edge weight between the continuous Gaussian variable ‘Working hours’ and the categorical variable ‘Type of Work’, which has the categories (1) No work, (2) Supervised work, (3) Unpaid work and (4) Paid work. We get the estimated parameters behind this edge weight from the matrix of all estimated parameters in the mixed graphical model fit\$mpar.matrix:

fit\$par.labels indicates which parameters in fit\$mpar.matrix belong to the interaction between which two variables. Note that in the case of jointly Gaussian data, fit\$mpar.matrix is equivalent to the inverse covariance matrix and each interaction would be represented by 1 value only.

The four values we got from the model parameter matrix represent the interactions of the continuous variable ‘Working hours’ with each of the categories of ‘Type of work’. These can be interpreted in a straight forward way of incraesing/decreasing the probability of a category depending on ‘Working hours’. We see that the probability of category (a) ‘No work’ is greatly decreased by an increase of ‘Working hours’. This makes sense as somebody who does not work has to work 0 hours. Next, working hours seem not to predict the probability of categories (b) ‘Supervised work’ and (c) ‘Unpaid work’. However, increasing working hours does increase the probabilty of category (d) ‘Paid work’, which indicates that individuals who get paid for their work, work longer hours. Note that these interactions are unique in the sense that the influence of all other variables is partialed out!

## Interpretation of Interaction: Categorical – Categorical

Next we consider the edge weight between the categorical variables (14) ‘Type of Housing’ and the variable (15) ‘Type of Work’ from above. ‘Type of Housing’ has to categories, (a) ‘Not independent’ and (b) ‘Independent’. As in the previous example, we take the relevant parameters from the model parameter matrix:

Again, the rows represent the categories of variable (14) ‘Type of Housing’. The columns indicate how the different catgories of variable (16) ‘Type of Work’ predict the probability of these categories. The first column is the dummy category ‘No work’. The parameters can therefore be interpreted as follows:

Having supervised or unpaid work, does not predict a probability of living independently or not that is different for individuals with no work. Having paid work, however, decreases the probability of living not independently and increases the probability of living independently, compared to the reference category ‘no work’.

The interpretations above correspond to the typical interpretation of parameters in a multinomial regression model, which is indeed what is used in the node wise regression approach we use in the mgm packge to estimate mixed graphical models. For details about the exact parameterization of the multinomial regression model check chapter 4 in the glmnet paper. Note that because we use the node wise regression approach, we could also look at how the categories in (16) ‘Type of work’ predict (17) ‘Working hours’ or how the categories of (14) ‘Type of housing’ predict the probabilities of (16) ‘Type of Housing’. These parameters can be obtained by exchanging the row indices with the column indices when subsetting fit\$mpar.matrix. For an elaborate explanation of the node wise regresssion approach and the exact structure of the model parameter matrix please check the mgm paper.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## gap frequencies [& e]

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

A riddle from The Riddler where brute-force simulation does not pay:

For a given integer N, pick at random without replacement integers between 1 and N by prohibiting consecutive integers until all possible entries are exhausted. What is the frequency of selected integers as N grows to infinity?

A simple implementation of the random experiment is as follows

generope=function(N){
frei=rep(1,N)
hus=0
while (max(frei)==1){
i=sample(rep((1:N)[frei==1],2),1)
frei[i]=frei[i+1]=frei[i-1]=0
hus=hus+1}
return(hus)}

It is however quite slow and does not exploit the recursive feature of the sampling, namely that the first draw breaks the set {1,…,N} into two sets:

generipe=function(N){
if (N<2){ return((N>0))}else{
i=sample(1:N,1)
return(1+generipe(i-2)+generipe(N-i-1))}}

But even this faster solution takes a while to run for large values of N:

frqns=function(N){
space=0
for (t in 1:1e3) space=space+generipe(N)
return(space/(N*1e3))}

as for instance

>  microbenchmark(frqns(100),time=10)
Unit: nanoseconds
expr       min        lq         mean    median        uq       max
frqns(100) 178720117 185020903 212212355.77 188710872 205865816 471395620
time         4         8        26.13        32        37       102

Hence this limits the realisation of simulation to, say, N=10⁴. Thinking further about the recursive aspect of the process however leads to a recursion on the frequencies qN, as it is straightforward to prove that

with q1=1 and q2=1/2. This recursion can be further simplified into

$q_N=frac{1}{N^2}+frac{2(N-2)}{N^2},q_{N-2}+frac{(N-1)^2}{N^2}q_{N-1}qquad(1)$

which allows for a much faster computation

s=seq(1,1e7) #s[n]=n*q[n]
for (n in 3:1e7) s[n]=(1+2*q[n-2]+(n-1)*q[n-1])/n

and a limiting value of 0.4323324… Since storing s does not matter, a sliding update works even better:

a=b=1
for (n in 3:1e8){ c=(1+2*a+(n-1)*b)/n;a=b;b=c}

still producing a final value of 0.4323324, which may mean we have reached some limit in the precision.

As I could not figure out a way to find the limit of the sequence (1) above, I put it on the maths forum of Stack Exchange and very quickly got the answer (obtained by a power series representation) that the limit is (rather amazingly!)

$dfrac{1 - e^{-2}}{2}$

which is 0.432332358.., hence very close to the numerical value obtained for n=3×10⁸. (Which does not change from n=10⁸, once again for precision reasons.) Now I wonder whether or not an easier derivation of the above is feasible, but we should learn about it in a few hours on The Riddler.

Filed under: Kids, R Tagged: misanthrope, R, sampling, The Riddler