## R Tip: Force Named Arguments

R tip: force the use of named arguments when designing function signatures.

R’s named function argument binding is a great aid in writing correct programs. It is a good idea, if practical, to force optional arguments to only be usable by name. To do this declare the additional arguments after “`...`” and enforce that none got lost in the “`...` trap” by using a checker such as wrapr::stop_if_dot_args().

Example:

```#' Increment x by inc.
#'
#' @param x item to add to
#' @param ... not used for values, forces later arguments to bind by name
#' @param inc (optional) value to add
#' @return x+inc
#'
#' @examples
#'
#' f(7) # returns 8
#'
f  [1] 8

f(7, inc = 2)
#> [1] 9

f(7, q = mtcars)
#> Error: f unexpected arguments: q = mtcars

f(7, 2)
#> Error: f unexpected arguments: 2
```

By R function evaluation rules: any unexpected arguments are captured by the “`...`” argument. Then “wrapr::stop_if_dot_args()” inspects for such values and reports an error if there are such. The “f” string is returned as part of the error, I chose the name of the function as in this case. The “substitute(list(…))” part is R’s way of making the contents of “…” available for inspection.

You can also use the technique on required arguments. wrapr::stop_if_dot_args() is a simple low-dependency helper function intended to make writing code such as the above easier. This is under the rubric that hidden errors are worse than thrown exceptions. It is best to find and signal problems early, and near the cause.

The idea is that you should not expect a user to remember the positions of more than 1 to 3 arguments, the rest should only be referable by name. Do not make your users count along large sequences of arguments, the human brain may have special cases for small sequences.

If you have a procedure with 10 parameters, you probably missed some.

Alan Perlis, “Epigrams on Programming”, ACM SIGPLAN Notices 17 (9), September 1982, pp. 7–13.

Note that the “`substitute(list(...))`” part is the R idiom for capturing the unevaluated contents of “`...`“, I felt it best to use standard R as much a possible in favor of introducing any additional magic invocations.

Source:: R News

## Whys and Hows of Apply Family of Functions in R

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

# Introduction to Looping system

Imagine you were to perform a simple task, let’s say calculating sum of columns for 3X3 matrix, what do you think is the best way? Calculating it directly using traditional methods such as calculator or even pen and paper doesn’t sound like a bad approach. A lot of us may prefer to just calculate it manually instead of writing an entire piece of code for such a small dataset.

Now, if the dataset is 10X10 matrix, would you do the same? Not sure.

Now, if the dataset is further bigger, let’s say 100X100 matrix or 1000X1000 matrix or 5000X5000 matrix, would you even think of doing it manually? I won’t.

But, let’s not worry ourselves with this task because it is not as big a task as it may look at prima facie. There’s a concept called ‘looping’ which comes to our rescue in such situations. I’m sure whosoever has ever worked on any programming language, they must have encountered loops and looping system. It is one of the most useful concepts in any programming language. ‘Looping system’ is nothing but an iterative system that performs a specific task repeatedly until given conditions are met or it is forced to break. Looping system comes in handy when we have to carry a task iteratively; we may or may not know before-hand how many iterations to be carried out. Instead of writing the same piece of code tens or hundreds or thousands of times, we write a small piece of code using loops and it does the entire task for us.

There are majorly two loops which are used extensively in programming – for loop and while loop. In the case of ‘for loop’ we know before hand as to how many times we want the loop to run or we know before-hand the number of iterations that should be carried out. Let’s take a very simple example of printing number 1 to 10. One way could be we write the code of printing every number from 1 to 10; while, the other and smart way would be to write a two-line code that will do the work for us.

```for (i in 1:10) {
print(i)
}
```

The above code should print values from 1 to 10 for us.

```> for (i in 1:10) {
+   print(i)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
```

The other very powerful loop is ‘while’ loop. In while loop, we don’t know before-hand as to how many iterations should the loop perform. It works till a certain condition is being met, as soon as the condition is violated the loop breaks.

```i = 1
while (i
In the above code, we don't know how many iterations are there, we just know that the loop should work until the value of ‘i' is less than 10 and it does the same.
```
```> i = 1
> while (i
Using very basic examples, we have seen how powerful these loops can be. However, there is one disadvantage of using these loops in R language – they make our code run slow. The number of computations that needs to be carried increases and this increases the time that system takes to execute the code.
But we need not worry about this limitation as R offers a very good alternative, vectorization, to using these loops in lot of conditions. Vectorization, as the name suggests, is an operation of converting scalars or plain numbers in to single operation on vectors or matrices. A lot of functions that are performed by loops can be performed through vectorization. Moreover, vectorization makes calculation and running of processes faster because they convert the code into lower language such as C, C++ which further contains loops to execute the operation. User need not worry about these aspects of vectorization and can just do fine with direct functions.
Based on the concept of vectorization is a family of functions in R called ‘apply' family function. It is a part of base R package. There are multiple functions in the apply family. We will go through them one by one and check their implementation, alongside, in R. The functions in apply family are apply, sapply, lapply, mapply, rapply, tapply and vapply. Their usage depends on the kind of input data we have, kind of output we want to see and the kind of operations we want to perform on data. Let's go through some of these functions and implement toy examples using them.
Apply Function
Apply function is the most commonly used function in apply family. It works on arrays or matrices. The syntax of apply function is follows:
```
```Apply(x, margin, function, ….)
```

Where,

• X refers to an array or matrix on which operation is to be performed
• Margin refers to how the function is to be applied; margin =1 means function is to be applied on rows, while margin = 2 means function is to be applied on columns. Margin = c(1,2) means function is to be applied on both row and column.
• Function refers to the operation that is to be performed on the data. It could be predefined functions in R such as Sum, Stddev, ColMeans or it could be user defined function.

Let’s take an example and use the function to see how it can help us.

```ApplyFun = matrix(c(1:16), 4,4)

ApplyFun

apply(ApplyFun,2,sum)

apply(ApplyFun,2,mean)

apply(ApplyFun,1,var)

apply(ApplyFun,1,sum)
```

In the above code, we have applied sum and mean function on columns of the matrix; while variance and sum function on the rows. Let’s see the output of the above code.

```> ApplyFun = matrix(c(1:16), 4,4)

> ApplyFun
[,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

> apply(ApplyFun,2,sum)
[1] 10 26 42 58

> apply(ApplyFun,2,mean)
[1]  2.5  6.5 10.5 14.5

> apply(ApplyFun,1,var)
[1] 26.66667 26.66667 26.66667 26.66667

> apply(ApplyFun,1,sum)
[1] 28 32 36 40
```

Let’s understand the first statement that we executed; others are based on the same logic. We first generated a matrix as below:

```> ApplyFun
[,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
```

Now, in the second line [apply(ApplyFun,2,sum)], we are trying to calculate the sum of all the columns of the matrix. Here, ‘2′ means that the operation should be performed on the columns and sum is the function that should be executed. The output generated here is a vector.

## Lapply Function

Lapply function is similar to apply function but it takes list or data frame as an input and returns list as an output. It has a similar syntax to apply function. Let’s take a couple of examples and see how it can be used.

```LapplyFun = list(a = 1:5, b = 10:15, c = 21:25)

LapplyFun

lapply(LapplyFun, FUN = mean)

lapply(LapplyFun, FUN = median)

> LapplyFun = list(a = 1:5, b = 10:15, c = 21:25)
> LapplyFun
\$a
[1] 1 2 3 4 5

\$b
[1] 10 11 12 13 14 15

\$c
[1] 21 22 23 24 25

> lapply(LapplyFun, FUN = mean)
\$a
[1] 3

\$b
[1] 12.5

\$c
[1] 23

> lapply(LapplyFun, FUN = median)
\$a
[1] 3

\$b
[1] 12.5

\$c
[1] 23
```

Sapply Function

Sapply function is similar to lapply, but it returns a vector as an output instead of list.

```set.seed(5)

SapplyFun = list(a = rnorm(5), b = rnorm(5), c = rnorm(5))

SapplyFun

sapply(SapplyFun, FUN = mean)

> set.seed(5)
>
> SapplyFun = list(a = rnorm(5), b = rnorm(5), c = rnorm(5))
>
> SapplyFun
\$a
[1] -0.84085548  1.38435934 -1.25549186  0.07014277  1.71144087

\$b
[1] -0.6029080 -0.4721664 -0.6353713 -0.2857736  0.1381082

\$c
[1]  1.2276303 -0.8017795 -1.0803926 -0.1575344 -1.0717600

>
> sapply(SapplyFun, FUN = mean)
a          b          c
0.2139191 -0.3716222 -0.3767672
```

Let’s take another example and see the difference between lapply and sapply in further detail.

```X = matrix(1:9,3,3)
X

Y = matrix(11:19,3,3)
Y

Z = matrix(21:29,3,3)
Z

Comp.lapply.sapply = list(X,Y,Z)
Comp.lapply.sapply

> X = matrix(1:9,3,3)
> X
[,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
>
> Y = matrix(11:19,3,3)
> Y
[,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19
>
> Z = matrix(21:29,3,3)
> Z
[,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29
> Comp.lapply.sapply = list(X,Y,Z)
> Comp.lapply.sapply
[[1]]
[,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

[[2]]
[,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19

[[3]]
[,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29

lapply(Comp.lapply.sapply,"[", , 2)

lapply(Comp.lapply.sapply,"[", 1, )

lapply(Comp.lapply.sapply,"[", 1, 2)

> lapply(Comp.lapply.sapply,"[", , 2)
[[1]]
[1] 4 5 6

[[2]]
[1] 14 15 16

[[3]]
[1] 24 25 26

> lapply(Comp.lapply.sapply,"[", 1, )
[[1]]
[1] 1 4 7

[[2]]
[1] 11 14 17

[[3]]
[1] 21 24 27

> lapply(Comp.lapply.sapply,"[", 1, 2)
[[1]]
[1] 4

[[2]]
[1] 14

[[3]]
[1] 24

```

Now, getting the output of last statement using sapply function.

```> sapply(Comp.lapply.sapply,"[", 1,2)
[1]  4 14 24
```

We can see the difference between lapply and sapply in the above example. Lapply returns the list as an output; while sapply returns vector as an output.

## Mapply Function

Mapply function is similar to sapply function, which returns vector as an output and takes list as an input. Let’s take an example and understand how it works.

```X = matrix(1:9,3,3)
X

Y = matrix(11:19,3,3)
Y

Z = matrix(21:29,3,3)
Z

mapply(sum,X,Y,Z)

> X = matrix(1:9,3,3)
> X
[,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
>
> Y = matrix(11:19,3,3)
> Y
[,1] [,2] [,3]
[1,]   11   14   17
[2,]   12   15   18
[3,]   13   16   19
>
> Z = matrix(21:29,3,3)
> Z
[,1] [,2] [,3]
[1,]   21   24   27
[2,]   22   25   28
[3,]   23   26   29
>
> mapply(sum,X,Y,Z)
[1] 33 36 39 42 45 48 51 54 57
```

The above function adds element by element of each of the matrix and returns a vector as an output.

```For e.g., 33 = X[1,1] + Y[1,1] + Z[1,1]
36 = X[2,1] + Y[2,1] + Z[2,1} and so on.
```

How to decide which apply function to use

Now, comes the part of deciding which apply function should one use and how to decide which apply function will provide the desired results. This is mainly based on the following four parameters:

1. Input
2. Output
3. Intention
4. Section of Data

As we have discussed in the article above, all the apply family functions work on different types of datasets. Apply function works on arrays or matrices, lapply works on lists, sapply also works on lists and similar other functions. Based on kind of input that we are providing to the function provides us a first level of filter as to which all functions can be used.

Second filter comes from the output that we desire from the function. Lapply and sapply both work on lists; in that case, how to decide which function to use? As we saw above, lapply gives us list as an output while sapply outputs vector. This provides us another level of filter in deciding which function to choose.

Now, comes the intention which is making us use the apply family function. By intention, we mean the kind of functions that we are planning to pass through the apply family. Section of data refers to the subset or part of the data that we want our function to operate on – is it rows or columns or the entire dataset.

These four things can help us figure out which apply function should we choose for our tasks.

After going through the article, I’m sure you will agree with me that these functions are much easier to use than loops, and provides faster and efficient ways to execute codes. However, this doesn’t mean that we should not use loops at all. Loops have their own advantage when doing complex operations. Moreover, other programming languages do not provide any support for apply family function, so we don’t have an option but to use loops. We should keep ourselves open to both the ideas and decide what to use basis the requirements at hand.

Author Bio:

Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

R-Ladies Remote is kicking off and we want YOU! Do you want to be part of the R community but can’t attend meetups? There are many R-Ladies across the globe who love the idea of the organisation, but aren’t able to connect with it easily due to their distance, their work or their caring responsibilities. If child care ends at 6pm, ducking out to a chapter meeting at 6:30 isn’t always easy.

What do you need to join in? An interest in R and to be part of a gender minority in tech, that’s all. We are open to all R users, from new starters to experienced users. Sign up here.

What will RLadies Remote be doing? We’ll be hosting a variety of online events and speakers. We’ll be covering introductions to basic R and more advanced topics, discussions about remote working, independent consulting and seminars from our members.

Do you have an idea for an event, would you like to give a talk or would you like to come along to learn? If so we’d love to hear from you. Please show your interest by filling in our initial survey.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

(This article was first published on R – TRinker’s R Blog, and kindly contributed to R-bloggers)

About a year and a half back I was working in Python a bit and became accustomed to the explicit importing of modules (akin to R packages) and functions. Python imports packages like this:

``` import tidyr import dplyr as dp from plyr import l_ply, rbind.fill```

If these R packages were in Python, the first line imports tidyr, the second imports dplyr as an object aliased as `dp`. These 2 lines are how Python imports packages. The user still needs to explicitly prefix objects from these modules when calling them. For example one might call: `tidyr::separate` or `dplyr::select`. The third line is more akin to how R loads packages and has access to all exported functions. Still, in Python the user has to explicitly name all functions that are imported. The way that Python imports packages and functions is explicit and may seem annoying to an R user but it avoids NAMESPACE clashes.

Back then I made a proof of concept package called pysty (Python style) to make R behave a bit more like Python for importing packages and functions. I had forgotten about this package…Fast forward to this week.

Within the last 2 days I have been bit by a bug with MASS/dplyr `select` and a team-mate by plyr/dplyr `summarise` clashes. I tend to prefix a lot of my functions with their package name now in my code to avoid such headaches. This habit reminded me of the pysty. I thought others may find it interesting.

The aliasing makes it convenient to explicitly reference a package without typing out the entire package name when you call it. pysty uses the colon operator to accomplish this explicit calling of functions. Most functions can be imported and added to the global name space in one fell swoop using the `from` call. All of these Python style calls are possible via the `add_imports`.

## Installing pysty

```if (!require("pacman")) install.packages("pacman")

library(pysty)
```

## Calling Dependencies

```library(pysty)

import dplyr as dp
import MASS as m
import ggplot2 as gg
import tidyr
import plyr as p
from plyr import l_ply, rbind.fill

')

assign('%>%', dp::`%>%`) ## arrow assignment wasn't rendering in blog
```

The user can now access the functions in the first 5 packages, optionally as an alias if one was supplied. select can be called explicitly without specifying the entire package name (using the alias). The object `select` doesn’t exist in the global environment.

```dp::select

## function (.data, ...)
## {
##     UseMethod("select")
## }

m::select

## function (obj)
## UseMethod("select")
##
##

select

dp::summarize

## function (.data, ...)
## {
##     UseMethod("summarise")
## }
##

p::summarize

## {
##     stopifnot(is.data.frame(.data) || is.list(.data) || is.environment(.data))
##     cols
Use Cases
The following snippets demonstrate the use of such a package.  Notice how the same function can be used within the same chain even if it comes from another package.  Because you have to be explicit in using dependencies, the likeliness of a NAMESPACE conflict is slim.
```
```longley %>%
dp::select(-Employed) %>%
{m::lm.ridge(GNP.deflator ~ ., ., lambda = seq(0,0.1,0.0001))} %>%
m::select()

## modified HKB estimator is 0.004974797
## modified L-W estimator is 0.03567913
## smallest value of GCV  at 0.003

p::baseball %>%
dp::group_by(team) %>%
dp::summarise(
min_year = min(year),
max_year = min(year)
) %>%
p::summarise(
duration = max(max_year) - min(min_year),
nteams = length(unique(team))
)

##   duration nteams
## 1      134    132
```

I doubt I’d use this myself in my workflow, but the idea was interesting to me and I wanted to share.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Combine your hex stickers with magic(k)

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

Hex stickers remind me of Pogs, except they’re cooler because you can combine them together! Some people do that very smartly.

Now when I forget how to do an #rstats analysis I can just check the back of my laptop #runconf16 pic.twitter.com/DY7UgeBzYV

— David Robinson (@drob) 1 d’abril de 2016

I’ve got a pretty random hex stickers combination on my laptop, but after all it could be worse…

— Hilary Parker (@hspter) 13 de setembre de 2017

Now since I’m a `magick`/collage fan, you can bet I’ve wondered how to use R in order to combine stickers automatically! Say I have a bunch of sticker PNGs, how could I produce a map to design my laptop style? Read to find out more…

# Getting some hex stickers to play with

I do agree that in real life you’ll most often not have PNGs of the stickers but hey I can dream… especially since I’m sure that one will soon be able to identify stickers on a conference haul pic in the blink of an eye with the `Rvision` package. I got stickers from hex.bin whose existence I learnt from this tweet:

A plea to my favorite #rstats developers: please submit your BIG hex sticker image files (with vector versions ) to https://t.co/7KWg3d2fZA! @old_man_chester @nj_tierney @samfirke they will be right at home pic.twitter.com/AIqgnBIx1X

— Alison Hill (@apreshill) 21 de febrer de 2018

And well if more R packages are added to it my `Rvision` dream could come true.

Note: When I say I got them from hex.bin, I mean I lazily downloaded the “hexagons/” folder of this GitHub repository by using a link as explained in this SO answer.

# Combining a few of them at random

I’ll sample 42 stickers from my nice little internet haul.

``````library("magrittr")
set.seed(42)
ultimate_sample  fs::dir_ls("hexagons") %>%
sample(size = 42)
``````

After that I’ll create rows which is pretty straightforward.

``````no_row  6
no_col  7

row_paths  split(ultimate_sample, rep(1:no_col, each = no_row))

# let's be crazy

magick::image_append()

``````

Now rows have to be combined together, not just stacked, with a small shift. This will depend on the size of pics. They seem to be the same size which is great. Note that the thing that gets displayed below is what `magick::image_info` gives you about any `magick` object, so I’ll use that.

``````magick::image_read(ultimate_sample[1:10])
magick::image_info()
height  info\$height
width  info\$width
``````

The `col` parameter below is my laptop background hex code. I needed it before pasting the hex stickers.

``````my_laptop  magick::image_blank(width = width * no_col,
height = height * no_row * 0.9,
col = "#FF6987")

for(i in 1:no_row){
if(i/2 == floor(i/2)){
offset1  0
}else{
offset1  (width/2)
}

offset2  (i-1)*(height*0.75)

my_laptop  magick::image_composite(my_laptop, rows[[i]],
offset = paste0("+", offset1,
"+", offset2))
}

magick::image_write(my_laptop, "data/my_laptop.png")
``````

This looks fine, the only reason why it’s not perfect is probably the stickers not having exactly the same dimensions. Moreover, one of them doesn’t have transparent borders! That’s a mean design. Jeroen Ooms told me I could correct it by smartly using `image_fill` with color = “transparent” and some fuzz and `image_trim` before that to remove margins but I was lazy.

Now, what could one do? One could order hex stickers by color using the code in my blog post about rainbowing a set of pictures. Or one could just combine all hex stickers together, then make them gray and use the resulting image as a place to collect stickers. #goals

``````magick::image_convert(my_laptop, colorspace = "Gray") %>%
magick::image_write("data/my_laptop_goals.png")
``````

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Topological Tomfoolery in R: Plotting a Möbius Strip

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Topology is, according to Clifford Pickover, the “silly putty of mathematics”. This branch of maths studies the transformation of shapes, knots and other complex geometry problems. One of the most famous topics in topology is the Möbius strip. This shape has some unusual properties which have inspired many artists, inventors, mathematicians and magicians.

You can make a Möbius strip by taking a strip of paper, giving it one twist and glue the ends together to form a loop. If you now cut this strip lengthwise in half, you don’t end-up with two separate strips, but with one long one.

The Möbius strip can also be described with the following parametric equations (where , $-1 leq v leq 1$ and $R$ is the radius of the loop):

$x(u,v)= left(R+frac{v}{2} cos frac{u}{2}right)cos u$
$y(u,v)= left(R+frac{v}{2} cosfrac{u}{2}right)sin u$
$z(u,v)= frac{v}{2}sin frac{u}{2}$

The mathematics of this set of parametric equations is not as compex as it looks. $R$ is the radius of the ring, $u$ is the polar angle of each point and $v$ indicates the width of the strip. The polar angle $u/2$ indicates the number of half twists. To make the ring twist twice, change the anlge to $u$.

For my data science day job, I have to visualise some three-dimensional spaces so I thought I best learn how to do this by visualising a Möbis strip, using these three equations.

## Plotting a Möbius Strip

The RGL package provides the perfect functionality to play with virtual Möbius strips. This package produces interactive three-dimensional plots that you can zoom and rotate. This package has many options to change lighting, colours, shininess and so on. The code to create for plotting a Möbius strip is straightforward.

The first section defines the parameters and converts the $u$ and $v$ sequences to a mesh (from the plot3D package). This function creates two matrices with every possible combination of $u$ and $v$ which are used to calculate the $x, y, z$ points.

The last three lines define a 3D window with a white background and plot the 3D surface in blue. You can explore the figure with your mouse by zooming and rotating it. Parametric equations can be a bit of fun, play with the formula to change the shape and see what happens.

```# Moebius strip
library(rgl)
library(plot3D)

# Define Parameters
R

Plotting a Möbius Strip: RGL output.

We can take it to the next level by plotting a three-dimensional Möbius strip, or a Klein Bottle. The parametric equations for the bottle are mind boggling:
$x(u,v) = -frac{2}{15} cos u (3 cos{v}-30 sin{u}+90 cos^4{u} sin{u} -60 cos^6{u} sin{u} +5 cos{u} cos{v} sin{u})$
$y(u,v) = -frac{1}{15} sin u (3 cos{v}-3 cos^2{u} cos{v}-48 cos^4{u} cos{v} + 48 cos^6{u} cos{v} - 60 sin{u}+5 cos{u} cos{v} sin{u}-5 cos^3{u} cos{v} sin{u}-80 cos^5{u} cos{v} sin{u}+80 cos^7{u} cos{v} sin{u})$
$z(u,v) = frac{2}{15} (3+5 cos{u} sin{u}) sin{v}$
Where: $0 leq u leq pi$ and $0 leq v leq 2leq$.
The code to visualise this bottle is essentially the same, just more complex equations.
```
```u

Plotting a Klein Bottle in RGL. Click to view RGL widget.

The RGL package has some excellent facilities to visualise three-dimensional objects, far beyond simple strips. I am still learning and am working toward using it to visualise bathymetric surveys of water reservoirs. Möbius strips are, however, a lot more fun.
Creating Real Möbius Strips
Even more fun than playing with virtual Möbius strips is to make some paper versions and start cutting, just like August Möbius did when he did his research. If you like to create a Möbius strip, you can recycle then purchase a large zipper from your local haberdashery shop, add some hook-and-loop fasteners to the ends and start playing. If you like to know more about the mathematics of the topological curiosity, then I can highly recommend Clifford Pickover's book on the topic.

Möbius strip zipper.

The Möbius Strip in Magic
In the first half of the twentieth century, many magicians used the Möbius strip as a magic trick. The great Harry Blackstone performed it regularly in his show.

If you are interested in magic tricks and Möbius strips, then you can read my ebook on the Afghan bands.

The post Topological Tomfoolery in R: Plotting a Möbius Strip appeared first on The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
```

Source:: R News

## Machine Learning in R with TensorFlow

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Modern machine learning platforms like Tensorflow have to date been used mainly by the computer science crowd, for applications like computer vision and language understanding. But as JJ Allaire pointed out in his keynote at the RStudio conference earlier this month (embedded below), there’s a wealth of applications in the data science domain that have yet to be widely explored using these techniques. This includes things like time series forecasting, logistic regression, latent variable models, and censored data analysis (including survival analysis and failure data analysis).

The keras package for R provides a flexible, high-level interface for specifying machine learning models.(RStudio also provides some nice features when using the package, including a dynamically-updated convergence chart to show progress.) Networks defined with keras are flexible enough to specify models for data science applications, that can then be optimized using frameworks like Tensorflow (as opposed to traditional maximum-likelihood techniques), without limitations on data set size and with the ability to apply modern computational hardware.

For learning materials, RStudio’s Tensorflow Gallery provides a good place to get started with several worked examples using real-world data. The book Deep Learning with R (Chollet and Allaire) provides even more worked examples translated from the original Python. If you want to dive into the mathematical underpinnings, the book Deep Learning (Goodfellow et al) provides the details there.

RStudio blog: TensorFlow for R

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## R Tip: Use [[ ]] Wherever You Can

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

R tip: use `[[ ]]` wherever you can.

In R the `[[ ]]` is the operator that (when supplied a scalar argument) pulls a single element out of lists (and the `[ ]` operator pulls out sub-lists).

For vectors `[[ ]]` and `[ ]` appear to be synonyms. However, when writing reusable code you may not always be sure if your code is going to be applied to a vector or list in the future. It is safer to get into the habit of always using `[[ ]]` when you intend to retrieve a single element.

Example with lists:

```list("a", "b")[1]
#> [[1]]
#> [1] "a"

list("a", "b")[[1]]
#> [1] "a"
```

Example with vectors:

```c("a", "b")[1]
#> [1] "a"

c("a", "b")[[1]]
#> [1] "a"
```

This R tips series is short simple notes on R best practices, and additional packaged tools. The intent is to show both how to perform common tasks, and how to avoid common pitfalls. I hope to share about 20 of these about every other day to learn from the community which issues resonate and to also introduce some of features from some of our packages. It is an opinionated series and will sometimes touch on coding style, and also try to showcase appropriate Win-Vector LLC R tools.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Markdown based web analytics? Rectangle your blog

(This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers)

Locke Data’s great blog is Markdown-based. What this means is that all blog posts exist as Markdown files: you can see all of them here. They then get rendered to html by some sort of magic cough blogdown cough we don’t need to fully understand here. For marketing efforts, I needed a census of existing blog posts along with some precious information. Here is how I got it, in other words here is how I rectangled the website GitHub repo and live version to serve our needs.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## “I have to randomize by cluster. Is it OK if I only have 6 sites?”

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

The answer is probably no, because there is a not-so-low chance (perhaps considerably higher than 5%) you will draw the wrong conclusions from the study. I have heard variations on this question not so infrequently, so I thought it would be useful (of course) to do a few quick simulations to see what happens when we try to conduct a study under these conditions. (Another question I get every so often, after a study has failed to find an effect: “can we get a post-hoc estimate of the power?” I was all set to post on the issue, but then I found this, which does a really good job of explaining why this is not a very useful exercise.) But, back to the question at hand.

Here is the bottom line: if there are differences between clusters that relate to the outcome, there is a good chance that we might confuse those inherent differences for treatment effects. These inherent differences could be the characteristics of people in the different clusters; for example, a health care clinic might attract healthier people than others. Or the differences could be characteristics of the clusters themselves; for example, we could imagine that some health care clinics are better at managing high blood pressure than others. In both scenarios, individuals in a particular cluster are likely to have good outcomes regardless of the intervention. And if these clusters happen to get assigned to the intervention, we could easily confuse the underlying structure or characteristics as an intervention effect.

This problem easiest to observe if we generate data with the underlying assumption that there is no treatment effect. Actually, I will generate lots of these data sets, and for each one I am going to test for statistical significance. (I am comfortable doing that in this situation, since I literally can repeat the identical experiment over an over again, a key pre-requisite for properly interpreting a p-value.) I am going to estimate the proportion of cases where the test statistic would lead me to incorrectly reject the null hypothesis, or make a Type I error. (I am not getting into the case where there is actually a treatment effect.)

### A single cluster randomized trial with 6 sites

First, I define the cluster level data. Each cluster or site will have a “fixed” effect that will apply to all individuals within that site. I will generate the fixed effect so that on average (across all sites) it is 0 with a variance of 0.053. (I will explain that arbitrary number in a second.) Each site will have exactly 50 individuals.

``````library(simstudy)

defC ``````
``````##    varname formula variance      dist     link
## 1: siteFix       0    0.053    normal identity
## 2:   nsite      50    0.000 nonrandom identity``````

Now, I generate the cluster-level data and assign treatment:

``````set.seed(7)

dtC ``````

Once the cluster-level are ready, I can define and generate the individual-level data. Each cluster will have 50 records, for a total of 300 individuals.

``defI ``
``````##      cID trtGrp    siteFix nsite  id          y
##   1:   1      0  0.5265638    50   1  2.7165419
##   2:   1      0  0.5265638    50   2  0.8835501
##   3:   1      0  0.5265638    50   3  3.2433156
##   4:   1      0  0.5265638    50   4  2.8080158
##   5:   1      0  0.5265638    50   5  0.8505844
##  ---
## 296:   6      1 -0.2180802    50 296 -0.6351033
## 297:   6      1 -0.2180802    50 297 -1.3822554
## 298:   6      1 -0.2180802    50 298  1.5197839
## 299:   6      1 -0.2180802    50 299 -0.4721576
## 300:   6      1 -0.2180802    50 300 -1.1917988``````

I promised a little explanation of why the variance of the sites was specified as 0.053. The statistic that characterizes the extent of clustering is the inter-class coefficient, or ICC. This is calculated by

[ICC = frac{sigma^2_{clust}}{sigma^2_{clust}+sigma^2_{ind}}] where (sigma^2_{clust}) is the variance of the cluster means, and (sigma^2_{ind}) is the variance of the individuals within the clusters. (We are assuming that the within-cluster variance is constant across all clusters.) The denominator represents the total variation across all individuals. The ICC ranges from 0 (no clustering) to 1 (maximal clustering). When (sigma^2_{clust} = 0) then the (ICC=0). This means that all variation is due to individual variation. And when (sigma^2_{ind}=0), (ICC=1). In this case, there is no variation across individuals within a cluster (i.e. they are all the same with respect to this measure) and any variation across individuals more generally is due entirely to the cluster variation. I used a cluster-level variance of 0.053 so that the ICC is 0.05:

[ICC = frac{0.053}{0.053+1.00} approx 0.05]

OK – back to the data. Let’s take a quick look at it:

``````library(ggplot2)

ggplot(data=dtI, aes(x=factor(cID), y=y)) +
geom_jitter(aes(color=factor(trtGrp)), width = .1) +
scale_color_manual(labels=c("control", "rx"),
values = c("#ffc734", "#346cff")) +
theme(panel.grid.minor = element_blank(),
legend.title = element_blank())``````

Remember, there is no treatment effect (either positive or negative). But, due to cluster variation, Site 1 (randomized to control) has higher than average outcomes. We estimate the treatment effect using a fixed effects model. This model seems reasonable, since we don’t have enough sites to estimates the variability of a random effects model. We conclude that the treatment has a (deleterious) effect (assuming higher (y) is a good thing), based on the p-value for the treatment effect estimate that is considerably less than 0.05.

``````library(broom)
library(lme4)

lmfit ``````
``````##           term   estimate std.error  statistic      p.value
## 1  (Intercept)  0.8267802 0.1394788  5.9276404 8.597761e-09
## 2       trtGrp -0.9576641 0.1972528 -4.8550088 1.958238e-06
## 3 factor(cID)2 -0.1162042 0.1972528 -0.5891129 5.562379e-01
## 4 factor(cID)3  0.1344241 0.1972528  0.6814812 4.961035e-01
## 5 factor(cID)4 -0.8148341 0.1972528 -4.1309123 4.714672e-05
## 6 factor(cID)5 -1.2684515 0.1972528 -6.4305878 5.132896e-10``````

### A more systematic evaluation

OK, so I was able to pick a seed that generated the outcomes in that single instance that seemed to illustrate my point. But, what happens if we look at this a bit more systematically? The series of plots that follow seem to tell a story. Each one represents a series of simulations, similar to the one above (I am not including the code, because it is a bit convoluted, but would be happy to share if anyone wants it.)

The first scenario shown below is based on six sites using ICCs that range from 0 to 0.10. For each level of ICC, I generated 100 different samples of six sites. For each of those 100 samples, I generated 100 different randomization schemes (which I know is overkill in the case of 6 sites since there are only 20 possible randomization schemes) and generated a new set of individuals. For each of those 100 randomization schemes, I estimated a fixed effects model and recorded the proportion of the 100 where the p-values were below the 0.05 threshold.

How do we interpret this plot? When there is no clustering ((ICC=0)), the probability of a Type I error is close to 5%, which is what we would expect based on theory. But, once we have any kind of clustering, things start to go a little haywire. Even when (ICC=0.025), we would make a lot of mistakes. The error rate only increases as the extent of clustering increases. There is quite a lot variability in the error rate, which is a function of the variability of the site specific effects.

If we use 24 sites, and continue to fit a fixed effect model, we see largely the same thing. Here, we have a much bigger sample size, so a smaller treatment effect is more likely to be statistically significant:

One could make the case that instead of fitting a fixed effects model, we should be using a random effects model (particularly if the sites themselves are randomly pulled from a population of sites, though this is hardly likely to be the case when you are having a hard time recruiting sites to participate in your study). The next plot shows that the error rate goes down for 6 sites, but still not enough for my comfort:

With 24 sites, the random effects model seems much safer to use:

But, in reality, if we only have 6 sites, the best that we could do is randomize within site and use a fixed effect model to draw our conclusions. Even at high levels of clustering, this approach will generally lead us towards a valid conclusion (assuming, of course, the study itself is well designed and implemented):

But, I assume the researcher couldn’t randomize at the individual level, otherwise they wouldn’t have asked that question. In which case I would say, “It might not be the best use of resources.”