sample_n_of(): a useful helper function

By Higher Order Functions

An illegible plot because too many facets are plotted

(This article was first published on Higher Order Functions, and kindly contributed to R-bloggers)

Here’s the problem: I have some data with nested time series. Lots of them. It’s
like there’s many, many little datasets inside my data. There are too many
groups to plot all of the time series at once, so I just want to preview a
handful of them.

For a working example, suppose we want to visualize the top 50 American female
baby names over time. I start by adding up the total number of births for each
name, finding the overall top 50 most populous names, and then keeping just the
time series from those top names.

library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

babynames  babynames::babynames %>% 
  filter(sex == "F")

top50  babynames %>% 
  group_by(name) %>% 
  summarise(total = sum(n)) %>% 
  top_n(50, total) 

# keep just rows in babynames that match a row in top50
top_names  babynames %>%
  semi_join(top50, by = "name")

Hmm, so what does this look like?

ggplot(top_names) + 
  aes(x = year, y = n) + 
  geom_line() + 
  facet_wrap("name")

Aaack, I can’t read anything! Can’t I just see a few of them?

This is a problem I face frequently, so frequently that I wrote a helper
function to handle this problem: sample_n_of(). This is not a very clever
name, but it works. Below I call the function from my personal R package
and plot just the data from four names.

# For reproducible blogging
set.seed(20180524)

top_names %>% 
  tjmisc::sample_n_of(4, name) %>% 
  ggplot() + 
    aes(x = year, y = n) + 
    geom_line() + 
    facet_wrap("name")

A plot with four faceted timeseries

In this post, I walk through how this function works. It’s not very
complicated: It relies on some light tidy evaluation plus one obscure dplyr
function.

Working through the function

As usual, let’s start by sketching out the function we want to write:

sample_n_of  function(data, size, ...) {
  # quote the dots
  dots  quos(...)
  
  # ...now make things happen...
}

where size are the number of groups to sample and ... are the columns names
that define the groups. We use quos(...) to capture and quote those column
names. (As I wrote before,
quotation is how we bottle up R code so we can deploy it for later.)

For interactive testing, suppose our dataset are the time series from the top 50
names and we want data from a sample of 5 names. In this case, the values for
the arguments would be:

data  top_names
size  5
dots  quos(name)

A natural way to think about this problem is that we want to sample subgroups of
the dataframe. First, we create a grouped version of the dataframe using
group_by(). The function group_by() also takes a ... argument where the
dots are typically names of columns in the dataframe. We want to take the
names inside of our dots, unquote them and plug them in to where the ...
goes in group_by(). This is what the tidy evaluation world calls
splicing.

Think of splicing as doing this:

# Demo function that counts the number of arguments in the dots
count_args  function(...) length(quos(...))
example_dots  quos(var1, var2, var2)

# Splicing turns the first form into the second one
count_args(!!! example_dots)
#> [1] 3
count_args(var1, var2, var2)
#> [1] 3

So, we create a grouped dataframe by splicing our dots into the group_by()
function.

grouped  data %>% 
  group_by(!!! dots)

There is a helper function buried in dplyr called group_indices() which
returns the grouping index for each row in a grouped dataframe.

grouped %>% 
  tibble::add_column(group_index = group_indices(grouped)) 
#> # A tibble: 6,407 x 6
#> # Groups:   name [50]
#>     year sex   name          n    prop group_index
#>    
#>  1  1880 F     Mary       7065 0.0724           33
#>  2  1880 F     Anna       2604 0.0267            4
#>  3  1880 F     Emma       2003 0.0205           19
#>  4  1880 F     Elizabeth  1939 0.0199           17
#>  5  1880 F     Margaret   1578 0.0162           32
#>  6  1880 F     Sarah      1288 0.0132           45
#>  7  1880 F     Laura      1012 0.0104           29
#>  8  1880 F     Catherine   688 0.00705          11
#>  9  1880 F     Helen       636 0.00652          21
#> 10  1880 F     Frances     605 0.00620          20
#> # ... with 6,397 more rows

We can randomly sample five of the group indices and keep the rows for just
those groups.

unique_groups  unique(group_indices(grouped))
sampled_groups  sample(unique_groups, size)
sampled_groups
#> [1]  4 25 43 20 21

subset_of_the_data  data %>% 
  filter(group_indices(grouped) %in% sampled_groups)
subset_of_the_data
#> # A tibble: 674 x 5
#>     year sex   name         n      prop
#>    
#>  1  1880 F     Anna      2604 0.0267   
#>  2  1880 F     Helen      636 0.00652  
#>  3  1880 F     Frances    605 0.00620  
#>  4  1880 F     Samantha    21 0.000215 
#>  5  1881 F     Anna      2698 0.0273   
#>  6  1881 F     Helen      612 0.00619  
#>  7  1881 F     Frances    586 0.00593  
#>  8  1881 F     Samantha    12 0.000121 
#>  9  1881 F     Karen        6 0.0000607
#> 10  1882 F     Anna      3143 0.0272   
#> # ... with 664 more rows

# Confirm that only five names are in the dataset
subset_of_the_data %>% 
  distinct(name)
#> # A tibble: 5 x 1
#>   name    
#>   
#> 1 Anna    
#> 2 Helen   
#> 3 Frances 
#> 4 Samantha
#> 5 Karen

Putting these steps together, we get:

sample_n_of  function(data, size, ...) {
  dots  quos(...)
  
  group_ids  data %>% 
    group_by(!!! dots) %>% 
    group_indices()
  
  sampled_groups  sample(unique(group_ids), size)
  
  data %>% 
    filter(group_ids %in% sampled_groups)
}

We can test that the function works as we might expect. Sampling 10 names
returns the data for 10 names.

ten_names  top_names %>% 
  sample_n_of(10, name) %>% 
  print()
#> # A tibble: 1,326 x 5
#>     year sex   name         n      prop
#>    
#>  1  1880 F     Sarah     1288 0.0132   
#>  2  1880 F     Frances    605 0.00620  
#>  3  1880 F     Rachel     166 0.00170  
#>  4  1880 F     Samantha    21 0.000215 
#>  5  1880 F     Deborah     12 0.000123 
#>  6  1880 F     Shirley      8 0.0000820
#>  7  1880 F     Carol        7 0.0000717
#>  8  1880 F     Jessica      7 0.0000717
#>  9  1881 F     Sarah     1226 0.0124   
#> 10  1881 F     Frances    586 0.00593  
#> # ... with 1,316 more rows

ten_names %>% 
  distinct(name)
#> # A tibble: 10 x 1
#>    name    
#>    
#>  1 Sarah   
#>  2 Frances 
#>  3 Rachel  
#>  4 Samantha
#>  5 Deborah 
#>  6 Shirley 
#>  7 Carol   
#>  8 Jessica 
#>  9 Patricia
#> 10 Sharon

We can sample based on multiple columns too. Ten combinations of names and years
should return just ten rows.

top_names %>% 
  sample_n_of(10, name, year) 
#> # A tibble: 10 x 5
#>     year sex   name          n      prop
#>    
#>  1  1907 F     Jessica      17 0.0000504
#>  2  1932 F     Catherine  5446 0.00492  
#>  3  1951 F     Nicole       94 0.0000509
#>  4  1953 F     Janet     17761 0.00921  
#>  5  1970 F     Sharon     9174 0.00501  
#>  6  1983 F     Melissa   23473 0.0131   
#>  7  1989 F     Brenda     2270 0.00114  
#>  8  1989 F     Pamela     1334 0.000670 
#>  9  1994 F     Samantha  22817 0.0117   
#> 10  2014 F     Kimberly   2891 0.00148

Next steps

There are a few tweaks we could make to this function. For example, in my
package’s version, I warn the user when the number of groups is too large.

too_many  top_names %>% 
  tjmisc::sample_n_of(100, name)
#> Warning: Sample size (100) is larger than number of groups (50). Using size
#> = 50.

My version also randomly samples n of the rows when there are no grouping
variables provided.

top_names %>% 
  tjmisc::sample_n_of(2)
#> # A tibble: 2 x 5
#>    year sex   name          n     prop
#>   
#> 1  1934 F     Stephanie   128 0.000118
#> 2  2007 F     Mary       3674 0.00174

One open question is how to handle data that’s already grouped. The function we
wrote above fails.

top_names %>% 
  group_by(name) %>% 
  sample_n_of(2, year)
#> Error in filter_impl(.data, quo): Result must have length 136, not 6407

Is this a problem?

Here I think failure is okay because what do we think should happen? It’s not
obvious. It should randomly choose 2 of the years for each name.
Should it be the same two years? Then this should be fine.

top_names %>% 
  sample_n_of(2, year)
#> # A tibble: 100 x 5
#>     year sex   name         n    prop
#>    
#>  1  1970 F     Jennifer 46160 0.0252 
#>  2  1970 F     Lisa     38965 0.0213 
#>  3  1970 F     Kimberly 34141 0.0186 
#>  4  1970 F     Michelle 34053 0.0186 
#>  5  1970 F     Amy      25212 0.0138 
#>  6  1970 F     Angela   24926 0.0136 
#>  7  1970 F     Melissa  23742 0.0130 
#>  8  1970 F     Mary     19204 0.0105 
#>  9  1970 F     Karen    16701 0.00912
#> 10  1970 F     Laura    16497 0.00901
#> # ... with 90 more rows

Or, should those two years be randomly selected for each name? Then, we should
let do() handle that. do() takes some code that returns a dataframe, applies
it to each group, and returns the combined result.

top_names %>% 
  group_by(name) %>% 
  do(sample_n_of(., 2, year))
#> # A tibble: 100 x 5
#> # Groups:   name [50]
#>     year sex   name       n      prop
#>    
#>  1  1913 F     Amanda   346 0.000528 
#>  2  1953 F     Amanda   428 0.000222 
#>  3  1899 F     Amy      281 0.00114  
#>  4  1964 F     Amy     9579 0.00489  
#>  5  1916 F     Angela   715 0.000659 
#>  6  2005 F     Angela  2893 0.00143  
#>  7  1999 F     Anna    9092 0.00467  
#>  8  2011 F     Anna    5649 0.00292  
#>  9  1952 F     Ashley    24 0.0000126
#> 10  2006 F     Ashley 12340 0.00591  
#> # ... with 90 more rows

I think raising an error and forcing the user to clarify their code is a better
than choosing one of these options and not doing what the user expects.

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Simple Spatial Modelling – Part 3: Exercises

By Hanif Kusuma

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

So far, we have learned how to count spatial variability in our model. Please look at these two previous exercises here and here if you haven’t tried it yet. However, it only represents 1-Dimension model. On this exercise, we will try to expand our spatial consideration into 2-Dimension model.

Have a look at this plan view below to get an illustration of how the 2-dimensions model will work.

The water levels are store in a 2-D array. they are numbered as follows:

The water flows are store in 2 different 2-D arrays.
1. qv : defines water flows between buckets down the screen (in plan view)
2. qh : defines water flows between buckets across the screen.


Let’s get into the modelling by cracking the exercises below. Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Set all the required settings for the model:
a. set number of timesteps. Here we use 1000 (it shows how many timesteps in which we are going to run the model, you can change it as you want)
b. set the total number of cell, here we have 25 x 25 water tanks
c. set timestep in seconds; here is 1
d. set time at the start of simulations
e. set k between each water tank. Here we set a uniform value of k; 0.01

Exercise 2
create matrix H for initial water level in the water tank

Exercise 3
Set boundary conditions for the model; here we have water flow (qh) into the water tanks from three sides (top, left and right) and water flowing out on the bottom (qv;see the plan view). Water flow to the right and to the bottoms are considered positive. Don’t forget to declare the matrix for qh and qv.

Exercise 4
Create an output models for every 100 timesteps

Exercise 5
Run the model by creating loop for qh, qv, water storage update, and models output (Remember the threshold loop on latest previous exercise?)

Exercise 6
Plot model output using contour plot

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Create your Machine Learning library from scratch with R ! (3/5) – KNN

By Antoine Guillot

[hat{y}_{n+1} = frac{1}{k} sum_{ileq n} y_i 1_{d_i leq D_k}]

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement K nearest neighbors (KNN) using only the linear algebra available in R. Previously, we managed to implement PCA and next time we will deal with SVM and decision trees.

The K-nearest neighbors (KNN) is a simple yet efficient classification and regression algorithm. KNN assumes that an observation will be similar to its K closest neighbors. For instance, if most of the neighbors of a given point belongs to a given class, it seems reasonable to assume that the point will belong to the same given class.

The mathematics of KNN

Now, let’s quickly derive the mathematics used for KNN regression (they are similar for classification).

Let be the observations of our training dataset. The points are in mathbb{R}^{p}. We denote y_1, ..., y_n the variable we seek to estimate. We know its value for the train dataset.
Let mathbf{x}_{n+1} be a new point in mathbb{R}^{p}. We do not know y_{n+1} and will estimate it using our train dataset.

Let k be a positive and non-zero integer (the number of neighbors used for estimation). We want to select the k points from the dataset which are the closest to mathbf{x}_{n+1}. To do so, we compute the euclidean distance d_i=||mathbf{x}_i-mathbf{x}_{n+1}||_{L2}. From all the distance, we can compute D_k, the smallest radius of the circle centered on mathbf{x}_{n+1} which includes exactly k points from the training sample.

An estimation hat{y}_{n+1} of y_{n+1} is now easy to construct. This is the mean of the y_i of the k closest points to mathbf{x}_{n+1}:

[hat{y}_{n+1} = frac{1}{k} sum_{ileq n} y_i 1_{d_i leq D_k}]

KNN regression in R

First, we build a “my_knn_regressor” object which stores all the training points, the value of the target variable and the number of neighbors to use.

my_knn_regressor <- function(x,y,k=5)
{
  if (!is.matrix(x))
  {
    x <- as.matrix(x)
  }
  if (!is.matrix(y))
  {
    y <- as.matrix(y)
  }
  my_knn <- list()
  my_knn[['points']] <- x
  my_knn[['value']] <- y
  my_knn[['k']] <- k
  attr(my_knn, "class") <- "my_knn_regressor"
  return(my_knn)
}

The tricky part of KNN is to compute efficiently the distance. We will use the function we created in our previous post on vectorization. The function and mathematical derivations are specified in this post.

gramMatrix <- function(X,Y)
{
  tcrossprod(X, Y)
}
compute_pairwise_distance=function(X,Y)
{
  xn <- rowSums(X ** 2)
  yn <- rowSums(Y ** 2)
  outer(xn, yn,'+') - 2 * tcrossprod(X, Y)
}

Now we can build our predictor:

predict.my_knn_regressor <- function(my_knn,x)
{
  if (!is.matrix(x))
  {
    x=as.matrix(x)
  }
  ##Compute pairwise distance
  dist_pair <- compute_pairwise_distance(x,my_knn[['points']])
  ##as.matrix(apply(dist_pair,2,order) <= my_knn[['k']]) orders the points by distance and select the k-closest points
  ##The M[i,j]=1 if x_j is on the k closest point to x_i
  t(as.matrix(apply(dist_pair,2,order) <= my_knn[['k']])) %*% my_knn[['value']] / my_knn[['k']]
}

The last line may seem complicated:

  1. apply(dist_pair,2,order) orders the points by distance
  2. apply(dist_pair,2,order) selects the k-closest points to each point in our new dataset
  3. M=t(as.matrix(apply(dist_pair,2,order) cast the matrix into a one hot matrix. mathbf{M}_{i,j}=1 if mathbf{x}_j is one of the k closest points to mathbf{x}_i. mathbf{M}_{i,j} is zero otherwise.
  4. M %*% my_knn[['value']] / my_knn sums the value of the k closest points and normalises it by k

KNN Binary Classification in R

The previous code can be reused as it is for binary classification. Your outcome should be encoded as a one-hot variable. If the estimated output is greater (resp. less) than 0.5, you can assume that your point belongs to the class encoded as one (resp. zero). We will use the classical Iris dataset and classify the setosa versus the virginica specy.

iris_class <- iris[iris[["Species"]]!="versicolor",]
iris_class[["Species"]] <- as.numeric(iris_class[["Species"]]!="setosa")
knn_class <- my_knn_regressor(iris_class[,1:2],iris_class[,5])
predict(knn_class,iris_class[,1:2])

Since, we only used 2 variables, we can easily plot the decision boundaries on a 2D plot.

#Build grid
x_coord <- seq(min(iris_class[,1]) - 0.2,max(iris_class[,1]) + 0.2,length.out = 200)
y_coord <- seq(min(iris_class[,2])- 0.2,max(iris_class[,2]) + 0.2 , length.out = 200)
coord <- expand.grid(x=x_coord, y=y_coord)
#predict probabilities
coord[['prob']] <- predict(knn_class,coord[,1:2])

library(ggplot2)
ggplot() + 
  ##Ad tiles according to probabilities
  geom_tile(data=coord,mapping=aes(x, y, fill=prob)) + scale_fill_gradient(low = "lightblue", high = "red") +
  ##add points
  geom_point(data=iris_class,mapping=aes(Sepal.Length,Sepal.Width, shape=Species),size=3 ) + 
  #add the labels to the plots
  xlab('Sepal length') + ylab('Sepal width') + ggtitle('Decision boundaries of KNN')+
  #remove grey border from the tile
  scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0))

And this gives us this cool plot:

Possible extensions

Our current KNN is basic, but you can improve and test it in several ways:

  • What is the influence of the number of neighbors ? (You should see some overfitting/underfitting)
  • Can you implement other metrics than L_2 distance ? Can you create kernel KNNs ?
  • Instead of doing estimations using only the mean, could you use a more complex mapping ?

Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.

The post Create your Machine Learning library from scratch with R ! (3/5) – KNN appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Introducing Datazar Paper

By Aman Tsegai

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

We’re finally here. Datazar Paper is the most exciting tool we’ve created to date. Up until now, I’d like to believe we’ve only made par with the status quo in terms of the tools we have developed for our beloved community of industry researchers, hackers and academics.

We have a vision at Datazar where researchers would eventually be able to create and consume research in one, continious cycle. We introduced tools like Replication (ability to replicate a research project simply by clicking a button), Discussions (real time chat where you can exchange code snippets etc…) and Metrics (research tracking and feedback on views, usage etc…). These tools help users create research with their colleagues without juggling different tools.

Datazar Paper is the tools that ties all of these together. A common workflow on Datazar is (1) uploading data, (2) creating analysis based on the data using for example an R notebook and(3) exporting the results/visualizations for reporting or for a research paper. That last step was always awkward to deal with because there are so many ways to go about it. At some point we create a versatile, JavaScript notebook to try and tie all of it together to make reporting easier. Before that, we created a new language to simplify creating a report or paper while maintaining interactivity. Reports are papers are everywhere, the problem is they are not interactive and/or reproducible. This article* on the deatch of traditional papers explains it really well.

In short, we believe the traditional, static paper is not good enough anymore in today’s world to convey today’s complex ideas.

Datazar Paper Editor

Medium, yes this Medium was a giant leap in terms of how easy it made creating articles. Datazar Paper was heavily influced by Medium; specifically the editor. We took the same approach and created an editor that allows you to create an article/paper/report like a lego house. No code involved whatsoever. This last bit is extremely important because the people in an organization who usually do the reporting are the people with least exeprience in coding. And to be frank, you shouldn’t really need code to create a simple paper or report.

We recognize that the overall transition to an interactive and easily reproducible writing process will be long so I’d like to take this opportunity to say that we still support LaTeX and MarkDown and will continue to do so. In case you didn’t know, Datazar gives you the ability to create LaTeX and MarkDown documents in your browser. In fact, LaTeX documents are the most popular files created on Datazar.

With Datazar Paper, we got one step closer in creating an ecosystem where research is created, shared and preserved automatically. Now you can store your data, analyze it using R & Python notebooks and publish eveything from code to visualizations using Datazar Paper. All without switching between a dozen programs and without squeezig your CPU for computational power.

Link to Launch: https://www.producthunt.com/posts/datazar-paper

Some papers I picked out:

https://www.datazar.com/focus/f03b8705b-c0ba-454c-afa0-3a7729a6c96f

https://www.datazar.com/focus/f5abc508d-7091-40cd-a14b-c6f2da005c14

https://www.datazar.com/focus/f649aea35-953d-4eb2-8ad4-037d6c343225

*https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/


Introducing Datazar Paper was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Intro to FFTree Exercise

By Biswarup Ghosh

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below, we will work with FFTree pacakge which lets us use fast and frugal decision tree to model the data

Please install the package and load the library before starting
Answers to these exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

FFTree package comes with heart.train,heart.test data .Check the heart.train data and see the diagnosis column .This is our response variable .
Create a FFTree model using heart.test,heart.train and check the summary of the model

Exercise 2

Now FFTree is understood better by plotting it ,uuse the plot function to see the plot and check the probability of heart attack and the probability of stable heart .
Exercise 3

Create your own custom tree using simple if else blocks ,this allows us to compare different tree with the default tree .
The custom tree should follow the logic
“if trestbps >180 predict attack
if chol>300 decide hear attack
if age
if thal equals fd or rd predict attack else stable”

Exercise 4

Plot and summarize the new model and check the confusion matrix . Did you improve the result
Exercise 5

Now rather than plotting everything ,Plot just the cues and see how the cues stack up in the FFTree methods

Exercise 6

Plot the same FFTree without the stats,This will show the tree for better understanding and without too much information
Exercise 7

You can also print the best training tree to see how its different and how the confusion matrix is different from the tree that is chosen as the default .

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

PubMed retractions report has moved

By nsaunders

🙂

(This article was first published on R – What You’re Doing Is Rather Desperate, and kindly contributed to R-bloggers)

A brief message for anyone who uses my PubMed retractions report. It’s no longer available at RPubs; instead, you will find it here at Github. Github pages hosting is great, once you figure out that docs/ corresponds to your web root

Now I really must update the code and try to make it more interesting than a bunch of bar charts.

To leave a comment for the author, please follow the link and comment on their blog: R – What You’re Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Why you should regret not going to eRum 2018?

By Appsilon Data Science Blog

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

I spent 3 amazing days at eRum conference in Budapest. The conference was a blast and organizers (BIG thanks to them again) did wonderful job compiling such a high-level event.

My favourite talks from the conference

Better than Deep Learning – Gradient Boosting Machines (GBM) in R – Szilard Pafka

Szilard made a thorough overview of currently available ML algorithms. Showed that ML algorithm works better on tabular data than Deep Learning. Gave his advice on which packages to choose depending on your goals like maximizing speed on GPU/CPU or going to production.

My take away from his talk: you should choose algorithm based on the problem you have and take into account outside constraints like interpretability. Choosing model is one thing, but a lot of prediction improvement can come from feature engineering, so domain knowledge and problem understanding matters a lot.

Thanks to Erin LeDell’s talk we know that majority of ML tasks can be automated thanks to their awesome autoML framework.

Harness the R condition system – Lionel Henry

Lionel talked about improving errors in R. Currently R offers errors handling solely through tryCatch function. From the presentation we learn that errors are regular objects. This makes it possible for a user to provide a custom classes and error metadata, which makes it much easier to implement handling and reporting. Some of the ideas he shared will be available through the new release of lang package.

Show my your model 2.0! – Przemysław Biecek

Przemek together with Mateusz gave both a workshop and a talk about the Dalex package which is an impressive toolkit for understanding machine learning model. Dalex is being developed by talented group of Przemek’s students in Poland. Thanks to Dalex you can do single variable explanations for more than one model at the same time. It’s also easier to understand how the single variable is influencing the prediction.
You may wonder why should you use Dalex if you are already familiar with Lime and the answer is: Dalex offers many methods for variable inspection (lime has one) and comparison of many methods using selected method.

R-Ladies event

The cherry on the cake was R-Ladies Budapest event where we could here 6 amazing presentations. Some of the R-Ladies were giving the second talk during those 3 days. One of them was Omayma Said talking about her Shiny app “Stringr Explorer: Tweet Driven Development for a Shiny App!”. It’s a really cool app that helps you navigate the stringr package, plus Omayma story how it was created was entertaining and admirable.

Other conference perks

It’s always pleasure to meet in person people from R community and fellow R-Ladies.

Great things about events like this is that there is always something extra you learn: recipes package is a neat way to do data pre-processing for you model, thanks to Barbara now I know about 2 useful parameters ignoreInit and once in observEevent function and Tobias explained when would I want to choose R vs. Python when doing Deep Learning – If you just need Keras go for R, everything you can do in Python is available in R!

Making Shiny shine brighter!

Finally I had a pleasure to give an invited talk about new Shiny packages: “Taking inspirations from proven frontend frameworks to add to Shiny with 4 6 new packages”. You can access the slides here and watch the video or YouTube. It was really valuable and motivating to get feedback on our open source packages. I’m proud that I’m part of such a great team!

If you like the idea of shiny.users and shiny.admin and would like to know when the packages are released, you can visit packages landing page or keep following our blog.

What next?

I hope the idea of eRum will continue and others would pick it up, so we can all meet in 2 years time!
I’m already jealous of all the lucky people going to useR this year. I sadly won’t be there, but Marek will and let me reveal a little secret: he will have shiny.semantic and semantic.dashboard cheat sheets and stickers to give away!

Read the original post at
Appsilon Data Science Blog.

Follow Appsilon Data Science

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Finalfit, knitr and R Markdown for quick results

By Ewen Harrison

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

Thank you for the many requests to provide some extra info on how best to get finalfit results out of RStudio, and particularly into Microsoft Word.

Here is how.

Make sure you are on the most up-to-date version of finalfit.

devtools::install_github("ewenharrison/finalfit")

What follows is for demonstration purposes and is not meant to illustrate model building.

Does a tumour characteristic (differentiation) predict 5-year survival?

Demographics table

First explore variable of interest (exposure) by making it the dependent.

library(finalfit)
library(dplyr)

dependent = "differ.factor"

# Specify explanatory variables of interest
explanatory = c("age", "sex.factor", 
  "extent.factor", "obstruct.factor", 
  "nodes")

Note this useful alternative way of specifying explanatory variable lists:

colon_s %>% 
  select(age, sex.factor, 
  extent.factor, obstruct.factor, nodes) %>% 
  names() -> explanatory

Look at associations between our exposure and other explanatory variables. Include missing data.

colon_s %>% 
  summary_factorlist(dependent, explanatory, 
  p=TRUE, na_include=TRUE)
label              levels        Well    Moderate       Poor      p
       Age (years)           Mean (SD) 60.2 (12.8) 59.9 (11.7)  59 (12.8)  0.788
               Sex              Female   51 (11.6)  314 (71.7)  73 (16.7)  0.400
                                  Male    42 (9.0)  349 (74.6)  77 (16.5)       
  Extent of spread           Submucosa    5 (25.0)   12 (60.0)   3 (15.0)  0.081
                                Muscle   12 (11.8)   78 (76.5)  12 (11.8)       
                                Serosa   76 (10.2)  542 (72.8) 127 (17.0)       
                   Adjacent structures     0 (0.0)   31 (79.5)   8 (20.5)       
       Obstruction                  No    69 (9.7)  531 (74.4) 114 (16.0)  0.110
                                   Yes   19 (11.0)  122 (70.9)  31 (18.0)       
                               Missing    5 (25.0)   10 (50.0)   5 (25.0)       
             nodes           Mean (SD)   2.7 (2.2)   3.6 (3.4)  4.7 (4.4) 

Note missing data in obstruct.factor. We will drop this variable for now (again, this is for demonstration only). Also that nodes has not been labelled.
There are small numbers in some variables generating chisq.test warnings (predicted less than 5 in any cell). Generate final table.

Hmisc::label(colon_s$nodes) = "Lymph nodes involved"
explanatory = c("age", "sex.factor", 
  "extent.factor", "nodes")

colon_s %>% 
  summary_factorlist(dependent, explanatory, 
  p=TRUE, na_include=TRUE, 
  add_dependent_label=TRUE) -> table1
table1
Dependent: Differentiation                            Well    Moderate       Poor      p
                Age (years)           Mean (SD) 60.2 (12.8) 59.9 (11.7)  59 (12.8)  0.788
                        Sex              Female   51 (11.6)  314 (71.7)  73 (16.7)  0.400
                                           Male    42 (9.0)  349 (74.6)  77 (16.5)       
           Extent of spread           Submucosa    5 (25.0)   12 (60.0)   3 (15.0)  0.081
                                         Muscle   12 (11.8)   78 (76.5)  12 (11.8)       
                                         Serosa   76 (10.2)  542 (72.8) 127 (17.0)       
                            Adjacent structures     0 (0.0)   31 (79.5)   8 (20.5)       
       Lymph nodes involved           Mean (SD)   2.7 (2.2)   3.6 (3.4)  4.7 (4.4) 

Logistic regression table

Now examine explanatory variables against outcome. Check plot runs ok.

explanatory = c("age", "sex.factor", 
  "extent.factor", "nodes", 
  "differ.factor")
dependent = "mort_5yr"
colon_s %>% 
  finalfit(dependent, explanatory, 
  dependent_label_prefix = "") -> table2
Mortality 5 year                           Alive        Died           OR (univariable)         OR (multivariable)
          Age (years)           Mean (SD) 59.8 (11.4) 59.9 (12.5)  1.00 (0.99-1.01, p=0.986)  1.01 (1.00-1.02, p=0.195)
                  Sex              Female  243 (47.6)  194 (48.0)                          -                          -
                                     Male  268 (52.4)  210 (52.0)  0.98 (0.76-1.27, p=0.889)  0.98 (0.74-1.30, p=0.885)
     Extent of spread           Submucosa    16 (3.1)     4 (1.0)                          -                          -
                                   Muscle   78 (15.3)    25 (6.2)  1.28 (0.42-4.79, p=0.681)  1.28 (0.37-5.92, p=0.722)
                                   Serosa  401 (78.5)  349 (86.4) 3.48 (1.26-12.24, p=0.027) 3.13 (1.01-13.76, p=0.076)
                      Adjacent structures    16 (3.1)    26 (6.4) 6.50 (1.98-25.93, p=0.004) 6.04 (1.58-30.41, p=0.015)
 Lymph nodes involved           Mean (SD)   2.7 (2.4)   4.9 (4.4)  1.24 (1.18-1.30, p

Odds ratio plot

colon_s %>% 
  or_plot(dependent, explanatory, 
  breaks = c(0.5, 1, 5, 10, 20, 30))

To MS Word via knitr/R Markdown

Important. In most R Markdown set-ups, environment objects require to be saved and loaded to R Markdown document.

# Save objects for knitr/markdown
save(table1, table2, dependent, explanatory, file = "out.rda")

We use RStudio Server Pro set-up on Ubuntu. But these instructions should work fine for most/all RStudio/Markdown default set-ups.

In RStudio, select File > New File > R Markdown.

A useful template file is produced by default. Try hitting knit to Word on the knitr button at the top of the .Rmd script window.

Now paste this into the file:

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "22/5/2018"
output:
  word_document: default
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE}
colon_s %>% 
  or_plot(dependent, explanatory)
```

It’s ok, but not great.

Create Word template file

Now, edit the Word template. Click on a table. The style should be compact. Right click > Modify... > font size = 9. Alter heading and text styles in the same way as desired. Save this as template.docx. Upload to your project folder. Add this reference to the .Rmd YAML heading, as below. Make sure you get the space correct.

The plot also doesn’t look quite right and it prints with warning messages. Experiment with fig.width to get it looking right.

Now paste this into your .Rmd file and run:

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "21/5/2018"
output:
  word_document:
    reference_docx: template.docx  
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10}
colon_s %>% 
  or_plot(dependent, explanatory)
```

This is now looking good for me, and further tweaks can be made.

To PDF via knitr/R Markdown

Default settings for PDF:

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "21/5/2018"
output:
  pdf_document: default
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"))
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE}
colon_s %>% 
  or_plot(dependent, explanatory)
```

Again, ok but not great.

We can fix the plot in exactly the same way. But the table is off the side of the page. For this we use the kableExtra package. Install this in the normal manner. You may also want to alter the margins of your page using geometry in the preamble.

---
title: "Example knitr/R Markdown document"
author: "Ewen Harrison"
date: "21/5/2018"
output:
  pdf_document: default
geometry: margin=0.75in
---

```{r setup, include=FALSE}
# Load data into global environment. 
library(finalfit)
library(dplyr)
library(knitr)
library(kableExtra)
load("out.rda")
```

## Table 1 - Demographics
```{r table1, echo = FALSE, results='asis'}
kable(table1, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"),
						booktabs=TRUE)
```

## Table 2 - Association between tumour factors and 5 year mortality
```{r table2, echo = FALSE, results='asis'}
kable(table2, row.names=FALSE, align=c("l", "l", "r", "r", "r", "r"),
			booktabs=TRUE) %>% 
	kable_styling(font_size=8)
```

## Figure 1 - Association between tumour factors and 5 year mortality
```{r figure1, echo = FALSE, warning=FALSE, message=FALSE, fig.width=10}
colon_s %>% 
  or_plot(dependent, explanatory)
```

This is now looking pretty good for me as well.

There you have it. A pretty quick workflow to get final results into Word and a PDF.

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The use of R in official statistics conference 2018

By mark

(This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

On September 12-14 the 6th international conference on the use of R in official statistics (#uRos2018) will take place at the Dutch National Statistical Office in Den Haag, the Netherlands. The conference is aimed at producers and users of official statistics from government, academia, and industry. The conference is modeled after the useR! conference and will consist of one day of tutorials (12th September 2018) followed by two days of conference (13, 14 September 2018). Topics include:

  • Examples of applying R in statistical production.
  • Examples of applying R in dissemination of statistics (visualisation, apps, reporting).
  • Analyses of big data and/or application of machine learning for official statistics.
  • Implementations of statistical methodology in the areas of sampling, editing, modelling and estimation, or disclosure control.
  • R packages connecting R to other standard tools/technical standards
  • Organisational and technical aspects of introducing R to the statistical office.
  • Teaching R to users in the office
  • Examples of accessing or using official statistics publications with R in other fields

    Keynote speakers
    We are very happy to announce that we confirmed two fantastic keynote speakers.

  • Alina Matei is a professor of statistics at the University of Neuchatel and maintainer of the important sampling package.
  • Jeroen Ooms is a postdoc at UC Berkeley, author of many infrastructural R packages and maintainer of R and Rtools for Windows.

    Call for abstracts

    The call for abstracts is open until 31 May. You can contribute to the conference by proposing a 20-minute talk, or a 3-hour tutorial. Also, authors have the opportunity to submit a paper for one of the two journals that will devote a special issue to the conference. Read all about it over here.

    Pointers

  • conference website
  • Follow uRos2018 on twitter
  • Markdown with by wp-gfm
    To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Why R? 2018 Conf – CfP ends May 25th

    By Marcin Kosiński

    (This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

    We are pleased to announance upcoming Why R? 2018 conference that is going to happen in central-eastern Europe (Poland, Wroclaw) this July (2-5th). It is the last week for the call for papers! Submit your talk here.

    About

    More about the conference one can find on the conference website whyr2018.pl and in the previous blog post we’ve prepared Why R? 2018 Conference – Registration and Call for Papers Opened. The general overview

    Pre-meetings

    We are organizing pre-meetings in many European cities to cultivate the R experience of knowledge sharing. You are more than welcome to visit upcoming events and check photos and presentations from previous ones. If you are interested in co-organizing a Why R? pre-meeting in your city, let us know (under kontakt_at_whyr.pl) and the Why R? Foundation can provide speakers for the venue!











    Past event

    Why R? 2017 edition, organized in Warsaw, gathered 200 participants. The Facebook reach of the conference page exceeds 15 000 users, with almost 800 subscribers. Our official web page had over 8000 unique visitors and over 12 000 visits in general. To learn more about Why R? 2017 see the conference after movie (https://vimeo.com/239259242).

    Why R? 2017 Conference from Kinema Indigo on Vimeo.

    To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News