Using the tuber package to analyse a YouTube channel

By insightr

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

So I decided to have a quick look at the tuber package to extract YouTube data in R. My cousin is a singer (a hell of a good one) and he has a YouTube channel (dan vasc), which I strongly recommend, where he posts his covers. I will focus my analysis on his channel. The tuber package is very friendly and it downloads YouTube statistics on comments, views, likes and more straight to R using the YouTube API.

First let us look on some general information on the channel (Codes for replication in the end of the text). The table below shows the number of followers, views, videos, etc in the moment I downloaded the data (2017-12-12 11:20pm). If you run the code on your computer the results may be different because the channel will have more activity. Dan’s channel is getting close to 1 million views and he has 58 times more likes than dislikes. His views ratio is 13000 views per video.

Channel Subscriptions Views Videos Likes Dislikes Comments
Dan Vasc 5127 743322 57 9008 155 1993

We can also see some of the same statistics for each video. I selected only videos published after January 2016 that is when the channel became more active. The list has 29 videos. You can see that the channel became even more active in 2017. In the last month it started with weekly publications.

date title viewCount likeCount dislikeCount commentCount
2016-03-09 “Heart Of Steel” – MANOWAR cover 95288 1968 53 371
2016-05-09 “The Sound Of Silence” – SIMON & GARFUNKEL / DISTURBED cover 13959 556 6 85
2016-07-04 One Man Choir – Handel’s Hallelujah 9390 375 6 70
2016-08-16 “Carry On” – MANOWAR cover 19146 598 12 98
2016-09-12 “You Are Loved (Don’t Give Up)” – JOSH GROBAN cover 2524 142 0 21
2016-09-26 “Hearts On Fire” – HAMMERFALL cover 6584 310 4 58
2016-10-26 “Dawn Of Victory” – RHAPSODY OF FIRE cover 10335 354 5 69
2017-04-28 “I Don’t Wanna Miss A Thing” – AEROSMITH cover 9560 396 5 89
2017-05-09 State of affairs 906 99 1 40
2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Japanese) 2862 160 4 39
2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (Português) 3026 235 3 62
2017-05-26 “Cha-La Head Cha-La” – DRAGON BALL Z INTRO cover (English) 2682 108 2 14
2017-06-14 HOW TO BE A YOUTUBE SINGER | ASKDANVASC 01 559 44 1 19
2017-06-17 Promotional Live || Q&A and video games 206 16 0 2
2017-07-18 “The Bard’s Song” – BLIND GUARDIAN cover (SPYGLASS INN project) 3368 303 2 47
2017-07-23 “Numb” – LINKIN PARK cover (R.I.P. CHESTER) 6717 350 14 51
2017-07-27 THE PERFECT TAKE and HOW TO MIX VOCALS | ASKDANVASC 02 305 29 0 11
2017-08-04 4000 Subscribers and Second Channel 515 69 1 23
2017-08-10 “Hello” – ADELE cover [ROCK VERSION] 6518 365 2 120
2017-08-27 “The Rains Of Castamere” (The Lannister Song) – GAME OF THRONES cover 2174 133 5 28
2017-08-31 “Africa” – TOTO cover 18251 642 10 172
2017-09-24 “Chop Suey!” – SYSTEM OF A DOWN cover 2562 236 6 56
2017-10-09 “An American Trilogy” – ELVIS PRESLEY cover 1348 168 1 48
2017-11-08 “Beauty And The Beast” – Main Theme cover | Feat. Alina Lesnik 2270 192 2 59
2017-11-16 “Bohemian Rhapsody” – QUEEN cover 2589 339 3 95
2017-11-23 “The Phantom Of The Opera” – NIGHTWISH/ANDREW LLOYD WEBBER cover | Feat. Dragica 1857 209 2 42
2017-11-24 “Back In Black” – AC/DC cover (RIP MALCOLM YOUNG) 2202 207 2 56
2017-11-30 “Immigrant Song” – LED ZEPPELIN cover 3002 204 1 62
2017-12-07 “Sweet Child O’ Mine” – GUNS N’ ROSES cover 1317 201 2 86

Now that we saw the data. Let’s explore it to check for structures and information. The plots below show how likes, dislikes and comments are related to views. The positive relation is obvious. However, we have some degree of nonlinearity in likes and comments. The increment on likes and comments becomes smaller as the views increase. The dislikes look more linear on the views but the number of dislikes is to small to be sure.

Another interesting information is how comments are distributed over time in each video. I selected the four most recent videos and plotted the comments time-series below. All videos have a lot of activity in the first days but it decreases fast a few days latter. Followers and subscribers probably see the videos first and they must be responsible for the intense activity in the beginning of each plot.

plot of chunk unnamed-chunk-4

The most important information might be how the channel grows over the time. Dan’s channel had two important moments in 2017. It became much more active in April and it started having weekly publications in November. We can clearly see that both strategies worked in the plot below. I put two dashed lines to show these two events. In April the number of comments increased a lot and they increased even more in November.

plot of chunk unnamed-chunk-5

Finally, let’s have a look at what is in the comments using a WordCloud (wordcloud package). I removed words that are not informative such as “you, was, is, were” for English and Portuguese. The result is just below.

plot of chunk unnamed-chunk-6

Codes

Before using the tuber package you need an ID and a password from Google Developer Console. Click here for more information. If you are interested, the package tubern has some other tools to work with YouTube data such as generating reports.

library(tuber)
library(tidyverse)
library(lubridate)
library(stringi)
library(wordcloud)
library(gridExtra)

httr::set_config( config( ssl_verifypeer = 0L ) ) # = Fixes some certificate problems on linux = #

# = Autentication = #
yt_oauth("ID",
         "PASS",token = "")

# = Download and prepare data = #

# = Channel stats = #
chstat = get_channel_stats("UCbZRdTukTCjFan4onn04sDA")

# = Videos = #
videos = yt_search(term="", type="video", channel_id = "UCbZRdTukTCjFan4onn04sDA")
videos = videos %>%
  mutate(date = as.Date(publishedAt)) %>%
  filter(date > "2016-01-01") %>%
  arrange(date)

# = Comments = #
comments = lapply(as.character(videos$video_id), function(x){
  get_comment_threads(c(video_id = x), max_results = 1000)
})

# = Prep the data = #
# = Video Stat Table = #
videostats = lapply(as.character(videos$video_id), function(x){
  get_stats(video_id = x)
})
videostats = do.call(rbind.data.frame, videostats)
videostats$title = videos$title
videostats$date = videos$date
videostats = select(videostats, date, title, viewCount, likeCount, dislikeCount, commentCount) %>%
  as.tibble() %>%
  mutate(viewCount = as.numeric(as.character(viewCount)),
         likeCount = as.numeric(as.character(likeCount)),
         dislikeCount = as.numeric(as.character(dislikeCount)),
         commentCount = as.numeric(as.character(commentCount)))

# = General Stat Table = #
genstat = data.frame(Channel="Dan Vasc", Subcriptions=chstat$statistics$subscriberCount,
                   Views = chstat$statistics$viewCount,
                   Videos = chstat$statistics$videoCount, Likes = sum(videostats$likeCount),
                   Dislikes = sum(videostats$dislikeCount), Comments = sum(videostats$commentCount))

# = videostats Plot = #
p1 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3 = ggplot(data = videostats[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)

# = Comments TS = #
comments_ts = lapply(comments, function(x){
  as.Date(x$publishedAt)
})
comments_ts = tibble(date = as.Date(Reduce(c, comments_ts))) %>%
  group_by(date) %>% count()
ggplot(data = comments_ts) + geom_line(aes(x = date, y = n)) +
  geom_smooth(aes(x = date, y = n), se = FALSE) + ggtitle("Comments by day")+
  geom_vline(xintercept = as.numeric(as.Date("2017-11-08")), linetype = 2,color = "red")+
  geom_vline(xintercept = as.numeric(as.Date("2017-04-28")), linetype = 2,color = "red")

# = coments by video = #
selected = (nrow(videostats) - 3):nrow(videostats)
top4 = videostats$title[selected]
top4comments = comments[selected]

p = list()
for(i in 1:4){
  df = top4comments[[i]]
  df$date = as.Date(df$publishedAt)
  df = df %>%
    arrange(date) %>%
    group_by(year(date), month(date), day(date)) %>%
    count()
  df$date = make_date(df$`year(date)`, df$`month(date)`,df$`day(date)`)
  p[[i]] = ggplot(data=df) + geom_line(aes(x = date, y = n)) + ggtitle(top4[i])
}
do.call(grid.arrange,p)

## = WordClouds = ##
comments_text = lapply(comments,function(x){
  as.character(x$textOriginal)
})
comments_text = tibble(text = Reduce(c, comments_text)) %>%
  mutate(text = stri_trans_general(tolower(text), "Latin-ASCII"))
remove = c("you","the","que","and","your","muito","this","that","are","for","cara",
         "from","very","like","have","voce","man","one","nao","com","with","mais",
         "was","can","uma","but","ficou","meu","really","seu","would","sua","more",
         "it's","it","is","all","i'm","mas","como","just","make","what","esse","how",
         "por","favor","sempre","time","esta","every","para","i've","tem","will",
         "you're","essa","not","faz","pelo","than","about","acho","isso",
         "way","also","aqui","been","out","say","should","when","did","mesmo",
         "minha","next","cha","pra","sei","sure","too","das","fazer","made",
         "quando","ver","cada","here","need","ter","don't","este","has","tambem",
         "una","want","ate","can't","could","dia","fiquei","num","seus","tinha","vez",
         "ainda","any","dos","even","get","must","other","sem","vai","agora","desde",
         "dessa","fez","many","most","tao","then","tudo","vou","ficaria","foi","pela",
         "see","teu","those","were")
words = tibble(word = Reduce(c, stri_extract_all_words(comments_text$text))) %>%
  group_by(word) %>% count() %>% arrange(desc(n)) %>% filter(nchar(word) >= 3) %>%
  filter(n > 10 & word %in% remove == FALSE) 

set.seed(3)
wordcloud(words$word, words$n, random.order = FALSE, random.color = TRUE,
          rot.per = 0.3, colors = 1:nrow(words))

To leave a comment for the author, please follow the link and comment on their blog: R – insightR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Linear mixed-effect models in R

By Francisco Lima

(This article was first published on poissonisfish, and kindly contributed to R-bloggers)
Arabidopsis thaliana rosette. From: Wikimedia Commons

Statistical models generally assume that

  1. All observations are independent from each other
  2. The distribution of the residuals follows mathcal{N}(0, sigma^2), irrespective of the values taken by the dependent variable y

When any of the two is not observed, more sophisticated modelling approaches are necessary. Let’s consider two hypothetical problems that violate the two respective assumptions, where y denotes the dependent variable:

A. Suppose you want to study the relationship between average income (y) and the educational level in the population of a town comprising four fully segregated blocks. You will sample 1,000 individuals irrespective of their blocks. If you model as such, you neglect dependencies among observations – individuals from the same block are not independent, yielding residuals that correlate within block.

B. Suppose you want to study the relationship between anxiety (y) and the levels of triglycerides and uric acid in blood samples from 1,000 people, measured 10 times in the course of 24 hours. If you model as such, you will likely find that the variance of y changes over time – this is an example of heteroscedasticity, a phenomenon characterized by the heterogeneity in the variance of the residuals.

In A. we have a problem of dependency caused by spatial correlation, whereas in B. we have a problem of heterogeneous variance. As a result, classic linear models cannot help in these hypothetical problems, but both can be addressed using linear mixed-effect models (LMMs). In rigour though, you do not need LMMs to address the second problem.

LMMs are extraordinarily powerful, yet their complexity undermines the appreciation from a broader community. LMMs dissect hierarchical and / or longitudinal (i.e. time course) data by separating the variance due to random sampling from the main effects. In essence, on top of the fixed effects normally used in classic linear models, LMMs resolve i) correlated residuals by introducing random effects that account for differences among random samples, and ii) heterogeneous variance using specific variance functions, thereby improving the estimation accuracy and interpretation of fixed effects in one go.

I personally reckon that most relevant textbooks and papers are hard to grasp for non-mathematicians. Therefore, following the brief reference in my last post on GWAS I will dedicate the present tutorial to LMMs. For further reading I highly recommend the ecology-oriented Zuur et al. (2009) and the R-intensive Gałecki et al. (2013) books, and this simple tutorial from Bodo Winter. For agronomic applications, H.-P. Piepho et al. (2003) is an excellent theoretical introduction.

Here, we will build LMMs using the Arabidopsis dataset from the package lme4, from a study published by Banta et al. (2010). These data summarize variation in total fruit set per plant in Arabidopsis thaliana plants conditioned to fertilization and simulated herbivory. Our goal is to understand the effect of fertilization and simulated herbivory adjusted to experimental differences across groups of plants.

Mixed-effect linear models

Whereas the classic linear model with n observational units and p predictors has the vectorized form

mathbf{y} = mathbf{X}beta + epsilon

with the n times (p+1) predictor matrix mathbf{X} , the vector of p + 1 coefficient estimates beta and the n-long vectors of the response mathbf{y} and the residuals epsilon , LMMs additionally accomodate separate variance components modelled with a set of random effects mathbf{u} ,

mathbf{y} = mathbf{X}beta + mathbf{Z}mathbf{u} + epsilon

where mathbf{Z} and mathbf{X} are design matrices that jointly represent the set of predictors. Random effects models include only an intercept as the fixed effect and a defined set of random effects. Random effects comprise random intercepts and / or random slopes. Also, random effects might be crossed and nested. In terms of estimation, the classic linear model can be easily solved using the least-squares method. For the LMM, however, we need methods that rather than estimating predict mathbf{u} , such as maximum likelihood (ML) and restricted maximum likelihood (REML). Bear in mind that unlike ML, REML assumes that the fixed effects are not known, hence it is comparatively unbiased (see Chapter 5 in Zuur et al. (2009) for more details). Unfortunately, LMMs too have underlying assumptions – both residuals and random effects should be normally distributed. Residuals in particular should also have a uniform variance over different values of the dependent variable, exactly as assumed in a classic linear model.

One of the most common doubts concerning LMMs is determining whether a variable is a random or fixed. First of all, an effect might be fixed, random or even both simultaneously – it largely depends on how you approach a given problem. Generally, you should consider all factors that qualify as sampling from a population as random effects (e.g. individuals in repeated measurements, cities within countries, field trials, plots, blocks, batches) and everything else as fixed. As a rule of thumb, i) factors with fewer than 5 levels should be considered fixed and conversely ii) factors with numerous levels should be considered random effects in order to increase the accuracy in the estimation of variance. Both points relate to the LMM assumption of having normally distributed random effects.

Best linear unbiased estimators (BLUEs) and predictors (BLUPs) correspond to the values of fixed and random effects, respectively. The usage of the so-called genomic BLUPs (GBLUPs), for instance, elucidates the genetic merit of animal or plant genotypes that are regarded as random effects when trial conditions, e.g. location and year of trials are considered fixed. However, many studies sought the opposite, i.e. using breeding values as fixed effects and trial conditions as random, when the levels of the latter outnumber the former, chiefly because of point ii) outlined above. In GWAS, LMMs aid in teasing out population structure from the phenotypic measures.

Let’s get started with R

We will follow a structure similar to the 10-step protocol outlined in Zuur et al. (2009): i) fit a full ordinary least squares model and run the diagnostics in order to understand if and what is faulty about its fit; ii) fit an identical generalized linear model (GLM) estimated with ML, to serve as a reference for subsequent LMMs; iii) deploy the first LMM by introducing random effects and compare to the GLM, optimize the random structure in subsequent LMMs; iv) optimize the fixed structure by determining the significant of fixed effects, always using ML estimation; finally, v) use REML estimation on the optimal model and interpret the results.

You need to havenlme andlme4 installed to proceed. We will firstly examine the structure of the Arabidopsis dataset.

# Install (if necessary) and load nlme and lme4
library(nlme)
library(lme4)
# Load dataset, inspect size and additional info
data(Arabidopsis)
dim(Arabidopsis) # 625 observations, 8 variables
?Arabidopsis
attach(Arabidopsis)

The Arabidopsis dataset describes 625 plants with respect to the the following 8 variables (transcript from R):

reg
region: a factor with 3 levels NL (Netherlands), SP (Spain), SW (Sweden)
popu
population: a factor with the form n.R representing a population in region R
gen
genotype: a factor with 24 (numeric-valued) levels
rack
a nuisance factor with 2 levels, one for each of two greenhouse racks
nutrient
fertilization treatment/nutrient level (1, minimal nutrients or 8, added nutrients)
amd
simulated herbivory or “clipping” (apical meristem damage): unclipped(baseline) or clipped
status
a nuisance factor for germination method (Normal, Petri.Plate, or Transplant)
total.fruits
total fruit set per plant (integer), henceforth TFPP for short.

We will now visualise the absolute frequencies in all 7 factors and the distribution for TFPP.

# Overview of the variables
par(mfrow = c(2,4))
barplot(table(reg), ylab = "Frequency", main = "Region")
barplot(table(popu), ylab = "Frequency", main = "Population")
barplot(table(gen), ylab = "Frequency", las = 2, main = "Genotype")
barplot(table(rack), ylab = "Frequency", main = "Rack")
barplot(table(nutrient), ylab = "Frequency", main = "Nutrient")
barplot(table(amd), ylab = "Frequency", main = "AMD")
barplot(table(status), ylab = "Frequency", main = "Status")
hist(total.fruits, col = "grey", main = "Total fruits", xlab = NULL)

plot1

The frequencies are overall balanced, perhaps except for status (i.e. germination method). Genotype, greenhouse rack and fertilizer are incorrectly interpreted as quantitative variables. In addition, the distribution of TFPP is right-skewed. As such, we will encode these three variables as categorical variables and log-transform TFPP to approximate a Gaussian distribution (natural logarithm).

# Transform the three factor variables gen, rack and nutrient
Arabidopsis[,c("gen","rack","nutrient")] 

A closer look into the variables shows that each genotype is exclusive to a single region. The data contain no missing values.

# gen x popu table
table(gen, popu)
# Any NAs?
any(is.na(Arabidopsis)) # FALSE

Formula syntax basics

At this point I hope you are familiar with the formula syntax in R. Note that interaction terms are denoted by : and fully crossed effects with *, so that A*B = A + B + A:B. The following code example

lm(y ~ x1 + x2*x3)

builds a linear model of y using x_1 , x_2 , x_3 and the interaction between x_2 and x_3 . In case you want to perform arithmetic operations inside the formula, use the function I. You can also introduce polynomial terms with the function poly. One handy trick I use to expand all pairwise interactions among predictors is

model.matrix(y ~ .*., data = X)

provided a matrix X that gathers all predictors and y. You can also simply use .*. inside the lm call, however you will likely need to preprocess the resulting interaction terms.

While the syntax of lme is identical to lm for fixed effects, its random effects are specified under the argument random as

random = ~intercept + fixed effect | random effect

and can be nested using /. In the following example

random = ~1 + C | A/B

the random effect B is nested within random effect A, altogether with random intercept and slope with respect to C. Therefore, not only will the groups defined by A and A/B have different intercepts, they will also be explained by different slight shifts of beta from the fixed effect C.

Classic linear model

Ideally, you should start will a full model (i.e. including all independent variables). Here, however, we cannot use all descriptors in the classic linear model since the fit will be singular due to the redundancy in the levels of reg and popu. For simplicity I will exclude these alongside gen, since it contains a lot of levels and also represents a random sample (from many other extant Arabidopsis genotypes). Additionally, I would rather use rack and status as random effects in the following models but note that having only two and three levels respectively, it is advisable to keep them as fixed.

LM 

Rplot

These diagnostic plots show that the residuals of the classic linear model poorly qualify as normally distributed. Because we have no obvious outliers, the leverage analysis provides acceptable results. We will try to improve the distribution of the residuals using LMMs.

Generalized linear model

We need to build a GLM as a benchmark for the subsequent LMMs. This model can be fit without random effects, just like a lm but employing ML or REML estimation, using the gls function. Hence, it can be used as a proper null model with respect to random effects. The GLM is also sufficient to tackle heterogeneous variance in the residuals by leveraging different types of variance and correlation functions, when no random effects are present (see arguments correlation and weights).

GLM 

At this point you might consider comparing the GLM and the classic linear model and note they are identical. Also, you might wonder why are we using LM instead of REML – as hinted in the introduction, REML comparisons are meaningless in LMMs that differ in their fixed effects. Therefore, we will base all of our comparisons on LM and only use the REML estimation on the final, optimal model.

Optimal random structure

Let's fit our first LMM with all fixed effects used in the GLM and introducing reg, popu, gen, reg/popu, reg/gen, popu/gen and reg/popu/gen as random intercepts, separately.

In order to compare LMMs (and GLM), we can use the function anova (note that it does not work for lmer objects) to compute the likelihood ratio test (LRT). This test will determine if the models are significantly different with respect to goodness-of-fit, as weighted by the trade-off between variance explained and degrees-of-freedom. The model fits are also evaluated based on the Akaike (AIC) and Bayesian information criteria (BIC) – the smaller their value, the better the fit.

lmm1 

We could now base our selection on the AIC, BIC or log-likelihood. Considering most models are undistinguishable with respect to the goodness-of-fit, I will select lmm6 and lmm7 as the two best models so that we have more of a random structure to look at. Take a look into the distribution of the random effects with plot(ranef(MODEL)). We next proceed to incorporate random slopes.

There is the possibility that the different researchers from the different regions might have handled and fertilized plants differently, thereby exerting slightly different impacts. Let's update lmm6 and lmm7 to include random slopes with respect to nutrient. We first need to setup a control setting that ensures the new models converge.

# Set optimization pars
ctrl 

Assuming a level of significance alpha = 0.05 , the inclusion of random slopes with respect to nutrient improved both lmm6 and lmm7. Comparing lmm6.2 andlmm7.2 head-to-head provides no evidence for differences in fit, so we select the simpler model,lmm6.2. Let's check how the random intercepts and slopes distribute in the highest level (i.e. gen within popu).

plot(ranef(lmm6.2, level = 2))

Rplot

The random intercepts (left) appear to be normally distributed, except for genotype 34, biased towards negative values. This could warrant repeating the entire analysis without this genotype. The random slopes (right), on the other hand, are rather normally distributed. Interestingly, there is a negative correlation of -0.61 between random intercepts and slopes, suggesting that genotypes with low baseline TFPP tend to respond better to fertilization. Try plot(ranef(lmm6.2, level = 1)) to observe the distributions at the level of popu only. Next, we will use QQ plots to compare the residual distributions between the GLM and lmm6.2 to gauge the relevance of the random effects.

# QQ plots (drawn to the same scale!)
par(mfrow = c(1,2))
lims 

Rplot

The improvement is clear. Bear in mind these results do not change with REML estimation. Try different arrangements of random effects with nesting and random slopes, explore as much as possible!

Optimal fixed structure

Now that we are happy with the random structure, we will look into the summary of the optimal model so far (i.e. lmm6.2) and determine if we need to modify the fixed structure.

summary(lmm6.2)

All effects are significant with alpha = 0.05 , except for one of the levels from status that represents transplanted plants. Given the significant effect from the other two levels, we will keep status and all current fixed effects. Just for fun, let’s add the interaction term nutrient:amd and see if there is any significant improvement in fit.

lmm8 

The addition of the interaction was non-significant with respect to both beta and the goodness-of-fit, so we will drop it. Note that it is not a good idea to add new terms after optimizing the random structure, I did so only because otherwise there would be nothing to do with respect to the fixed structure.

Fit optimal model with REML

We could play a lot more with different model structures, but to keep it simple let's finalize the analysis by fitting the lmm6.2 model using REML and finally identifying and understanding the differences in the main effects caused by the introduction of random effects.

finalModel 

We will now contrast our REML-fitted final model against a REML-fitted GLM and determine the impact of incorporating random intercept and slope, with respect to nutrient, at the level of popu/gen. Therefore, both will be given the same fixed effects and estimated using REML.

dev.off() # Reset previous graphical pars
# New GLM, updated from the first by estimating with REML
GLM2 

Rplot

The figure above depicts the estimated beta from the different fixed effects, including the intercept, for the GLM (black) and the final LMM (red). Error bars represent the corresponding standard errors (SE). Overall the results are similar but uncover two important differences. First, for all fixed effects except the intercept and nutrient, the SE is smaller in the LMM. Second, the relative effects from two levels of status are opposite. With the consideration of random effects, the LMM estimated a more negative effect of culturing in Petri plates on TFPP, and conversely a less negative effect of transplantation.

plot(finalModel)

Rplot

The distribution of the residuals as a function of the predicted TFPP values in the LMM is still similar to the first panel in the diagnostic plots of the classic linear model. The usage of additional predictors and generalized additive models would likely improve it.

Conclusions

Now that we account for genotype-within-region random effects, how do we interpret the LMM results?

Plants that were placed in the first rack, left unfertilized, clipped and grown normally have an average TFPP of 2.15. This is the value of the estimated grand mean (i.e. intercept), and the predicted TFPP when all other factors and levels do not apply. For example, a plant grown under the same conditions but placed in the second rack will be predicted to have a smaller yield, more precisely of ln(TFPP) = beta_{intercept} + beta_{Rack} = 2.15 + (-0.75) = 1.4 . To these reported yield values, we still need to add the random intercepts predicted for region and genotype within region (which are tiny values, by comparison; think of them as a small adjustment). Moreover, we can state that

  • Fertilized plants produce more fruits than those kept unfertilized. This was the strongest main effect and represents a very sensible finding.
  • Plants grown in the second rack produce less fruits than those in the first rack. This was the second strongest main effect identified. Could this be due to light / water availability?
  • Simulated herbivory (AMD) negatively affects fruit yield. This is also a sensible finding – when plants are attacked, more energy is allocated to build up biochemical defence mechanisms against herbivores and pathogens, hence compromising growth and eventually fruit yield.
  • Both culturing in Petri plates and transplantation, albeit indistinguishable, negatively affect fruit yield as opposed to normal growth. When conditions are radically changed, plants must adapt swiftly and this comes at a cost as well. Thus, these observations too make perfect sense.
  • One important observation is that the genetic contribution to fruit yield, as gauged by gen, was normally distributed and adequately modelled as random. One single outlier could eventually be excluded from the analysis. This makes sense assuming plants have evolved universal mechanisms in response to both nutritional status and herbivory that overrule any minor genetic differences among individuals from the same species.

Wrap-up

  • Always check the residuals and the random effects! While both linear models and LMMs require normally distributed residuals with homogeneous variance, the former assumes independence among observations and the latter normally distributed random effects. Use normalized residuals to establish comparisons.
  • One key additional advantage of LMMs we did not discuss is that they can handle missing values.
  • Wide format data should be first converted to long format, using e.g. the R package reshape.
  • Variograms are very helpful in determining spatial or temporal dependence in the residuals. In the case of spatial dependence, bubble plots nicely represent residuals in the space the observations were drown from (e.g. latitude and longitude; refer to Zuur et al. (2009) for more information).
  • REML estimation is unbiased but does not allow for comparing models with different fixed structures. Only use the REML estimation on the optimal model.

With respect to this particular set of results:

  • The analysis outlined here is not as exhaustive as it should be. Among other things, we did neither initially consider interaction terms among fixed effects nor investigate in sufficient depth the random effects from the optimal model.
  • The dependent variable (total fruit set per plant) was highly right-skewed and required a log-transformation for basic modeling. The large amount of zeros would in rigour require zero inflated GLMs or similar approaches.
  • All predictors used in the analysis were categorical factors. We could similarly use an ANOVA model. LMMs are likely more relevant in the presence of quantitative or mixed types of predictors.

I would like to thank Hans-Peter Piepho for answering my nagging questions over ResearchGate. I hope these superficial considerations were clear and insightful. I look forward for your suggestions and feedback. My next post will cover a joint multivariate model of multiple responses, the graph-guided fused LASSO (GFLASSO) using a R package I am currently developing. Happy holidays!

To leave a comment for the author, please follow the link and comment on their blog: poissonisfish.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

styler – A non-invasive source code formatter for R

By Rsome – A blog about some R stuff

(This article was first published on Rsome – A blog about some R stuff, and kindly contributed to R-bloggers)

I am pleased to announce that the R package
styler, which I have worked on
through Google Summer of Code 2017
with Kirill Müller and Yihui Xie,
has reached a mature stage.

You can now install it from CRAN

install.packages("styler")

If your CRAN mirror does not yet have it, you can get it from GitHub with remotes::install_github("r-lib/styler").

The package formats R code, by default according to the tidyverse style guide.
The distinguishing feature of styler is its flexibility. We will introduce some
of the options below. Before I continue, I want to thank my two mentors
from Google Summer of Code, in particular
Kirill Müller, who was an amazing
companion during the three months of coding – and beyond. I feel really blessed
how everything came about. In addition, I would like to thank Google for
organizing GSOC this year and facilitating the involvement of students in open
source projects.

Back to the package: styler can style text, single files, packages and entire
R source trees with the following functions:

  • style_text() styles a character vector.
  • style_file() styles R and Rmd files.
  • style_dir() styles all R and/or Rmd files in a directory.
  • style_pkg() styles the source files of an R package.
  • An RStudio Addin that styles the active file R or Rmd file, the current
    package or the highlighted code.

Styling options

We can limit ourselves to styling just spacing information by indicating this
with the scope argument:

library("styler")
library("magrittr")
style_text("a=3; 2", scope = "spaces")
a = 3; 2

If you are reading this post on r-bloggers, there might be issues with
displaying code and rendered output correctly. You can continue reading on
the page this post was published initially.

Or, on the other extreme of the scale, styling spaces, indention, line breaks
and tokens:

style_text("a=3; 2", scope = "tokens")
a 2

Another option that is helpful to determine the level of ‘invasiveness’ is
strict. If set to TRUE, spaces and line breaks before or after tokens are
set to either zero or one. However, in some situations this might be
undesirable (so we set strict = FALSE), as the following example shows:

style_text(
"data_frame(
small = 2 ,
medium = 4,#comment without space
large =6
)"
, strict = FALSE
)
data_frame(
small = 2,
medium = 4, # comment without space
large = 6
)

We prefer to keep the equal sign after “small”, “medium” and large aligned,
so we set strict = FALSE to set spacing to at least one around =.

Though simple, hopefully the above examples convey some of the flexibility of
the configuration options available in styler. You can find out more about
options available with the tidyverse style by checking out the help file for
style_tidyverse().

Gallery

In the sequel, let us focus on a configuration with
strict = TRUE and scope = "tokens" and illustrate a few more examples of
code before and after styling.

styler can identify and handle unary operators and other math tokens:

# Before
1++1-1-1/2
# After
1 + +1 - 1 - 1 / 2

This is tidyverse style. However, styler offers very granular control for
math token spacing. Assuming you like spacing around + and -, but not
around / and * and ^. This can be achieved as follows:

style_text(
"1++1/2*2^2",
math_token_spacing = specify_math_token_spacing(zero = c("'/'", "'*'", "'^'"))
)
1 + +1/2*2^2

It can also format complicated expressions that involve line breaking and
indention based on both brace expressions and operators:

# Before
if (x >3) {stop("this is an error")} else {
c(there_are_fairly_long,
1 / 33 *
2 * long_long_variable_names)%>% k(

) }
# After
if (x > 3) {
stop("this is an error")
} else {
c(
there_are_fairly_long,
1 / 33 *
2 * long_long_variable_names
) %>% k()
}

Lines are broken after ( if a function call spans multiple lines:

# Before
do_a_long_and_complicated_fun_cal("which", has, way, to,
"and longer then lorem ipsum in its full length"
)
# After
do_a_long_and_complicated_fun_cal(
"which", has, way, to,
"and longer then lorem ipsum in its full length"
)

styler replaces = with for assignment, handles single quotes within
strings if necessary, and adds braces to function calls in pipes:

# Before
one= 'one string'
two= "one string in a 'string'"
a %>%
b %>%
c
# After
one two a %>%
b() %>%
c()

Function declarations are indented if multi-line:

# Before
my_fun y,
z) {
just(z)
}
# After
my_fun y,
z) {
just(z)
}

styler can also deal with tidyeval syntax:

# Before
mtcars %>%
group_by(!!!my_vars)
# After
mtcars %>%
group_by(!!! my_vars)

If you, say, don’t want comments starting with ### to be indented, you can
formulate an unindention rule:

style_text(
c(
"a ,
"### not to be indented",
"# indent normally",
"33",
"}"
),
reindention = specify_reindention(regex_pattern = "###", indention = 0)

)
a ### not to be indented
# indent normally
33
}

Customizing styler – implementing your own style guide

Not only can you customize styler with the options of tidyverse_style(). The
real flexibility of styler is supporting third-party style
guides. Technically speaking, a style guide such as tidyverse_style() is
nothing but a set of transformer functions and options. How you can create
your own style guide is explained in this
vignette.

Wrap-up

I hope I have convinced you that you should give styler a try. If you find
unexpected behavior, you are welcome to file an issue on
GitHub.

To leave a comment for the author, please follow the link and comment on their blog: Rsome – A blog about some R stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Lots of Package News

By Max Kuhn

recipes_rsample.png

(This article was first published on Blog – Applied Predictive Modeling, and kindly contributed to R-bloggers)

I’ve sent a slew of packages to CRAN recently (thanks to Swetlana and Uwe). There are updates to:

  • caret was primarily updated to deal with an issue introduced in the last version. It is a good idea to avoid fully loading the underlying modeling packages to avoid name conflicts. We made that change in the last version and it ended up being more complex than thought. A quirk in the regression tests missed it too but the whole thing can be avoided by loading the modeling package. news file

  • rsample had some modest additions including bridges to caret and recipes. The website added more application examples (times series and survival analysis). news file

  • recipes had more substantial enhancments including six new steps, a better interface for creating interactions (using selectors), and the ability to save the processed data in a sparse format. news file

  • Cubist and C50 have been updated and brought into the age of roxygen and pkgdown. C5.0 now has a nice plot method a la partykit and now has a vignette. I’ll be adding a few features to each over time.

Two new packages:

  • yardstick contains many of the performance measurement methods in caret but in a format that is easier to use with dplyr syntax and functional programming.

  • tidyposterior is a Bayesian version of caret‘s resamples function. It can be used to take the resampling results from multiple models and do more formal statistical comparisons. It is similar in spirit to Benavoli et al (2017). We are looking for nominations for the hex logo so please offer your suggestions (but keep it clean).

To leave a comment for the author, please follow the link and comment on their blog: Blog – Applied Predictive Modeling.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Shiny Server (Pro) 1.5.6

By Alan Dipert

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Shiny Server 1.5.6.875 and Shiny Server Pro 1.5.6.902 are now available.

This release of Shiny Server Pro includes floating license support and Shiny Server contains a small enhancement to the way errors are displayed. We recommend upgrading at your earliest convenience.

Shiny Server 1.5.6.875

  • Use HTTPS for Google Fonts on error page, which resolves insecure content errors on some browser when run behind SSL.
    (PR #322)

Shiny Server Pro 1.5.6.902

This release adds floating license support through the license_type configuration directive.
Full documentation can be found at http://docs.rstudio.com/shiny-server/#floating-licenses.

Floating licensing allows you to run fully licensed copies of Shiny Server Pro easily in ephemeral instances, such as Docker containers, virtual machines, and EC2 instances. Instances don’t have to be individually licensed, and you don’t have to manually activate and deactivate a license in each instance. Instead, a lightweight license server distributes temporary licenses (“leases”) to each instance, and the instance uses the license only while it’s running.

This model is perfect for environments in which Shiny Server Pro instances are frequently created and destroyed on demand, and only requires that you purchase a license for the maximum number of concurrent instances you want to run.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Connecting R to Keras and TensorFlow

By R Views

(This article was first published on R Views, and kindly contributed to R-bloggers)

It has always been the mission of R developers to connect R to the “good stuff”. As John Chambers puts it in his book Extending R:

One of the attractions of R has always been the ability to compute an interesting result quickly. A key motivation for the original S remains as important now: to give easy access to the best computations for understanding data.

From the day it was announced a little over two years ago, it was clear that Google’s TensorFlow platform for Deep Learning is good stuff. This September (see announcment), J.J. Allaire, François Chollet, and the other authors of the keras package delivered on R’s “easy access to the best” mission in a big way. Data scientists can now build very sophisticated Deep Learning models from an R session while maintaining the flow that R users expect. The strategy that made this happen seems to have been straightforward. But, the smooth experience of using the Keras API indicates inspired programming all the way along the chain from TensorFlow to R.

The Keras Strategy

TensorFlow itself is implemented as a Data Flow Language on a directed graph. Operations are implemented as nodes on the graph and the data, multi-dimensional arrays called “tensors”, flow over the graph as directed by control signals. An overview and some of the details of how this all happens is lucidly described in a paper by Abadi, Isard and Murry of the Google Brain Team,

and even more details and some fascinating history are contained in Peter Goldsborough’s paper, A Tour of TensorFlow.

This kind of programming will probably strike most R users as being exotic and obscure, but my guess is that because of the long history of dataflow programming and parallel computing, it was an obvious choice for the Google computer scientists who were tasked to develop a platform flexible enough to implement arbitrary algorithms, work with extremely large data sets, and be easily implementable on any kind of distributed hardware including GPUs, CPUs, and mobile devices.

The TensorFlow operations are written in C++, CUDA, Eigen, and other low-level languages optimized for different operation. Users don’t directly program TensorFlow at this level. Instead, they assemble flow graphs or algorithms using a higher-level language, most commonly Python, that accesses the elementary building blocks through an API.

The keras R package wraps the Keras Python Library that was expressly built for developing Deep Learning Models. It supports convolutional networks (for computer vision), recurrent networks (for sequence processing), and any combination of both, as well as arbitrary network architectures: multi-input or multi-output models, layer sharing, model sharing, etc. (It should be pretty clear that the Python code that makes this all happen counts as good stuff too.)

Getting Started with Keras and TensorFlow

Setting up the whole shebang on your local machine couldn’t be simpler, just three lines of code:

install.packages("keras")
library(keras)
install_keras()

Just install and load the keras R package and then run the keras::install_keras() function, which installs TensorFlow, Python and everything else you need including a Virtualenv or Conda environment. It just works! For instructions on installing Keras and TensorFLow on GPUs, look here.

That’s it; just a few minutes and you are ready to start a hands-on exploration of the extensive documentation on the RStudio’s TensorFlow webpage tensorflow.rstudio.com, or jump right in and build a Deep Learning model to classify the hand-written numerals using

MNIST data set which comes with the keras package, or any one of the other twenty-five pre-built examples.

Beyond Deep Learning

Being able to build production-level Deep Learning applications from R is important, but Deep Learning is not the answer to everything, and TensorFlow is bigger than Deep Learning. The really big ideas around TensorFlow are: (1) TensorFlow is a general-purpose platform for building large, distributed applications on a wide range of cluster architectures, and (2) while data flow programming takes some getting used to, TensorFlow was designed for algorithm development with big data.

Two additional R packages make general modeling and algorithm development in TensorFlow accessible to R users.

The tfestimators package, currently on GitHub, provides an interface to Google’s Estimators API, which provides access to pre-built TensorFlow models including SVM’s, Random Forests and KMeans. The architecture of the API looks something like this:

There are several layers in the stack, but execution on the small models I am running locally goes quickly. Look here for documentation and sample models that you can run yourself.

At the deepest level, the tensorflow package provides an interface to the core TensorFlow API, which comprises a set of Python modules that enable constructing and executing TensorFlow graphs. The documentation on the package’s webpage is impressive, containing tutorials for different levels of expertise, several examples, and references for further reading. The MNIST for ML Beginners tutorial works through the classification problem described above in terms of the Keras interface at a low level that works through the details of a softmax regression.

While Deep Learning is sure to capture most of the R to TensorFlow attention in the near term, I think having easy access to a big league computational platform will turn out to be the most important benefit to R users in the long run.

As a final thought, I am very much enjoying reading the MEAP from the forthcoming Manning Book, Deep Learning with R by François Chollet, the creator of Keras, and J.J. Allaire. It is a really good read, masterfully balancing theory and hands-on practice, that ought to be helpful to anyone interested in Deep Learning and TensorFlow.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to develop good R packages (for open science)

By Maëlle Salmon

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I was invited to an exciting ecology & R hackathon in my capacity as a co-editor for rOpenSci onboarding system of packages. It also worked well geographically since this hackathon was to take place in Ghent (Belgium) which is not too far away from my new city, Nancy (France). The idea was to have me talk about my “top tips on how to design and develop high-quality, user-friendly R software” in the context of open science, and then be a facilitator at the hackathon.

The talk topic sounded a bit daunting but as soon as I started preparing the talk I got all excited gathering resources – and as you may imagine since I was asked to talk about my tips I did not need to try & be 100% exhaustive. I was not starting from scratch obviously: we at rOpenSci already have well-formed opinions about such software, and I had given a talk about automatic tools for package improvement whose content was part of my top tips.

As I’ve done in the past with my talks, I chose to increase the impact/accessibility of my work by sharing it on this blog. I’ll also share this post on the day of the hackathon to provide my audience with a more structured document than my slides, in case they want to keep some trace of what I said (and it helped me build a good narrative for the talk!). Most of these tips will be useful for package development in general, and a few of them specific to scientific software.

What is a good R package in my opinion?

The answer to this will appear in the blog post but if I had to say this in one sentence, I’d say that a good R package for open science is an useful, robust and well-documented package that’s easy to find and that users will trust.

How to plan your R package

Should your package even exist?

You probably want to write and publish an R package because you have some sort of scripts or functions that you think would be useful to other people, or you do not even have that but wish the functionality existed. I won’t speak about the benefits of building an R package for yourself, or a research compendium to make your work reproducible.

I think you should ask yourself these questions before starting to code away:

  • Will this piece of software really be useful to someone else than you? If not, building a package will still save you time, but you maybe don’t need to put a lot of effort into it. If you doubt, you can ask around you if other people would be interested in using it. Brainstorming package ideas is part of what happened before the hackathon, in a collaborative document.

  • Is your idea really new, or is there already a package out there performing the exast same task?

  • If there is one, would your implementation bring something to the market, e.g. an user-friendlier implementation? There are for instance two R packages that interface the Keras Deep Learning Library, keras and kerasR. In some cases having several packages will be too much, in some cases it won’t because one package will suit some users better.

  • If there is already a similar package, should you work on your own or collaborate with the author(s) of the original package? The rtweet package actually completely replaced the twitteR package after the two maintainers agreed on it. I guess the best thing is to just ask the maintainer, unless their contributing guide already makes clear what’s welcome and what’s not (more on contributing guides later!).

What should your package contain exactly?

Then, once you’ve really decided that you want to write and publish this R package, you could start planning its development. Two things seem important to me at this point:

  • Make your package compatible with the workflow of your users. For instance, if your package has something to do with time series, it’s probably best to try and make your output compatible with time series classes people are used to, and for which there are other packages available.

  • Do not get too ambitious. For instance, do you want to have a specific plotting method for your output, or can you rely on your users’ using say ggplot2, merely giving them an example of how to do so in a vignette? I’ve heard that’s part of the Unix philosophy, small pieces building on top each other. This will save you time, and let you concentrate on the novel and important piece you’re providing to the community.

Once you’ve decided on all of this, maybe sketch a to-do list. If you’re already familiar with Github, which you will be after reading some books recommended in the next section, you can have one issue per item, or even take advantage of Github milestones feature. You don’t need to have a master plan for your whole package life, because one can hope your users will also influence the future features, and well you’ll keep getting smarter as time goes by and the ideas will keep flowing.

If you want to work on your package as a team, make roles clear from the beginning, e.g. who will be the maintainer and keep others right on track… in particular if you start the project as a hackathon, who will make sure development goes forward once everyone goes home?

What will your package be called?

Last but not least in this planning phase, you should call your package something! I’d advise names in lower case as recommended by rOpenSci, because this way your user doesn’t need to remember where the capital letters are (yes, it is a funny recommendation by an organization called rOpenSci), and because a full capital letter name looks like screaming. I recommend checking your package name via the available package which will tell you whether it’s valid, available on the usual packages venues and also whether it has some unintended meanings.

How to build an R package

Here I’ll list useful resources to learn how to build an R package while respecting best practice. When googling a bit about R package development, I was surprised to see that although there is a ton of R packages out there, there aren’t that many guides. Maybe we do not need more than CRAN official guide after all, and Hadley Wickham’s book is excellent, especially with the corresponding RStudio and devtools magic combo… but well diversity of resources is good too!

I’d recommend your building packages in RStudio making the most of automatic tools like devtools. Note, in comparison to the content of Hadley’s book, some automation is now supported by usethis.

Using these short and long form resources will help you increase the quality of your package. It’ll have the structure it should have for instance.

Short tutorials

To learn the 101 of package building, check out Hilary Parker’s famous blog post Writing an R package from scratch that got many R package developpers started (I’ve for instance seen quite a few tweets of people celebrating their first R package and thanking Hilary!) or this pictorial by Matthew J Denny. How someone can be patient enough to do so many screenshots is beyond me but it’s really great that he did!

Books

Then you can read books! I’ve mentioned Hadley’s book. There’s also a very good one called Mastering Software Development in R by Roger D. Peng, Sean Kross, and Brooke Anderson. And I’d be damned if I do not mention the bible a.k.a Writing R extensions, the official CRAN guide which is not really a book but well a (very) long form resource. In all cases you can cherry pick what you need and want to read but be sure you follow the rules when submitting a package to CRAN for instance.

MOOCs

If you’re more into MOOCs, there is a Coursera specialization corresponding to the book by Roger Peng, Sean Kross and Brooke Anderson, with a course specifically about R packages.

In-person training

You can also look for in person training! Look up your local R user group or R-Ladies chapter if there’s one, maybe book a company, etc. If you happen to be in or near South Africa in March 2018, I’ll be teaching a workshop about R package development in Cape Town at satRday 2018 on March the 16th, maybe you can buy a ticket and come!

More precise guidelines

If you want to follow even more precise guidelines than the ones in the resources, check out rOpenSci packaging guide.

Where should your package live?

Open your code!

I was giving a talk in the context of open science so this one was obvious. And R is an open-source language, so if you publish your package there’s no black box. On top of that you should make the place where your code is created and modified open to facilitate its browsing, bug reports (if they are open, there’s less chance they’ll be duplicated, and people can see how fast/whether you respond to them), and collaboration.

I’ve mentioned Github when giving my planning tips earlier. My recommendation is to have the code of your R package live in an open repository with version control. That might be R-Forge and svn if you like, or Github/Gitlab and git. I was introduced to version control basics with svn and R-Forge and found that the small tortoise of the Windows Tortoise svn software was cute, that’s all I can tell you about svn, ah! I then learnt git and Github thanks to this part of Hadley’s book, which also gives you good reasons to use git and Github. I’ve heard good things about this online book by Jenny Bryan and this website for when bad things happen to you.

Put your package in one of the official platforms

That is, CRAN or Bioconductor, where they’re easier to install, and also for the badge of trust it gives your package, although you might already follow stricter guidelines than these platforms.

By the way regarding trust, this blog post by Jeff Leek, How I decide when to trust an R package is an interesting read.

Special tip specific to CRAN, release your package using devtools::release() which will help you catch some issues before submitting your package to CRAN, and will allow you to do the whole submission without even leaving R.

Link the two

I wrote this with my CRAN experience in mind, but this is probably relevant for Bioconductor too.

I want to underline one point: please indicate the URL of your code repository and the link to the issues or bug reports page in the DESCRIPTION of your package. This way, once your package is up on CRAN for instance, users can trace it back more easily. Really, do not forget to do that. Also indicate the link to your issue page in DESCRIPTION under BugReports.

Another important point is that if your package is on CRAN, and your code repository on Github, it’s good to:

  • create a release and attach the corresponding NEWS each time you upload a new version on CRAN (otherwise your release is a surprise release). One reason for having releases is archiving, this way users can download and install a given version quite easily. Another one is that users might use tools such as Sibbell to be notified of new releases.

  • have a CRAN badge in the README, see this README for instance. It’s easier to install packages from CRAN so people might want to do that, they can see what the latest CRAN version number is, and well you recognize the hosting on CRAN in a way, because you should be thankful for CRAN work.

How to check your package automatically

Here I will assume your package has unit tests and is hosted on Github, although the latter isn’t a condition for all tips here. If you’ve read my blog post about automatic tools for improving packages thoroughly, this will sound quite familiar, and I’d advise you to go to this post for more details.

These tools are all here to help you get a better package before releasing it.

R CMD check

Use devtools::check()! The link I’ve put here explains why you should, and the efforts you should put in solving the different messages it’ll output.

Continuous integration

As a complement to the last sub-section, if your code lives on say Github, you can have R CMD check run on an external platform each time you push changes. This is called continuous integration and will help you spot problems without needing to remember to do the checks. Furthermore, having them run on an external platform allows you to choose the R version so you can check your package works with different R versions without installing them all.

This blog post by Julia Silge is an excellent introduction to continuous integration for beginners.

To perform checks on say Windows only once in a while, you might want to check the R-hub project. Making sure your package works on several platforms is a condition for CRAN release, and in any case something you’ll want to do to have an user-friendly package.

Style your code

Style your code for future readers, who will be future you, collaborators and curious users. If your code is pretty, it’ll be easier for them to concentrate on its content.

Use lintr for static code analysis.

To prettify your code, use styler.

Get automatic advice about your package

Install and run goodpractice for “Advice (that) includes functions and syntax to avoid, package structure, code complexity, code formatting, etc.”.

Remove typos!

Use the devtools::spell_check() function. Typos in documentation are not the end of the world but can annoy the user, and it’s not hard to remove them thanks to this tool, so do it when you get a chance.

All these tools provide you with a list of things to change. This means they create a to-do list without your needing to think too much, so if you’re a bit tired but want to improve your package, you can let them fix the agenda!

Bring the package to its users and get feedback

These two aspects are linked because your users will help you make your package better for them, but for that to happen they need to know about your package and to understand it.

Documentation!

You know how your package works but the users don’t (and future you might forget as well!). Document your parameters, functions, add examples, and also write vignettes. At least one vignette will show your users how the different functions of your package interact.

At the beginning of this document I mentioned that when planning your package you should think about making it compatible with your users workflow. One of the vignette can be an example of just that. Otherwise, it’ll be hard for users to see how useful your package is… and when writing the example workflow, you might find your package lacks some functionalities.

Once you’ve written all these documentation parts, create a website for your package using pkgdown. This tutorial is excellent.

User-friendly programming

I’ll give four tips here, partly inspired by this essay by François Chollet. I want to share these two sentences from the essay with you: “ In the long run, good design wins, because it makes its adepts more productive and more impactful, thus spreading faster than user-hostile undesign. Good design is infectious.”.

  • Have error and warning messages that are easy to understand. For instance, if communicating with an API, then translate http errors.

  • Check the input. For instance, If a parameter provided to a function should be a positive number, check it is and if it is not, return a friendly error message.

  • Offer shortcuts. For instance, if your function relies on an API key being provided, you can silently search for it in environment variables without the user doing anything, provided you explain how to store an environment variable in the documentation. This is how my opencage works if you want an example. Noam Ross made the PR that added this functionality, and I agree that it makes the package user-friendler. Likewise, have good default values of other parameters if possible.

  • Choose a good naming scheme that will allow your users to more easily remember the names of functions. You could choose to have a prefix for all the functions of your package, which is recommended by rOpenSci, and discussed in this very interesting and nuanced blog post by Jon Calder.

Promotion

Now that you have a well-documented error-friendly package… how are your end users going to learn that it exists? Note that you should have a good idea of your potential users when planning your package (or at least part of them, who knows who else might find it useful?). One way to define good venues for promotion is to think of how you learn about R packages, but as an R developper you’re probably more into R and R news than your future average user! For information you can also see the results of Julia Silge’s survey about how people discover packages here.

Obvious venues for promotion are social media, for instance Twitter with the #rstats hashtag in the form of a carefully crafted tweet including a link and a cool image like a figure or screenshot and maybe emojis, but that might not be enough for your niche package. If you write an ecology package and you have a well-recognized blog in that area blog about your package of course. You could also ask someone to let them you guest post on their well-recognized niche field blog.

If your package is an interface to an existing tool or internet resource, try to have your package listed on that tool or resource webpage.

As a scientist, you can write a paper about your package and publish it to a journal well known in your field like Methods in Ecology and Evolution. By the way when you use R and packages in such a paper or any other paper do not forget to cite them (using the output of citation() for instance) to acknowledge their usefulness and help their authors feel their work is valued (furthermore if they’re academics having their work cited is quite crucial for their career advancement).

Review

As a scientist you’re used to having papers peer-reviewed. In general if you submit a paper about your package, your paper will be reviewed more than your package, but you can actually get the benefits of peer-review for your package itself! This is a fantastic way to improve the quality and user-friendliness of your package.

Where to have your package review?

  • You don’t need anything formal, maybe you can ask someone you know and who could for instance be an user of your package? You can ask them to review the code or its documentation or both, depending on their skills and time of course.

  • The R Journal asks reviewers of papers to have a look at the package, as does The Journal of Statistical Software. That said, you’ll need to write a whole paper to go with your package which is maybe something you don’t want to.

  • The Journal of Open Science Software, JOSS asks reviewers to check some points in your package and you to write a very minimal paper about your package. The review process is minimal but you can still get useful feedback.

  • At rOpenSci we review packages that fit in our categories. Your package will be reviewed by two people. We have a partnership with the Journal of Open Source Software so that if your package fits in both venues, the review over at JOSS will be a breeze after the rOpenSci reviews. We also have a partnership with MEE, allowing a duo paper-package to be reviewed by MEE and rOpenSci respectively, see this post.

Needless to say, but I’ll write it anyway, all these review systems work both way… Offer your reviewing services! You’ll actually learn so much that you’ll find ideas to improve your own packages.

Make your repo a friendly place

Another way to make your package higher quality and user friendlier is to make people at ease when they have a question, a bug report, or a suggestion for improvement. Adding a code of conduct (use devtools::use_code_of_conduct as a starting point) and a contributing guide (with issue and PR templates if you want, in any case making clear what kind of contributions you welcome) to your package repository will help ensuring that. Also try to respond timely to requests.

On top of having a contributing guide you could also add a status badge for your repository so that any visitor might see at a glance whether your package is still in an early development stage.

If people actually start contributing to the package, try being generous with acknowledgements. A selfish reason is that it’ll motivate people to keep contributing to your package. A less selfish reason is that it makes the whole R community more welcoming to new contributors. So you can add discoverers and solvers of bugs for instance as ctb in DESCRIPTION and add a line about their contribution in NEWS.md.

Package analytics

If you’re curious, you can get an idea of the number of downloads of your package via the cranlogs package. Using it you can compare your package popularity to similar packages, and check whether there was a peak after a course you’ve given for instance.

Similarly, if your package repository is hosted on Github, using the gh package you can do similar analyses, looking at the number of stars over time for instance.

The end

This is the end of my list of tips, do not hesitate to add suggestions in the comments!

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

When there’s a fork in the road, take it. Or, taking a look at marginal structural models.

By Keith Goldfeld

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I am going to cut right to the chase, since this is the third of three posts related to confounding and weighting, and it’s kind of a long one. (If you want to catch up, the first two are here and here.) My aim with these three posts is to provide a basic explanation of the marginal structural model (MSM) and how we should interpret the estimates. This is obviously a very rich topic with a vast literature, so if you remain interested in the topic, I recommend checking out this (as of yet unpublished) text book by Hernán & Robins for starters.

The DAG below is a simple version of how things can get complicated very fast if we have sequential treatments or exposures that both affect and are affected by intermediate factors or conditions.

(A_0) and (A_1) represent two treatment points and (L_0) and (L_1) represent measurements taken before and after treatments, respectively. Both treatments and at least (L_1) affect outcome (Y). (I am assuming that the (A)‘s and (L)‘s are binary and that (Y) is continuous. (epsilon) is (N(0, sigma_epsilon^2)).)

An example of this might be a situation where we are interested in the effect of a drug treatment on mental health well-being for patients with prehypertension or hypertension. A physician’s decision to administer the drug at each visit is influenced by the patient’s level of hypertension. In turn, the treatment (A_0) potentially reduces the probability of hypertension – (P(L_1=1)). And finally, (L_1) influences the next treatment decision and ultimately the mental health outcome (Y).

The complicating factor is that the hypertension level following the first treatment ((L_1)) is both a mediator the effect of treatment (A_0) and confounder of the treatment effect (A_1) on (Y). To get an unbiased estimate the effect of the combined treatment regime ((A_0) and (A_1)) we need to both control for (L_1) and not control for (L_1). This is where MSMs and inverse probability weighting (IPW) come into play.

The MSM is marginal in the sense that we’ve been talking about in this series – the estimate will be a population-wide estimate that reflects the mixture of the covariates that influence the treatments and outcomes (in this case, the (L)‘s). It is structural in the sense that we are modeling potential outcomes. Nothing has changed from the last post except for the fact that we are now defining the exposures as a sequence of different treatments (here (A_0) and (A_1), but could easily extend to (n) treatments – up to (A_n).)

Imagine an experiment …

To understand the MSM, it is actually helpful to think about how a single individual fits into the picture. The tree diagram below literally shows that. The MSM posits a weird experiment where measurements (of (L)) are collected and treatments ((A)) are assigned repeatedly until a final outcome is measured. In this experiment, the patient is not just assigned to one treatment arm, but to both! Impossible of course, but that is the world of potential outcomes.

At the start of the experiment, a measurement of (L_0) is collected. This sends the patient down the one of the branches of the tree. Since the patient is assigned to both (A_0=0) and (A_0=1), she actually heads down two different branches simultaneously. Following the completion of the first treatment period (A_0), the second measurement ((L_1)) is collected. But, two measurements are taken for the patient – one for each branch. The results need not be the same. In fact, if the treatment has an individual-level effect on (L_1), then the results will be different for this patient. In the example below, this is indeed the case. Following each of those measurements (in parallel universes), the patient is sent down the next treatment branches ((A_1)). At this point, the patient finds herself in four branches. At the end of each, the measurement of (Y) is taken, and we have four potential outcomes for individual {i}: (Y^i_{00}), (Y^i_{10}), (Y^i_{01}), and (Y^i_{11}).

A different patient will head down different branches based on his own values of (L_0) and (L_1), and will thus end up with different potential outcomes. (Note: the values represented in the figure are intended to be average values for that particular path.)

How do we define the causal effect?

With four potential outcomes rather than two, it is less obvious how to define the causal effect. We could, for example, consider three separate causal effects by comparing each of the treatment “regimes” that include at least one exposure to the intervention to the single regime that leaves the patient entirely unexposed. That is, we could be interested in (at the individual (i) level) (E^i_1 = Y^i_{10}-Y^i_{00}), (E^i_2 = Y^i_{01}-Y^i_{00}), and (E^i_3 = Y^i_{11}-Y^i_{00}). This is just one possibility; the effects of interest are driven entirely by the research question.

When we have three or four or more intervention periods, the potential outcomes can start to pile up rapidly (we will have (2^n) potential outcomes for a sequence of (n) treatments.) So, the researcher might want to be judicious in deciding which contrasts to be made. Maybe something like (Y_{1111} – Y_{0000}), (Y_{0111} – Y_{0000}), (Y_{0011} – Y_{0000}), and (Y_{0001} – Y_{0000}) for a four-period intervention. This would allow us to consider the effect of starting (and never stopping) the intervention in each period compared to never starting the intervention at all. By doing this, though, we would be using only 5 out of the 16 potential outcomes. If the remaining 11 paths are not so rare, we might be ignoring a lot of data.

The marginal effect

The tree below represents an aggregate set of branches for a sample of 5000 individuals. The sample is initially characterized only by the distribution of (L_0). Each individual will go down her own set of four paths, which depend on the starting value of (L_0) and how each value of (L_1) responds in the context of each treatment arm.

Each individual (i) (at least in theory) has four potential outcomes: (Y^i_{00}), (Y^i_{10}), (Y^i_{01}), and (Y^i_{11}). Averaging across the sample provides a marginal estimate of each of these potential outcomes. For example, (E(Y_{00})=sum_i{Y^i_{00}}/5000). This can be calculated from the tree as [(1742*53 + 1908*61 + 392*61 + 958*69)/5000 = 59.7] Similarly, (E(Y_{11}) = 40.1) The sample average causal effects are estimated using the respective averages of the potential outcomes. For example, (E_3) at the sample level would be defined as (E(Y_{11}) – E(Y_{00}) = 40.1 – 59.7 = -19.6).

Back in the real world

In reality, there are no parallel universes. Maybe we could come up with an actual randomized experiment to mimic this, but it may be difficult. More likely, we’ll have observed data that looks like this:

Each individual heads down his or her own path, receiving a single treatment at each time point. Since this is not a randomized trial, the probability of treatment is different across different levels of (L_0) and (L_1) and that (L_0) and (L_1) are associated with the outcome (i.e. there is confounding).

Estimating the marginal effects

In the previous posts in this series, I provided some insight as to how we might justify using observed data only to estimate these sample-wide average potential outcomes. The most important assumption is that when we have measured all confounders, we may be able to say, for example, (E(Y_{01}) = E(Y | A_0 = 0 & A_1 = 1 )). The potential outcome for everyone in the sample is equal to the observed outcome for the subgroup who actually followed the particular path that represents that potential outcome. We will make the same assumption here.

At the start of this post, I argued that given the complex nature of the data generating process (in particular given that (L_1) is both a mediator and confounder), it is challenging to get unbiased estimates of the intervention effects. One way to do this with marginal structural models (another way is using g-computation, but I won’t talk about that here). Inverse probability weighting converts the observed tree graph from the real world to the marginal tree graph so that we can estimate sample-wide average (marginal) potential outcomes as an estimate for some population causal effects.

In this case, the inverse probability weight is calculated as [IPW = frac{1}{P(A_0=a_0 | L_0=l_0) times P(A_1=a_1 | L_0=l_0, A_0=a_0, L_1=l_1)}] In practice, we estimate both probabilities using logistic regression or some other modeling technique. But here, we can read the probabilities off the tree graph. For example, if we are interested in the weight associated with individuals observed with (L_0=1, A_0=0, L_1=0, textbf{and } A_1=1), the probabilities are [P(A_0 = 0 | L_0=1) = frac{676}{1350}=0.50] and [P(A_1=1 | L_0=1, A_0=0, L_1=0) = frac{59}{196} = 0.30]

So, the inverse probability weight for these individuals is [IPW = frac{1}{0.50 times 0.30} = 6.67] For the 59 individuals that followed this pathway, the weighted number is (59 times 6.67 = 393). In the marginal world of parallel universes, there were 394 individuals.

Simulating data from an MSM

Before I jump into the simulation, I do want to reference a paper by Havercroft and Didelez that describes in great detail how to generate data from a MSM with time-dependent confounding. It turns out that the data can’t be generated exactly using the intial DAG (presented above), but rather needs to come from something like this:

where (U) is an unmeasured, maybe latent, covariate. The observed data (that ignores (U)) will indeed have a DAG that looks like the one that we started with.

When doing simulations with potential outcomes, we can generate all the potential outcomes for each individual using a parallel universe approach. The observed data (treatment choices and observed outcomes) are generated separately. The advantage of this is that we can confirm the true causal effects because we have actually generated potential outcomes. The disadvantage is that the code is considerably more complicated and the quantity of data generated grows. The situation is not so bad with just two treatment periods, but the size of the data increases exponentially with the number of treatments: as I mentioned earlier, there will be (2^n) potential outcomes for each individual.

Alternatively, we can just generate the observed data directly. Since we know the true causal parameters we actually “know” the causal effects and can compare our estimates.

I will go through the convoluted approach because I think it clarifies (at least a bit) what is going on. As an addendum, I will show how all of this could be done in a few lines of code if we take the second approach …

library(broom)
library(simstudy)

# define U, e and L0

defA0 
##    id         U          e L0
## 1:  1 0.1137034 -3.5951796  0
## 2:  2 0.6222994 -0.5389197  0
## 3:  3 0.6092747  1.0675660  0
## 4:  4 0.6233794 -0.7226909  1
## 5:  5 0.8609154  0.8280401  0
## 6:  6 0.6403106  3.3532399  0

Now we need to create the two parallel universes – assigning each individual to both treatments. simstudy has a function addPeriods to generate longitudinal data. I am not doing that here, but can generate 2-period data and change the name of the “period” field to “A0”.

dtA0 
##    id A0         U          e L0 timeID
## 1:  1  0 0.1137034 -3.5951796  0      1
## 2:  1  1 0.1137034 -3.5951796  0      2
## 3:  2  0 0.6222994 -0.5389197  0      3
## 4:  2  1 0.6222994 -0.5389197  0      4
## 5:  3  0 0.6092747  1.0675660  0      5
## 6:  3  1 0.6092747  1.0675660  0      6

Now we are ready to randomly assign a value of (L_1). The probability is lower for cases where (A_0 = 1), so individuals themselves may have different values of (L_1) in the alternative paths.

# generate L1 as a function of U, L0, and A0

addA0 
##    id A0         U          e L0 timeID L1
## 1:  1  0 0.1137034 -3.5951796  0      1  0
## 2:  1  1 0.1137034 -3.5951796  0      2  0
## 3:  2  0 0.6222994 -0.5389197  0      3  1
## 4:  2  1 0.6222994 -0.5389197  0      4  0
## 5:  3  0 0.6092747  1.0675660  0      5  0
## 6:  3  1 0.6092747  1.0675660  0      6  0
# L1 is clearly a function of A0

dtA0[, .(prob_L1 = mean(L1)), keyby = .(L0,A0)]
##    L0 A0   prob_L1
## 1:  0  0 0.5238369
## 2:  0  1 0.1080039
## 3:  1  0 0.7053957
## 4:  1  1 0.2078551

Now we create two additional parallel universes for treatment (A_1) and the potential outcomes. This will result in four records per individual:

dtA1 
##    id A1 A0         U          e L0 timeID L1     Y_PO
## 1:  1  0  0 0.1137034 -3.5951796  0      1  0 40.90296
## 2:  1  0  1 0.1137034 -3.5951796  0      2  0 32.90296
## 3:  1  1  0 0.1137034 -3.5951796  0      3  0 28.90296
## 4:  1  1  1 0.1137034 -3.5951796  0      4  0 20.90296
## 5:  2  0  0 0.6222994 -0.5389197  0      5  1 64.30306
## 6:  2  0  1 0.6222994 -0.5389197  0      6  0 56.30306
## 7:  2  1  0 0.6222994 -0.5389197  0      7  1 52.30306
## 8:  2  1  1 0.6222994 -0.5389197  0      8  0 44.30306

Not surprisingly, the estimates for the causal effects mirror the parameters we used to generate the (Y)‘s above …

# estimate for Y_00 is close to what we estimated from the tree

Y_00 
## [1] 59.96619
Y_10 
## [1]  -8 -12 -20

Now that we’ve generated the four parallel universes with four potential outcomes per individual, we will generate an observed treatment sequence and measurements of the (L)‘s and (Y) for each individual. The observed data set will have a single record for each individual:

dt 
##           id L0
##     1:     1  0
##     2:     2  0
##     3:     3  0
##     4:     4  1
##     5:     5  0
##    ---         
## 49996: 49996  1
## 49997: 49997  0
## 49998: 49998  1
## 49999: 49999  0
## 50000: 50000  1

(A_0) is a function of (L_0):

dtAdd 
##    L0        V1
## 1:  0 0.3015964
## 2:  1 0.4994783

Now, we need to pull the appropriate value of (L_1) from the original data set that includes both possible values for each individual. The value that gets pulled will be based on the observed value of (A_0):

setkeyv(dt, c("id", "A0"))
setkeyv(dtA1, c("id", "A0"))

dt 
##           id L0 A0 L1
##     1:     1  0  1  0
##     2:     2  0  1  0
##     3:     3  0  0  0
##     4:     4  1  1  1
##     5:     5  0  0  1
##    ---               
## 49996: 49996  1  1  0
## 49997: 49997  0  1  0
## 49998: 49998  1  1  0
## 49999: 49999  0  0  1
## 50000: 50000  1  0  0

Finally, we generate (A_1) based on the observed values of (A_0) and (L_1), and select the appropriate value of (Y):

dtAdd 
##           id L0 A0 L1 A1        Y
##     1:     1  0  1  0  0 32.90296
##     2:     2  0  1  0  1 44.30306
##     3:     3  0  0  0  1 53.38856
##     4:     4  1  1  1  1 44.16249
##     5:     5  0  0  1  0 75.21466
##    ---                           
## 49996: 49996  1  1  0  0 74.09161
## 49997: 49997  0  1  0  0 50.26162
## 49998: 49998  1  1  0  0 73.29376
## 49999: 49999  0  0  1  0 52.96703
## 50000: 50000  1  0  0  0 57.13109

If we do a crude estimate of the causal effects using the unadjusted observed data, we know we are going to get biased estimates (remember the true causal effects are -8, -12, and -20):

Y_00 
## [1]  -6.272132 -10.091513 -17.208856

This biased result is confirmed using an unadjusted regression model:

lmfit 
##          term   estimate  std.error statistic p.value
## 1 (Intercept)  58.774695 0.07805319 753.00828       0
## 2          A0  -6.681213 0.10968055 -60.91520       0
## 3          A1 -10.397080 0.10544448 -98.60241       0

Now, shouldn’t we do better if we adjust for the confounders? I don’t think so – the parameter estimate for (A_0) should be close to (8); the estimate for (A_1) should be approximately (12), but this is not the case, at least not for both of the estimates:

lmfit 
##          term   estimate  std.error  statistic p.value
## 1 (Intercept)  53.250244 0.08782653  606.31157       0
## 2          L0   7.659460 0.10798594   70.93016       0
## 3          L1   8.203983 0.10644683   77.07119       0
## 4          A0  -4.369547 0.11096204  -39.37875       0
## 5          A1 -12.037274 0.09592735 -125.48323       0

Maybe if we just adjust for (L_0) or (L_1)?

lmfit 
##          term   estimate  std.error  statistic       p.value
## 1 (Intercept)  54.247394 0.09095074  596.44808  0.000000e+00
## 2          L1   9.252919 0.11059038   83.66839  0.000000e+00
## 3          A0  -2.633981 0.11354466  -23.19775 2.031018e-118
## 4          A1 -12.016545 0.10063687 -119.40499  0.000000e+00
lmfit 
##          term   estimate  std.error  statistic p.value
## 1 (Intercept)  57.036320 0.07700591  740.67459       0
## 2          L0   8.815691 0.11311215   77.93761       0
## 3          A0  -8.150706 0.10527255  -77.42480       0
## 4          A1 -10.632238 0.09961593 -106.73231       0

So, none of these approaches seem to work. This is where IPW can provide a solution. First we estimate the treatment/exposure models, then we estimate the IPW, and finally we use weighted regression or just estimate weighted average outcomes directly (we’d have to bootstrap here if we want standard errors for the simple average approach):

# estimate P(A0|L0) and P(A1|L0, A0, L1)

fitA0 
##          term   estimate  std.error  statistic p.value
## 1 (Intercept)  59.982379 0.09059652  662.08257       0
## 2          A0  -7.986486 0.10464257  -76.32157       0
## 3          A1 -12.051805 0.10464258 -115.17114       0
# non-parametric estimation

Y_00 
## [1]  -8.04 -12.10 -20.04

Addendum

This post has been quite long, so I probably shouldn’t go on. But, I wanted to show that we can do the data generation in a much less convoluted way that avoids generating all possible forking paths for each individual. As always in simstudy the data generation process needs us to create a data definition table. In this example, I’ve created that table in an external file named msmDef.csv. In the end, this simpler approach has reduced necessary code by about 95%.

defMSM 
##    varname                         formula variance      dist     link
## 1:       U                             0;1        0   uniform identity
## 2:       e                               0        9    normal identity
## 3:      L0                      -2.66+ 3*U        0    binary    logit
## 4:      A0                  0.3 + L0 * 0.2        0    binary identity
## 5:      L1    -1.2 + 3*U + 0.2*L0 - 2.5*A0        0    binary    logit
## 6:      A1           0.3 + L1*0.2 + A0*0.2        0    binary identity
## 7:       Y 39.95 + U*40 - A0*8 - A1*12 + e        0 nonrandom identity
dt 
##          term   estimate  std.error  statistic p.value
## 1 (Intercept)  60.061609 0.09284532  646.89967       0
## 2          A0  -7.931715 0.10715916  -74.01808       0
## 3          A1 -12.131829 0.10715900 -113.21335       0

Does the MSM still work with more complicated effects?

In conclusion, I wanted to show that MSMs still function well even when the causal effects do not follow a simple linear pattern. (And I wanted to be able to end with a figure.) I generated 10000 datasets of 900 observations each, and calculated the crude and marginal causal effects after each iteration. The true treatment effects are described by an “interaction” between (A_0) and (A_1). If treatment is received in both periods (i.e. (A_0=1) and (A_1=1)), there is an extra additive effect:

[ Y = 39.95 + U*40 – A0*8 – A1*12 – A0*A1*3 + e]

The purple density is the (biased) observed estimates and the green density is the (unbiased) IPW-based estimate. Again the true causal effects are -8, -12, and -23:

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The ‘I-Love-IKEA’ – web app, built at the IKEA Hackaton with R and Shiny

By Longhow Lam

(This article was first published on R – Longhow Lam’s Blog, and kindly contributed to R-bloggers)

Introduction

On the 8th, 9th and 10th of December I participated at the IKEA hackaton. In one word it was just FANTASTIC! Well organized, good food, and participants from literally all over the world, even the heavy snow fall on Sunday did not stop us from getting there!

I formed a team with Jos van Dongen and his son Thomas van Dongen and we created the “I-Love-IKEA” app to help customers find IKEA products. And of course using R.

The “I-Love-IKEA” Shiny R app

The idea is simple. Suppose you are in the unfortunate situation that you are not in an IKEA store, and you see a chair, or nice piece of furniture, or something completely else…. Now does IKEA have something similar? Just take a picture, upload it using the I-Love-IKEA R Shiny app and get the best matching IKEA products back.

Implementation in R

How was this app created? The following steps outline steps that we took during the creation of the web app for the hackaton.

First, we have scraped 9000 IKEA product images using Rvest, then each image is ‘scored’ using a pre-trained VGG16 network, where the top layers are removed.

That means that for each IKEA image we have a 7*7*512 tensor, we flattened this tensor to a 25088 dimensional vector. Putting all these vectors in a matrix we have a 9000 by 25088 matrix.

If you have a new image, we use the same pre-trained VGG16 network to generate a 25088 dimensional vector for the new image. Now we can calculate all the 9000 distances (for example cosine similarity) between this new image and the 9000 IKEA images. We select, say, the top 7 matches.

A few examples

A Shiny web app

To make this useful for an average consumer, we have put it all in an R Shiny app using the library ‘minUI‘ so that the web site is mobile friendly. A few screen shots:

The web app starts with an ‘IKEA-style’ instruction, then it allows you to take a picture with your phone, or use one that you already have on your phone. It uploads the image and searches for the best matching IKEA products.

The R code is available from my GitHub, and a live running Shiny app can be found here.

Conclusions

Obviously, there are still many adjustments you can make to the app to improve the matching. For example pre process the images before they are sent through the VGG network. But there was no more time.

Unfortunately, we did not win a price during the hackaton, the jury did however find our idea very interesting. More importantly, we had a lot of fun. In Dutch “Het waren 3 fantastische dagen!”.

Cheers, Longhow.

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam’s Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

nrow, references and copies

By Pierre Jacob

Hi all,

This post deals with a strange phenomenon in R that I have noticed while working on

(This article was first published on R – Statisfaction, and kindly contributed to R-bloggers)

Claude Monnet’s paintings have nothing to do with the topic of this post.

Hi all,

This post deals with a strange phenomenon in R that I have noticed while working on unbiased MCMC. Reducing the problem to a simple form, consider the following code, which iteratively samples a vector ‘x’ and stores it in a row of a large matrix called ‘chain’ (I’ve kept the MCMC terminology).

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
 if (imcmc == nrow(chain)){ #call to nrow
 }
 x = rnorm(dimstate, mean = 0, sd = 1)
 chain[imcmc,] = x #copying of x in chain
}

If you execute this code, you will see that it is surprisingly slow: it takes close to a minute on my computer. Now, consider the next block, which does exactly the same except that the vector ‘x’ is not copied into the matrix ‘chain’.

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
if (imcmc == nrow(chain)){ #call to nrow
}
x = rnorm(dimstate, mean = 0, sd = 1)
# chain[imcmc,] = x #no more copying
}

This code runs nearly instantaneously. Could it be that just copying a vector in a matrix takes a lot of time? Sounds unlikely. Now consider this third block.

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
if (imcmc == nmcmc){ #no call to nrow
}
x = rnorm(dimstate, mean = 0, sd = 1)
chain[imcmc,] = x #copying of x in chain
}

This code runs nearly instantaneously as well; this time ‘x’ is copied into ‘chain’, but the call to the nrow function is removed….?! What is nrow doing? It is meant to simply return dim(chain)[1], the first dimension of chain. So consider this fourth block.

dimstate = 100
nmcmc = 1e4
chain = matrix(0, nrow = nmcmc, ncol = dimstate)
for (imcmc in 1:nmcmc){
 if (imcmc == dim(chain)[1]){ #call to dim instead of nrow
 }
 x = rnorm(dimstate, mean = 0, sd = 1)
 chain[imcmc,] = x #copying of x in chain
}

This one also runs instantaneously! So replacing nrow(chain) by dim(chain)[1] solves the problem. Why?

The answer comes from R guru and terrific statistician Louis Aslett. I directly quote from an exchange of emails, since he brilliantly explains the phenomenon.

You probably know R stores everything by reference, so if I do:

x
y

I actually only have one copy of the matrix in memory with two references to it. If I then do:

x[1,1]

R will first make a copy of the whole matrix, update x to point to that and then change the first element to one. This idea is used when you pass a variable to a standard (i.e. non-core, non-primitive) R function, which nrow is: it creates a reference to the variable you pass so that it doesn’t have to copy and the function call is very fast …. as long as you don’t write to it inside the function, no copy need ever happen. But the “bad design” bit is that R makes a decision whether to copy on write based only on a reference count and crucially that reference count stays increased even after a function returns, irrespective of whether or not the function has touched the variable.

So:

x
x[1,1]
nrow(x) # ref count is 2 even though nothing was touched
x[1,1]
x[2,2]
nrow(x) # ref count jumps
x[3,3]

So by calling nrow in the loop for the first example, the chain matrix is being copied in full on every iteration. In the second example, chain is never written to so there is no negative side effect to the ref count having gone up. In the third example, chain only ever has ref count 1 so there are no copies and each row is written in-place. I did a quick bit of profiling and indeed in the slow example, the R garbage collector allocates and tidies up nearly 9GB of RAM when executing the loop!

The crazy thing is that dim(chain)[1] works full speed even though that is all that nrow is doing under the hood, but the reason is that dim is a so-called “primitive” core R function which is special because it doesn’t affect the reference counter of its arguments. If you want to dig into this yourself, there’s a function refs() in the pryr package which tells you the current reference count to any variable.

Thanks Louis!

To leave a comment for the author, please follow the link and comment on their blog: R – Statisfaction.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News