## Canada Labour Market: Future Perspectives

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

For anyone looking for job opportunities, it is nice to have an idea how the job market will perform in the future for your chosen career or industry. Many countries have open data sets that offer this kind of data. In these exercises we will use R to analyze the future perspective of Canadian labour market.

Answers to the exercises are available here.

Exercise 1

Exercise 2
Load library tidyr. Use `gather` to rearrange any occupation related data set that present time series data into a tidy data format.

Exercise 3
Use `gather` to rearrange ALL other occupation related data sets that present time series data into a tidy data format, and pile out them in a unique data frame.

Exercise 4
Remove lines that present NA values, columns in French, and the “X” in front every year. Take a look at your tidy data set.

Learn more about Data manipulation in the online course R: Complete Data Analysis Solutions. In this course you will learn how to:

• Learn indepth how to work with dplyr
• Get a full introduction to the data.table package
• And much more

Exercise 5
Let’s do the same with industries data sets. Start by taking one of the industries data sets that present data in a time series an use `gather`.

Exercise 6
Do the same procedure of exercise 5 to all other industries data sets. Pile out them in a new data frame.

Exercise 7
Remove NAs, and French columns. In addition, set year and value as numeric, and take a look at your new tidy data set about industries.

Exercise 8
Find out the industries that have que lowest number of jobseekers, and create a new data set by sub setting the previous one.

Exercise 9
Plot the recently create data set using a line for each industry.

Exercise 10
Create a similar plot for the top 5 occupations in terms of low amount of jobseekers.

### Related exercise sets:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## In defense of wrapr::let()

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Saw this the other day:

In defense of `wrapr::let()` (originally part of `replyr`, and still re-exported by that package) I would say:

• `let()` was deliberately designed for a single real-world use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can re-phrase that as: there is deliberately less to learn as `let()` is adapted to a need (instead of one having to adapt to `let()`).
• The `R` community already has months of experience confirming `let()` working reliably in production while interacting with a number of different packages.
• `let()` will continue to be a very specific, consistent, reliable, and relevant tool even after `dpyr 0.6.*` is released, and the community gains experience with `rlang`/`tidyeval` in production.

If `rlang`/`tidyeval` is your thing, by all means please use and teach it. But please continue to consider also using `wrapr::let()`. If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.

An example follows.

In “base R” one can write:

```d
If we know the column name we wish to add to this data frame we write:
```
```d\$newColumn
The above is “non-parameterized” evaluation in that the variable name is taken from the source code, and not from a variable carrying the information.  This is great for ad-hoc analysis, but it would be hard to write functions, scripts and packages if this was the only way we had to manipulate columns.
This isn't a problem as `R` supplies an additional parameterized notation as we show here.  When we don't know the name of the column (but expect it to be in a variable at run time) we write:
```
```# code written very far away from us
variableHoldingColumnName
The purpose of `wrapr::let()` is to allow one to use the non-parameterized form as if it were parameterized. Obviously this is only useful if there is not a convenient parameterized alternative (which is the case for some packages).  But for teaching purposes: how would `wrapr::let()` let us use the “`\$`” as if it were parameterized (which we would have to do if `[[]]` and `[]` did not exist)?
With `wrapr::let()` we can re-write the “dollar-sign form” as:
```
```wrapr::let(
c(NEWCOL = variableHoldingColumnName),
{
d\$NEWCOL
The name “`NEWCOL`” is a stand-in name that we write all our code in terms of.  The expression “`c(NEWCOL = variableHoldingColumnName)`” is saying: “`NEWCOL` is to be interpreted as being whatever name is stored in the variable `variableHoldingColumnName`.”  Notice we can't tell from the code what value is in `variableHoldingColumnName`, as that won't be known until execution time.  The alias list or vector can be arbitrarily long and built wherever you like (it is just data).  The expression block can be arbitrarily large and complex (so you need only pay the mapping notation cost once per function, not once per line of code).
And that is `wrapr::let()`.
If you don't need parametric names you don't need `wrapr::let()`. If you do need parametric names `wrapr::let()` is one of the most reliable and easiest to learn ways to get them.
A nice article from a user recording their experience trying different ways to parameterize their code (including `wrapr::let()`) can be found here.
If you wish to learn more we have a lot of resources available:
```

And many more examples on our blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## A Primer in functional Programming in R (part -2)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the last exercise, We have seen how powerful functional programming principles can be and how it can drammatically increase the readablity of the code and how easily you can work with them .In this set of exercises we will look at functional programming principles with `purrr`.Purrr comes with a number of interesting features and is really useful in writing clean and concise code . Please check the documentation and load the purrr library in your R session before starting these exercise set .
Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
From the airquality dataset( available in base R ) , Find the mean ,median ,standard deviation of all columns using map functions .

Exercise 2

In the same dataset,find 95th percentile of each column excluding the NA values

Exercise 3
Load the iris dataset ,with help of `pipe` and `map ` functions find out the mean of the relavant columns.Keep in mind mean is meant for numeric columns ,so you may need multiple map like functions.I expect the output as a dataframe .

Exercise 4
I have a vector ` x ,I want to multiply every odd element with 2 and every even element with 4 , find a solution using purrr . `

Exercise 5
I have a sequence ``` x , I want to increase the 1st,11th, 21st…91st element by 1 . How can I achieve that .Exercise 6 Suppose I have two character vectors x y ```

``` How do I join each capital letter with the small letter so that the end result looks like this "A,a" "B,b" ,.. and so on Exercise 7 The previous exercise gave you the intuition of how to work parallely on two vectors.Now accepts a list of character vectors of same size and joins them like the previous exercise . so if I have a list like x ,it should give me output as . [1] "a d g" "b e h" "c f i" ```

Exercise 8
using a functional tool from purrr ,reverse the letters so that my output is a string of 26 character starting from z and ending at a
like “z y x w v u t s r q p o n m l k j i h g f e d c b a”

Exercise 9
Exercise on Purrr wont be complete if We dont mention its “Filter” like functions . take a numeric vector of 1:100 , keep all the numbers which are divisible by 2 and after that remove the one’s which are divisible by 5

Exercise 10
Ability to create partial functions is a powerful tool in functional programming. This allow functions to behave a little like data structures .Consider creating a partial function whenever you see you are repeating functions argument. This may not sound very useful in context of this exercise but I am sure you will find it very useful in your R gigs .Now create a Partial function
ql ,which is to find the 25th percentile and 50th percentile of a column from a dataframe .use it to find the same from airquality dataset .Dont use quantile twice in this exercise .

### Related exercise sets:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## A tidy model pipeline with twidlr and broom

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

## The problem

Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:

Thus, different models may need very different code.

However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.

## Two-step modelling

To understand the solution, think of the problem as a two-step process, depicted in this Figure:

### Step 1: from data to fitted model

Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!

To demonstrate:

``````#devtools::install_github("drimonj/twidlr")  # To install
library(twidlr)

lm(mtcars, hp ~ .)
#>
#> Call:
#> stats::lm(formula = formula, data = data)
#>
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat
#>      79.048       -2.063        8.204        0.439       -4.619
#>          wt         qsec           vs           am         gear
#>     -27.660       -1.784       25.813        9.486        7.216
#>        carb
#>      18.749
``````

This means we can pipe data.frames into any model function exposed by twidlr. For example:

``````library(dplyr)

mtcars %>% lm(hp ~ .)
#>
#> Call:
#> stats::lm(formula = formula, data = data)
#>
#> Coefficients:
#> (Intercept)          mpg          cyl         disp         drat
#>      79.048       -2.063        8.204        0.439       -4.619
#>          wt         qsec           vs           am         gear
#>     -27.660       -1.784       25.813        9.486        7.216
#>        carb
#>      18.749
``````

### Step2: fitted model to tidy results

Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: `glance`, `tidy`, and `augment`! To demonstrate:

``````#install.packages("broom")  # To install
library(broom)

fit % lm(hp ~ .)

glance(fit)
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21

tidy(fit)
#>           term    estimate   std.error  statistic     p.value
#> 1  (Intercept)  79.0483879 184.5040756  0.4284371 0.672695339
#> 2          mpg  -2.0630545   2.0905650 -0.9868407 0.334955314
#> 3          cyl   8.2037204  10.0861425  0.8133655 0.425134929
#> 4         disp   0.4390024   0.1492007  2.9423609 0.007779725
#> 5         drat  -4.6185488  16.0829171 -0.2871711 0.776795845
#> 6           wt -27.6600472  19.2703681 -1.4353668 0.165910518
#> 7         qsec  -1.7843654   7.3639133 -0.2423121 0.810889101
#> 8           vs  25.8128774  19.8512410  1.3003156 0.207583411
#> 9           am   9.4862914  20.7599371  0.4569518 0.652397317
#> 10        gear   7.2164047  14.6160152  0.4937327 0.626619355
#> 11        carb  18.7486691   7.0287674  2.6674192 0.014412403

#>           .rownames  hp  mpg cyl disp drat    wt  qsec vs am gear carb
#> 1         Mazda RX4 110 21.0   6  160 3.90 2.620 16.46  0  1    4    4
#> 2     Mazda RX4 Wag 110 21.0   6  160 3.90 2.875 17.02  0  1    4    4
#> 3        Datsun 710  93 22.8   4  108 3.85 2.320 18.61  1  1    4    1
#> 4    Hornet 4 Drive 110 21.4   6  258 3.08 3.215 19.44  1  0    3    1
#> 5 Hornet Sportabout 175 18.7   8  360 3.15 3.440 17.02  0  0    3    2
#> 6           Valiant 105 18.1   6  225 2.76 3.460 20.22  1  0    3    1
#>     .fitted     .resid      .hat   .sigma     .cooksd .std.resid
#> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773
#> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408
#> 3  79.99158  13.008418 0.3075987 26.38216 0.014633059  0.6019364
#> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601
#> 5 183.21756  -8.217565 0.2016137 26.53317 0.002878707 -0.3541128
#> 6 111.38490  -6.384902 0.3147448 26.55680 0.003682813 -0.2969840
``````

## A single, tidy pipeline

So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:

``````library(twidlr)
library(broom)

mtcars %>%
lm(hp ~ .)  %>%
glance()
#>   r.squared adj.r.squared    sigma statistic     p.value df    logLik
#> 1 0.9027993     0.8565132 25.97138  19.50477 1.89833e-08 11 -142.8905
#>        AIC      BIC deviance df.residual
#> 1 309.7809 327.3697 14164.76          21
``````

Any model included in twidlr and broom can be used in this same way. Here’s a `kmeans` example:

``````iris %>%
select(-Species) %>%
kmeans(centers = 3) %>%
tidy()
#>         x1       x2       x3       x4 size withinss cluster
#> 1 5.901613 2.748387 4.393548 1.433871   62 39.82097       1
#> 2 5.006000 3.428000 1.462000 0.246000   50 15.15100       2
#> 3 6.850000 3.073684 5.742105 2.071053   38 23.87947       3
``````

And a ridge regression with cross-fold validation example:

``````mtcars %>%
cv.glmnet(am ~ ., alpha = 0) %>%
glance()
#>   lambda.min lambda.1se
#> 1  0.2284167  0.8402035
``````

So next time you want to do some tidy modelling, keep this pipeline in mind:

## Limitations

Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use `randomForest` in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.

## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Correcting bias in meta-analyses: What not to do (meta-showdown Part 1)

By FelixS

tl;dr: Publication bias and p-hacking can dramatically inflate effect size estimates in meta-analyses. Many methods have been proposed to correct for such bias and to estimate the underlying true effect. In a large simulation study, we found out which methods do not work well under which conditions, and give recommendations what not to use.

It is well known that publication bias and p-hacking inflate effect size estimates from meta-analyses. In the last years, methodologists have developed an ever growing menu of statistical approaches to correct for such overestimation. However, to date it was unclear under which conditions they perform well, and what to do if they disagree. Born out of a Twitter discussion, Evan Carter, Joe Hilgard, Will Gervais and I did a large simulation project, where we compared the performance of naive random effects meta-analysis (RE), trim-and-fill (TF), p-curve, p-uniform, PET, PEESE, PET-PEESE, and the three-parameter selection model (3PSM).

Previous investigations typically looked only at publication bias or questionable research practices QRPs (but not both), used non-representative study-level sample sizes, or only compared few bias-correcting techniques, but not all of them. Our goal was to simulate a research literature that is as realistic as possible for psychology. In order to simulate several research environments, we fully crossed five experimental factors: (1) the true underlying effect, δ (0, 0.2, 0.5, 0.8); (2) between-study heterogeneity, τ (0, 0.2, 0.4); (3) the number of studies in the meta-analytic sample, k (10, 30, 60, 100); (4) the percentage of studies in the meta-analytic sample produced under publication bias (0%, 60%, 90%); and (5) the use of QRPs in the literature that produced the meta-analytic sample (none, medium, high).

This blog post summarizes some insights from our study, internally called “meta-showdown”. Check out the preprint; and the interactive app metaExplorer. The fully reproducible and reusable simulation code is on Github, and more information is on OSF.

In this blog post, I will highlight some lessons that we learned during the project, primarily focusing on what not do to when performing a meta-analysis.

Limitation of generalizability disclaimer: These recommendations apply to typical sample sizes, effect sizes, and heterogeneities in psychology; other research literatures might have different settings and therefore a different performance of the methods. Furthermore, the recommendations rely on the modeling assumptions of our simulation. We went a long way to make them as realistic as possible, but other assumptions could lead to other results.

### Never trust a naive random effects meta-analysis or trim-and-fill (unless you meta-analyze a set of registered reports)

If studies have no publication bias, nothing can beat plain old random effects meta-analysis: it has the highest power, the least bias, and the highest efficiency compared to all other methods. Even in the presence of some (though not extreme) QRPs, naive RE performs better than all other methods. When can we expect no publication bias? If (and, in my opinion only if) we meta-analyze a set of registered reports.

But.

In any other setting except registered reports, a consequential amount of publication bias must be expected. In the field of psychology/psychiatry, more than 90% of all published hypothesis tests are significant (Fanelli, 2011) despite the average power being estimated as around 35% (Bakker, van Dijk, & Wicherts, 2012) – the gap points towards a huge publication bias. In the presence of publication bias, naive random effects meta-analysis and trim-and-fill have false positive rates approaching 100%:

More thoughts about trim-and-fill’s inability to recover δ=0 are in Joe Hilgard’s blog post. (Note: this insight is not really new and has been shown multiple times before, for example by Moreno et al., 2009, and Simonsohn, Nelson, and Simmons, 2014).

Our recommendation: Never trust meta-analyses based on naive random effects and trim-and-fill, unless you can rule out publication bias. Results from previously published meta-analyses based on these methods should be treated with a lot of skepticism.

### Do not use p-curve and p-uniform for effect size estimation (under heterogeneity)

As a default, heterogeneity should always be expected – even under the most controlled conditions, where many labs perform the same computer-administered experiment, a large proportion showed significant and substantial amounts of between-study heterogeneity (cf. ManyLabs 1 and 3; see also our supplementary document for more details). p-curve and p-uniform assume homogeneous effect sizes, and their performance is impacted to a large extent by heterogeneity:

As you can see, all other methods retain the nominal false positive rate, but p-curve and p-uniform go through the roof as soon as heterogeneity comes into play (see also McShane, Böckenholt, & Hansen, 2016; van Aert et al., 2016).

Under H1, heterogeneity leads to overestimation of the true effect:

(additional settings for these plots: no QRPs, no publication bias, k = 100 studies, true effect size = 0.5)

Note that in their presentation of p-curve, Simonsohn et al. (2014) emphasize that, in the presence of heterogeneity, p-curve is intended as an estimate of the average true effect size among the studies submitted to p-curve (see here, Supplement 2). p-curve may indeed yield an accurate estimate of the true effect size among the significant studies, but in our view, the goal of bias-correction in meta-analysis is to estimate the average effect of all conducted studies. Of course this latter estimation hinges on modeling assumptions (e.g., that the effects are normally distributed), which can be disputed, and there might be applications where indeed the underlying true effect of all significant studies is more interesting.

Furthermore, as McShane et al (2016) demonstrate, p-curve and p-uniform are constrained versions of the more general three-parameter selection model (3PSM; Iyengar & Greenhouse, 1988). The 3PSM estimates (a) the mean of the true effect, δ, (b) the heterogeneity, τ, and (c) the probability that a non-significant result enters the literature, p. The constraints of p-curve and p-uniform are: 100% publication bias (i.e., p = 0) and homogeneity (i.e., τ = 0). Hence, for the estimation of effect sizes, 3PSM seems to be a good replacement for p-curve and p-uniform, as it makes these constraints testable.

Our recommendation: Do not use p-curve or p-uniform for effect size estimation when heterogeneity can be expected (which is nearly always the case).

### Ignore overadjustments in the opposite direction

Many bias-correcting methods are driven by QRPs – the more QRPs, the stronger the downward correction. However, this effect can get so strong, that methods overadjust into the opposite direction, even if all studies in the meta-analysis are of the same sign:

Note: You need to set the option “Keep negative estimates” to get this plot.

Our recommendation: Ignore bias-corrected results that go into the opposite direction; set the estimate to zero, do not reject H₀.

### Do not take it seriously if PET-PEESE does a reverse correction

Typical small-study effects (e.g., by p-hacking or publication bias) induce a negative correlation between sample size and effect size – the smaller the sample, the larger the observed effect size. PET-PEESE aims to correct for that relationship. In the absence of bias and QRPs, however, random fluctuations can lead to a positive correlation between sample size and effect size, which leads to a PET and PEESE slope of the unintended sign. Without publication bias, this reversal of the slope actually happens quite often.

See for example the next figure. The true effect size is zero (red dot), naive random effects meta-analysis slightly overestimates the true effect (see black dotted triangle), but PET and PEESE massively overadjust towards more positive effects:

PET-PEESE was never intended to correct in the reverse direction. An underlying biasing process would have to systematically remove small studies that show a significant result with larger effect sizes, and keep small studies with non-significant results. In the current incentive structure, I see no reason for such a process.

Our recommendation: Ignore the PET-PEESE correction if it has the wrong sign.

### PET-PEESE sometimes overestimates, sometimes underestimates

A bias can be more easily accepted if it always is conservative – then one could conclude: “This method might miss some true effects, but if it indicates an effect, we can be quite confident that it really exists”. Depending on the conditions (i.e., how much publication bias, how much QRPs, etc.), however, PET/PEESE sometimes shows huge overestimation and sometimes huge underestimation.

For example, with no publication bias, some heterogeneity (τ=0.2), and severe QRPs, PET/PEESE underestimates the true effect of δ = 0.5:

In contrast, if no effect exists in reality, but strong publication bias, large heterogeneity and no QRPs, PET/PEESE overestimates at lot:

In fact, the distribution of PET/PEESE estimates looks virtually identical for these two examples, although the underlying true effect is δ = 0.5 in the upper plot and δ = 0 in the lower plot. Furthermore, note the huge spread of PET/PEESE estimates (the error bars visualize the 95% quantiles of all simulated replications): Any single PET/PEESE estimate can be very far off.

Our recommendation: As one cannot know the condition of reality, it is safest not to use PET/PEESE at all.

## Recommendations in a nutshell: What you should not use in a meta-analysis

Again, please consider the “limitations of generalizability” disclaimer above.

• When you can exclude publication bias (i.e., in the context of registered reports), do not use bias-correcting techniques. Even in the presence of some QRPs they perform worse than plain random effects meta-analysis.
• In any other setting except registered reports, expect publication bias, and do not use random effects meta-analysis or trim-and-fill. Both will give you a 100% false positive rate in typical settings, and a biased estimation.
• Under heterogeneity, p-curve and p-uniform overestimate the underlying effect and have false positive rates >= 50%
• Even if all studies entering a meta-analysis point into the same direction (e.g., all are positive), bias-correcting techniques sometimes overadjust and return a significant estimate of the opposite direction. Ignore these results, set the estimate to zero, do not reject H₀.
• Sometimes PET/PEESE adjust into the wrong direction (i.e., increasing the estimated true effect size)

As with any general recommendations, there might be good reasons to ignore them.

• The p-uniform package (v. 0.0.2) very rarely does not provide a lower CI. In this case, ignore the estimate.
• Do not run p-curve or p-uniform on supplemental material for more information and plots.
• If the 3PSM method (in the implementation of McShane et al., 2016) returns an incomplete covariance matrix, ignore the result (even if a point estimate is provided).

The post Correcting bias in meta-analyses: What not to do (meta-showdown Part 1) appeared first on Nicebread.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## A Partial Remedy to the Reproducibility Problem

By matloff

Several years ago, John Ionnidis jolted the scientific establishment with an article titled, “Why Most Published Research Findings Are False.” He had concerns about inattention to statistical power, multiple inference issues and so on. Most people had already been aware of all this, of course, but that conversation opened the floodgates, and many more issues were brought up, such as hidden lab-to-lab variability. In addition, there is the occasional revelation of outright fraud.

Many consider the field to be at a crisis point.

In the 2014 JSM, Phil Stark organized a last-minute session on the issue, including Marcia McNutt, former editor of Science and Yoav Benjamini of multiple inference methodology fame. The session attracted a standing-room-only crowd.

In this post, Reed Davis and I are releasing the prototype of an R package that we are writing, revisit, with the goal of partially remedying the statistical and data wrangling aspects of this problem. It is assumed that the authors of a study have supplied (possibly via carrots or sticks) not only the data but also the complete code for their analyses, from data cleaning up through formal statistical analysis.

There are two main aspects:

• The package allows the user to “replay” the authors’ analysis, and most importantly, explore other alternate analyses that the authors may have overlooked. The various alternate analyses may be saved for sharing.
• Warn of statistical errors, such as: overreliance on p-values; need for multiple inference procedures; possible distortion due to outliers; etc.

The term user here could refer to several different situations:

• The various authors of a study, collaborating and trying different analyses during the course of the study.
• Reviewers of a paper submitted for publication on the results of the study.
• Fellow scientists who wish to delve further into the study after it is published.

The package has text and GUI versions. The latter is currently implemented as an RStudio add-in.

The package is on my GitHub site, and has a fairly extensive README file introducing the goals and usage.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## U.S. Residential Energy Use: Machine Learning on the RECS Dataset

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Contributed by Thomas Kassel. He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from January-May 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course.

## Introduction

The residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity end-uses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns.

The Residential Energy Consumption Survey, RECS, collects energy-related data on a nationally representative sample of U.S. homes. First conducted in 1978 and administered every 4-5 years by the Energy Information Administration, it is a leading data source for residential energy analysis. Methodology for the study is well-documented – a comprehensive overview of RECS can be found in the program’s technical documentation.

This project applied machine learning methods to the most recently available RECS dataset, published in 2009. The primary goals were twofold, as is common in many regression exercises:

• Descriptionidentification of variable importance or features with the strongest statistical associations to annual residential electricity usage (kWh)
• Prediction – feature engineering and use of machine learning platform h2o to optimize predictive accuracy over annual electricity usage

## EDA and Data Pre-Processing

RECS 2009 consists of 12,083 observations (each representing a unique housing unit) and over 900 features encompassing physical building characteristics, appliances, occupant behavior/usage patterns, occupant demographics, etc. These features serve as independent variables in predicting the outcome variable, annual electricity usage in kilowatt-hours (kWh). Because RECS aims to cover a comprehensive overview of many types of residences for a variety of analytical purposes (beyond energy use prediction), many of the features are sparsely populated, collinear, or uncorrelated with kWh usage. Therefore, a preliminary and recurring task throughout the project was dimension reduction.

Since most of the independent variables were collected through occupant interviews – as opposed to exact physical examination of the residence – the values of many are binned as factor/categorical variables. The dataset’s 931 original variables had the following characteristics:

Missingness was common in the raw dataset, with 73 features having NA values in more than 95% of observations. To quickly gauge whether these features correlated with the outcome variable kWh (and therefore should be retained/imputed), I made use of flexible EDA graphing functions from the GGally R package. An example is shown below on continuous variables, revealing generally low r-statistics in correlation to kWh and having non-significant p-values (not shown). GGally also accommodates similar exploratory analysis on factor variables via pairwise box plots. Overall, although a more complex imputation strategy could have been employed to address these 73 features, they were dropped due to their high missingness and little evidence of predictive power over the outcome variable.

## Feature Engineering/Reduction

Factor Encoding

As previously mentioned, the majority of the dataset features were factor variables, many of which needed recoding in order to correctly capture the variation being described. As illustrated below, nominal factor variables such as WALLTYPE have no intrinsic ordering; it would not make sense to order “stucco” higher than “wood”, for example. On the other hand, ordinal factors such as AGERFRI1 (age of most-used refrigerator) have a clear order to their factor levels; each level denotes a numerically higher value than the previous. Information contained in variables like AGERFRI1 is often more appropriately formatted in integer/numeric form, which can better convey their continuous range of values, aid interpretation of the variable coefficient, and reduce standard error by decreasing the number of coefficients for model estimation (many algorithms generate a separate model coefficient for each factor level via one-hot-encoding).

In the image above, WALLTYPE was left as a factor while AGERFRI1 was recoded as an integer value (using the mid range of each bin as number of years). This factor-to-integer recoding was applied to additional binned variables relating to age of appliances and frequency of use of those appliances.

New factor variables were also created by consolidating information from existing factors into one-hot-encoded binary values such as presence of heated pool, use of electricity for space heating, etc.

Feature Reduction

We can visualize multicollinearity among the integer and numeric features using correlation plots, again made with R package GGally. In the plot below, a conceptual group of collinear variables stands out, i.e. those directly or indirectly representing a residence’s size, for example, total square footage (TOTSQFT), cooling square footage (TOTCSQFT), number of bathrooms (NCOMBATH), number of refrigerators (NUMFRIG), etc.

To confirm the degree of multicollinearity, variance inflation factors were calculated based on a linear model with the above numeric features – those with high VIF scores (loosely following the rule of thumb VIF > 5.0) were dropped.

Despite variable recoding and de-correlation using the methods previously described, a large number of features remained in the dataset, the majority of them being factor variables. Although principal component analysis (PCA) is a powerful method for dimension reduction, it does not generalize easily to categorical data. Therefore, feature reduction was continued instead through the use of an exploratory LASSO regression model, utilizing the L1 regularization penalty to drive non-significant variable coefficients in the model’s output to 0. This was helpful in identifying and dropping features with very little predictive power over kWh usage, particularly factor variables that were not covered in the multicollinearity exercise above.

## Modeling With H2o

The processed and cleaned dataset was migrated to open-source, online machine learning platform h2o, which offers highly efficient and scalable modeling methods. In contrast to machine learning conducted in slower native R packages (caret, glm) in the local R environment, R package h2o facilitates API calls to h2o’s online platform, sending the given dataset to be distributed and parallel-processed among multiple clusters. H2o offers an array of the most common machine learning algorithms (glm, kNN, random forests, gbm, deep learning) at very impressive run times – lessening the large burden of computational speed in the model fitting and tuning process. It also provides handy, built-in functions for pre-processing such as training/validation/test set splitting, in this case chosen to be 70/20/10.

View the code on Gist.

Multivariable Regression

To best understand variable importance and interpret linear effects of the features on kWh usage, a simple multivariate regression model was fit using h2o’s glm function with default parameters (no regularization). A variable importance plot (see below) ranks the top 15 features by z-statistic, all of which are quite large – an observed z-score of 15 translates to a p-value of 9.63e-54, or virtually zero. Hence all 15 variables show extremely significant statistical evidence of a relationship to kWh, with an additional ~100 of the features (not shown) also meeting significance at the p = 0.05 threshold. Variables beginning with “hasElec” were feature engineered (previously described), validating the approach of using domain knowledge to add or recode valuable information in new features. Coefficient estimates are denoted on the bar for each feature. We can observe that the presence of electric space heating at a residence (one-hot-encoded as a factor variable) increases yearly electricity usage at the residence by about 2,722 kWh; each additional TOTCSQFT (air conditioned square feet, numeric) adds 1.146 kWh/year.

ElasticNet Regression with Grid Search

While default h2o glm provided interpretability and p-values, adding regularization to the linear model enabled an increase in predictive accuracy. H2o allows flexible grid search methods to tune the main glm hyperparameters alpha (type of regularization) and lambda (degree of coefficient shrinkage). Because grid search can quickly become computationally expensive, I utilized h2o’s powerful early-stopping and random search functionality to ensure that computation time is not wasted for very small, marginal gains in validation accuracy.

View the code on Gist.

A primary purpose of grid search is to choose optimal tuning parameters according to a validation metric – in this case root-mean-squared-error (RMSE) of kWh prediction on the validation set – and then train a new model using the optimal values discovered on the full dataset (no train/validation split) to maximize sample size. In 5 minutes of training, h2o fit all ~440 model combinations specified in the grid search, with the best RMSE model having alpha = 1.0 (LASSO regression) and lambda = 1.0 (small amount of regularization). These parameters were then used to fit a final model on 90% of the data (combining 70% train and 20% validation sets) and performance was evaluated on the 10% holdout test data, which had not yet been seen by the model.

To increase predictive accuracy over generalized linear model-based approaches, I used h2o’s implementation of gradient boosted machines. Boosting, a nonlinear tree-based approach, sequentially fits many decorrelated decision trees to slowly reduce overall predictive error (in the regression setting, often RMSE). Grid search was performed using the same methods as above and with the following tuning parameters:

• learn_rate – how much error should each tree “remember” from the previous
• max_depthmax tree depth (rule of thumb – sqrt of # of features)
• sample_rate – percentage of rows/observations to be chosen to fit a given tree
• col_sample_rate – percentage of columns to be chosen to fit a given tree

Following the same process as with GLM, parameter values were then chosen from the trained model with the best validation set RMSE. The model was re-trained on the full 90% of training data and tested on the final 10% holdout split.

Results from the two modeling strategies are summarized in the table below.

Additionally, predicted vs actual kWh values for either modeling strategy are plotted against the most significant numeric feature, CDD30YR (30-year average of cooling degree-days in the residence’s region). Both model types demonstrate reasonable ability to predict kWh over a range of annual cooling loads/climate regions.

## Conclusions

Final predictive accuracy was similar for both linear-based (GLM) and tree-based (GBM) models on the RECS dataset. The two models’ RMSE values represent a large improvement over an RMSE of 7,560, which was established as a baseline accuracy by initially fitting a default GLM on the raw dataset (before any feature engineering or reduction). This improvement validates the approaches used – using domain knowledge to feature engineer highly significant variables and grid search for hyperparameter tuning.

Aside from the regression setting, future analysis could be applied to RECS in a classification context. Since the majority of the dataset’s features are factor variables, energy providers could use the rich training set to attempt to answer important energy-related classification questions about their customers – type of space heating, presence of major appliances, or whether the home is a good candidate for solar installation. Greater predictive ability over residential electricity use will contribute over time to a smarter, cleaner, and more efficient power grid.

The post U.S. Residential Energy Use: Machine Learning on the RECS Dataset appeared first on NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Complete Subset Regressions, simple and powerful

By insightr

(This article was first published on R – insightR, and kindly contributed to R-bloggers)
By Gabriel Vasconcelos

The complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013. It is as very simple but powerful technique. Suppose you have a set of variables and you want to forecast one of them using information from the others. If your variables are highly correlated and the variable you want to predict is noisy you will have collinearity problems and in-sample overfitting because the model will try to fit the noise.

These problems may be solved if you estimate a smaller model using only a subset of the explanatory variables, however, if you do not know which variables are important you may loose information. What if we estimate models from many different subsets and combine their forecasts? Even better, what if we estimate models for all possible combinations of variables? This would be a good solution, however, if you have only 20 variables, the number of regressions would be more the 1 million. The CSR is a solution between using only one subset and all possible subsets. Instead of estimating all possible combinations of models, we fix a value and estimate all possible models with variables. Then we compute the forecasts from all these models to get our final result. Naturaly, must be smaller than the number of variables you have. Let us review the steps:

• Suppose you have $K$ explanatory variables. Estimate all possible models with <img src="https://s0.wp.com/latex.php?latex=k%3CK&bg=ffffff&%23038;fg=333333&%23038;s=0" alt="k<K" title="k variables,
• Compute the forecasts for all the models,
• Compute the average of all the forecasts to have the final result.

## Aplication

We are going to generate data from a linear model where the explanatory variables, $X$, are draw from a multivariate normal distribution. The coefficients are generated from a normal distribution and multiplied by a parameter $alpha$ to ensure the dependent variable $y$ has a significant random component. The value for $K$ is 10 and the CSR is estimated with $k=4$.

```set.seed(1) # = Seed for replication = #
K = 10 # = Number of Variables = #
T = 300 # = Number of Observations = #

# = Generate covariance matrix = #
library(mvtnorm)
D = diag(0.1, K)
P = matrix(rnorm(K * K), K, K)
sigma = t(P)%*%D%*%P

alpha = 0.1
beta = rnorm(K) * alpha # = coefficients = #
X = rmvnorm(T, sigma = sigma) # = Explanatory Variables = #
u = rnorm(T) # = Error = #
y = X%*%beta + u # = Generate y = #

# = Break data into in-sample and out-of-sample = #
y.in = y[1:200]
y.out = y[-c(1:200)]
X.in = X[1:200, ]
X.out = X[-c(1:200), ]
```
```# = Estimate model by OLS to compare = #
model = lm(y.in ~ X.in)
pred = cbind(1, X.out)%*%coef(model)

# = CSR = #
k = 4
models = combn(K, k) # = Gives all combinations with 4 variables = #
csr.fitted = rep(0, length(y.in)) # = Store in-sample fitted values = #
csr.pred = rep(0, length(y.out)) # = Store forecast = #
for(i in 1:ncol(models)){
m = lm(y.in ~ X.in[ ,models[ ,i]])
csr.fitted = csr.fitted + fitted(m)
csr.pred = csr.pred + cbind(1, X.out[ ,models[ ,i]])%*%coef(m)
}

R2ols = 1 - var(y.in - fitted(model))/var(y.in) # = R2 OLS = #
R2csr = 1 - var(y.in - csr.fitted/ncol(models))/var(y.in) # = R2 CSR = #
c(R2ols, R2csr)
```
```## [1] 0.1815733 0.1461342
```
```# = In-sample fit = #
plot(y.in, type="l")
lines(fitted(model), col=2)
lines(csr.fitted/ncol(models), col=4)
```

```# = Out-of-sample fit = #
plot(y.out, type="l")
lines(pred, col = 2)
lines(csr.pred/ncol(models), col = 4)
```

```# = MAPE = #
MAPEols=mean(abs(y.out - pred))
MAPEcsr=mean(abs(y.out - csr.pred/ncol(models)))
c(MAPEols, MAPEcsr)
```
```## [1] 0.8820019 0.8446682
```

The main conclusion from the results is that the CSR gives up some in-sample performance to improve the forecasts. In fact, the CSR is very robust to overfitting and you should definitively add it to your collection to use when you believe that you are having this type of problem. Most modern forecasting techniques have the same idea of accepting some bias in-sample to have a more parsimonious or stable model out-of-sample.

This is the simplest implementation possible for the CSR. In some cases you may have fixed controls you want to include in all regressions such as lags of the dependent variable. The CSR computational costs increase fast with the number of variables. If you are in a high-dimensional framework you may need to do some type of pre-selection of the variables to reduce the problem’s size.

## References

Elliott, Graham, Antonio Gargano, and Allan Timmermann. “Complete subset regressions.” Journal of Econometrics 177.2 (2013): 357-373.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Euler Problem 23: Non-Abundant Sums

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

A demonstration of the abundance of the number 12 using Cuisenaire rods (Wikipedia).

Euler problem 23 asks us to solve a problem with abundant or excessive numbers.

These are numbers for which the sum of its proper divisors is greater than the number itself.

12 is an abundant number because the sum of its proper divisors (the aliquot sum) is larger than 12: (1 + 2 + 3 + 4 + 6 = 16).

All highly composite numbers or anti-primes greater than six are abundant numbers. These are numbers that have so many divisors that they are considered the opposite of primes, as explained in the Numberphile video below.

## Euler Problem 23 Definition

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.

A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.

As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis, even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.

Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

## Solution

This solution repurposes the divisors function that determines the proper divisors for a number, introduced for Euler Problem 21. The first code snippet creates the sequence of all abundant numbers up to 28123 (sequence A005101 in the OEIS). An abundant number is one where its aliquot sum is larger than n.

```# Generate abundant numbers (OEIS A005101)
A005101  n) {
abundant[a]
The solution to this problem is also a sequence in the Online Encyclopedia of Integer Sequences (OEIS A048242). This page states that the highest number in this sequence is 20161, not 28123 as stated in the problem definition.
The second section of code creates a list of all potential numbers not the sum of two abundant numbers. The next bit of code sieves any sum of two abundant numbers from the list. The answer is determined by adding remaining numbers in the sequence.
```
```# Create a list of potential numbers that are not the sum of two abundant numbers
A048242
The post Euler Problem 23: Non-Abundant Sums appeared first on The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
```

Source:: R News

## My new DataCamp course: Forecasting Using R

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

For the past few months I’ve been working on a new DataCamp course teaching Forecasting using R. I’m delighted that it is now available for anyone to do.
Course blurb Forecasting involves making predictions about the future. It is required in many situations such as deciding whether to build another power generation plant in the next ten years requires forecasts of future demand; scheduling staff in a call center next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements.