Advisory on Multiple Assignment dplyr::mutate() on Databases

By John Mount


(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I currently advise R dplyr users to take care when using multiple assignment dplyr::mutate() commands on databases.

(image: Kingroyos, Creative Commons Attribution-Share Alike 3.0 Unported License)

In this note I exhibit a troublesome example, and a systematic solution.

First let’s set up dplyr, our database, and some example data.

## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
##     filter, lag

## The following objects are masked from 'package:base':
##     intersect, setdiff, setequal, union
## [1] '0.7.4'
## [1] '1.2.0'
db  DBI::dbConnect(RSQLite::SQLite(), 

d  dplyr::copy_to(
  data.frame(xorig = 1:5, 
             yorig = sin(1:5)),

Now suppose somewhere in one of your projects somebody (maybe not even you) has written code that looks somewhat like the following.

d %>%
    delta = 0,
    x0 = xorig + delta,
    y0 = yorig + delta,
    delta = delta + 1,
    x1 = xorig + delta,
    y1 = yorig + delta,
    delta = delta + 1,
    x2 = xorig + delta,
    y2 = yorig + delta
  ) %>%
  select(-xorig, -yorig, -delta) %>%
x0 y0 x1 y1 x2 y2
1 0.8414710 1 0.8414710 1 0.8414710
2 0.9092974 2 0.9092974 2 0.9092974
3 0.1411200 3 0.1411200 3 0.1411200
4 -0.7568025 4 -0.7568025 4 -0.7568025
5 -0.9589243 5 -0.9589243 5 -0.9589243

Notice the above gives an incorrect result: all of the x_i columns are identical, and all of the y_i columns are identical. I am not saying the above code is in any way desirable (though something like it does arise naturally in certain test designs). If this is truly “incorrect dplyr code” we should have seen an error or exception. Unless you can be certain you have no code like that in a database backed dplyr project: you can not be certain you have not run into the problem producing silent data and result corruption.

The issue is: dplyr on databases does not seem to have strong enough order of assignment statement execution guarantees. The running counter “delta” is taking only one value for the entire lifetime of the dplyr::mutate() statement (which is clearly not what the user would want).

The fix is: break up the dplyr::mutate() into a series of smaller mutates that don’t exhibit the problem. It is a trade-off breaking up dplyr::mutate() on a database causes deeper statement nesting, and potential loss of performance. However, correct results should come before speed.

One automated variation of the fix is to use seplyr‘s statement partitioner. seplyr can factor the large mutate in a minimal number of very safe sub-mutates (and use dplyr to execute them).

d %>% 
      delta = 0,
      x0 = xorig + delta,
      y0 = yorig + delta,
      delta = delta + 1,
      x1 = xorig + delta,
      y1 = yorig + delta,
      delta = delta + 1,
      x2 = xorig + delta,
      y2 = yorig + delta
    )) %>%
  select(-xorig, -yorig, -delta) %>%
x0 y0 x1 y1 x2 y2
1 0.8414710 2 1.8414710 3 2.841471
2 0.9092974 3 1.9092974 4 2.909297
3 0.1411200 4 1.1411200 5 2.141120
4 -0.7568025 5 0.2431975 6 1.243197
5 -0.9589243 6 0.0410757 7 1.041076

The above notation is, however, a bit clunky for everyday use. We did not use the more direct seplyr::mutate_nse() as we are (to lower maintenance effort) deprecating the direct non-standard evaluation methods in seplyr in favor of code using seplyr::quote_mutate or wrapr::qae().

One can instead use seplyr as a code inspecting and re-writing tool with seplyr::factor_mutate().

  delta = 0,
  x0 = xorig + delta,
  y0 = yorig + delta,
  delta = delta + 1,
  x1 = xorig + delta,
  y1 = yorig + delta,
  delta = delta + 1,
  x2 = xorig + delta,
  y2 = yorig + delta

Warning in seplyr::factor_mutate(delta = 0, x0 = xorig + delta, y0 = yorig
+ : Mutate should be split into more than one stage.

   mutate(delta = 0) %>%
   mutate(x0 = xorig + delta,
          y0 = yorig + delta) %>%
   mutate(delta = delta + 1) %>%
   mutate(x1 = xorig + delta,
          y1 = yorig + delta) %>%
   mutate(delta = delta + 1) %>%
   mutate(x2 = xorig + delta,
          y2 = yorig + delta)

seplyr::factor_mutate() both issued a warning and produced the factored code snippet seen above. We think this is in fact a different issue than explored in our prior note on dependency driven result corruption, and that fixes for the first issue did not fix this issue last time we looked.

And that why to continue to be careful when using multi assignment dplyr::mutate() statements with database backed data.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Weekly Bulletin Vol – XIV

By QuantInsti

(This article was first published on R programming, and kindly contributed to R-bloggers)

This week’s R bulletin covers some interesting ways to list functions, to list files and illustrates the use of double colon operator.

We will also cover functions like path.package,, and rank. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. New document – Ctrl+Shift+N
2. Close active document – Ctrl+W
3. Close all open documents – Ctrl+Shift+W

Problem Solving Ideas

How to list functions from an R package

We can view the functions from a particular R package by using the “jwutil”s package. Install the package and use the lsf function from the package. The syntax of the function is given as:


Where pkg is a character string containing package name.

The function returns a character vector of function names in the given package.



How to list files with a particular extension

To list files with a particular extension, one can use the pattern argument in the list.files function. For example to list CSV files use the following syntax:


# This will list all the csv files present in the current working directory.
# To list files in any other folder, you need to provide the folder path.

files = list.files(pattern = ".csv$")

# $ at the end means that this is end of the string.
# Adding . ensures that you match only files with extension .csv

list.files(path = "C:/Users/MyFolder", pattern = ".csv$")

Using the double colon operator

The double colon operator is used to access exported variables in a namespace. The syntax is given as:


Where pkg is the package name symbol or literal character string. The name argument is the variable name symbol or literal character string.

The expression pkg::name returns the value of the exported variable from the package if it has a namespace. The package will be loaded if it was not loaded already before the call. Using the double colon operator has its advantage when we have functions of the same name but from different packages. In such a case, the sequence in which the libraries are loaded is important.

To see the help documentation for these colon operators you can run the following command in R – ?’::’ or help(“:::”)



first = c(1:6)
second = c(3:9)

dplyr::intersect(first, second)
[1] 3 4 5 6
base::intersect(first, second)
[1] 3 4 5 6

In this example, we have two functions having the same names but from different R packages. In some cases, functions having same names can produce different results. By specifying the respective package name using the double colon operator, R knows in which package to look for the function.

Functions Demystified

path.package function

The path.package function returns path to the locations where the given package is found. If the package is not mentioned then the function will return the path of the all the currently attached packages. The syntax of the function is given as:

path.package(package, quiet = FALSE)

The quiet argument takes a default value of False. If this is changed to True then it will throw a warning if the package named in the argument is not attached and will give an error if none are attached.


path.package("stats") function

There are different R packages which have functions to fill NA values. The function is part of the mefa package and it replaces NA values with the nearest values above them in the same column.The syntax of the function is given as:

Where, x can be a vector, a matrix or a data frame.


x = c(12,NA,15,17,21,NA)

rank function

The rank function returns the sample ranks of the values in a vector. Ties (i.e., equal values) and missing values can be handled in several ways.

rank(x, na.last = TRUE, ties.method = c(“average”, “first”, “random”, “max”,”min”))

x: numeric, complex, character or logical vector
na.last: for controlling the treatment of NAs. If TRUE, missing values in the data are put last; if FALSE, they are put first; if NA, they are removed; if “keep” they are kept with rank NA
ties.method: a character string specifying how ties are treated


x = c(3, 5, 1, -4, NA, Inf, 90, 43)

rank(x, na.last = FALSE)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – XIV appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Who wants to work at Google?

By Anu Rajaram

(This article was first published on R – Journey of Analytics, and kindly contributed to R-bloggers)

In this tutorial, we will explore the open roles at Google, and try to see what common attributes Google is looking for, in future employees.

This dataset is a compilation of job descriptions of 1200+ open roles at Google offices across the world. This dataset is available for download from the Kaggle website, and contains text information about job location, title, department, minimum, preferred qualifications and responsibilities of the position. You can download the dataset here, and run the code on the Kaggle site itself here.

Using this dataset we will try to answer the following questions:

  1. Where are the open roles?
  2. Which departments have the most openings?
  3. What are the minimum and preferred educational qualifications needed to get hired at Google?
  4. How much experience is needed?
  5. What categories of roles are the most in demand?

Step1 – Data Preparation and Cleaning:

The data is all in free-form text, so we do need to do a fair amount of cleanup to remove non-alphanumeric characters. Some of the job locations have special characters too, so we remove those using basic string manipulation functions. Once we read in the file, this is the snapshot of the resulting dataframe:

Step 2 – Analysis:

Now we will use R programming to identify patterns in the data that help us answer the questions of interest.

a) Job Categories:

First let us look at which departments have the most number of open roles. Surprisingly, there are more roles open for the “Marketing and Communications” and “Sales & Account Management” categories, as compared to the traditional technical business units. (like Software Engineering or networking) .

b) Full-time versus internships:

Let us see how many roles are full-time and how many are for students. As expected, only ~13% of roles are for students i.e. internships. Majority are full-time positions.

c) Technical Roles:

Since Google is predominantly technical company, let us see how many positions need technical skills, irrespective of the business unit (job category)

a) Roles related to “Google Cloud”:

To check this, we investigate how many roles have the phrase either in the job title or the responsibilities. As shown in the graph below, ~20% of the roles are related to Cloud infrastructure, clearly showing that Google is making Cloud services a high priority.

b) Senior Roles and skills :

A quick word search also reveals how many senior roles (roles that require 10+ years of experience) use the word “strategy” in their list of requirements, under either qualifications or responsibilities. Word association analysis can also show this. (not shown here).

Educational Qualifications:

Here we are basically parsing the “min_qual” and “pref_qual” columns to see the minimum qualifications needed for the role. If we only take the minimum qualifications into consideration, we see that 80% of the roles explicitly ask for a bachelors degree. Less than 5% of roles ask for a masters or PhD.

However, when we consider the “preferred” qualifications, the ratio increases to a whopping ~25%. Thus, a fourth of all roles would be more suited to candidates with masters degrees and above.

Google Engineers:

Google is famous for hiring engineers for all types of roles. So we will read the job qualification requirements to identify what percentage of roles requires a technical degree or degree in Engineering.
As seen from the data, 35% specifically ask for an Engineering or computer science degree, including roles in marketing and non-engineering departments.

<img data-attachment-id="1346" data-permalink="" data-orig-file="" data-orig-size="730,498" data-comments-opened="1" data-image-meta="{" aperture data-image-title="technical_qual" data-image-description="

technical qualifications for Google

” data-medium-file=”″ data-large-file=”″ src=”″ alt=”technical qualifications for Google” width=”450″ srcset_temp=” 676w, 150w, 300w, 730w” sizes=”(max-width: 676px) 100vw, 676px”>

Years of Experience:

We see that 30% of the roles require at least 5-years, while 35% of roles need even more experience.
So if you did not get hired at Google after graduation, no worries. You have a better chance after gaining a strong experience in other companies.

Role Locations:

The dataset does not have the geographical coordinates for mapping. However, this is easily overcome by using the geocode() function and the amazing Rworldmap package. We are only plotting the locations, so some places would have more roles than others. So, we see open roles in all parts of the world. However, the maximum positions are in US, followed by UK, and then Europe as a whole.

Responsibilities – Word Cloud:

Let us create a word cloud to see what skills are most needed for the Cloud engineering roles: We see that words like “partner”, “custom solutions”, “cloud”, strategy“,”experience” are more frequent than any specific technical skills. This shows that the Google cloud roles are best filled by senior resources where leadership and business skills become more significant than expertise in a specific technology.


So who has the best chance of getting hired at Google?

For most of the roles (from this dataset), a candidate with the following traits has the best chance of getting hired:

  1. 5+ years of experience.
  2. Engineering or Computer Science bachelor’s degree.
  3. Masters degree or higher.
  4. Working in the US.

The code for this script and graphs are available here on the Kaggle website. If you liked it, don’t forget to upvote the script. 🙂 And don’t forget to share!

Next Steps:

You can tweak the code to perform the same analysis, but on a subset of data. For example, only roles in a specific department, location (HQ in California) or Google Cloud related roles.

Thanks and happy coding!

(Please note that this post has been reposted from the main blog site at )

To leave a comment for the author, please follow the link and comment on their blog: R – Journey of Analytics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Rcpp 0.12.15: Numerous tweaks and enhancements

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The fifteenth release in the 0.12.* series of Rcpp landed on CRAN today after just a few days of gestation in incoming/.

This release follows the 0.12.0 release from July 2016, the 0.12.1 release in September 2016, the 0.12.2 release in November 2016, the 0.12.3 release in January 2017, the 0.12.4 release in March 2016, the 0.12.5 release in May 2016, the 0.12.6 release in July 2016, the 0.12.7 release in September 2016, the 0.12.8 release in November 2016, the 0.12.9 release in January 2017, the 0.12.10.release in March 2017, the 0.12.11.release in May 2017, the 0.12.12 release in July 2017, the 0.12.13.release in late September 2017, and the 0.12.14.release in November 2017 making it the nineteenth release at the steady and predictable bi-montly release frequency.

Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 1288 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with another 91 in BioConductor.

This release contains a pretty large number of pull requests by a wide variety of authors. Most of these pull requests are very focused on a particular issue at hand. One was larger and ambitious with some forward-looking code for R 3.5.0; however this backfired a little on Windows and is currently “parked” behind a #define. Full details are below.

Changes in Rcpp version 0.12.15 (2018-01-16)

  • Changes in Rcpp API:

    • Calls from exception handling to Rf_warning() now correctly set an initial format string (Dirk in #777 fixing #776).

    • The ‘new’ Date and Datetime vectors now have is_na methods too. (Dirk in #783 fixing #781).

    • Protect more temporary SEXP objects produced by wrap (Kevin in #784).

    • Use public R APIs for new_env (Kevin in #785).

    • Evaluation of R code is now safer when compiled against R 3.5 (you also need to explicitly define RCPP_PROTECTED_EVAL before including Rcpp.h). Longjumps of all kinds (condition catching, returns, restarts, debugger exit) are appropriately detected and handled, e.g. the C++ stack unwinds correctly (Lionel in #789). [ Committed but subsequently disabled in release 0.12.15 ]

    • The new function Rcpp_fast_eval() can be used for performance-sensitive evaluation of R code. Unlike Rcpp_eval(), it does not try to catch errors with tryEval in order to avoid the catching overhead. While this is safe thanks to the stack unwinding protection, this also means that R errors are not transformed to an Rcpp::exception. If you are relying on error rethrowing, you have to use the slower Rcpp_eval(). On old R versions Rcpp_fast_eval() falls back to Rcpp_eval() so it is safe to use against any versions of R (Lionel in #789). [ Committed but subsequently disabled in release 0.12.15 ]

    • Overly-clever checks for NA have been removed (Kevin in #790).

    • The included tinyformat has been updated to the current version, Rcpp-specific changes are now more isolated (Kirill in #791).

    • Overly picky fall-through warnings by gcc-7 regarding switch statements are now pre-empted (Kirill in #792).

    • Permit compilation on ANDROID (Kenny Bell in #796).

    • Improve support for NVCC, the CUDA compiler (Iñaki Ucar in #798 addressing #797).

    • Speed up tests for NA and NaN (Kirill and Dirk in #799 and #800).

    • Rearrange stack unwind test code, keep test disabled for now (Lionel in #801).

    • Further condition away protect unwind behind #define (Dirk in #802).

  • Changes in Rcpp Attributes:

    • Addressed a missing Rcpp namespace prefix when generating a C++ interface (James Balamuta in #779).
  • Changes in Rcpp Documentation:

    • The Rcpp FAQ now shows Rcpp::Rcpp.plugin.maker() and not the outdated ::: use applicable non-exported functions.

Thanks to CRANberries, you can also look at a diff to the previous release. As always, details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Winter solstice challenge #3: the winner is Bianca Kramer!

By Egon Willighagen

Part of the winning submission in the category ‘best tool‘.

A bit later than intended, but I am pleased to announce the winner of the Winter solstice challenge: Bianca Kramer! Of course, she was the only contender, but her solution is awesome! In fact, I am surprised no one took her took, ran it on their own data and just submit that (which was perfectly well within the scope of the challenge).

Best Tool: Bianca Kramer
The best tool (see the code snippet on the right) uses R and a few R packages (rorcid, rjson, httpcache) and services like ORCID and CrossRef (and the I4OC project), and the (also awesome) project. The code is available on GitHub.

Highest Open Knowledge Score: Bianca Kramer
I did not check the self-reported score of 54%, but since no one challenged here, Bianca wins this category too.

So, what next? First, start calculating your own Open Knowledge Scores. Just to be prepared for the next challenge in 11 months. Of course, there is still a lot to explore. For example, how far should we recurse with calculating this score? The following tweet by Daniel Gonzales visualizes the importance so clearly (go RT it!):

We have all been there, and I really think we should not teach our students it is normal that you have to trust your current read and no be able to look up details. I do not know how much time Gonzales spent on traversing this trail, but it must not take more than a minute, IMHO. Clearly, any paper in this trail that is not Open, will require a look up, and if your library does not have access, an ILL will make the traverse much, much longer. Unacceptable. And many seem to agree, because Sci-Hub seems to be getting more popular every day. About the latter, almost two years ago I wrote Sci-Hub: a sign on the wall, but not a new sign.
Of course, in the end, it is the scholars that should just make their knowledge open, so that every citizen can benefit from it (keep in mind, a European goal is to educate half the population with higher education, so half of the population is basically able to read primary literature!).
That completes the circle back to the winner. After all, Bianca Kramer has done really important work on how scientists can exactly do that: make their research open. I was shocked to see this morning that Bianca did not have a Scholia page yet, but that is fixed now (though far from complete):
Other papers that you should be read more include:
Congratulations, Bianca!

Source:: R News

Version 2.2.2 Released

By Nicholas Hamilton

(This article was first published on ggtern: ternary diagrams in R, and kindly contributed to R-bloggers)

ggtern version 2.2.2 has just been submitted to CRAN, and it includes a number of new features. This time around, I have adapted the hexbin geometry (and stat), and additionally, created an almost equivalent geometry which operates on a triangular mesh rather than a hexagonal mesh. There are some subtle differences which give some added functionality, and together these will provide an additional level of richness to ternary diagrams produced with ggtern, when the data-set is perhaps significantly large and points themselves start to lose their meaning from visual clutter.

Ternary Hexbin

Firstly, lets look a the ternary hexbin, which, as the name suggests has the capability to bin points in a regular hexagonal grid to produce a pseudo-surface. Now in the original ggplot version, this geometry is somewhat limiting since it only performs a ‘count’ on the number of points in each bin, however, it is not hard to imagine how performing a ‘mean’ or ‘standard deviation’, or other user-defined scalar function on a mapping provided by the user:

n  = 5000
df = data.frame(x     = runif(n),
                y     = runif(n),
                z     = runif(n),
                value = runif(n))

ggtern(df,aes(x,y,z)) + 

Now because we can define user functions, we can do something a little more fancy. Here we will calculate the mean within each hexagon, and, also superimpose a text label over the top.

ggtern(df,aes(x,y,z)) + 
  theme_bw() +
  geom_hex_tern(bins=5,fun=mean,aes(value=value,fill=..stat..)) + 
                size=3, color='white')

Ternary Tribin

The ternary tribin operates much the same, except that the binwidth no longer has meaning, instead, the density (number of panels) of the triangular mesh is controlled exclusively by the ‘bins’ argument. Using the same data above, lets create some equivalent plots:

ggtern(df,aes(x,y,z)) + 

There is a subtle difference with the labelling in the stat_tri_tern usage below, we have introduced a ‘centroid’ parameter, and this is because the orientation of each polygon is not consistent (some point up, some point down) and so unlike the hexbin where the centroid is returned by default, with the construction of the polygons being handled by the geometry, for the tribin, this is all handled in the stat.

ggtern(df,aes(x,y,z)) + 
  theme_bw() +
  geom_tri_tern(bins=5,fun=mean,aes(value=value,fill=..stat..)) + 
                size=3, color='white',centroid=TRUE)

These new geometries have been on the cards for quite some time, several users have requested them. Many thanks Laurie and Steve from QEDInsight for partially supporting the development of this work, and forcing me to pull my finger out and get it done. Hopefully we will see these in some awesome publications this year at some time. Until this is accepted on CRAN, you will have to download from my bitbucket repo.



The post Version 2.2.2 Released appeared first on ggtern: ternary diagrams in R.

To leave a comment for the author, please follow the link and comment on their blog: ggtern: ternary diagrams in R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Friday #rstats PuzzleR : 2018-01-19

By hrbrmstr


(This article was first published on R –, and kindly contributed to R-bloggers)

Peter Meissner ( to CRAN today. It’s a spiffy package that makes it dead simple to generate crossword puzzles.

He also made a super spiffy javascript library to pair with it, which can turn crossword model output into an interactive puzzle.

I thought I’d combine those two creations with a way to highlight new/updated packages from the previous week, cool/useful packages in general, and some R functions that might come in handy. Think of it as a weekly way to get some R information while having a bit of fun!

This was a quick, rough creation and I’ll be changing the styles a bit for next Friday’s release, but Peter’s package is so easy to use that I have absolutely no excuse to not keep this a regular feature of the blog.

I’ll release a static, ggplot2 solution to each puzzle the following Monday(s). If you solve it before then, tweet a screen shot of your solution with the tag #rstats #puzzler and I’ll pick the first time-stamped one to highlight the following week.

I’ll also get a GitHub setup for suggestions/contributions to this effort + to hold the puzzle data.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Curb your imposterism, start meta-learning

By Edwin Thoen

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Recently, there has been a lot of attention for the imposter syndrome. Even seasoned programmers admit they suffer from feelings of anxiety and low self-esteem. Some share their personal stories, which can be comforting for those suffering in silence. I focus on a method that helped me grow confidence in recent years. It is a simple, yet very effective way to deal with being overwhelmed by the many things a data scientis can acquaint him or herself with.

Two Faces of the Imposter Demon

I think imposterism can be broken into two, related, entities. The first is feeling you are falling short on a personal level. That is, you think you are not intelligent enough, you think you don’t have perseverance, or any other way to describe you are not up to the job. Most advice for overcoming imposterism focuses on this part. I do not. Rather, I focus on the second foe, the feeling that you don’t know enough. This can be very much related to the feeling of failing on a personal level, you might feel you don’t know enough because you are too slow a learner. However, I think it is helpful to approach it as objective as possible. The feeling of not knowing enough can be combated more actively. Not by learning as much you can, but by considering not knowing a choice, rather than an imperfection.

You can’t have it all

The field of data science is incredibly broad. Comprising, among many others, getting data out of computer systems, preparing data in databases, principles of distributed computing, building and interpreting statistical models, data visualization, building machine learning pipelines, text analysis, translatingbusiness problems into data problems and communicating results to stakeholders. To make matters worse, for each and every topic there are several, if not dozens, databases, languages, packages and tools. This means, by definition, no one is going to have mastery of everything the field comprises. And thus there are things you do not and never will know.

Learning new stuff

To stay effective you have to keep up with developments within the field. New packages will aid your data preparations, new tools might process data in a faster way and new machine learning models might give superior results. Just to name a few. I think a great deal of impostering comes from feeling you can’t keep up. There is a constant list in the back of your head with cool new stuff you still have to try out. This is where meta-learning comes into play, actively deciding what you will and will not learn. For my peace of mind it is crucial to decide the things I am not going to do. I keep a log (Google Sheets document) that has two simple tabs. The first a collector of stuff I come across in blogs and on twitter. These are things that do look interesting, but it needs a more thorough look. I also add things that I come across in the daily job, such as a certain part of SQL I don’t fully grasp yet. Once in a while I empty the collector, trying to pick up the small stuff right away and moving the larger things either to second tab or to will-not-do. The second tab holds the larger things I am actually going to learn. With time at hand at work or at home I work on learning the things on the second tab. More about this later.

Define Yourself

So you cannot have it all, you have to choose. What can be of good help when choosing is to have a definition of your unique data science profile. Here is mine:

I have thorough knowledge of statistical models and know how to apply them. I am a good R programmer, both in interactive analysis and in software development. I know enough about data bases to work effectively with them, if necessary I can do the full data preparation in SQL. I know enough math to understand new models and read text books, but I can’t derive and proof new stuff on my own. I have a good understanding of the principles of machine learning and can apply most of the algorithms in practice. My oral and written communication are quite good, which helps me in translating back and forth between data and business problems.

That’s it, focused on what I do well and where I am effective. Some things that are not in there; building a full data pipeline on an Hadoop cluster, telling data stories with d3.js, creating custom algorithms for a business, optimizing a database, effective use of python, and many more. If someone comes to me with one of these task, it is just “Sorry, I am not your guy”.

I used to feel that I had to know everything. For instance, I started to learn python because I thought a good data scientist should know it as well as R. Eventually, I realized I will never be good at python, because I will always use R as my bread-and-butter. I know enough python to cooperate in a project where it is used, but that’s it and that it will remain. Rather, I spend time and effort now in improving what I already do well. This is not because I think because R is superior to python. I just happen to know R and I am content with knowing R very well at the cost of not having access to all the wonderful work done in python. I will never learn d3.js, because I don’t know JavaScript and it will take me ages to learn. Rather, I might focus on learning Stan which is much more fitting to my profile. I think it is both effective and alleviating stress to go deep on the things you are good at and deliberately choose things you will not learn.

The meta-learning

I told you about the collector, now a few more words about the meta-learning tab. It has three simple columns. what it is I am going to learn and how I am going to do that are the first two obvious categories. The most important, however, is why I am going to learn it. For me there are only two valid reasons. Either I am very interested in the topic and I envision enjoying doing it, or it will allow me to do my current job more effectively. I stressed current there because scrolling the requirements of job openings is about the perfect way to feed your imposter monster. Focus on what you are doing now and have faith you will pick-up new skills if a future job demands it.

Meta-learning gives me focus, relaxation and efficiency. At its core it is defining yourself as a data scientist and deliberately choose what you are and, more importantly, what you are not going to learn. I experienced, that doing this with rigor actively fights the imposterism. Now, what works for me might not work for you. Maybe a different system fits you better. However, I think everybody benefits from defining the data scientist he/she is and actively choose what not to learn.

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

501 days of Summer (school)

By Gianluca Baio

(This article was first published on Gianluca Baio’s blog, and kindly contributed to R-bloggers)

As I anticipated earlier, we’re now ready to open registration for our Summer School in Florence (I was waiting for UCL to set up the registration system and thought it may take much longer than it actually did $-$ so well done UCL!).

We’ll probably have a few changes here and there in the timetable $-$ we’re thinking of introducing some new topics and I think I’ll certainly merge a couple of my intro lectures, to leave some time for those…

Nothing is fixed yet and we’re in the process of deliberating all the changes $-$ but I’ll post as soon as we have a clearer plan for the revised timetable.

Here’s the advert (which I’ve sent out to some relevant mailing list, also).

Summer school: Bayesian methods in health economics
Date: 4-8 June 2018
Venue: CISL Study Center, Florence (Italy)

COURSE ORGANISERS: Gianluca Baio, Chris Jackson, Nicky Welton, Mark Strong, Anna Heath

This summer school is intended to provide an introduction to Bayesian analysis and MCMC methods using R and MCMC sampling software (such as OpenBUGS and JAGS), as applied to cost-effectiveness analysis and typical models used in health economic evaluations. We will present a range of modelling strategies for cost-effectiveness analysis as well as recent methodological developments for the analysis of the value of information.

The course is intended for health economists, statisticians, and decision modellers interested in the practice of Bayesian modelling and will be based on a mixture of lectures and computer practicals, although the emphasis will be on examples of applied analysis: software and code to carry out the analyses will be provided. Participants are encouraged to bring their own laptops for the practicals.

We shall assume a basic knowledge of standard methods in health economics and some familiarity with a range of probability distributions, regression analysis, Markov models and random-effects meta-analysis. However, statistical concepts are reviewed in the context of applied health economic evaluations in the lectures.

The summer school is hosted in the beautiful complex of the Centro Studi Cisl, overlooking and a short distance from Florence (Italy). The registration fees include full board accommodation in the Centro Studi.

More information can be found at the summer school webpage. Registration is available from the UCL Store. For more details or enquiries, email Dr Gianluca Baio.

To leave a comment for the author, please follow the link and comment on their blog: Gianluca Baio’s blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

General Linear Models The Basics

By Bluecology blog

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

General Linear Models: The Basics

General linear models are one of the most widely used statistical tool
in the biological sciences. This may be because they are so flexible and
they can address many different problems, that they provide useful
outputs about statistical significance AND effect sizes, or just that
they are easy to run in many common statistical packages.

The maths underlying General Linear Models (and Generalized linear
models, which are a related but different class of model) may seem
mysterious to many, but are actually pretty accessible. You would have
learned the basics in high school maths.

We will cover some of those basics here.

Linear equations

As the name suggests General Linear Models rely on a linear equation,
which in its basic form is simply:

yi = α + βx*i* + ϵ*i

The equation for a straight line, with some error added on.

If you aren’t that familiar with mathematical notation, notice a few
things about this equation (I have followed standard conventions here).
I used normal characters for variables (i.e. things you measure) and
Greek letters for parameters, which are estimated when you fit the model
to the data.

yi are your response data, I indexed the y with i to
indicate that there are multiple observations. xi is
variously known as a covariate, predictor variable or explanatory
variable. α is an intercept that will be estimated. α has the same
units as y. (e.g. if y is number of animals, then α is expected the
number of animals when x = 0).

β is a slope parameter that will also be estimated. β is also termed
the effect size because it measures the effect of x on y. β has units
of ‘y per x’. For instance, if x is temperature, then β has units of
number of animals per degree C. β thus measures how much we expect y
to change if x were to increase by 1.

Finally, don’t forget ϵi, which is the error.
ϵi will measure the distance between each prediction of
yi made by the model and the observed value of

These predictions will simply be calculated as:

yi = α + βx*i

(notice I just removed the ϵi from the end). You can
think of the linear predictions as: the mean or ‘expected’ value a new
observation yi would take if we only knew
xi and also as the ‘line of best fit’.

Simulating ideal data for a general linear model

Now we know the model, we can generate some idealized data. Hopefully
this will then give you a feel for how we can fit a model to data. Open
up R and we will create these parameters:


Where n is the sample size and alpha and beta are as above.

We also need some covariate data, we will just generate a sequence of
n numbers from 0 to 1:


The model’s expectation is thus this straight line:


Because we made the model up, we can say this is the true underlying
relationship. Now we will add error to it and see if we can recover that
relationship with a general linear model.

Let’s generate some error:


Here sigma is our standard deviation, which measures how much the
observations y vary around the true relationship. We then used rnorm
to generate n random normal numbers, that we just add to our predicted
line y_true to simulate observing this relationship.

Congratulations, you just created a (modelled) reality a simulated an
ecologist going out and measuring that reality.

Note the set.seed() command. This just ensures the random number
generator produces the same set of numbers every time it is run in R and
it is good practice to use it (so your code is repeatable). Here is a
great explanation of seed setting and why 42 is so

Also, check out the errors:


Looks like a normal distribution hey? That’s because we generated them
from a normal distribution. That was a handy trick, because the basic
linear model assumes the errors are normally distributed (but not
necessarily the raw data).

Also note that sigma is constant (e.g. it doesn’t get larger as x gets
larger). That is another assumption of basic linear models called
‘homogeneity of variance’.

Fitting a model

To fit a basic linear model in R we can use the lm() function:


It takes a formula argument, which simply says here that y_obs depends
on (the tilde ~) x. R will do all the number crunching to estimate
the parameters now.

To see what it came up with try:


## (Intercept)           x
##   30.163713    2.028646

This command tells us the estimate of the intercept ((Intercept)) and
the slope on x under x. Notice they are close to, but not exactly the
same as alpha and beta. So the model has done a pretty decent job of
recovering our original process. The reason the values are not identical
is that we simulated someone going and measuring the real process with
error (that was when we added the normal random numbers).

We can get slightly more details about the model fit like this:


## Call:
## lm(formula = y_obs ~ x)
## Residuals:
##     Min      1Q  Median      3Q     Max
## -7.2467 -1.5884  0.1942  1.5665  5.3433
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  30.1637     0.4985  60.503   

I’m not going to go overboard with explaining this output now, but
notice a few key things. With the summary, we get standard errors for
the parameter estimates (which is a measure of how much they might
vary). Also notice the R-squared, which can be handy. Finally, notice
that the Residual standard error is close to the value we used for
sigma, which is because it is an estimate of sigma from our
simulated data.

Your homework is play around with the model and sampling process. Try
change alpha, beta, n and sigma, then refit the model and see
what happens.

Final few points

So did you do the homework? If you did, well done, you just performed a
simple power analysis (in the broad sense).

In a more formal power analysis (which is what you might have come
across previously) could systematically vary n or beta and for 1000
randomised data sets and then calculate the proportion out of 1000
data-sets that your p-value was ‘significant’ (e.g. less than a critical
threshold like the ever-popular 0.05). This number tells you how good
you are at detecting ‘real’ effects.

Here’s a great intro to power analysis in the broad sense: Bolker,
Ecological Models and Data in

One more point. Remember we said above about some ‘assumptions’. Well we
can check those in R quite easily:

plot(m1, 1)

This shows a plot of the residuals (A.K.A. errors) versus the predicted
values. We are looking for ‘heteroskedasticity’ which is a fancy way of
saying the errors aren’t equal across the range of predictions (remember
I said sigma is a constant?).

Another good plot:

plot(m1, 2)

Here we are looking for deviations of the points from the line. Points
on the line mean the errors are approximately normally distributed,
which was a key assumption. Points far from the line could indicate the
errors are skewed left or right, too fat in the middle, or too in the
middle skinny. More on that issue

The end

So the basics might belie the true complexity of situations we can
address with General Linear Models and their relatives Generalized
Linear Models. But, just to get you excited, here are a few things you
can do by adding on more terms to the right hand side of the linear

  1. Model multiple, interacting covariates.
  2. Include factors as covariates (instead of continuous variables). Got
    a factor and a continuous variable? Don’t bother with the old-school
    ANCOVA method, just use a linear model.
  3. Include a spline to model non-linear effects (that’s a GAM).
  4. Account for hierarchies in your sampling, like transects sampled
    within sites (that’s a mixed effects model)
  5. Account for spatial or temporal dependencies.
  6. Model varying error variance (e.g. when the variance increases with the mean).

You can also change the left-hand side, so that it no longer assumes
normality (then that’s a Generalized Linear Model). Or even add
chains of models together to model pathways of cause and effect (that’s
a ‘path analysis’ or ‘structural equation model’)

If this taster has left you keen to learn more, then check out any one
of the zillion online courses or books on GLMs with R, or if you can get
to Brisbane, come to our next course (which as of writing was in Feb
2018, but we do them regularly)

Now you know the basics, practice, practice, practice and pretty soon
you will be running General Linear Models behind your back while you
watch your 2 year old, which is what I do for kicks.

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News