Euler Problem 10: Summation of Primes

By Peter Prevos

Euler Problem 10: Prime gaps for all primes up to two million

Euler Problem 10 asks for the summation of primes. Computationally this is a simple problem because we can re-use the prime sieve developed for problem 3.

When generating a large number of primes the erratic pattern at which they occur is much more interesting than their sum. Mathematicians consider primes the basic building blocks of number theory. No matter how hard we look, however, they do not seem to obey any logical sequence. This Euler Problem is simply

Euler Problem 10 Definition

The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17. Find the sum of all the primes below two million.

Solution

The sieve of Eratosthenes function used in Euler Problem 3 can be reused once again to generate the prime numbers between two and two million. An interesting problem occurs when I run the code. When I sum all the primes without the as.numeric conversion, R throws an integer overflow error and recommends the conversion.

primes <- esieve(2e6)
answer <- (sum(as.numeric(primes)))
print(answer)

This results in a vector containing the first 148,933 prime numbers. The largest prime gap is 132 and it seems that sexy primes are more common than any of the other twin primes (note the spikes at intervals of 6 in the bar chart).

The summing of primes reveals an interesting problem in mathematics. Goldbach’s conjecture is one of the oldest and best-known unsolved problems in number theory and states that:

Every even integer greater than 2 can be expressed as the sum of two primes.

Note that this conjecture is only about even numbers. Goldbach also theorised that every odd composite number can be written as the sum of a prime and twice a square. This conjecture is the subject of the Euler Problem 46, which I will work on soon.

The post Euler Problem 10: Summation of Primes appeared first on The Devil is in the Data.

Source:: R News

Staying Current in Bioinformatics & Genomics: 2017 Edition

By Stephen Turner

A while back I wrote this post about how I stay current in bioinformatics & genomics. That was nearly five years ago. A lot has changed since then. A few links are dead. Some of the blogs or Twitter accounts I mentioned have shifted focus or haven’t been updated in years (guilty as charged). The way we consume media has evolved — Google thought they could kill off RSS (long live RSS!), there are many new literature alert services, preprints have really taken off in this field, and many more scientists are engaging via social media than before.
People still frequently ask me how I stay current and keep a finger on the pulse of the field. I’m not claiming to be able to do this well — that’s a near-impossible task for anyone. Five years later and I still run our bioinformatics core, and I’m still mostly focused on applied methodology and study design rather than any particular phenotype, model system, disease, or specific method. It helps me to know that transcript-level estimates improve gene-level inferences from RNA-seq data, and that there’s software to help me do this, but the details underlying kmer shredding vs pseudoalignment to a transcriptome de Bruijn graph aren’t as important to me as knowing that there’s a software implementation that’s well documented, actively supported, and performs well in fair benchmarks. As such, most of what I pay attention to is applied/methods-focused.
What follows is a scattershot, noncomprensive guide to the people, blogs, news outlets, journals, and aggregators that I lean on in an attempt to stay on top of things. I’ve inevitably omitted some key resources, so please don’t be offended if you don’t see your name/blog/Twitter/etc. listed here (drop a link in the comments!). Whatever I write here now will be out of date in no time, so I’ll try to write an update post every year instead of every five.

Twitter

In the 2012 post I ended with Twitter, but I have to lead with it this time. Twitter is probably my most valuable resource for learning about the bleeding-edge developments in genomics & bioinformatics. It’s great for learning what’s new and contributing to the dialogue in your field, but only when used effectively.
I aggressively prune the list of people I follow to keep what I see relevant and engaging. I can tolerate an occasional digression into politics, posting pictures of you drinking with colleagues at a conference, or self-congratulatory announcements. But once these off-topic Tweets become the norm, I unfollow. I also rely on the built-in list feature. I follow a few hundred people, but I only add a select few dozen to a “notjunk” list that I look at when I’m short on time. Folks in this list don’t Tweet too often and have a high signal-to-noise ratio (as far as what I’m interested in reading). If I don’t get a chance to catch up on my entire timeline, I can at least breeze through recent Tweets from folks on this list.
I’m also wary of following extremely prolific users. For example — if someone’s been on Twitter less than a year, already has 20,000 Tweets, but only 100 followers, it tells me they’ve got a lot to say but nobody cares. I let the hive mind work for me in this case, using this Tweet-to-follower ratio as sort of a proxy for signal-to-noise.
I mostly follow individuals and aggregators, but I also follow a few organization accounts. These can be a mixed bag. Only a few organization accounts do this well, delivering interesting and applicable content to a targeted audience, while many more are poor attempts at marketing and self-promotion while not offering any substantive value or interesting content.
Individuals: In no particular order, here’s an incomplete list of people who Tweet content that I find consistently on-topic and interesting.
Others: Besides individual accounts, there are also a number of aggregators and organizations that I keep on a high signal-to-noise list.

Blogs

I follow these and other blogs using RSS. I’ve been happy with the free version of Feedly ever since Google Reader was killed. The web interface and iOS app have everything I need, and they both integrate nicely with other services like Evernote, Instapaper, Buffer, Twitter, etc. If you can’t find a direct link to the blog’s RSS feed, you can usually type the name of the blog into Feedly’s search bar and it’ll find it for you. Similar to my “notjunk” list in Twitter, I have a Favorites category in Feedly where I include only the feeds I absolutely wouldn’t want to miss.
These are some of the few that I try to read whenever something new is posted, and Feedly helps me keep those organized, either by “starring” something I want to come back to, or saving it for later with Instapaper. They’re in no particular order, and I’m sure I’ve forgotten something.
  • Variance Explained: David Robinson’s blog (Data Scientist at Stack Overflow, works in R and Python).
  • Global Biodefense: News on pathogens, outbreaks, and preparedness, with periodic posts on genomics and bioinformatics-related developments and funding opportunities.
  • In between lines of code: Lex Nederbragt’s blog on biology, sequencing, bioinformatics, …
  • Simply Statistics: A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek.
  • Bits of DNA: Reviews and commentary on computational biology by Lior Pachter (fair warning: dialogue here can get a bit heated!).
  • Blue Collar Bioinformatics: articles related tool validation and the open source bioinformatics community.
  • Microbiome Digest – Bik’s Picks: A daily digest of scientific microbiome papers, by Elisabeth Bik, Science Editor at uBiome.
  • Living in an Ivory Basement: Titus Brown’s blog on metagenomics, open science, testing, reproducibility, and programming.
  • Enseqlopedia: James Hadfield’s blog on all things NGS.
  • Epistasis Blog: Jason Moore’s computational biology blog.
  • RStudio Blog: announcements about new RStudio functionality, updates about the tidyverse, and more.
  • nextgenseek.com: Next-Gen Sequencing Blog covering new developments in NGS data & analysis.
  • RNA-Seq Blog: Transcriptome Research & Industry News.
  • The Allium: We all need a little humor in our lives. Like The Onion, but for science.

Others

I’m unsure how to categorize the rest. These are things like aggregators, Q&A sites/forums, and others.
  • Nuzzel is something I’ve only been using for a few months but it works very well. It’s meant to solve the Twitter / social media overload problem. If you’re following a few hundred people, you could easily have thousands of Tweets per day to read through (or miss). Nuzzel emails you a daily newsletter of the most relevant content in your Twitter feed. I’m guessing it does this by analyzing how many people you follow share, retweet, or favorite the same links. I try to read everything in my RSS feeds but I could never do this with Twitter (nor should you worry about trying). Nuzzel helps you catch up on things that are trending among the people you follow. It’s not a substitute for following the right people (see the Twitter section above).
  • RWeekly: weekly updates from the entire R community. Offers an RSS feed but I subscribe to the weekly email. Each email sends out about 50 links with one-sentence descriptions to things being done in the R community that week.
  • R Bloggers aggregates RSS feeds from hundreds of blogs about R. Much more comprehensive than RWeekly, but lots to sort through.
  • GenomeWeb still provides high-quality original content as well as summaries of what’s going on in the field. Create an account, log in, view your profile page, and subscribe to some of their regular emails. I subscribe to their daily news, the scan, informatics, sequencing, and infectious diseases bulletins. Pro tip: Much of their content is only available for premium subscribers. If you sign up with a .edu address, you can access all this content for free.
  • F1000’s Smart Search is one of the few literature recommendation services that I find useful, relevant, and current. My RNA-seq and metagenomics alerts consistently deliver relevant and fresh content.
  • BioStars: This is a stack exchange Q&A site focused on bioinformatics, computational genomics, biological data analysis. You can go to the homepage and sort by topic, views, answers, etc., and the platform offers several granular ways to subscribe via RSS.
  • Bioconductor Support: This is a Q&A site much like BioStars that replaced the Bioconductor mailing list. You can do things like limit to a certain time period and sort by views, for example, if you only want to log in occasionally to see what’s being talked about.
  • SEQanswers: I subscribe to all new threads in the SEQanswers bioinformatics forum, and regularly browse post titles. When something sparks my interest, I’ll click into that post and subscribe to future updates on that post via email.
  • Google Scholar lets you search and create email alerts.
  • PubMed Alerts: You can save, automate, and have search results emailed to you through your MyNCBI account. Surprisingly, these seem to be more relevant than the Google Scholar searches for the terms that I use.
  • PubMed Trending – I have no idea how PubMed ranks these. It seemed to be more useful in the past, but now it seems that the top “trending” articles alternate between CRISPR/Cas9, and old kinesiology / sports medicine articles.
  • IFTTT: If This Then That is a service that connects many different web services together in an endless number of ways. At home I might connect Facebook and Dropbox, so that whenever someone tags me in a photo, that photo is automatically downloaded to my Dropbox. At work I can connect an RSS feed to an Evernote note or Google Doc. It’s useful is so many ways, both for personal and for work-related tasks. I mostly use it here as a last safeguard so that things I really shouldn’t miss don’t slip through the cracks. I have recipes that do things like email me if certain low-volume Twitter accounts post a new Tweet, others that automatically save to Instapaper things like starred articles in Feedly. I also use this to keep a close eye on a few accounts on GitHub. I have connections set up for a few users on GitHub so that whenever one of these users creates a new public repository, I get an email. I’ve also used IFTTT to archive Tweets coming out of various hashtags — you can create a recipe where if a new Tweet contains certain keywords or hashtags, then save that Tweet to Evernote, a shared Google Doc spreadsheet, etc. Zapier is a similar service that I’ve heard provides more granular control, but I haven’t tried it.
  • Podcasts: I listen to every episode of Roger Peng and Hilary Parker’s Not So Standard Deviations data science podcast, and most episodes of Roger Peng and Elizabeth Matsui’s The Effort Report (this one’s more about life in academia in general). I use the Overcast iOS app to listen to these and other podcasts on ~1.75X speed. (When I met Hilary at the RStudio Conference I heard her speak for the first time at regular 1X speed. Odd experience.) Finally, I just learned about the R podcast. I haven’t listened to much yet, but I’ve added it to my long Overcast queue.

Preprints!

Preprints in life sciences were nearly unheard of when I wrote the 2012 post. Now everybody’s doing it. There are still a few people using the arXiv Quantitative biology channel, and I’ll occasionally find something in PeerJ Preprints that grabs my attention.
bioRxiv is the biggest player here, hands down. The Alerts/RSS page lets you sign up for email alerts on particular topics, or subscribe to RSS feeds coming from particular categories that interest you. I subscribe to the Genomics and Bioinformatics feeds. I also follow several of the bioRxiv’s top-level and category Twitter feeds @biorxivpreprint, @biorxiv_bioinfo, and @biorxiv_genomic).
F1000 Research deserves some special attention here. It’s somewhere in-between a preprint server and a peer-reviewed publication. You can upload manuscripts (or other research outputs like posters or slides), and they’re immediately and permanently published, and given a DOI. Then one or more rounds of open peer review as well as public comment take place, and authors can update the published paper for further review. Check out the transcript estimates / gene inference paper I mentioned earlier. You’ll see it’s “version 2,” and was approved by two referees. If you look at the right-hand panel, you can actually go back and see the prior to revision, as well as see who reviewed it, what the reviewer wrote, and how the authors responded to those reviews. It’s an innovative platform where peer review is open and transparent, and is independent of publication, since papers are published before they are reviewed, and remain regardless of the outcome of the review. F1000 Research has a number of channels that are externally curated by different organizations, societies, conferences, etc. I subscribe to and get alerts about the R package and Bioconductor channels. Whenever a new preprint is dropped into one of these channels, I’ll get an email and an RSS item.
I only recently discovered PrePubMed, which looks very useful. PrePubMed indexes preprints from arXiv q-bio, PeerJ Preprints, bioRxiv, F1000Research, preprints.org, The Winnower, Nature Precedings, and Wellcome Open Research. In the tools box on the homepage, you can enter a search string and get back an RSS feed with results from that search. It looks like PrePubMed is maintained by a single person, but he’s made the entire thing open source, so you could presumably set this up and mirror it on your own, should you check back in 2021 and the link be dead.

Journals

I started with Journals in my 2012 post, but they’re last (and probably least) here. I still subscribe to a few journals’ RSS feeds, but in most cases, by the time I see a new Table of Contents hit my RSS reader, I probably saw the publications making the rounds on Twitter, blogs, or other channels mentioned above. It’s also no longer unusual to see a “publication” land where I read the preprint on biorXiv months ago, and perhaps even a blog post before that! What “publication” means is changing rapidly, and I’m sure the lines between a blog post, preprint, and journal article will be even blurrier in the year 2022 post.

How do you have the time to do this?

How do you not? It’s not as bad as it seems. I probably spend an hour each weekday scanning all the resources mentioned here, and I find the time well spent. I can breeze through my Twitter and RSS feeds on my bus ride into work, and saving things I actually want to look at later with a bookmark, star, favorite, Instapaper, etc.
I should have prefaced this whole article with the note that I hardly ever actually fully read any of the papers or blog posts I see here. If I see, for example, a new WGS variant caller published, I’ll glance at the figures benchmarking it against GATK and FreeBayes, and skim through the documentation on the GitHub README or BioConductor vignette. If either of these is missing or falls short, that’s usually enough for me to ignore the publication completely (don’t underestimate the importance of good documentation!).
It’s taken me a decade to compile and continually hone this list of resources to the things that I find useful and relevant. This is what works for me, now, in 2017. It’s not a one-size-fits-all, and the 2018-me will probably have a somewhat different list, but I hope you’ll find it useful. If your interests are similar to what I’ve discussed here, how do you stay current? What have I left out? Let me know in the comments!

Source:: R News

roxygen2 6.0.0

By hadleywickham

roxygen2 6.0.0 is now available on CRAN. roxygen2 helps you document your packages by turning specially formatted inline comments into R’s standard Rd format. It automates everything that can be automated, and provides helpers for sharing documentation between topics. Learn more at http://r-pkgs.had.co.nz/man.html. Install the latest version with:

install.packages("roxygen2")

There are two headline features in this version of roxygen2:

  • Markdown support.
  • Improved documentation inheritance.

These are described in detail below.

This release also included many minor improvements and bug fixes. For a full list of changes, please see release notes. A big thanks to all the contributors to this release: @dlebauer, @fmichonneau, @gaborcsardi, @HenrikBengtsson, @jefferis, @jeroenooms, @jimhester, @kevinushey, @krlmlr, @LiNk-NY, @lorenzwalthert, @maxheld83, @nteetor, @shrektan, @yutannihilation

Markdown

Thanks to the hard work of Gabor Csardi you can now write roxygen2 comments in markdown. While we have tried to make markdown mode as backward compatible as possible, there are a few cases where you will need to make some minor changes. For this reason, you’ll need to explicitly opt-in to markdown support. There are two ways to do so:

  • Add Roxygen: list(markdown = TRUE) to your DESCRIPTION to turn it on everywhere.
  • Add @md to individual roxygen blocks to enable for selected topics.

roxygen2’s markdown dialect supports inline formatting (bold, italics, code), lists (numbered and bulleted), and a number of helpful link shortcuts:

  • [func()]: links to a function in the current package, and is translated to code{link[=func]{func()}.
  • [object]: links to an object in the current package, and is translated to link{object}.
  • [link text][object]: links to an object with custom text, and is translated to link[=link text]{object}

Similarly, you can link to functions and objects in other packages with [pkg::func()], [pkg::object], and [link text][pkg::object]. For a complete list of syntax, and how to handle common problems, please see vignette("markdown") for more details.

To convert an existing roxygen2 package to use markdown, try https://github.com/r-pkgs/roxygen2md. Happy markdown-ing!

Improved inheritance

Writing documentation is challenging because you want to reduce duplication as much as possible (so you don’t accidentally end up with inconsistent documentation) but you don’t want the user to have to follow a spider’s web of cross-references. This version of roxygen2 provides more support for writing documentation in one place then reusing in multiple topics.

The new @inherit tag allows to you inherit parameters, return, references, title, description, details, sections, and seealso from another topic. @inherit my_fun will inherit everything; @inherit my_fun return params will allow to you inherit specified components. @inherits fun sections will inherit all sections; if you’d like to inherit a single section, you can use @inheritSection fun title. You can also inherit from a topic in another package with @inherit pkg::fun.

Another new tag is @inheritDotParams, which allows you to automatically generate parameter documentation for ... for the common case where you pass ... on to another function. The documentation generated is similar to the style used in ?plot and will eventually be incorporated in to RStudio’s autocomplete. When you pass along ... you often override some arguments, so the tag has a flexible specification:

  • @inheritDotParams foo takes all parameters from foo().
  • @inheritDotParams foo a b e:h takes parameters a, b, and all parameters between e and h.
  • @inheritDotParams foo -x -y takes all parameters except for x and y.

All the @inherit tags (including the existing @inheritParams) now work recursively, so you can inherit from a function that inherited from elsewhere.

If you want to generate a basic package documentation page (accessible from package?packagename and ?packagename), you can document the special sentinel value "_PACKAGE". It automatically uses the title, description, authors, url and bug reports fields from the DESCRIPTION. The simplest approach is to do this:

#' @keywords internal
"_PACKAGE"

It only includes what’s already in the DESCRIPTION, but it will typically be easier for R users to access.

Source:: R News

Announcing ShinyTester – a package that helps you build Shiny apps

By Amit

(This article was first published on R – AmitKohli.com, and kindly contributed to R-bloggers)

Shiny is awesome, but can be a bit daunting and easy to make mistakes in. I recently came back to Shiny after a hiatus of a few years and it was much more challenging than I feel comfortable admitting. I was making bonehead mistakes like writing something instead of output$something, confusing where to put Output commands vs Render commands, etc. I would eventually find my mistake, curse myself and move on with a crumpled ego. Then I had the realization that maybe if I was a beginner, I wouldn’t even know what I was doing wrong. Thusly did I conclude that I was in a unique position to help out the R community: Dumb enough to make mistakes, but experienced enough to eventually remember how to resolve them. So naturally, I wrote an R package that tests the code of the Shiny app itself.

To install: install.packages("ShinyTester") cause yes, it’s on CRAN (my first!!! Shoutout to @jbkunst for invaluable help!)

Quick caveat: Since this package parses code and everyone writes code differently, it will necessarily be super buggy. If this package doesn’t work for your app (after reading the full list of caveats at the bottom of this post), please help by opening an Issue on the github repo.

The package consists of two functions that analyze the code itself:

  • ShinyDummyCheck() – checks how items are created in server.R and then how they are called in ui.R and runs some fairly naive checks
  • ShinyHierarchy() – to create an ad hoc hirearchy of the structure of the Shiny Apps – ie – what inputs go to what reactives, what reactives go to other reactives, and what then gets pushed back out to the UI as an output.

It is my hope that both of these combined minimize the intrinsic boneheadedness in us all. This is really quite beta though… please do check the Caveats! In the meantime, some examples:

Examples for ShinyDummyCheck:

ShinyTester::ShinyDummyCheck("https://raw.githubusercontent.com/mexindian/ShinyServer/master/LineSelector")

Provides this table:

Which shows that there are no errors in the Shiny app, oh except for the fact that I defined an object twice… whoops (Yeah, see that’s exactly the boneheadedness I’m talkin bout). The structure of the table is as follows:

  • Item – The name of the asset that maybe should be on both server.R and ui.R
  • SrvCall – the TYPE of object that you’re saying this specific item is (in server.R)
  • isOutput – is a binary that will specify if in server.R you wrote just item or output$item
  • VisualCall – is the TYPE of thingie you’re trying to push the item into (in ui.R).
  • Status – Compares the SrvCall to the VisualCall, also looks at isOutput and then applies some rules to figure out if it’s probably ok or not

Examples for ShinyHierarchy:

A simple example:

library(ShinyTester)
ShinyHierarchy("https://raw.githubusercontent.com/rstudio/shiny-examples/master/003-reactivity")

Will yield:

Which shows one of the weaknesses of the function… it assumes all Item names are unique… and will act strangely if this assumption doesn’t hold (ie – caption).

A more complex example:

ShinyTester::ShinyHierarchy("https://raw.githubusercontent.com/mexindian/ShinyServer/master/LineSelector")

Yields:

image

And here we can start to see the structure that I’m attempting to show… there are basically three ROWS of nodes. The first one is the UI Inputs, the second row are the reactives (kinda…), and the third row are the outputs being visualized. I said the reactives are “kinda” the second row because I have introduced a small shift to each node in the middle row in order to see reactive flows into each other (if they are all in the same row, you can’t really see them). The structure is made evident in a more complex case below (forgive the redacted names):

image

If you want to suppress the shift in reactive nodes, use offsetReactives = F

Caveats:

This is a very naive app, and in early stages at that… it works best with my style of programming and will probably take significant work to universalize (since we’re talking about code… maybe it’s impossible to fully universalize). Some other caveats:

  • For now only works with <- assignments, not = or -> assignments
  • For now calling items only works with doublequotes. (ie. plotOutput("thingie") works, plotOutput('thingie') doesn’t.
  • For now, only supports seperate ui.R and server.R Shiny apps… the single app.R implementation is not supported.
  • isolate and observe are not supported yet
  • For now, I don’t read in data outside the shinyserver call (for example, if I want to pass data in that only needs to be calculated once. Not sure yet what’s the best way.
  • For now it only analyzes the main scripts, if you are SOURCEing files in from other places, it won’t work.

Other tips for working in Shiny:

  • Add to your server.R and ui.R TEST items. for example, add one for a data.frame and one for a figure. You can keep these commented out or displaying random data… then, when you add a new element, just test them in the test blocks before adding them to the exact place. Saves time.
  • Likewise, during testing, if you need to run through the code to debug, you can always simulate inputs by writing this: input <- data.frame(Parameter1="thingie1",Parameter2="thingie2"). Keep this commented out, but when you test, you can run through the Shiny app as if it were live.
  • Check Dean Attali’s excellent tips and tricks (http://deanattali.com/blog/advanced-shiny-tips/).

Enjoy!

(Thanks to my rusers community, especially to Joshua Kunst and Colin Phillips for discussion, help and encouragement required to push this through to CRAN).

To leave a comment for the author, please follow the link and comment on their blog: R – AmitKohli.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The R Formula Method: The Good Parts

By Joseph Rickert

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Max Kuhn

Introduction

The formula interface to symbolically specify blocks of data is ubiquitous in R. It is commonly used to generate design matrices for modeling function (e.g. lm). In traditional linear model statistics, the design matrix is the two-dimensional representation of the predictor set where instances of data are in rows and variable attributes are in columns (a.k.a. the X matrix).

A simple motivating example uses the inescapable iris data in a linear regression model:

While the purpose of this code chunk is to fit a linear regression models, the formula is used to specify the symbolic model as well as generating the intended design matrix. Note that the formula method defines the columns to be included in the design matrix, as well as which rows should be retained.

Formulas are used in R beyond specifying statistical models, and their use has been growing over time (see this or this).

In this post, I’ll walk through the mechanics of how some modeling functions use formulas to make a design matrix using lm to illustrate the details. Note, however, that the syntactical minutiae are likely to be different from function to function, even within base R.

Formulas and Terms

lm initially uses the formula and the appropriate environment to translate the relationships between variables to creating a data frame containing the data. R has a fairly standard set of operators that can be used to create a matrix of predictors for models.

We will start by looking at some of the internals of lm (circa December 2016).

Preparing for the Model Frame

The main tools used to get the design matrix are the model.frame and model.matrix functions. The definition and first few lines of lm are:

The goal of this code is to manipulate the formula and other arguments into an acceptable set of arguments to the model.frame function in the stats package. The code will modify the call to lm to use as the substrate into model.frame, which has many similar arguments (e.g. formula, data, subset, and na.action). However, there are arguments that are not common to both functions.

The object mf is initially created to mirror the original call. After executing match.call(expand.dots = FALSE), our original call generates a value of

and class(mf) has a value of "call". Note that the first element of the call, mf[[1L]], has a value of lm with the class of name.

The next few lines remove any arguments to lm that are not arguments to model.frame, and adds another (drop.unused.levels). Finally, the call is modified by replacing the first element of the call (lm) to stats::model.frame. Now, mf, has a value of

Creating the Model Frame and Terms

When this code is executed using eval(expr = mf, envir = parent.frame()), the model.frame function returns

A data.frame containing the variables used in formula plus those specified in .... It will have additional attributes, including “terms” for an object of class “terms” derived from formula, and possibly “na.action” giving information on the handling of NAs (which will not be present if no special handling was done, e.g. by na.pass).

For our particular call, the first six values of mf are

Note that :

  • all of the columns are present, predictors and response,
  • the filtering defined by the subset command is executed here (note the row names above),
  • the Petal.Length has been log transformed and the column name is not a valid name, and
  • the Species variable has not generated any dummy variables.

If weights or an offset was used in the model, the resulting model frame would also include these.

As alluded to above, mf has several attributes, and includes one that would not normally be associated with a data frame (e.g. "terms"). The terms object contains the data that defines the relationships between variables in the formulas, as well as any transformations of the individual predictors (e.g. log). For our original model:

The terms object will be used to generate design matrices on new data (e.g. samples being predicted).

Creating the Design Matrix

The lm code has some additional steps to save the model terms and generate the design matrix:

The model.matrix function uses the data in the terms object to generate any interactions and/or dummy variables from factors. This work is mostly accomplished by a C routine.

The Predictive Nature of the terms

In the previous example, the log transformation is applied to one of the columns. When using an inline function inside a formula, this transformation will be applied to the current data, as well as any future data points (say, via predict.lm). The same workflow is followed where a model frame is used with the terms object and model.matrix.

However, there are some operations that can be specified in a formula that require statistical estimates. Two examples:

  • a natural spline (splines::ns) takes a numeric variable, does some computations, and expands that variable into multiple features that can be used to model that predictor in a nonlinear fashion.
  • orthogonal polynomials (stats::poly) is a basis expansion that takes a single predictor and produces new columns that correspond to the polynomial degree.

As an example of natural splines:

ns returns multiple elements: the basis function spline results (shown just above) and the data required to generate them for new data (in the attributes).

It turns out that the only statistics required to produce new spline results are the knots, Boundary.knots, and intercept attributes. When new data are predicted, those statistical quantities are used:

Now, getting back to formulas, we can include a function like this inline:

The resulting terms object saves the model specification, and also the values required to reproduce the spline basis function. The terms object contains an attribute that is misleadingly named predvars that has this information:

When predict.lm is invoked with a new data set, the terms are passed to model.frame and the predvars are evaluated on the new data, e.g.,

In summary, R’s formula interface works by exploiting the original formula as a general expression, and the terms object is where most of the information about the design matrix is stored.

Finally, it is possible to include more complex operations in the formula. For example, two techniques for imputation in predictive models are using tree ensembles and K-nearest neighbors. In each case, a model can be created to impute a predictor, and this model also could be embedded into predvars as long as the prediction function can be exploited as an expression.

There are some severe limitations to formulas which may not be obvious and various packages have had to work around these issues. The second post on formulas (“The Bad”) will illustrate these shortcomings.

To leave a comment for the author, please follow the link and comment on their blog: RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Jobs for R users – 14 jobs from around the world (2017-02-01)

By Tal Galili

r_jobs

To post your R job on the next post

Just visit this link and post a new R job to the R community. You can post a job for free (and there is also a “featured job” option for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

  1. Full-Time
    Quantitative/Data Scientist
    Signal Maritime – Posted by s.malle
    Glifada, Glyfada
    Attica, Greece
    13 Jan2017
  2. Full-Time
    Software Engineer
    Icahn School of Medicine at Mt. Sinai – Posted by dtelad11
    New York
    New York, United States
    11 Jan2017
  3. Full-Time
    Data Analyst in survey design and quantitative fisheries
    Sampling Design And Analysis Group @ Institute of Marine Research – Posted by Sampling.Design.And.Analysis.Group
    Lysekil
    Västra Götalands län, Sweden
    11 Jan2017
  4. Full-Time
    Customer Success Rep
    RStudio – Posted by jclemens1
    Anywhere
    3 Jan2017

More New Jobs

  1. Full-Time
    1-year postdoctoral position in Bioinformatics/Statistics is available @ Madrid, Spain.
    rdiaz02
    Madrid
    Comunidad de Madrid, Spain
    17 Jan2017
  2. Full-Time
    Transportation Market Research Analyst @ Burlington, Vermont, United States
    RSG – Posted by patricia.holland@rsginc.com
    Burlington
    Vermont, United States
    13 Jan2017
  3. Full-Time
    Data Scientist @ New York, United States
    7Park Data – Posted by Derek Darves
    New York
    New York, United States
    13 Jan2017
  4. Full-Time
    Bioinformatics Data Analyst @ Dresden, Sachsen, Germany
    Scionics Computer Innovation GmbH – Posted by holger brandl
    Dresden
    Sachsen, Germany
    13 Jan2017
  5. Full-Time
    Data Scientist – Modeller @ Manchester, England, United Kingdom
    Hello Soda – Posted by LeanneFitzpatrick
    Manchester
    England, United Kingdom
    13 Jan2017
  6. Full-Time
    Full stack Java developer and healthcare data scientist @ Greenwich, Connecticut, United States
    Health Care Data Analytics Company – Posted by mogorman
    Greenwich
    Connecticut, United States
    13 Jan2017
  7. Full-Time
    Data Scientist @ München, Bayern, Germany
    Stylight GmbH – Posted by StylightGmbH
    München
    Bayern, Germany
    13 Jan2017
  8. Full-Time
    Quantitative/Data Scientist
    Signal Maritime – Posted by s.malle
    Glifada, GlyfadaAttica, Greece
    13 Jan2017
  9. Full-Time
    Software Engineer
    Icahn School of Medicine at Mt. Sinai – Posted by dtelad11
    New York
    New York, United States
    11 Jan2017
  10. Full-Time
    Data Analyst in survey design and quantitative fisheries
    Sampling Design And Analysis Group @ Institute of Marine Research – Posted by Sampling.Design.And.Analysis.Group
    Lysekil
    Västra Götalands län, Sweden
    11 Jan2017
  11. Freelance
    R Programmer for Statistical Research on hedge funds
    Empiricus – Posted by empiricus
    Anywhere
    7 Jan2017
  12. Full-Time
    Data Scientist – Analytics @ Amsterdam, Noord-Holland, Netherlands
    Booking.com – Posted by work_at_booking
    Amsterdam
    Noord-Holland, Netherlands
    5 Jan2017
  13. Full-Time
    Data Scientist @ Etterbeek, Bruxelles, Belgium
    dataroots – Posted by bart
    Etterbeek
    Bruxelles, Belgium
    4 Jan2017
  14. Full-Time
    Customer Success Rep
    RStudio – Posted by jclemens1
    Anywhere
    3 Jan2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

Source:: R News

Metro Systems Over Time: Part 3

By Page Piccinini

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Note, at the time of this writing using the packages ggplot2 and ggmap from CRAN will result in an error. To avoid the error be sure to install both packages from GitHub with the package devtools and restart R if the problem persists.

devtools::install_github("dkahle/ggmap")
devtools::install_github("hadley/ggplot2")

Note, at the time this writing the package gganimate requires the package cowplot. Be sure to install and load the package before continuing.

install.packages("cowplot")
library(cowplot)

Introduction

In Part 1 and Part 2 of this series we made maps of metro systems across four European cities, and then computed Delaunay triangulations and centroids for each city. In the third and final part, we’ll do these same steps but at multiple time points to make a .gif of how metro systems change over time.

Data

As a reminder, our data is the hand corrected values of the data we pulled down from Google. To see how we got the data go back to Part 1: Data.

Maps with Change Over Time

We now have a good sense of what each city’s current metro system looks like, but how did these systems come to be this way? Now we’ll look at how these systems have changed and grown over time. That’s why at the beginning we made a column for opened_year. At this point the code gets less elegant but we’ll go through it step by step. It’s all the same principles as when we made our figures earlier.

The main idea of the following code is that we’re going to create unique triangulations for each year within each city. As more metro stations get added each year the triangulation will change. Just as we had data_deldir_delsgs and data_deldir_cent, we’re going to start by creating two empty data frames time_deldir_delsgs and time_deldir_sum (remember that our centroid data frame was based on the summary data). With our empty data frames initialized we can make a for loop. We want to go through each year, but for each city separately, so our first for loop goes through each city, filtering our data to only the city in question. Next we have our second for loop going through each year starting with the minimum year in the data for that city and up to 2015, the maximum year for the full data set. For a given year we filter() to include only metro stops that were opened that year or earlier. We do equal to or less than because we don’t want to ignore metro stops from earlier years, we want the whole metro system as it exists for a given year. Note, we need at least three points to make a triangle, and you may think that a city wouldn’t ever have only one or two metro stops but you would be wrong (cough Barcelona cough), so we’re going to put a stop gap saying if the number of data points is less than three the loop should skip that year and move to the next one.

Okay, assuming there are at least three data points though we’re going to run the deldir() call and then save it in year_deldir. Then we create two new data frames. the first is year_deldir_delsgs which contains the delsgs information from deldir. We’re going to add two columns too, city and opened_year, so we know which city and year this data comes from. We then add this information to our existing time_deldir_delsgs data frame with a bind_rows() call. We then do the same thing to create year_deldir_sum, only we pull out the summary information from year_deldir instead of the delsgs information. We also add our city and opened_year columns and then bind_rows() it with time_deldir_sum. The loop does this for every city from the minimum year in the data up to 2015. See below the head of the two data frames we created.

time_deldir_delsgs = data.frame()

time_deldir_sum = data.frame()

for(c in c("Paris", "Berlin", "Barcelona", "Prague")) {
  data_city = filter(data, city == c)
  for(year in min(data_city$opened_year):2015) {
    data_year = filter(data_city, opened_year <= year)
    
    # Add condition to skip if number of stops less than 3
    if(dim(data_year)[1] < 3) next

      year_deldir = deldir(data_year$lon, data_year$lat)

      year_deldir_delsgs = year_deldir$delsgs %>%
        mutate(city = c) %>%
        mutate(opened_year = year)
    
      time_deldir_delsgs = bind_rows(time_deldir_delsgs, year_deldir_delsgs)
    
      year_deldir_sum = year_deldir$summary %>%
        mutate(city = c) %>%
        mutate(opened_year = year)
    
      time_deldir_sum = bind_rows(time_deldir_sum, year_deldir_sum)
  }
}

head(time_deldir_delsgs)
head(time_deldir_sum)
        x1       y1       x2       y2 ind1 ind2  city opened_year
1 2.358279 48.85343 2.369510 48.85302   11    2 Paris        1900
2 2.374377 48.84430 2.369510 48.85302    9    2 Paris        1900
3 2.374377 48.84430 2.358279 48.85343    9   11 Paris        1900
4 2.414484 48.84640 2.369510 48.85302   16    2 Paris        1900
5 2.414484 48.84640 2.374377 48.84430   16    9 Paris        1900
6 2.386581 48.84732 2.369510 48.85302   18    2 Paris        1900

         x        y n.tri del.area  del.wts n.tside nbpt dir.area  dir.wts  city opened_year
1 2.289364 48.87577     4  1.9e-05 0.012510       4    2 0.000066 0.009329 Paris        1900
2 2.369510 48.85302     5  7.0e-05 0.047518       4    2 0.000520 0.073787 Paris        1900
3 2.290121 48.86685     5  3.7e-05 0.024907       5    0 0.000074 0.010541 Paris        1900
4 2.313310 48.86750     4  3.7e-05 0.024692       3    4 0.000388 0.054946 Paris        1900
5 2.294900 48.87393     5  1.6e-05 0.010551       3    2 0.000061 0.008594 Paris        1900
6 2.347301 48.85869     4  1.1e-05 0.007359       3    4 0.000392 0.055588 Paris        1900

As you may recall though we’re not necessarily interested in all the summary information, we just want it to compute our centroid. So, we make a new data frame time_deldir_cent. The code is the same as our earlier code for computing centroids, the only difference is that we’ll also group by opened_year, not just city, since we want unique centroids for each year for each city. See part of the data frame of the centroids below.

time_deldir_cent = time_deldir_sum %>%
  group_by(city, opened_year) %>%
  summarise(cent_x = sum(x * del.wts),
            cent_y = sum(y * del.wts)) %>%
  ungroup()
head(time_deldir_cent)
# A tibble: 6 × 4
       city opened_year   cent_x   cent_y
      <chr>       <int>    <dbl>    <dbl>
1 Barcelona        1924 2.116839 41.38132
2 Barcelona        1925 2.120019 41.37782
3 Barcelona        1926 2.121921 41.37834
4 Barcelona        1927 2.121921 41.37834
5 Barcelona        1928 2.122325 41.37384
6 Barcelona        1929 2.113628 41.37543

There’s still one more thing I want to do before we make our figures. Right now the figures will have different start dates depending on when the first metro stop was built in a given city. Instead, I want all figures to start at the same year so we see them change over time with the same start date for each city. To do this we’ll make a new data frame called years that simply lists the years 1900 to 2015 four times, once for each city. We then do a left_join() with our data. As a result any time the opened_year in question is not found in the data frame for a given city an empty row will be added, empty except for the opened_year and city values. You’ll also notice that I filter()ed to only include decade years (1900, 1910, 1920, etc.), and the year 2015 so it includes the last year of our data. This is because if we include every year our gif will be very large and non-portable. Also it’s more dramatic to see changes every 10 years.

years = data.frame(opened_year = rep(seq(1900, 2015), 4),
                   city = c(rep("Paris", 116), rep("Berlin", 116),
                            rep("Barcelona", 116), rep("Prague", 116)))

data_time = left_join(years, data) %>%
  mutate(opened_by_year = ifelse(opened_year %% 10 == 0, opened_year,
                                 opened_year + (10 - (opened_year %% 10)))) %>%
  filter(opened_by_year %
  filter(opened_year %% 10 == 0 | opened_year == 2015)

time_deldir_cent_sub = time_deldir_cent %>%
  filter(opened_year %% 10 == 0 | opened_year == 2015)

I kept saying we were going to make maps showing the change over time, but how are we going to do that? Well instead of building a single static plot for each city we’re going to build an animation where as the year changes so will the map. To do this we’ll use the package gganimate which works on top of ggplot2 (which is useful since we’re already using ggmap which works on top of ggplot2). We build our plot just as we would any other ggplot2 figure, but for data we want to add the frame setting. The frame is the thing in the plot that changes, in our case opened_year. Also, while we only want to plot the triangulations and centroids specific to a given year, we want the points for the metro stops to be additive. For example, when frame is 2000 we still want the points from 1990 to be plotted. To do this we add cumulative = TRUE to the call for those points. Finally, since we updated our data to include empty rows so that all plots start on 1900, all plots will have a frame starting at 1900, even if there are no data points to plot. I’ve again made a function to make our plots. See below for the code for the Paris map as well as all four animations. Also, notice that in 1920 (actually 1912) Barcelona gets their first metro stop…but doesn’t get anymore until 1930 (actually 1924). Take a look to see if you can find any other interesting things about how the systems changed over time.

devtools::install_github("dgrtwo/gganimate")
library(gganimate)

time_plot = function(city_name, city_map){
  ggmap(city_map, extent = "device") +
    geom_segment(data = subset(time_deldir_delsgs_sub, city == city_name),
                 aes(x = x1, y = y1, xend = x2, yend = y2, frame = opened_year),
                 size = 1, color= "#92c5de") +
    geom_point(data = subset(data_time, city == city_name),
               aes(x = lon, y = lat, frame = opened_by_year, cumulative = TRUE),
               color = "#0571b0", size = 3) +
    geom_point(data = subset(time_deldir_cent_sub, city == city_name),
               aes(x = cent_x, y = cent_y, frame = opened_year),
               size = 6, color= "#ca0020") +
      theme(plot.title = element_text(hjust = 0.5))
}

paris_time.plot = time_plot("Paris", paris_map)
gganimate(paris_time.plot)

Plot for Paris:

Conclusion

In these three post we looked at how the metro systems of four European cities changed over time. To do this we used a lot of different packages. We used the packages dplyr, tidyr, purrr, and ggplot2, which are all now a part of the package tidyverse. We used used two other plotting packages that build upon ggplot2, ggmap and gganimate. Finally we used the deldir package to make Delaunay triangulations and compute centroids of city metro systems over time. All of these skills can be applied to any other type of spacial data with unique shapes, and can be used to make your very own gifs. Try your city as a practice exercise!

    Related Post

    1. Metro Systems Over Time: Part 2
    2. Metro Systems Over Time: Part 1
    3. Outlier App: An Interactive Visualization of Outlier Algorithms
    4. Creating an animation using R
    5. The importance of Data Visualization
    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Taking steps (in XML)

    By nsaunders

    (This article was first published on R – What You’re Doing Is Rather Desperate, and kindly contributed to R-bloggers)

    So the votes are in:

    Your established blog is mostly about your work. Your work changes. Do you continue at the current blog or start a new one?

    — Neil Saunders (@neilfws) January 23, 2017

    I thank you, kind readers. So here’s the plan: (1) keep blogging here as frequently as possible (perhaps monthly), (2) on more general “how to do cool stuff with data and R” topics, (3) which may still include biology from time to time. Sounds OK? Good.

    So: let’s use R to analyse data from the iOS Health app.

    I own an iPhone. It comes with a Health app installed by default. Not being a big user of mobile apps, it was several months before I opened it and realised that it had been collecting data. Which makes me wonder what else the phone does without my knowledge…but back to the topic. It turns out that health data can be exported by tapping at top-right on the overview page, then choosing export.

    Click to view slideshow.

    This generates a compressed file, ios_health_export.zip. Upload it from your phone to your destination of choice; I went with Google Drive.

    Being Apple, I’d assumed that the contents might be some hideous proprietary binary format but in fact unzipping the file reveals a directory, apple_health_export, in which reside two XML files. The larger export.xml contains your health data.

    Records in the XML file consist of lines that specify the record type (measurement), source, three timestamps for creation, start and end, and the value of the measurement. Most of my records are step counts, which look like this:

    <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Health" unit="count" creationDate="2014-09-24 09:25:06 +1100" startDate="2014-09-23 18:01:22 +1100" endDate="2014-09-23 18:01:24 +1100" value="9"/>
    

    And so to R. In the past I would have used the XML package but in my ongoing effort to convert to the “tidyverse”, I’ll try xml2 instead. We’ll use purrr too for reasons that will become apparent, ggplot2 for plotting and dplyr because it is awesome.

    Reading in the file could not be easier:

    library(xml2)
    library(purrr)
    library(ggplot2)
    library(dplyr)
    
    health_data <- read_xml("export.xml")
    

    Nor could extracting the records that contain step counts. We use an xpath expression, then pipe the result to purr’s mapping functions to go straight from XML attributes to a data frame, as described here.

    steps <- xml_find_all(health_data, ".//Record[@type='HKQuantityTypeIdentifierStepCount']") %>% map(xml_attrs) %>% map_df(as.list)
    
    glimpse(steps)
    Observations: 188,677
    Variables: 9
    $ type          <chr> "HKQuantityTypeIdentifierStepCount", "HKQuantityTypeIdentifierStepCount", "HKQuantityTypeIdentifierStepCount", "HKQuantityTypeIdentifierStepCount", ...
    $ sourceName    <chr> "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health", "Health"...
    $ unit          <chr> "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "count", "cou...
    $ creationDate  <chr> "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "2014-09-24 09:25:06 +1100", "20...
    $ startDate     <chr> "2014-09-23 17:58:58 +1100", "2014-09-23 17:59:08 +1100", "2014-09-23 17:59:18 +1100", "2014-09-23 17:59:28 +1100", "2014-09-23 17:59:58 +1100", "20...
    $ endDate       <chr> "2014-09-23 17:59:03 +1100", "2014-09-23 17:59:13 +1100", "2014-09-23 17:59:23 +1100", "2014-09-23 17:59:33 +1100", "2014-09-23 18:00:03 +1100", "20...
    $ value         <chr> "12", "5", "17", "1", "14", "4", "10", "2", "4", "2", "9", "7", "4", "9", "7", "6", "11", "13", "6", "8", "5", "8", "6", "9", "1", "7", "13", "6", "...
    $ sourceVersion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
    $ device        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
    

    To illustrate an example analysis, let’s aggregate steps to a monthly count and plot counts by month. We’ll assume that startDate is a proxy for day (i.e. I’m not walking at midnight so steps don’t straddle day boundaries). We’ll also assign the monthly count to the first day of the month, to avoid having to figure out what number day ends the month

    So, to recode step count as an integer, convert the start date to a date object, summarise by month and plot, let’s see dplyr in action:

    steps %>% select(startDate, value) %>%
    group_by(Date = as.Date(paste(substr(startDate, 1, 7), "01", sep = "-")))
    %>% summarise(count = sum(as.numeric(value))) %>%
    ggplot(aes(Date, count)) + geom_col(fill = "skyblue3") + theme_bw() + labs(y = "monthly step count", title = "Steps by month September 2014 - January 2017 as measured by iOS")
    

    Result:

    ios_steps

    As to how accurate the counts are: that’s for another day.

    Filed under:

    To leave a comment for the author, please follow the link and comment on their blog: R – What You’re Doing Is Rather Desperate.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Tidy grid search with pipelearner

    By Simon Jackson

    (This article was first published on blogR, and kindly contributed to R-bloggers)

    @drsimonj here to show you how to use pipelearner to easily grid-search hyperparameters for a model.

    pipelearner is a package for making machine learning piplines and is currently available to install from GitHub by running the following:

    # install.packages("devtools")  # Run this if devtools isn't installed
    devtools::install_github("drsimonj/pipelearner")
    library(pipelearner)
    

    In this post we’ll grid search hyperparameters of a decision tree (using the rpart package) predicting cars’ transmission type (automatic or manual) using the mtcars data set. Let’s load rpart along with tidyverse, which pipelearner is intended to work with:

    library(tidyverse)
    library(rpart)
    

    The data

    Quickly convert our outcome variable to a factor with proper labels:

    d <- mtcars %>% 
      mutate(am = factor(am, labels = c("automatic", "manual")))
    head(d)
    #>    mpg cyl disp  hp drat    wt  qsec vs        am gear carb
    #> 1 21.0   6  160 110 3.90 2.620 16.46  0    manual    4    4
    #> 2 21.0   6  160 110 3.90 2.875 17.02  0    manual    4    4
    #> 3 22.8   4  108  93 3.85 2.320 18.61  1    manual    4    1
    #> 4 21.4   6  258 110 3.08 3.215 19.44  1 automatic    3    1
    #> 5 18.7   8  360 175 3.15 3.440 17.02  0 automatic    3    2
    #> 6 18.1   6  225 105 2.76 3.460 20.22  1 automatic    3    1
    

    Default hyperparameters

    We’ll first create a pipelearner object that uses the default hyperparameters of the decision tree.

    pl <- d %>% pipelearner(rpart, am ~ .)
    pl
    #> $data
    #> # A tibble: 32 × 11
    #>      mpg   cyl  disp    hp  drat    wt  qsec    vs        am  gear  carb
    #>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <fctr> <dbl> <dbl>
    #> 1   21.0     6 160.0   110  3.90 2.620 16.46     0    manual     4     4
    #> 2   21.0     6 160.0   110  3.90 2.875 17.02     0    manual     4     4
    #> 3   22.8     4 108.0    93  3.85 2.320 18.61     1    manual     4     1
    #> 4   21.4     6 258.0   110  3.08 3.215 19.44     1 automatic     3     1
    #> 5   18.7     8 360.0   175  3.15 3.440 17.02     0 automatic     3     2
    #> 6   18.1     6 225.0   105  2.76 3.460 20.22     1 automatic     3     1
    #> 7   14.3     8 360.0   245  3.21 3.570 15.84     0 automatic     3     4
    #> 8   24.4     4 146.7    62  3.69 3.190 20.00     1 automatic     4     2
    #> 9   22.8     4 140.8    95  3.92 3.150 22.90     1 automatic     4     2
    #> 10  19.2     6 167.6   123  3.92 3.440 18.30     1 automatic     4     4
    #> # ... with 22 more rows
    #> 
    #> $cv_pairs
    #> # A tibble: 1 × 3
    #>            train           test   .id
    #>           <list>         <list> <chr>
    #> 1 <S3: resample> <S3: resample>     1
    #> 
    #> $train_ps
    #> [1] 1
    #> 
    #> $models
    #> # A tibble: 1 × 5
    #>   target model     params     .f   .id
    #>    <chr> <chr>     <list> <list> <chr>
    #> 1     am rpart <list [1]>  <fun>     1
    #> 
    #> attr(,"class")
    #> [1] "pipelearner"
    

    Fit the model with learn():

    results <- pl %>% learn()
    results
    #> # A tibble: 1 × 9
    #>   models.id cv_pairs.id train_p         fit target model     params
    #>       <chr>       <chr>   <dbl>      <list>  <chr> <chr>     <list>
    #> 1         1           1       1 <S3: rpart>     am rpart <list [1]>
    #> # ... with 2 more variables: train <list>, test <list>
    

    The fitted results include our single model. Let’s assess the model’s performance on the training and test sets:

    # Function to compute accuracy
    accuracy <- function(fit, data, target_var) {
      # Coerce `data` to data.frame (needed for resample objects)
      data <- as.data.frame(data)
      # Obtain predicted class
      predicted <- predict(fit, data, type = "class")
      # Return accuracy
      mean(predicted == data[[target_var]])
    }
    
    # Training accuracy
    accuracy(results$fit[[1]], results$train[[1]], results$target[[1]])
    #> [1] 0.92
    
    # Test accuracy
    accuracy(results$fit[[1]], results$test[[1]], results$target[[1]])
    #> [1] 0.8571429
    

    Looks like we’ve achieved 92% accuracy on the training data and 86% accuracy on the test data. Perhaps we can improve on this by tweaking the model’s hyperparameters.

    Adding hyperparameters

    When using pipelearner, you can add any arguments that the learning function will accept after we provide a formula. For example, run ?rpart and you’ll see that control options can be added. To see these options, run ?rpart.control.

    An obvious choice for decision trees in minsplit, which determines “the minimum number of observations that must exist in a node in order for a split to be attempted.” By default it’s set to 20. Given that we have such a small data set, this seems like a poor choice. We can adjust it as follows:

    pl <- d %>% pipelearner(rpart, am ~ ., minsplit = 5)
    results <- pl %>% learn()
    
    # Training accuracy
    accuracy(results$fit[[1]], results$train[[1]], results$target[[1]])
    #> [1] 0.92
    
    # Test accuracy
    accuracy(results$fit[[1]], results$test[[1]], results$target[[1]])
    #> [1] 0.8571429
    

    Reducing minsplit will generally increase your training accuracy. Too small, however, and you’ll overfit the training data resulting in poorer test accuracy.

    Using vectors

    All the model arguments you provide to pipelearner() can be vectors. pipelearner will then automatically expand those vectors into a grid and test all combinations. For example, let’s try out many values for minsplit:

    pl <- d %>% pipelearner(rpart, am ~ ., minsplit = c(2, 4, 6, 8, 10))
    results <- pl %>% learn()
    results
    #> # A tibble: 5 × 9
    #>   models.id cv_pairs.id train_p         fit target model     params
    #>       <chr>       <chr>   <dbl>      <list>  <chr> <chr>     <list>
    #> 1         1           1       1 <S3: rpart>     am rpart <list [2]>
    #> 2         2           1       1 <S3: rpart>     am rpart <list [2]>
    #> 3         3           1       1 <S3: rpart>     am rpart <list [2]>
    #> 4         4           1       1 <S3: rpart>     am rpart <list [2]>
    #> 5         5           1       1 <S3: rpart>     am rpart <list [2]>
    #> # ... with 2 more variables: train <list>, test <list>
    

    Combining mutate from dplyr and map functions from the purrr package (all loaded with tidyverse), we can extract the relevant information for each value of minsplit:

    results <- results %>% 
      mutate(
        minsplit = map_dbl(params, "minsplit"),
        accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
        accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
      )
    
    results %>% select(minsplit, contains("accuracy"))
    #> # A tibble: 5 × 3
    #>   minsplit accuracy_train accuracy_test
    #>      <dbl>          <dbl>         <dbl>
    #> 1        2              1     0.5714286
    #> 2        4              1     0.5714286
    #> 3        6              1     0.5714286
    #> 4        8              1     0.5714286
    #> 5       10              1     0.5714286
    

    This applies to as many hyperparameters as you care to add. For example, let’s grid search combinations of values for minsplit, maxdepth, and xval:

    pl <- d %>% pipelearner(rpart, am ~ .,
                            minsplit = c(2, 20),
                            maxdepth = c(2, 5),
                            xval     = c(5, 10))
    pl %>%
      learn()%>% 
      mutate(
        minsplit = map_dbl(params, "minsplit"),
        maxdepth = map_dbl(params, "maxdepth"),
        xval     = map_dbl(params, "xval"),
        accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
        accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
      ) %>%
      select(minsplit, maxdepth, xval, contains("accuracy"))
    #> # A tibble: 8 × 5
    #>   minsplit maxdepth  xval accuracy_train accuracy_test
    #>      <dbl>    <dbl> <dbl>          <dbl>         <dbl>
    #> 1        2        2     5           1.00     0.8571429
    #> 2       20        2     5           0.92     0.8571429
    #> 3        2        5     5           1.00     0.8571429
    #> 4       20        5     5           0.92     0.8571429
    #> 5        2        2    10           1.00     0.8571429
    #> 6       20        2    10           0.92     0.8571429
    #> 7        2        5    10           1.00     0.8571429
    #> 8       20        5    10           0.92     0.8571429
    

    Not much variance in the accuracy, but it demonstrates how you can use this in your own work.

    Using train_models()

    A bonus tip for those of you how are comfortable so far: you can use learn_models() to isolate multiple grid searches. For example:

    pl <- d %>%
      pipelearner() %>% 
      learn_models(rpart, am ~ ., minsplit = c(1, 2), maxdepth = c(4, 5)) %>% 
      learn_models(rpart, am ~ ., minsplit = c(6, 7), maxdepth = c(1, 2))
    
    pl %>%
      learn()%>% 
      mutate(
        minsplit = map_dbl(params, "minsplit"),
        maxdepth = map_dbl(params, "maxdepth"),
        accuracy_train = pmap_dbl(list(fit, train, target), accuracy),
        accuracy_test  = pmap_dbl(list(fit, test,  target), accuracy)
      ) %>%
      select(minsplit, maxdepth, contains("accuracy"))
    #> # A tibble: 8 × 4
    #>   minsplit maxdepth accuracy_train accuracy_test
    #>      <dbl>    <dbl>          <dbl>         <dbl>
    #> 1        1        4           1.00     1.0000000
    #> 2        2        4           1.00     1.0000000
    #> 3        1        5           1.00     1.0000000
    #> 4        2        5           1.00     1.0000000
    #> 5        6        1           0.88     0.8571429
    #> 6        7        1           0.88     0.8571429
    #> 7        6        2           0.96     0.8571429
    #> 8        7        2           0.96     0.8571429
    

    Notice the separate grid searches for minsplit = c(1, 2), maxdepth = c(4, 5) and minsplit = c(6, 7), maxdepth = c(1, 2).

    This is because grid search is applied separately for each model defined by a learn_models() call. This means you can separate various hyperparameters combinations if you want to.

    Sign off

    Thanks for reading and I hope this was useful for you.

    For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

    If you’d like the code that produced this blog, check out the blogR GitHub repository.

    To leave a comment for the author, please follow the link and comment on their blog: blogR.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Install R Packages on the Ubuntu Virtual Machine

    By arthur charpentier

    For the (Advanced) R Crash Course of the Data Science for Actuaries program, we will use the Ubuntu virtual machine. There might be some issues when installing some packages… One trick can be to open a terminal

    and then to use the sudo command, to install some packages,

    (after entering the password). Just type (or copy/paste)

    sudo apt-get install libcurl4-openssl-dev
    sudo apt-get install libxml2-dev
    sudo apt-get install openjdk-7-*
    update-alternatives --config java
    sudo apt-get install aptitude
    sudo aptitude install libgdal-dev
    sudo aptitude install libproj-dev
    sudo apt-get install curl
    sudo apt-get build-dep r-cran-rgl
    sudo apt-get install r-cran-plyr r-cran-xml r-cran-reshape r-cran-reshape2 r-cran-rmysql
    sudo apt-get install r-cran-rjava

    Then, in RStudio, enter

    install.packages("RCurl")
    install.packages("xml")
    install.packages("rJava")
    install.packages("rgdal")
    install.packages("xlsx")

    It should be fine…

    Source:: R News