DataCamp talks @user2015aalborg

By DataCamp

facebook

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

The team behind DataCamp is getting ready for its third useR! attendance in as many years. This year we will contribute two talks. One on teaching R in class, and another on our testwhat package.

Talk One: Teaching R in (an Online) Class

Today, over 145,000 people have started a course on DataCamp. In this talk we will present some of our latest tools that we developed to make the learning experience even better.

The headliner within our presentation is a whole new DataCamp feature that will allow you to create, participate and manage professional teams and groups of students that are taking courses at DataCamp. This new feature comes with a dashboard that will allow you to get detailed insight into student and employee performance within and across courses. An instructor, professor or team manager can use the tool both in a fully online setting or in a blended learning environment, and thanks to the underlying automation less time needs to be spent on tasks such as student grading.

Furthermore, we will provide a brief intro on how to create courses on DataCamp for both our traditional interface and our new swirl interface. Finally, we will share some key insights based on analyzing the data students learning R.

When? Teaching 2 session, Thursday 16:00-17:30.

Talk Two: Taking testing to another level with testwhat

The architecture of the testthat R package (the de facto standard for writing unit tests for R packages) is very generic and suits itself to extension and adaptation. At DataCamp we adapted the testthat package to be used on the R backend of our interactive learning platform. By defining a new type of reporter and adding user-friendly test functions that are designed specifically for testing the correctness of a student’s submission, the testwhat package now exists as a wrapper around testthat.

The talk intends to give a brief overview of testthat and its internals, followed by a more detailed discussion about testwhat and the elegant adaptations that have been made to leverage testthat’s functionality for an entirely different application.

When? Data Management session, Wednesday 13:00-14:30

See you soon!

twittergoogle_pluslinkedin

The post DataCamp talks @user2015aalborg appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Everyone loves R markdown and Github; stories from the R Summit, day two

By richierocks

(This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers)

More excellent talks today!

Andrie de Vries of Microsoft kicked off today’s talks with a demo of checkpoint. This is his package for assisting reproducibility by letting you install packages from a specific date.

The idea is that a lot of R packages, particularly those from PhD projects don’t get maintained, and suffer bitrot. That means that they often don’t work with current versions of R and current packages.

Since I now work in a university, and our team is trying to make sure we release an accompanying package of the data analysis steps with every paper, having a system for reproducibility is important to me. I’ve played around with packrat, and it is nice but a bit too much effort to bother with on a day-to-day basis. I’ve not used checkpoint before, but from Andrie’s demo it seems a little easier. You just add these lines to the top of your script:

library(checkpoint)
checkpoint("2014-08-18", R.version = "3.0.3")

and it checks you are running the correct version of R, downloads the packages that you are about to run, from an archived version of CRAN on the date you suggested, and sets them up as a new library. Easy reproducibility.

Jeroen Ooms of UCLA (sort of) talked about his streaming suite of packages: curl, jsonlite and mongolite, for downloading web data, converting JSON to and from R objects, and working with MongoDB databases.

There have been other packages around for each of these three tasks, but the selling point is that Jeroen is a disgustingly talented coder and has written definitive versions.

jsonlite takes care to be consistent about the weird edge cases when converting between R and JSON. (The JSON spec doesn’t support infinity or NA, so you get a bit of control about what happens there, for example.)

jsonlite also supports ndjson, where each line is a JSON object. This is important for large files: you can just parse one line at a time, then return the whole thing as a list or data frame, or you can define a line specific behaviour.

This last case is used for streaming functionality. You use stream_in to read an ndjson file, then parse a line, manipulate the result, maybe stream_out to somewhere else, then move on to the next line. It made me wissh I was doing fancy things with the twitter firehose.

The MongoDB stuff also sounded interesting; I didn’t quite get a grasp of what makes it better than the other mongo packages, but he gave some examples of fast document searching.

Gabor Csardi of Harvard University talked about METACRAN, which is his spare time hobby.

One of the big sticking points in many people’s work in R is trying to find the best package to do something. There are a lot of tools you can use for this: Task Views, rdocumentation.org, crantastic.org, rseek.org, MRAN, the sos package, and so on. However, none of them are very good at recommending the best package for a given task.

This is where METACRAN comes in. Gabor gave a nice demo where he showed that when you search for “networks”, it successfully returns his igraph package. Um.

Jokes aside, it does seem like a very useful tool. His site also gives you information on trending packages.

He also mentioned another project, github.com/cran, a read-only mirror of CRAN, that lets you see what’s been updated when a new version of a package reaches CRAN. (Each package is a repository and each new version on CRAN counts as one commit.)

Peter Dalgaard of Copenhagen Business School, R-Core member and organiser of the summit, talked about R development conventions and directions.

We mentioned that many of the development principles that have shaped R were decided back in 1999, when R was desperately trying to gain credibility with organisations using SAS and Stata. This is, for example, why the base R distribution contains packages like nnet and spatial. While many users may not need to use neural networks or spatial statistics, it was important for the fledgeling language to be seen to have these capabilities built-in.

Peter said that some user contributed packages that are ubiquitous are being considered for inclusion into the base distribution. data.table, Rcpp and plyr in particular were mentioned. The balancing act is that more effort would go into ensuring that these packages work with new versions of R, but it requires more effort from R-Core, which is a finite resource.

Some other things that Peter talked about were that R-core worry about whether or not their traditional approach of being very conservative with the code base is too strict, and slows R’s development; and whether they should be more aggressive about removing quirks. This last point was referencing my talk from yesterday, so I was pleased that he had be listening.

Joe Rickert of Microsoft talked about the R community. He pointed out that “I use Excel, you use Excel, we have so much in common” is a conversation no-one has ever had. Joe thinks that “community” is mostly meaningless marketing speak, but R has it for real. I’m inclined to agree.

He talked about the R Consortium, which I’ve somewhere failed to hear about before. The point of the organisation is (mostly) to help build R infrastructure projects. The first big project they have in mind is a new version of R-forge, probably built on top of github, though official details are still secret, so there’s a bit of reading between the lines.

Users will be allowed to submit proposals for things for the consortium to build, and they get voted on (not sure if this is by users or the consortium members).

Joe also had a nice map of R User groups around the world, though my nearest one was several countries away. I guess I’d better start one of my own. If anyone in Qatar is interested in an R User Group, let me know in the comments.

Bettina Grün of Johannes Kepler Universitat talked about the R Journal. She discussed the history of the journal, and the topics that you can write about: packages, programming hints, and applications of R. There is also some content about changes in R and CRAN, and conference announcements.

One thing I didn’t get to ask her is how you blind a review of a paper about a package, since the package author is usually pretty easy to determine. My one experience of reviewing for the R Journal involved a paper about an update to the grid package, and included links to content on the University of Auckland website, and it was pretty clear that the only person that could have written it was Paul Murrell.

Mine Çetinkaya-Rundel of Duke University gave an impressive overview of how she teaches R to her undergraduate students. She suggested that while teaching programming at the same time as data analysis seems like it ought to make it harder, running a few line of code often takes less instruction than telling students where to point and click.

Her other teaching tips included: work on datasets that are big enough to make working in Excel annoying, so they appreciate programming (and R) more; use interactive examples; you get more engagement with real-world datasets; and force the students to learn a reproducible workflow by making them write R markdown documents.

Mine also talked briefly about her other projects: Datafest, which is a weekend long data analysis competition, and reach, a coursera data analysis course.

Jenny Bryan of the University of British Columbia had another talk about teaching R, this time to grad students. She’s developed a pretty slick workflow where each student submits their assignments (also R markdown documents) into github repos, which makes it really easy to check run their code, comment on it, give them hints via pull requests, and let them peer review each other’s code. Since using git seems to be an essential skill for data scientists these days, it seems like a good idea to explicitly teach them it while at university.

Jenny also talked about methods of finding interesting code via github search. An example she gave was that if you want to know how to see how vapply works, rather than just limiting yourself to the examples on the ?vapply help page, you can search all the packages on cran by going to Gabor’s github CRAN mirror (or maybe Winston Chang’s R-source mirror) and type vapply user:cran extension:R.

Karthik Ram of the University of California, Berkeley, headlined the day, talking about his work with ROpenSci, which is an organisation that creates open tools for data analysis (mostly R packages).

They have a ridiculously extensive set of packages for downloading online datasets, retrieving text corpuses, publishing your results, and working with spatial data.

ROpenSci also host regular community events and group phone calls.

Overall, it was an exciting day, and now I;m looking forward to going to the useR conference. See you in Aalborg!

Tagged:

To leave a comment for the author, please follow the link and comment on his blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Shiny finally has a colour picker – use colourInput to select colours in Shiny apps

By Dean Attali's R Blog

colourInput demo

(This article was first published on Dean Attali’s R Blog, and kindly contributed to R-bloggers)

I don’t always think Shiny is missing anything, but when I do – I fill in the gap myself.

That was meant to be read as The Most Interesting Man In The World, but now that I think about it – maybe he’s not the type of guy who would be building Shiny R packages…

Shiny has many useful input controls, but there was one that was always missing until today – a colour picker. The package shinyjs now has a colourInput() function and, of course, a corresponding updateColourInput(). There have been many times when I wanted to allow users in a Shiny app to select a colour, and I’ve seen that feature being requested multiple times on different online boards, so I decided to make my own such input control.

Table of contents

Demo

Click here for a Shiny app showing several demos of colourInput. If you don’t want to check out the Shiny app, here is a short GIF demonstrating the most basic functionality of colourInput.

The colours of course don’t look as ugly as in the GIF, here’s a screenshot of what a plain colourInput looks like.

Availability

colourInput() is available in shinyjs. You can either install it from GitHub with devtools::install_github("daattali/shinyjs") or from CRAN with install.packages("shinyjs").

Features

Simple and familiar

Using colourInput is extremely trivial if you’ve used Shiny, and it’s as easy to use as any other input control. It was implemented to very closely mimic all other Shiny inputs so that using it will feel very familiar. You can add a simple colour input to your Shiny app with colourInput("col", "Select colour", value = "red"). The return value from a colourInput is an uppercase HEX colour, so in the previous example the value of input$col would be #FF0000 (#FF0000 is the HEX value of the colour red). The default value at initialization is white (#FFFFFF).

Allowing “transparent”

Since most functions in R that accept colours can also accept the value “transparent”, colourInput has an option to allow selecting the “transparent” colour. By default, only real colours can be selected, so you need to use the allowTransparent = TRUE parameter. When this feature is turned on, a checkbox appears inside the input box.

If the user checks the checkbox for “transparent”, then the colour input is grayed out and the returned value of the input is transparent. This is the only case when the value returned from a colourInput is not a HEX value. When the checkbox is unchecked, the value of the input will be the last selected colour prior to selecting “transparent”.

By default, the text of the checkbox reads “Transparent”, but you can change that with the transparentText parameter. For example, it might be more clear to a user to use the word “None” instead of “Transparent”. Note that even if you change the checkbox text, the return value will still be transparent since that’s the actual colour name in R.

This is what a colour input with transparency enabled looks like

allowTransparent demo

How the chosen colour is shown inside the input

By default, the colour input’s background will match the selected colour and the text inside the input field will be the colour’s HEX value. If that’s too much for you, you can customize the input with the showColour parameter to either only show the text or only show the background colour.

Here is what a colour input with each of the possible values for showColour looks like

showColour demo

Updating a colourInput

As with all other Shiny inputs, colourInput can be updated with the updateColourInput function. Any parameter that can be used in colourInput can be used in updateColourInput. This means that you can start with a basic colour input such as colourInput("col", "Select colour") and completely redesign it with

updateColourInput(session, "col", label = "COLOUR:", value = "orange",
  showColour = "background", allowTransparent = TRUE, transparentText = "None")

Flexible colour specification

Specifying a colour to the colour input is made very flexible to allow for easier use. When giving a colour as the value parameter of either colourInput or updateColourInput, there are a few ways to specify a colour:

  • Using a name of an R colour, such as red, gold, blue3, or any other name that R supports (for a full list of R colours, type colours())
  • If transparency is allowed in the colourInput, the value transparent (lowercase) can be used. This will update the UI to check the checkbox.
  • Using a 6-character HEX value, either with or without the leading #. For example, initializing a colourInput with any of the following values will all result in the colour red: ff0000, FF0000, #ff0000.
  • Using a 3-character HEX value, either with or without the leading #. These values will be converted to full HEX values by automatically doubling every character. For example, all the following values would result in the same colour: 1ac, #1Ac, 11aacc.

Works on any device

If you’re worried that maybe someone viewing your Shiny app on a phone won’t be able to use this input properly – don’t you worry. I haven’t quite checked every single device out there, but I did spend extra time making sure the colour selection JavaScript works in most devices I could think of. colourInput will work fine in Shiny apps that are viewed on Android cell phones, iPhones, iPads, and even Internet Explorer 8+.

Misc

In order to build colourInput, I needed to use a JavaScript colour picker library. After experimenting with many different colour pickers, I decided to use this popular jQuery colour picker as a base, and extend it myself to make it geared to work with Shiny. I simplified much of the code and added some features that would make it integrate with Shiny much easier. You can see the exact changes I’ve made in the README for my version of the library. The main features I added were the support for a “transparent” checkbox, the complete look of the input field was redesigned, and I also changed the colour picker colours to render completely in CSS instead of using images.

It’s been pointed out that this function is not exactly in-line with the general shinyjs idea, so it might not stay there forever. Ideally, this colourInput will soon be part of shiny, but until then I’ll just keep it here until it finds a more loving home.


If anyone has any comments or feedback, both negative or positive, I’d love to hear about it! Feel free to open issues on GitHb if there are any problems.

To leave a comment for the author, please follow the link and comment on his blog: Dean Attali’s R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Generalizing from Marketing Research: The Right Question and the Correct Analysis

By Joel Cadwell

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

The marketing researcher asks some version of the following question in every study, “Tell me what you want?” The rest is a summary of the notes taken during the ensuing conversation.

Steve Jobs’ quote suggests that we might do better getting a reaction to an actual product. You tell me that price is not particularly important to you, yet this one here costs too much. You claim that design is not an issue, except you love the look of the product shown. In casual discussion color does not matter, but that shade of aqua is ugly and you will not buy it.

Although Steve Jobs was speaking of product design using focus groups, we are free to apply his rule to all decontextualized research. “Show it to them” provides the context for product design when we embed the showing within a usage occasion. On the other hand, if you seek incremental improvements to current products and services, you ask about problems experienced or extensions desired in concrete situations because that is the context within which these needs arise. Of course, we end up with a lot more variables in our datasets as soon as we start asking about the details of feature preference or product usage.

For example, instead of rating the importance of color in your next purchase of a car, suppose that you are shown a color array with numerous alternatives to which many of your responses are likely to be “no” or marked “not applicable” because some colors are associated with options you are not buying. Yet, this is the context within which cars are purchased, and the manufacturer must be careful not to lose a customer when no color option is acceptable. In order to respond to the rating question, the car buyer searches memory for instances of “color problems” in the past. The manufacturer, on the other hand, is concerned about “color problems” in the future when only a handful of specific color combinations are available. Importance is simply the wrong question given the strategic issues.

Because the resulting data are high dimensional and sparse, it will be difficult to analyze with traditional multivariate techniques. This is where R makes it contribution by offering tools from machine and statistical learning designed for sparse and high dimensional data that are produced whenever we provide a context.

We find such analyses in the data from fragmented product categories, where diverse consumer segments shop within distinct distribution channels for non-overlapping products and features (e.g., music purchases by young teens and older retirees). We can turn to R packages for nonnegative matrix factorization (NMF) and matrix completion (softImpute) to exploit such fragmentation and explain the observed high-dimensional and sparse data in terms of a much smaller set of inferred benefits.

What does your car color say about you? It’s a topic discussed in the media and among friends. It is a type of collaboration among purchasers who may have never met yet find themselves in similar situations and satisfy their needs in much the same manner. A particular pattern of color preferences has meaning only because it is shared by some community. Matrix factorization reveals that hidden structure by identifying the latent benefits responsible for the observed color choices.

I may be mistaken, but I imagine that Steve Jobs might find all of this helpful.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Bio7 at the useR! Conference 2015

By » R

(This article was first published on » R, and kindly contributed to R-bloggers)

I recently released a new version of Bio7 just in time and planned for the useR conference 2015 where I will present Bio7 in the Ecology oral session titled:

“A Graphical User Interface for R in an Integrated Development Environment for Ecological Modeling, Scientific Image Analysis and Statistical Analysis “.

I hope to see you there and if you have any questions about Bio7 don’t hesitate to contact me. I’m looking forward to the conference.

To leave a comment for the author, please follow the link and comment on his blog: » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

DotCity: a game written in R? and other statistical computer games?

By civilstat

This is where scatterplot points go to live and play when they're not on duty.

(This article was first published on Civil Statistician » R, and kindly contributed to R-bloggers)

A while back I recommended Nathan Uyttendaele’s beginner’s guide to speeding up R code.

I’ve just heard about Nathan’s computer game project, DotCity. It sounds like a statistician’s minimalist take on SimCity, with a special focus on demographic shifts in your population of dots (baby booms, aging, etc.). Furthermore, he’s planning to program the internals using R.

This is where scatterplot points go to live and play when they’re not on duty.

Consider backing the game on Kickstarter (through July 8th). I’m supporting it not just to play the game itself, but to see what Nathan learns from the development process. How do you even begin to write a game in R? Will gamers need to have R installed locally to play it, or will it be running online on something like an RStudio server?

Meanwhile, do you know of any other statistics-themed computer games?

  • I missed the boat on backing Timmy’s Journey, but happily it seems that development is going ahead.
  • SpaceChem is a puzzle game about factory line optimization (and not, actually, about chemistry). Perhaps someone can imagine how to take it a step further and gamify statistical process control à la Shewhart and Deming.
  • It’s not exactly stats, but working with data in textfiles is an important related skill. The Command Line Murders is a detective noir game for teaching this skill to journalists.
  • The command line approach reminds me of Zork and other old text adventure / interactive fiction games. Perhaps, using a similar approach to the step-by-step interaction of swirl (“Learn R, in R”), someone could make an I.F. game about data analysis. Instead of OPEN DOOR, ASK TROLL ABOUT SWORD, TAKE AMULET, you would type commands like READ TABLE, ASK SCIENTIST ABOUT DATA DICTIONARY, PLOT RESIDUALS… all in the service of some broader story/puzzle context, not just an analysis by itself.
  • Kim Asendorf wrote a fictional “short story” told through a series of data visualizations. (See also FlowingData’s overview.) The same medium could be used for a puzzle/mystery/adventure game.
To leave a comment for the author, please follow the link and comment on his blog: Civil Statistician » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Deaths in the Netherlands by cause and age

By Wingfeet

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I downloaded counts of deaths by age, year and mayor cause from the Dutch statistics site. In this post I do some plots to look at causes and changes between the years.

Data

Data from CBS. I downloaded the data in Dutch, hence the first thing to do was provide some kind of translation. The coding used seems slightly different from IDC-10 main categories (and has been alphabetically disordered compared to that). I used Google translate and IDC-10 to obtain the translations

Plots

Preparation

In the following I will be using both percentage of population and percentage of deaths by age cohort. The need for the percentage of deaths is because in some cohorts the percentages of deaths are much higher, thereby hiding anything happening in other cohorts. In addition I should mention that for visual purposes only the most important eight causes are used in the plots

Young

It seems that most of risks are associated with birth. In addition, these risks have steadily been decreasing.
Looking at the age cohorts above 0 years, it seems accidents are most important. Most remarkable is a spike at 1953, which occurs for all four ages. After some consideration, I link this to the North Sea flood of 1953. It is remarkable that this is visible in the plot. It says a lot about how safe we are from accidents that it does. In the age category 15 to 20 there is also a relatively large bump during the 1970 to 1975. This is more difficult to explain, but I suspect traffic, especially the moped. A light motorcycle which preferably would be boosted to run much faster than the legal speed. 1975 saw the requirement to wear a helmet. It was much hated at the time, but in hindsight I can see that government felt compelled to do something and that it did have effect.
Looking at the plots, it seems the next big cause are Neoplasms. This is not because these become more deadly, it is because accidents are getting under control.

Elderly

For the elderly, diseases of the circulatory system are the main cause and decreasing quite a bit. The number of Symptoms and Abnormal Clinical Observations seems to decrease too. Since this seems to be a nice name for the ‘other’ category, this may be better diagnostics.
What is less visible is the increase in mental and behavioral disorders, especially after 1980 and at oldest age. It also seems that Neoplasms are getting lower very slowly.

Code

data reading

library(dplyr)
library(ggplot2)
txtlines <- readLines(‘Overledenen__doodsoo_170615161506.csv')
txtlines <- grep(‘Centraal',txtlines,value=TRUE,invert=TRUE)
#txtlines[1:5]
#cat(txtlines[4])
r1 <- read.csv(sep=';',header=FALSE,
col.names=c(‘Causes’,’Causes2′,’Age’,’year’,’aantal’,’count’),
na.strings=’-‘,text=txtlines[3:length(txtlines)]) %>%
select(.,-aantal,-Causes2)
transcauses <- c(
“Infectious and parasitic diseases”,
“Diseases of skin and subcutaneous”,
“Diseases musculoskeletal system and connective “,
“Diseases of the genitourinary system”,
“Pregnancy, childbirth”,
“Conditions of perinatal period”,
“Congenital abnormalities”,
“Sympt., Abnormal clinical Observations”,
“External causes of death”,
“Neoplasms”,
“Illness of blood, blood-forming organs”,
“Endocrine, nutritional, metabolic illness”,
“Mental and behavioral disorders”,
“Diseases of the nervous system and sense organs”,
“Diseases of the circulatory system”,
“Diseases of the respiratory organs”,
“Diseases of the digestive organs”,
“Population”,
“Total all causes of death”)
#cc <- cbind(transcauses,levels(r1$Causes))
#options(width=100)
levels(r1$Causes) <- transcauses
levels(r1$Age) <-
gsub(‘jaar’,’year’,levels(r1$Age)) %>%
gsub(‘tot’,’to’,.) %>%
gsub(‘of ouder’,’+’,.)

Preparation for plots

perc.of.death %
mutate(.,Population=count) %>%
select(.,-count,-Causes) %>%
merge(.,r1) %>%
filter(.,Causes %in% transcauses[1:17]) %>%
mutate(.,Percentage=100*count/Population,
Causes = factor(Causes),
year = as.numeric(gsub(‘*’,”,year,fixed=TRUE))
)
perc.of.pop %
mutate(.,Population=count) %>%
select(.,-count,-Causes) %>%
merge(.,r1) %>%
filter(.,Causes %in% transcauses[1:17]) %>%
mutate(.,Percentage=100*count/Population,
Causes = factor(Causes),
year = as.numeric(gsub(‘*’,”,year,fixed=TRUE))

)

young

png(‘youngpop1.png’)
tmp1 % filter(.,Age %in% levels(perc.of.pop$Age),
!is.na(Percentage)) %>%
mutate(.,Age=factor(Age,levels=levels(perc.of.pop$Age)),
Causes =factor(Causes))
# select ‘important’ causes (which somewhen got over 15%)
group_by(tmp1,Causes)%>%
summarize(.,mp = max(Percentage)) %>%
mutate(.,rk=rank(-mp)) %>%
merge(.,tmp1) %>%
filter(.,rk%
ggplot(.,
aes(y=Percentage,x=year,col=Causes)) +
geom_line()+
guides(col=guide_legend(ncol=2)) +
facet_wrap( ~Age ) +
theme(legend.position=”bottom”)+
ylab(‘Percentage of Cohort’)
dev.off()
###
png(‘youngpop2.png’)
tmp1 % filter(.,Age %in% levels(perc.of.pop$Age),
!is.na(Percentage)) %>%
mutate(.,Age=factor(Age,levels=levels(perc.of.pop$Age)),
Causes =factor(Causes))
# select ‘important’ causes (which somewhen got over 15%)
group_by(tmp1,Causes)%>%
summarize(.,mp = max(Percentage)) %>%
mutate(.,rk=rank(-mp)) %>%
merge(.,tmp1) %>%
filter(.,rk%
ggplot(.,
aes(y=Percentage,x=year,col=Causes)) +
geom_line()+
guides(col=guide_legend(ncol=2)) +
facet_wrap( ~Age ) +
theme(legend.position=”bottom”)+
ylab(‘Percentage of Cohort’)
# https://en.wikipedia.org/wiki/North_Sea_flood_of_1953
dev.off()

old

png(‘oldpop.png’)
tmp2 % filter(.,Age %in% levels(perc.of.pop$Age)[18:21],
!is.na(Percentage)) %>%
mutate(.,Age=factor(Age),
Causes =factor(Causes))
group_by(tmp2,Causes)%>%
summarize(.,mp = max(Percentage)) %>%
mutate(.,rk=rank(-mp)) %>%
merge(.,tmp2) %>%
filter(.,rk%
ggplot(.,
aes(y=Percentage,x=year,col=Causes)) +
geom_line()+
guides(col=guide_legend(ncol=2)) +
facet_wrap( ~Age ) +
theme(legend.position=”bottom”)+
ylab(‘Percentage of Cohort’)
dev.off()
# rj.GD
# 2

png(‘oldpop2.png’)
tmp2 %
filter(.,
Age %in% levels(perc.of.death$Age)[18:21],
year>=1980,
!is.na(Percentage)) %>%
mutate(.,Age=factor(Age),
Causes =factor(Causes))
group_by(tmp2,Causes)%>%
summarize(.,mp = max(Percentage)) %>%
mutate(.,rk=rank(-mp)) %>%
merge(.,tmp2) %>%
filter(.,rk%
ggplot(.,
aes(y=Percentage,x=year,col=Causes)) +
geom_line()+
guides(col=guide_legend(ncol=2)) +
facet_wrap( ~Age ) +
theme(legend.position=”bottom”)+
ylab(‘Percentage of Cohort’)

dev.off()








To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Workflow of Infinite Shame, and other stories from the R Summit

By richierocks

A flow diagram of CRAN submission steps with an infinite loop

(This article was first published on 4D Pie Charts » R, and kindly contributed to R-bloggers)

At day one of the R Summit at Copenhagen Business School there was a lot of talk about the performance of R, and alternate R interpreters.

Luke Tierney of the University of Iowa, the author of the compiler package, and R-Core member who has been working on R’s performance since, well, pretty much since R was created, talked about future improvements to R’s internals.

Plans to improve R’s performance include implementing proper reference counting (that is tracking how many variables point at a particular bit of memory; the current version counts like zero/one/two-or-more, and a more accurate count means you can do less copying). Improving scalar performance and reducing function overhead are high priorities for performance enhancement. Currently when you do something like

for(i in 1:100000000) {}

R will assign a vector of length 100000000, which takes a ridiculous amount of memory. By being smart and realising that you only ever need one number at a time, you can store the vector much more efficiently. The same principle applies for seq_len and seq_along.

Other possible performance improvements that Luke discussed include having a more efficient data structure for environments, and storing the results of complex objects like model results more efficiently. (How often do you use that qr element in an lm model anyway?

Tomas Kalibera of Northeastern University has been working on a tool for finding PROTECT bugs in the R internals code. I last spoke to Tomas in 2013 when he was working with Jan Vitek on the alternate R engine, FastR. See Fearsome Engines part 1, part 2, part 3. Since then FastR has become a purely Oracle project (more on that in a moment), and the Purdue University fork of FastR has been retired.

The exact details of the PROTECT macro went a little over my head, but the essence is that it is used to stop memory being overwritten, and it’s a huge source of obscure bugs in R.

Lukas Stadler of Oracle Labs is the heir to the FastR throne. The Oracle Labs team have rebuilt it on top of Truffle, an Oracle product for generating dynamically optimized Java bytecode that can then be run on the JVM. Truffle’s big trick is that it can auto-generate this byte code for a variety of languages: R, Ruby and JavaScript are the officially supported languages, with C, Python and SmallTalk as side-projects. Lukas claimed that peak performance (that is, “on a good day”) for Truffle-generated code is comparable to language-specific optimized code.

Non-vectorised code is the main beneficiary of the speedup. He had a cool demo where a loopy version of the sum function ran slowly, then Truffle learned how to optimise it, and the result became almost as fast as the built-in sum function.

He has a complaint that the R.h API from R to C is really and API from GNU R to C, that is, it makes too many assumptions about how GNU works, and these don’t hold true when you are running a Java version of R.

Maarten-Jan Kallen from BeDataDriven works on Renjin, the other R interpreter built on top of the JVM. Based on his talk, and some other discussion with Maarten, it seems that there is a very clear mission-statement for Renjin: BeDataDriven just want a version of R that runs really fast inside Google App Engine. They also count an interesting use case forRenjin – it is currently powering software for the United Nations’ humanitarian effort in Syria.

Back to the technical details, Maarten showed an example where R 3.0.0 introduced the anyNA function as a fast version of any(is.na(x)). In the case of Renjin, this isn’t necessary since it works quickly anyway. (Though if Luke Tierney’s talk come true, it won’t be needed in GNU R soon either.)

Calling external code still remains a problem for Renjin; in particular Rcpp and it’s reverse dependencies won’t build for it. The spread of Rcpp, he lamented, even includes roxygen2.

Hannes Mühleisen has also been working with BeDataDriven, and completed Maarten’s talk. He previously worked on integrating MonetDB with R, and has been applying his database expertise to Renjin. In the same way that when you run a query in a database, it generates a query plan to try and find the most efficient way of retrieving your results, Renjin now generates a query plan to find the most efficient way to evaluate your code. That means using a deferred execution system where you avoid calculating things until the last minute, and in some cases not at all because another calculation makes them obsolete.

Karl Millar from Google has been working on CXXR. This was a bit of a shock to me. When I interviewed Andrew Runnalls in 2013, he didn’t really sell CXXR to me particularly well. The project goals at the time were to clean up the GNU R code base, rewriting it in modern C++, and documenting it properly to use as a reference implementation of R. It all seemed a bit of academic fun rather than anything useful. Since Google has started working on the project, the focus has changed. It is now all about having a high performance version of R.

I asked Andrew why he choose CXXR for this purpose. After all, of the half a dozen alternate R engines, CXXR was the only one that didn’t explicitly have performance as a goal. His response was that it has nearly 100% code compatibility with GNU R, and that the code is so clear that it makes it easy to make changes.

That talk focussed on some of the difficulties of optimizing R code. For example, in the assignment

a <- b + c

you don’t know how long b and c are, or what their classes are, so you have to spend a long time looking things up. At runtime however, you can guess a bit better. b and c are probably the same size and class as what you used last time, so guess that first.

He also had a little dig at Tomas Kalibera’s work, saying that CXXR has managed to eliminate almost all the PROTECT macros in its codebase.

Radford Neal talked about some optimizations in his pqR project, which uses both interpreted and byte-compiled code.

In interpreted code, pqR uses a “variant result” mechanism, which sounded similar to Renjin’s “deferred execution”.

A performance boost comes from having a fast interface to unary primitives. This also makes eval faster. Another boost comes fro ma smarter way to not look for variables in certain frames. For example, a list of which frames contain overrides for special symbols (+, [, if, etc.) is maintained, so calling them is faster.

Matt Dowle (“the data.table guy”) of H2O gave a nice demo of H2O Flow, a slick web-based GUI for machine learning on big datasets. It does lots of things in parallel, and is scriptable.

Indrajit Roy of HP Labs and Michael Lawrence of Genentech gave a talk on distributed data structures. These seem very good for cases where you need to access your data from multiple machines.

The SparkR package gives access to distributed data structures with a Spark backend, however Indrajit wasn’t keen, saying that it is too low-level to be easy to work with.

Instead he and Michael have developed the dds package that gives a standard interface for using distributed data structures. The package lies on top of Spark.dds and distributedR.dds. The analogy is with DBI providing a stanrdard database interface that uses RSQLite or RPostgreSQL underneath.

Ryan Hafen of Tessera talked about their product (I think also called Tessara) for analysing large datasets. It’s a fancy wrapper to MapReduce that also has distributed data objects. I didn’t get chance to ask if they support the dds interface. The R packages of interest are datadr and trelliscope.

My own talk was less technical than the others today. It consisted of a series of rants about things I don’t like about R, and how to fix them. The topics included how to remove quirks from the R language (please deprecate indexing with factors), helping new users (let the R community create some vignettes to go in base-R), and how to improve CRAN (CRAN is not a code hosting service, CRAN is a shop for packages). I don’t know if any of my suggestions will be taken up, but my last slide seemed to generate some empathy.

I’ve named my CRAN submission process “The Workflow of Infinite Shame”. What tends to happen is that I check that things work on my machine, submit to CRAN, and about an hour later get a response saying “we see these errors, please fix”. Quite often, especially for things involving locales or writing files, I cannot reproduce the issue, so I fiddle about a bit and guess, then resubmit. After five or six iterations, I’ve lost all sense of dignity, and while R-core are very patient, I’m sure they assume that I’m an idiot.

CRAN currently includes a Win-builder service that lets you submit packages, builds them under Windows, then tells you the results. What I want is an everything-builder service that builds and checks my package on all the necessary platforms (Windows, OS X, Linux, BSD, Solaris on R-release, R-patched, and R-devel), and only if it passes does a member of R-core get to see the problem. That way, R-core’s time isn’t wasted, and more importantly I look like less of an idiot.

The workflow of infinite shame encapsulates my CRAN submission process.

Tagged: r, r-summit

To leave a comment for the author, please follow the link and comment on his blog: 4D Pie Charts » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

What is a good Sharpe ratio?

By John Mount

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

We have previously written that we like the investment performance summary called the Sharpe ratio (though it does have some limits). What the Sharpe ratio does is: give you a dimensionless score to compare similar investments that may vary both in riskiness and returns without needing to know the investor’s risk tolerance. It does this … Continue reading What is a good Sharpe ratio?

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Open Analytics @ UseR! 2015

By tobias

:-)

(This article was first published on Open Analytics – Blog, and kindly contributed to R-bloggers)
Saturday 27 June 2015 – 22:47

The knarrs of Open Analytics have left the port of Antwerp on their way to Denmark. What will our delegation bring to Aalborg besides our loyal sponsorship?

Tuesday

For starters we’ve just released a Dockerfile Editor which may be particularly useful for Dirk Eddelbuettel’s tutorial on Docker on June 30. An overview of all tutorials can be found here.

Wednesday

On the first day of the conference Willem Ligtenberg will present at 16:00 on how to use databases in R without a line of SQL with Rango. The Databases session will see the first public appearance of an object-relational mapper (ORM) for R.

At the same time (16:00, Kaleidoscope 3) Tobias Verbeke will demonstrate our flagship product Architect, a full-fledged integrated development environment for data science. Keep an eye on our home page: rumours say a new release is approaching and that it is packed with new features..

For the more statistically inclined, Raphaël Coudret will present his implementation of the SAEM algorithm for left-censored data as part of the Lighting Talks.

Our final contribution for Wednesday will be a number of posters at the poster session starting at 18:00. As always, Jason Waddell has some insightful visualization experience to share and this time he will focus on density traces in plot legends. Maxim Nazarov has a poster on a suite of R packages for toxicological analyses following recently updated recommendations from the OECD. Meryam Krit, finally, presents goodness of fit techniques for exponential and two-parameter weibull distributions. She provides fast implementations for existing and novel methods that have, above all, been nicely packaged!

Thursday

On Thursday Arunkumar Srinivasan kicks off at 10:30 (Kaleidoscope session) and will show how to push the limits of data manipulation with the recent features in the data.table package. It will be fast. It will be flexible and it will be memory-efficient!

Pushing the limits is also what Marvin Steijaert will talk about (10:30; Medicine session) be it on how predictive pipelines are deployed in pharmaceutical companies to push the frontiers in drug discovery using phenotypic deconvolution.

Thursday at 13:00 (Visualisation 1) Kirsten Van Hoorde will demonstrate multiclass calibration tools in her new multiCalibration package.
At 16:00 (Visualisation 2) Laure Cougnaud shows how she stays on top of high-dimensional genomic datasets using interactive versions of known and less well-known visualization techniques.

Friday?

What about Friday? No more contributions from Open Analytics. Why? Conference dinner on Thursday evening

Looking forward to a great conference!

This post is about:

architect, r, useR

To leave a comment for the author, please follow the link and comment on his blog: Open Analytics – Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News