simmer 4.0.0

By Iñaki Úcar

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The 4.0.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN under a new license: we decided to switch to GPL >= 2. Most notably in this major release, the C++ core has been refactorised and exposed under inst/include. This is not a big deal for most users, but it enables extensions. As an example of this, simmer.mon is an experimental package that links to simmer and extends its monitoring facilities to provide a new DBI-based backend. Not a very efficient one, but it demonstrates how to extend the simmer core from another package.

Exception handling has been remarkably improved. In previous releases, errors were reported to happen in the run() method, which is… everything that can happen, obviously. In this version, errors are catched and more information is provided, particularly about the simulation time, the arrival and the activity involved:

library(simmer)

bad.traj %
  timeout(function() NA)

simmer() %>%
  add_generator("dummy", bad.traj, at(pi)) %>%
  run()
## Error: 'dummy0' at 3.14 in 'Timeout':
##  missing value (NA or NaN returned)

Another improvement has to do with attributes. These are commonly used to build incremental indices, but some boilerplate was needed to initialise them. Now this is automatic (and configurable):

index.traj %
  set_global("index", 1, mod="+", init=10)

simmer() %>%
  add_generator("dummy", index.traj, at(1:3), mon=2) %>%
  run() %>%
  get_mon_attributes()
##   time name   key value replication
## 1    1      index    11           1
## 2    2      index    12           1
## 3    3      index    13           1

Finally, the log_ activity was created for occasional debugging, but we noticed that simmer users use it a lot more to know what is happening when they build models, but so much output is annoying when a model is complete. Therefore, we have implemented simulation-scoped logging levels to be able to turn on and off specific messages on demand:

log.traj %
  log_("This will be always printed") %>% # level=0
  log_("This can be disabled", level=1)

simmer(log_level=1) %>%
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed
## 3.14159: dummy0: This can be disabled
simmer() %>% # log_level=0
  add_generator("dummy", log.traj, at(pi)) %>%
  run() %>% invisible()
## 3.14159: dummy0: This will be always printed

See below for a comprehensive list of changes.

New features:

  • The C++ core has been refactorised into a header-only library under inst/include (#147 closing #145). Therefore, from now on it is possible to extend the C++ API from another package by listing simmer under the LinkingTo field in the DESCRIPTION file.
  • New generic monitor constructor enables the development of new monitoring backends in other packages (179f656, as part of #147).
  • New simulation-scoped logging levels. The log_ activity has a new argument level which determines whether the message is printed depending on a global log_level defined in the simmer constructor (#152).
  • set_attribute and set_global gain a new argument to automatically initialise new attributes (#157). Useful to update counters and indexes in a single line, without initialisation boilerplate.

Minor changes and fixes:

  • Enhanced exception handling, with more informative error messages (#148).
  • Refactorisation of the printing methods and associated code (#149).
  • Allow empty trajectories in sources and activities with sub-trajectories (#151 closing #150).
  • Enable -DRCPP_PROTECTED_EVAL (Rcpp >= 0.12.17.3), which provides fast evaluation of R expressions by leveraging the new stack unwinding protection API (R >= 3.5.0).
  • Replace backspace usage in vector’s ostream method (2b2f43e).
  • Fix namespace clashes with rlang and purrr (#154).

Article originally published in Enchufa2.es: simmer 4.0.0.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RStudio:addins part 4 – Unit testing coverage investigation and improvement, made easy

By Jozef's Rblog

The pretty rewards for your tests

(This article was first published on Jozef’s Rblog, and kindly contributed to R-bloggers)

Introduction

A developer always pays his technical debts! And we have a debt to pay to the gods of coding best practices, as we did not present many unit tests for our functions yet. Today we will show how to efficiently investigate and improve unit test coverage for our R code, with focus on functions governing our RStudio addins, which have their own specifics.

As a practical example, we will do a simple resctructuring of one of our functions to increase its test coverage from a mere 34% to over 90%.

The pretty rewards for your tests

Fly-through of unit testing in R

Much has been written on the importance of unit testing, so we will not spend more time on convincing the readers, but rather very quickly provide a few references in case the reader is new to unit testing with R. In the later parts of the article we assume that these basics are known.

In a few words

  • devtools – Makes package development easier by providing R functions that simplify common tasks
  • testthat– Is the most popular unit testing package for R
  • covr– Helps track test coverage for R packages and view reports locally or (optionally) upload the results

For a start guide to use testthat within a package, visit the Testing section of R packages by Hadley Wickham. I would also recommend checking out the showcase on the 2.0.0 release of the testthat itself.

Investigating test coverage within a package

For the purpose of investigating the test coverage of a package we can use the covr package. Within an R project, we can call the package_coverage() function to get a nicely printed high-level overview, or we can provide a specific path to a package root directory and call it as follows:

# This looks much prettier in the R console ;)
covr::package_coverage(pkgPath)
## jhaddins Coverage: 59.05%
## R/viewSelection.R: 34.15%
## R/addRoxytag.R: 40.91%
## R/makeCmd.R: 92.86%

For a deeper investigation, converting the results to a data.frame might be very useful. The below shows the count of number of times that given expression was called during the running of our tests for each group of code lines:

covResults 
##             filename         functions first_line last_line value
## 1     R/addRoxytag.R            roxyfy         10        12     6
## 2     R/addRoxytag.R            roxyfy         11        11     2
## 3     R/addRoxytag.R            roxyfy         13        15     4
## 4     R/addRoxytag.R            roxyfy         14        14     2
## 5     R/addRoxytag.R            roxyfy         16        16     2
## 6     R/addRoxytag.R            roxyfy         17        17     2
## 7     R/addRoxytag.R            roxyfy         18        18     2
## 8     R/addRoxytag.R        addRoxytag         29        29     0
## 9     R/addRoxytag.R        addRoxytag         30        37     0
## 10    R/addRoxytag.R        addRoxytag         32        34     0
## 11    R/addRoxytag.R        addRoxytag         38        38     0
## 12    R/addRoxytag.R    addRoxytagCode         44        44     0
## 13    R/addRoxytag.R    addRoxytagLink         50        50     0
## 14    R/addRoxytag.R     addRoxytagEqn         56        56     0
## 15       R/makeCmd.R           makeCmd         20        24     5
## 16       R/makeCmd.R           makeCmd         21        21     0
## 17       R/makeCmd.R           makeCmd         23        23     5
## 18       R/makeCmd.R           makeCmd         25        27     5
## 19       R/makeCmd.R           makeCmd         26        26     4
## 20       R/makeCmd.R           makeCmd         28        32     5
## 21       R/makeCmd.R           makeCmd         33        35     5
## 22       R/makeCmd.R           makeCmd         34        34     2
## 23       R/makeCmd.R           makeCmd         36        38     5
## 24       R/makeCmd.R           makeCmd         37        37     1
## 25       R/makeCmd.R           makeCmd         39        39     5
## 26       R/makeCmd.R      replaceTilde         48        50     1
## 27       R/makeCmd.R      replaceTilde         49        49     1
## 28       R/makeCmd.R      replaceTilde         51        51     1
## 29       R/makeCmd.R        executeCmd         61        61     5
## 30       R/makeCmd.R        executeCmd         62        66     5
## 31       R/makeCmd.R        executeCmd         68        72     3
## 32       R/makeCmd.R        executeCmd         69        69     0
## 33       R/makeCmd.R        executeCmd         71        71     3
## 34       R/makeCmd.R runCurrentRscript         90        90     1
## 35       R/makeCmd.R runCurrentRscript         91        91     1
## 36       R/makeCmd.R runCurrentRscript         92        96     1
## 37       R/makeCmd.R runCurrentRscript         93        95     1
## 38       R/makeCmd.R runCurrentRscript         94        94     0
## 39 R/viewSelection.R     viewSelection          7         7     0
## 40 R/viewSelection.R     viewSelection          8        12     0
## 41 R/viewSelection.R     viewSelection         10        10     0
## 42 R/viewSelection.R     viewSelection         13        13     0
## 43 R/viewSelection.R  getFromSysframes         24        24     6
## 44 R/viewSelection.R  getFromSysframes         25        25     3
## 45 R/viewSelection.R  getFromSysframes         26        26     3
## 46 R/viewSelection.R  getFromSysframes         28        28     3
## 47 R/viewSelection.R  getFromSysframes         29        29     3
## 48 R/viewSelection.R  getFromSysframes         30        30     3
## 49 R/viewSelection.R  getFromSysframes         31        31    92
## 50 R/viewSelection.R  getFromSysframes         32        32    92
## 51 R/viewSelection.R  getFromSysframes         33        33    92
## 52 R/viewSelection.R  getFromSysframes         34        34     2
## 53 R/viewSelection.R  getFromSysframes         37        37     1
## 54 R/viewSelection.R        viewObject         56        56     3
## 55 R/viewSelection.R        viewObject         57        57     3
## 56 R/viewSelection.R        viewObject         58        58     3
## 57 R/viewSelection.R        viewObject         61        61     0
## 58 R/viewSelection.R        viewObject         64        64     0
## 59 R/viewSelection.R        viewObject         65        65     0
## 60 R/viewSelection.R        viewObject         66        66     0
## 61 R/viewSelection.R        viewObject         69        69     0
## 62 R/viewSelection.R        viewObject         70        70     0
## 63 R/viewSelection.R        viewObject         71        71     0
## 64 R/viewSelection.R        viewObject         74        74     0
## 65 R/viewSelection.R        viewObject         76        76     0
## 66 R/viewSelection.R        viewObject         77        77     0
## 67 R/viewSelection.R        viewObject         79        79     0
## 68 R/viewSelection.R        viewObject         81        81     0
## 69 R/viewSelection.R        viewObject         82        82     0
## 70 R/viewSelection.R        viewObject         83        83     0
## 71 R/viewSelection.R        viewObject         88        88     0
## 72 R/viewSelection.R        viewObject         89        89     0
## 73 R/viewSelection.R        viewObject         91        91     0
## 74 R/viewSelection.R        viewObject         92        92     0
## 75 R/viewSelection.R        viewObject         93        93     0
## 76 R/viewSelection.R        viewObject         96        96     0

Calling covr::zero_coverage with a overage object returned by package_coverage will provide a data.frame with locations that have 0 test coverage. The nice thing about running it within RStudio is that it outputs the results on the Markers tab in RStudio, where we can easily investigate:

zeroCov 
zero_coverage markers

zero_coverage markers

Test coverage for RStudio addin functions

Investigating our code, let us focus on the results for the viewSelection.R, which has a very weak 34% test coverage. We can analyze exactly which lines have no test coverage in a specific file:

zeroCov[zeroCov$filename == "R/viewSelection.R", "line"]
##  [1]  7  8  9 10 11 12 13 61 64 65 66 69 70 71 74 76 77 79 81 82 83 88 89
## [24] 91 92 93 96

Looking at the code, we can see that the first chuck of lines – 7:13 represent the viewSelection function, which just calls lapply and invisibly returns NULL.
The main weak spot however is the function viewObject, out of which we only test the early return in case of invalid chr argument provided. None of the other functionality is tested.

The reason behind this is that when running the tests, RStudio functionality is not available and therefore we would not be able to test even the not-so-well designed return values, as they are almost always preceded by a call to rstudioapi or other RStudio-related functionality such as the object viewer, because that is what they are designed to do. This means we must restructure the code in such a way that we contain the RStudio-dependent functionality to a necessary minimum, keeping a big majority of the code testable – only calling the side-effecting rstudioapi when actually executing the addin functionality itself.

Rewriting an addin function for better coverage

We will now show one potential way to solve this issue for the particular case of our viewObject function.

The idea behind the solution is to only return the arguments for the call to the RStudio API related functionality, instead of executing them in the function itself – hence the rename to getViewArgs.

This way we can test the function’s return value against the expected arguments and only execute them with do.call in the addin execution wrapper itself. A picture may be worth a thousand words, so here is the diff with relevant changes:

Refactoring for testability

Refactoring for testability

Testing the rewritten function and gained coverage

Now that our return values are testable across the entire getViewArgs function, we can easily write tests to cover the entire function, a couple examples:

test_that("getViewArgs for function"
        , expect_equal(
            getViewArgs("reshape")
          , list(what = "View", args = list(x = reshape, title = "reshape"))
          )
        )
test_that("getViewArgs for data.frame"
        , expect_equal(
            getViewArgs("datasets::women")
          , list(what = "View",
                 args = list(x = data.frame(
                     height = c(58, 59, 60, 61, 62, 63, 64, 65,
                                66, 67, 68, 69, 70, 71, 72),
                     weight = c(115, 117, 120, 123, 126, 129, 132, 135,
                                139, 142, 146, 150, 154, 159, 164)
                     ),
                   title = "datasets::women"
                   )
            )
          )
        )

Looking at the test coverage provided after our changes, we can see that we are at more than 90% percent coverage for viewSelection.R:

# This looks much prettier in the R console ;)
covResults 
## jhaddins Coverage: 82.05%
## R/addRoxytag.R: 40.91%
## R/viewSelection.R: 90.57%
## R/makeCmd.R: 92.86%

And looking at the lines that not covered for viewSelection.R, we can indeed see that the only uncovered lines left are in fact those with the viewSelection function, which is responsible only for executing the addin itself:

covResults 
##             filename     functions first_line last_line value
## 39 R/viewSelection.R viewSelection         13        17     0
## 40 R/viewSelection.R viewSelection         15        15     0
## 41 R/viewSelection.R viewSelection         16        16     0

In the ideal world we would of course want to also automate the testing of our addin execution itself by examining if their effects in the RStudio IDE are as expected, however this is far beyond the scope of this post. For some of our addin functionality we can however even directly test the side-effects, such as when the addin should produce a file with certain content.

TL;DR – Just give me the package

References

Did you find the article helpful or interesting? Help others find it by sharing

To leave a comment for the author, please follow the link and comment on their blog: Jozef’s Rblog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

MiKTeX Behind a Windows Firewall

By TeachR

RStudio: No pdflatex error message

(This article was first published on TeachR, and kindly contributed to R-bloggers)

I’ve always had problems with MiKTeX on my work computer.
I can install it just fine, or get IT to install it, but then the package manager doesn’t work because of our firewall.
You can set up a local repository to get around this problem, and I will show you how.
I’m just doing a basic setup here, just enough to compile the RStudio Rmd template to PDF.

“Why do I need MiKTeX?”, you might ask.
Well, because if you want to create a PDF from an RMarkdown file in RStudio it is required.
Otherwise you get this polite error message.

I downloaded the 32-bit basic version from https://MiKTeX.org/download.

MiKTeX Basic 32-bit Download

You’ll probably just want to pick the default options for the install EXCEPT for
the question that asks whether you want MiKTeX to install missing packages automatically.
I’m saying “no” to that because that is where the firewall gets in the way.

Say

MiKTeX will update your PATH system variable, but to get RStudio to recognize that it has changed you will need to restart it.
Trying to knit the RStudio Rmd template to PDF right after install gives a new error, that it cannot find the file “url.sty”.
Actually, if you had done the default install, MiKTeX would have tried to install the needed package automatically.
But, this would have just failed if the needed port is not open on your firewall.

One way to get the packages you need is to create a local package repo and install the packages “manually”.

I’ve created a github repo with a minimal set of LaTeX packages to get this working,
https://github.com/adamleerich/MiKTeX_pkgs.
So, you can clone it straight, download a ZIP, or fork and clone. Up to you.
I’m going to assume you know how to get these files from github and put them somewhere inside your firewall.
I’ve put mine in a folder called c:homegitOtherMiKTeX_pkgs.

Next, open the MiKTeX package manager on your computer.
It was just installed.
Select change package repository from the repository menu.
It will then ask you what kind of repo you want.

You want a local repo

On the next screen browse to the folder where you cloned the git repo, and point to the “repo-local” subdirectory.

You can send me an email if I've lost you

MiKTeX will take forever to synchronize the package database, but once it is done, you can select the packages you want installed.
Unfortunately, I have not been able to figure out how to create an index file that just has the downloaded packages in it.
So, the package manager will still list all of the packages available at the CTAN mirror.
I know, it sucks.
Then you can run the batch file scripts100-mpm-install.bat as an admin and it will install the packages for you.

If you decide you want to add more packages to your local repo, just put the file name in the DOWNLOAD file
and re-run the script ./scripts/050-download-locals.bat and then scripts100-mpm-install.bat.

Let me know if this helps or if it doesn’t work for you.

To leave a comment for the author, please follow the link and comment on their blog: TeachR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R.devices – Into the Void

By JottR on R

(This article was first published on JottR on R, and kindly contributed to R-bloggers)

R.devices 2.16.0 – Unified Handling of Graphics Devices – is on CRAN. With this release, you can now easily suppress unwanted graphics, e.g. graphics produced by one of those do-everything-in-one-call functions that we all bump into once in a while. To suppress graphics, the R.devices package provides graphics device nulldev(), and function suppressGraphics(), which both send any produced graphics into the void. This works on all operating systems, including Windows.

Guillaume Nery base jumping at Dean’s Blue Hole, filmed on breath hold by Julie Gautier

Examples

library(R.devices)
nulldev()
plot(1:100, main = "Some Ignored Graphics")
dev.off()
R.devices::suppressGraphics({
  plot(1:100, main = "Some Ignored Graphics")
})

Other Features

Some other reasons for using the R.devices package:

  • No need to call dev.off() – Did you ever forgot to call dev.off(), or did a function call produce an error causing dev.off() not to be reached, leaving a graphics device open? By using one of the toPDF(), toPNG(), … functions, or the more general devEval() function, dev.off() is automatically taken care of.

  • No need to specify filename extension – Did you ever switch from using png() to, say, pdf(), and forgot to update the filename resulting in a my_plot.png file that is actually a PDF file? By using one of the toPDF(), toPNG(), … functions, or the more general devEval() function, filename extensions are automatically taken care of – just specify the part without the extension.

  • Specify the aspect ratio – rather than having to manually calculate device-specific arguments width or height, e.g. toPNG("my_plot", { plot(1:10) }, aspectRatio = 2/3). This is particularly useful when switching between device types, or when outputting to multiple ones at the same time.

  • Unified API for graphics options – conveniently set (most) graphics options including those that can otherwise only be controlled via arguments, e.g. devOptions("png", width = 1024).

  • Control where figure files are saved – the default is folder figures/ but can be set per device type or globally, e.g. devOptions("*", path = "figures/col/").

  • Easily produce EPS and faviconstoEPS() and toFavicon() are friendly wrappers for producing EPS and favicon graphics.

  • Capture and replay graphics – for instance, use future::plan(remote, workers = "remote.server.org"); p % to produce graphics on a remote machine, and then display it locally by printing p.

Some more examples

R.devices::toPDF("my_plot", {
  plot(1:100, main = "Amazing Graphics")
})
### [1] "figures/my_plot.pdf"
R.devices::toPNG("my_plot", {
  plot(1:100, main = "Amazing Graphics")
})
### [1] "figures/my_plot.png"
R.devices::toEPS("my_plot", {
  plot(1:100, main = "Amazing Graphics")
})
### [1] "figures/my_plot.eps"
R.devices::devEval(c("png", "pdf", "eps"), name = "my_plot", {
  plot(1:100, main = "Amazing Graphics")
}, aspectRatio = 1.3)
### $png
### [1] "figures/my_plot.png"
### 
### $pdf
### [1] "figures/my_plot.pdf"
### 
### $eps
### [1] "figures/my_plot.eps"

Links

See also

To leave a comment for the author, please follow the link and comment on their blog: JottR on R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Regional population structures at a glance

By Ilya Kashnitsky

fig0

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

I am happy to announce that our paper is published today in The Lancet.

Kashnitsky, I., & Schöley, J. (2018). Regional population structures at a glance. The Lancet, 392(10143), 209–210. https://doi.org/10.1016/S0140-6736(18)31194-2

At a glance

Demographic history of a population is imprinted in its age structure. A meaningful representation of regional population age structures can tell numerous demographic stories – at a glance. To produce such a snapshot of regional populations, we use an innovative approach of ternary colour coding.

Here is the map:

We let the data speak colours

With ternary colour coding, each element of a three-dimensional array of compositional data is represented with a unique colour. The resulting colours show direction and magnitude of deviations from the centrepoint, which represents the average age of the European population, and is dark grey. The hue component of a colour encodes the direction of deviation: yellow indicates an elderly population (>65 years), cyan indicates people of working age (15–64 years), and magenta indicates children (0–14 years).

The method is very flexible, and one can easily produce these meaningful colours using our R package tricolore. Just explore the capabilities of the package in a built-in shiny app using the following lines of code:

install.packages("ticolore")
library(tricolore)
DemoTricolore()

Replication materials at github

Folow us on Twitter: @ikahhnitsky, @jschoeley.

SEE ALSO

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A hex sticker wall, created with R

By David Smith

😊

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Bill Venables, member of the R Foundation, co-author of the Introduction to R manual, R package developer, and one of the smartest and nicest (a rare combo!) people you will ever meet, received some gifts from the user!2018 conference committee in recognition of his longtime service to the R community. He received a book of reminiscences from those whose lives he has touched, and this blanket:

This morning we took care of some unfinished #user2018 business. @visnut @tslumley @_jessie_roberts pic.twitter.com/gW0wX2PLF2

— Miles McBain (@MilesMcBain) July 19, 2018

The design on the blanket is the “Hex Sticker Wall” that was on design during the conference, created by Mitchell O’Hara-Wild:

The #useR2018 #hexwall has been revealed! Read about how it was created in #rstats on this blog post: https://t.co/krYYOQ3N84

A huge thanks to everyone who has submitted stickers and provided feedback. I hope you enjoy the end result as much as I have had creating it! 🎉pic.twitter.com/GnG9m2cZme

— Mitchell O’Hara-Wild (@mitchoharawild) July 11, 2018

The design for the wall was created — naturally — using R. Mitchell created an R script that will arrange a folder of hexagon images according to a specified layout. He then used a map of Australia to lay out the hex sticker coordinates within the map boundaries and plotted that to create the Hexwall. If you have a similar collection of hex images you could use the same technique to arrange them into any shape you like. The details are provided in the blog post linked below.

Mitchell O’Hara-Wild: UseR! 2018 Feature Wall

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A new ‘boto3’ Amazon Athena client wrapper with dplyr async query support

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

A previous post explored how to deal with Amazon Athena queries asynchronously. The function presented is a beast, though it is on purpose (to provide options for folks).

In reality, nobody really wants to use rJava wrappers much anymore and dealing with icky Python library calls directly just feels wrong, plus Python functions often return truly daft/ugly data structures. R users deserve better than that.

R is not a first-class citizen in Amazon-land and while the cloudyr project does a fine job building native-R packages for various Amazon services, the fact remains that the official Amazon SDKs are in other languages. The reticulate package provides an elegant interface to Python so it seemed to make sense to go ahead and wrap the boto3 Athena client into something more R-like and toss in the collect_async() function for good measure.

Dependencies

I forced a dependency on Python 3.5 because friends don’t let friends rely on dated, fragmented ecosystems. Python versions can gracefully (mostly) coexist so there should be no pain/angst associated with keeping an updated Python 3 environment around. As noted in the package, I highly recommend adding RETICULATE_PYTHON=/usr/local/bin/python3 to your R environment (~/.Renviron is a good place to store it) since it will help reticulate find the proper target.

If boto3 is not installed, you will need to do pip3 install boto3 to ensure you have the necessary Python module available and associated with your Python 3 installation.

It may seem obvious, but an Amazon AWS account is also required and you should be familiar with the Athena service and AWS services in general. Most of the roto.athena functions have a set of optional parameters:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token
  • region_name
  • profile_name

Ideally, these should be in setup in the proper configuration files and you should let boto3 handle the details of retrieving them. One parameter you will see used in many of my examples is `profile_name = “personal”. I have numerous AWS accounts and manage them via the profile ids. By ensuring the AWS configuration files are thoroughly populated, I avoid the need to load and pass around the various keys and/or tokens most AWS SDK API calls require. You can read more about profile management in the official docs: 1, 2.

Usage

The project README and package manual pages are populated and have a smattering of usage examples. It is likely you will really just want to execute a manually prepared SQL query and retrieve the results or do the dplyr dance and collect the results asynchronously. We’ll cover both of those use-cases now, starting with a manual SQL query.

If you have not deleted it, your Athena instance comes with a sampledb that contains an elb_logs table. We’ll use that for our example queries. First, let’s get the packages we’ll be using out of the way:

library(odbc)
library(DBI) # for dplyr access later
library(odbc) # for dplyr access later
library(roto.athena) # hrbrmstr/roto.athena on gh or gl
library(tidyverse) # b/c it rocks

Now, we’ll prepare and execute the query. This is a super-simple one:

query  qex_id

The qex_id contains the query execution id. We can pass that along to get information on the status of the query:

get_query_execution(qex_id, profile = "personal") %>%
  glimpse()
## Observations: 1
## Variables: 10
## $ query_execution_id   "7f8d8bd6-9fe6-4a26-a021-ee10470c1048"
## $ query                "SELECT COUNT(requestip) AS ct FROM elb_logs"
## $ output_location      "s3://aws-athena-query-results-redacted/7f...
## $ database             "sampledb"
## $ state                "RUNNING"
## $ state_change_reason  NA
## $ submitted            "2018-07-20 11:06:06.468000-04:00"
## $ completed            NA
## $ execution_time_ms    NA
## $ bytes_scanned        NA

If the state is not SUCCEEDED then you’ll need to be patient before trying to retrieve the results.

get_query_results(qex_id, profile = "personal")
## # A tibble: 1 x 1
##   ct             
##   
## 1 4229

Now, we’ll use dplyr via the Athena ODBC driver:

DBI::dbConnect(
  odbc::odbc(),
  driver = "/Library/simba/athenaodbc/lib/libathenaodbc_sbu.dylib",
  Schema = "sampledb",
  AwsRegion = "us-east-1",
  AwsProfile = "personal",
  AuthenticationType = "IAM Profile",
  S3OutputLocation = "s3://aws-athena-query-results-redacted"
) -> con

elb_logs 

I’ve got the ODBC DBI fragment in a parameterized RStudio snippet and others may find that as a time-saver if you’re not doing that already.

Now to build and submit the query:

mutate(elb_logs, tsday = substr(timestamp, 1, 10)) %>%
  filter(tsday == "2014-09-29") %>%
  select(requestip, requestprocessingtime) %>%
  collect_async(
    database = "sampledb",
    output_location = "s3://aws-athena-query-results-redacted",
    profile_name = "personal"
  ) -> qex_id

As noted in the previous blog post, collect_async() turn the dplyr chain into a SQL query then fires off the whole thing to start_query_execution() for you and returns the query execution id:

get_query_execution(qex_id, profile = "personal") %>%
  glimpse()
## Observations: 1
## Variables: 10
## $ query_execution_id   "95bd158b-7790-42ba-aa83-e7436c3470fe"
## $ query                "SELECT "requestip", "requestprocessing...
## $ output_location      "s3://aws-athena-query-results-redacted/95...
## $ database             "sampledb"
## $ state                "RUNNING"
## $ state_change_reason  NA
## $ submitted            "2018-07-20 11:06:12.817000-04:00"
## $ completed            NA
## $ execution_time_ms    NA
## $ bytes_scanned        NA

Again, you’ll need to be patient and wait for the state to be SUCCEEDED to retrieve the results.

get_query_results(qex_id, profile = "personal")
## # A tibble: 774 x 2
##    requestip       requestprocessingtime
##    
##  1 255.48.150.122              0.0000900
##  2 249.213.227.93              0.0000970
##  3 245.108.120.229             0.0000870
##  4 241.112.203.216             0.0000940
##  5 241.43.107.223              0.0000760
##  6 249.117.98.137              0.0000830
##  7 250.134.112.194             0.0000630
##  8 250.200.171.222             0.0000540
##  9 248.193.76.218              0.0000820
## 10 250.57.61.131               0.0000870
## # ... with 764 more rows

You can also use the query execution id to sync the resultant CSV from S3. Which one is more performant is definitely something you’ll need to test since it varies with AWS region, result set size, your network connection and other environment variables. One benefit of using get_query_results() is that it uses the column types to set the data frame column types appropriately (I still need to setup a full test of all possible types so not all are handled yet).

Kick the tyres

The package is up on both GitLab and GitHub and any and all feedback (i.e. Issues) or tweaks (i.e. PRs) are most welcome.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R trainings and data science talks in Stuttgart: save 10% with our summer special

By eoda GmbH

(This article was first published on eoda english R news – Der Datenanalyse Blog von eoda, and kindly contributed to R-bloggers)

Summer holidays are over and work is calling. We will get you back on track by bringing you our popular R trainings and data science talks to the South of Germany.

With more than 1.500 satisfied participants, eodas R trainings are among the leading training courses for the programming language R in the German-speaking area. This time we bring our most popular R trainings „Introduction to R“ and „Introduction to Data Mining with R“ to Stuttgart. Furthermore, we offer the opportunity to benefit from practical recommendations given by data science specialists in interesting talks . Sign up now and get your ticket for the summer special price – available only for a limited time!

Save 10% on your registration fee for R trainings or data science talks with the promotional code DataScienceSummer until July 3th.

What you can look forward to? Our program at a glance:

September 17 – 18 | Introduction to R
The introductory course deals with the basics and terminologies of the programming language R, giving practical exercises and tips to create a basis for independent working.

September 19 – 20 | Introduction to Data Mining with R
Our data mining course is an introduction to basic concepts, most popular data mining algorithms, validation techniques and other specializations.

Finally, our trainings are followed by informative get-togethers: data science talks held by experienced specialists from the field, who explain how to extract strategic knowledge from data and make optimum use of it. Whether you are a manager, business user or IT Administrator – our talks will show you proven approaches and give you impulses on how to benefit successfully from data science in your company.

September 21 – Noon | Data Science, Künstliche Intelligenz und Machine Learning: Wie hebe ich den Datenschatz?
Explaining the background and providing insights into cross-industry use cases: this talk highlights how analytics can become a decisive building block for your company.

September 21 – Afternoon | Data Science operativ: Wie schaffe ich die richtige IT-Infrastruktur für die Anforderungen im Big-Data-Zeitalter?
Gain valuable insights into technical requirements of data science, the most important building blocks of your IT landscape and a deep understanding for the right tool and technology decisions based on practical reference scenarios.

So, what are you waiting for? Sign up now, use the promotion code DataScienceSummer and get a 10% discount on your registration fee for the event of your choice.
Follow

Trainings and Talks will be held in German.
*(Bundle offers and already reduced prices are not included)

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news – Der Datenanalyse Blog von eoda.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Benchmarking Feature Selection Algorithms with Xy()

By André Bleier

xy hexagon logo

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

Feature Selection

Feature Selection is one of the most interesting fields in machine learning in my opinion. It is a boundary point of two different perspectives on machine learning – performance and inference. From a performance point of view, feature selection is typically used to increase the model performance or to reduce the complexity of the problem in order to optimize computational efficiency. From an inference perspective, it is important to extract variable importance to identify key drivers of a problem.

Many people argue that in the era of deep learning feature selection is not important anymore. As a method of representation learning, deep learning models can find important features of the input data on their own. Those features are basically nonlinear transformations of the input data space. However, not every problem is suited to be approached with neural nets (actually, many problems). In many practical ML applications feature selection plays a key role on the road to success.

At STATWORX we work in a lot of dynamic pricing projects. One key piece in the puzzle of dynamic pricing is component-wise Gradient Boosting is an excellent example of such a white-box machine.

Introducing Xy()

In this blog post I will try to challenge feature selection algorithms with my simulation function Xy(). The function is designed to simulate regression and classification problems with a maximum degree of freedom for the user. If you want to get a quick introduction to the function I recommend reading my previous blog post, where I have analyzed Ordinary Least Squares coefficients under different signal strengths.

To challenge the feature selection algorithms, we first need to simulate data. By default, Xy() simulates regression learning data with 1000 observations, two linear and two nonlinear features as well as five randomly generated variables, which are not used to generate the target. We will use these random variables as pitfalls for the feature selection algorithms. To make things a little spicier let’s see how we can generate ten linear, ten non-linear and three categorical variables along with 50 random variables.

# Simulating regression data 

# install the package
devtools::install_github(“andrebleier/Xy”)

# load the library
library(Xy)

fs_sim = Xy(n = obs,
            numvars = c(10,10),
            catvars = c(3, 2),
            noisevars = 50,
            task = Xy_task(),
            nlfun = function(x) {x^2},
            interactions = 1,
            sig = c(1,4), 
            cor = c(0),
            weights = c(-10,10),
            intercept = TRUE,
            stn = 4)

print(fs_sim)
 Xy Simulation 
 	 | 
 	 | + task regression
 	 | + observations 1000
 	 | + interactions 1D
 	 | + signal to noise ratio 4
 	 | + effects 
 	   | - linear 10
 	   | - nonlinear 10
 	   | - categorical 3
 	   | - noise 50
 	 | + intervals 
 	   | - correlation 0
 	   | - weights [-10,10]
 	   | - sd [1,4]


Target generating process: 
y = -0.218 + 5.4NLIN_01 - 9.48NLIN_02 + 7NLIN_03 - 7.59NLIN_04 - 7.43NLIN_05 - 0.39NLIN_06 - 7.72NLIN_07 + 0.02NLIN_08 + 9.05NLIN_09 + 7.75NLIN_10 + 8.76LIN_01 + 7.7LIN_02 - 0.77LIN_03 + 8.16LIN_04 + 1.06LIN_05 + 4.46LIN_06 - 4.01LIN_07 + 3.4LIN_08 - 7.94LIN_09 - 8.65LIN_10 - 3.51DUMMY_01__2 - 10.04DUMMY_02__2 - 1.63DUMMY_03__2 + e ~ N(0,99.28)

You can extract the feature importance with the varimp() function. Furthermore, there is a build-in plotting argument, that allows for visualization. I decided to use Boxplots rather than a plain barchart as it visualizes a local and global view on the importance.

# Feature Importance
fs_varimp 

xy-variable-importance

Benchmarking Framework

Ok, we saw how we can generate data, but how about the algorithms? We are going to use four different classical machine learning algorithms.

  1. Gradient Boosting with library(xgboost)
  2. Component-wise Gradient Boosting with library(mboost)
  3. Ridge Regression with library(foba)
  4. Random Forest with library(rpart)

Two gradient boosting implementations, a greedy ridge regression and a random forest. If you want to read up on the mechanics of the algorithms I strongly recommend the package documentations.
In the end we want to elect a winner of course, thus we have to define the rules. There is not much literature on this matter, so I created three measures which made sense to me. Why three though? I think there are several angles to this problem. Sure a mean absolute error (M1) between the real and estimated importance could be the first approach.

[M1 = frac{1}{n} sum_{i=1}^{n} | widehat{Imp} - Imp |]

This gives us a feeling of how close an algorithm gets to the real importance. However, one could argue, that the deviation itself is not as interesting as the plain number of features it got right (M2).

[M2 = p_{est} / p_{real} * 100]

where p_{est} are the number of true features selected by the algorithm, whereas p_{real} is the total number of features.
Another angle to this problem are falsely selected features. Remember, we simulated fifty noise variables in the example above. The last measure aims to clarify the ratio of true to noise features, which have been selected by an algorithm.

[M3 = frac{p_{est} * imp_{est}}{(p_{true} * imp_{est} + p_{noise} * imp_{noise})}]

where imp_{est} is the aggregated importance for all estimated true features and p_{noise} are the noise variables selected by the algorithm and their summarized importance imp_{noise}.

Each algorithm gets a total of 100 randomly drawn hyperparameter sets. The hyperparameter configuration with the lowest cross-validation error will represent the feature importance of that particular algorithm. Furthermore, we will repeat the data simulation 20 times to create a little more variation. Each time the number of features, noise variables, polynomial degree of nonlinearity will be drawn randomly. You can find the simulation code on my GitHub. The whole simulation took three days on eight cores, so you might want to change the parameters at the beginning of the script.

Evaluation

A further point to note is, that the tree algorithms gradient boosting (xgboost) and the random forest (rpart) can capture nonlinearities while the ridge regression and the component-wise gradient boosting apply ordinary least squares approaches. Since, we are simulating nonlinear and linear features, this might be an advantage for the tree algorithms. The results are visualized in the following figure.

xy-simulation-results

While the metrics do not speak in favor of one particular algorithm overall, there are several takeaways.

  • The greedy ridge regression (foba) tends to select all important features (M2), however, at the cost of selecting a relatively large amount of irrelevant variables (M3).
  • The component-wise gradient boosting algorithm (glmboost) selects over 78% of the true features on average (M2), while it can reassemble the true feature importance with the lowest mean absolute error (M1). Unlike the ridge regression it does not select a large amount of noise variables (M3).
  • The random forest (rpart) tends to be more restrictive on the amount of features it selects (M2). However, those which get selected seem to be true features (M3).
  • The gradient boosted trees (xgbDART), seem to be the winner for most research questions, since it has the best M2 and M3 scores. On average xgboost finds over 93% of the true features (M2), while selecting a relatively low amount of irrelevant features (M3). However, compared to the other algorithms it cannot really reassemble the true feature importance (M1).

Of course, real-world problems are by far more complex than these simulations. However, I think that at least you can extract some tendencies about real-world problems from this example. If you have any ideas about other metrics or angles I did not cover in this example or you have an idea to improve my function, please feel free to give me some feedback.

Über den Autor

André Bleier

André Bleier

André ist im Data Science Team und organisiert intern unsere Afterwork Veranstaltungen. In seiner Freizeit treibt er Sport und programmiert gerne kleine und große R Codes.

Der Beitrag Benchmarking Feature Selection Algorithms with Xy() erschien zuerst auf STATWORX.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The end of errors in ANOVA reporting

By Dominique Makowski

🙂

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

Psychology is still (unfortunately) massively using analysis of variance (ANOVA). Despite its relative simplicity, I am very often confronted to errors in its reporting, for instance in student’s theses or manuscripts. Beyond the incomplete, uncomprehensible or just wrong reporting, one can find a tremendous amount of genuine errors (that could influence the results and their intepretation), even in published papers! (See the excellent statcheck to quickly check the stats of a paper). This error proneness can be at least partially explained by the fact that copy/pasting the (appropriate) values of any statistical software and formatting them textually is a very annoying process.

How to end it?

We believe that this could be solved (at least, partially) by the default implementation of current best practices of statistical reporting. A tool that automatically transforms a statistical result into a copy/pastable text. Of course, this automation cannot be suitable for each and every advanced usage, but would probably be satisfying for a substantial proportion of use cases. Implementing this unified, end-user oriented pipeline is the goal of the psycho package.

Fit an anova

Let’s start by doing a traditional ANOVA with adjusting (the ability to flexibly regulate one’s emotions) as dependent variable, and sex and salary as categorical predictors.

# devtools::install_github("neuropsychology/psycho.R")  # Install the latest psycho version
library(psycho)

df  psycho::affective  # load a dataset available in the psycho package

aov_results  aov(Adjusting ~ Sex * Salary, data=df)  # Fit the ANOVA
summary(aov_results)  # Inspect the results
             Df Sum Sq Mean Sq F value   Pr(>F)    
Sex           1   35.9   35.94  18.162 2.25e-05 ***
Salary        2    9.4    4.70   2.376   0.0936 .  
Sex:Salary    2    3.0    1.51   0.761   0.4674    
Residuals   859 1699.9    1.98                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
386 observations deleted due to missingness

APA formatted output

The psycho package include a simple function, analyze() that can be applied to an ANOVA object to format its content.

analyze(aov_results)
   - The effect of Sex is significant (F(1, 859) = 18.16, p  .1) and can be considered as very small (Partial Omega-squared = 0).

It formats the results, computes the partial omega-squared as an index of effect size (better than the eta2, see Levine et al. 2002, Pierce et al. 2004) as well as its interpretation and presents the results in a APA-compatible way.

Correlations, t-tests, regressions…

Note that the analyze() method also exists for other statistical procudures, such as correlations, t-tests and regressions.

Evolution

Of course, these reporting standards should change, depending on new expert recommandations or official guidelines. The goal of this package is to flexibly adaptive to new changes and good practices evolution. Therefore, if you have any advices, opinions or such, we encourage you to either let us know by opening an issue, or even better, try to implement them yourself by contributing to the code.

Credits

This package helped you? Don’t forget to cite the various packages you used

You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: An Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

On similar topics

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News