Creating a Table of Monthly Returns With R and a Volatility Trading Interview

By Ilya Kipnis

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will cover two aspects: the first will be a function to convert daily returns into a table of monthly returns, complete with drawdowns and annual returns. The second will be an interview I had with David Lincoln (now on youtube) to talk about the events of Feb. 5, 2018, and my philosophy on volatility trading.

So, to start off with, a function that I wrote that’s supposed to mimic PerforamnceAnalytics’s table.CalendarReturns is below. What table.CalendarReturns is supposed to do is to create a month X year table of monthly returns with months across and years down. However, it never seemed to give me the output I was expecting, so I went and wrote another function.

Here’s the code for the function:

require(data.table)
require(PerformanceAnalytics)
require(scales)
require(Quandl)

# helper functions
pastePerc 

Here's how the output looks like.

spy 

And with percentage formatting:

calendarReturnTable(spyRets, percent = TRUE)
Using 'Value' as value column. Use 'value.var' to override
         Jan      Feb     Mar     Apr     May     Jun     Jul      Aug      Sep      Oct     Nov     Dec   Annual      DD
1993  0.000%   1.067%  2.241% -2.559%  2.697%  0.367% -0.486%   3.833%  -0.726%   1.973% -1.067%  1.224%   8.713%  4.674%
1994  3.488%  -2.916% -4.190%  1.121%  1.594% -2.288%  3.233%   3.812%  -2.521%   2.843% -3.982%  0.724%   0.402%  8.537%
1995  3.361%   4.081%  2.784%  2.962%  3.967%  2.021%  3.217%   0.445%   4.238%  -0.294%  4.448%  1.573%  38.046%  2.595%
1996  3.558%   0.319%  1.722%  1.087%  2.270%  0.878% -4.494%   1.926%   5.585%   3.233%  7.300% -2.381%  22.489%  7.629%
1997  6.179%   0.957% -4.414%  6.260%  6.321%  4.112%  7.926%  -5.180%   4.808%  -2.450%  3.870%  1.910%  33.478% 11.203%
1998  1.288%   6.929%  4.876%  1.279% -2.077%  4.259% -1.351% -14.118%   6.362%   8.108%  5.568%  6.541%  28.688% 19.030%
1999  3.523%  -3.207%  4.151%  3.797% -2.287%  5.538% -3.102%  -0.518%  -2.237%   6.408%  1.665%  5.709%  20.388% 11.699%
2000 -4.979%  -1.523%  9.690% -3.512% -1.572%  1.970% -1.570%   6.534%  -5.481%  -0.468% -7.465% -0.516%  -9.730% 17.120%
2001  4.446%  -9.539% -5.599%  8.544% -0.561% -2.383% -1.020%  -5.933%  -8.159%   1.302%  7.798%  0.562% -11.752% 28.808%
2002 -0.980%  -1.794%  3.324% -5.816% -0.593% -7.376% -7.882%   0.680% -10.485%   8.228%  6.168% -5.663% -21.588% 32.968%
2003 -2.459%  -1.348%  0.206%  8.461%  5.484%  1.066%  1.803%   2.063%  -1.089%   5.353%  1.092%  5.033%  28.176% 13.725%
2004  1.977%   1.357% -1.320% -1.892%  1.712%  1.849% -3.222%   0.244%   1.002%   1.288%  4.451%  3.015%  10.704%  7.526%
2005 -2.242%   2.090% -1.828% -1.874%  3.222%  0.150%  3.826%  -0.937%   0.800%  -2.365%  4.395% -0.190%   4.827%  6.956%
2006  2.401%   0.573%  1.650%  1.263% -3.012%  0.264%  0.448%   2.182%   2.699%   3.152%  1.989%  1.337%  15.847%  7.593%
2007  1.504%  -1.962%  1.160%  4.430%  3.392% -1.464% -3.131%   1.283%   3.870%   1.357% -3.873% -1.133%   5.136%  9.925%
2008 -6.046%  -2.584% -0.903%  4.766%  1.512% -8.350% -0.899%   1.545%  -9.437% -16.519% -6.961%  0.983% -36.807% 47.592%
2009 -8.211% -10.745%  8.348%  9.935%  5.845% -0.068%  7.461%   3.694%   3.545%  -1.923%  6.161%  1.907%  26.364% 27.132%
2010 -3.634%   3.119%  6.090%  1.547% -7.945% -5.175%  6.830%  -4.498%   8.955%   3.820%  0.000%  6.685%  15.057% 15.700%
2011  2.330%   3.474%  0.010%  2.896% -1.121% -1.688% -2.000%  -5.498%  -6.945%  10.915% -0.406%  1.044%   1.888% 18.609%
2012  4.637%   4.341%  3.216% -0.668% -6.006%  4.053%  1.183%   2.505%   2.535%  -1.820%  0.566%  0.900%  15.991%  9.687%
2013  5.119%   1.276%  3.798%  1.921%  2.361% -1.336%  5.168%  -2.999%   3.168%   4.631%  2.964%  2.589%  32.307%  5.552%
2014 -3.525%   4.552%  0.831%  0.695%  2.321%  2.064% -1.344%   3.946%  -1.379%   2.355%  2.747% -0.256%  13.462%  7.273%
2015 -2.963%   5.620% -1.574%  0.983%  1.286% -2.029%  2.259%  -6.095%  -2.543%   8.506%  0.366% -1.718%   1.252% 11.910%
2016 -4.979%  -0.083%  6.724%  0.394%  1.701%  0.350%  3.647%   0.120%   0.008%  -1.734%  3.684%  2.028%  12.001% 10.306%
2017  1.789%   3.929%  0.126%  0.993%  1.411%  0.637%  2.055%   0.292%   2.014%   2.356%  3.057%  1.209%  21.700%  2.609%
2018  5.636%  -3.118%      NA      NA      NA      NA      NA       NA       NA       NA      NA      NA   2.342% 10.102%

That covers it for the function. Now, onto volatility trading. Dodging the February short volatility meltdown has, in my opinion, been one of the best out-of-sample validators for my volatility trading research. My subscriber numbers confirm it, as I’ve received 12 new subscribers this month, as individuals interested in the volatility trading space have gained a newfound respect for the risk management that my system uses. After all, it’s the down months that vindicate system traders like myself that do not employ leverage in the up times. Those interested in following my trades can subscribe here. Furthermore, recently, I was able to get a chance to speak with David Lincoln about my background, and philosophy on trading in general, and trading volatility in particular. Those interested can view the interview here.

Thanks for reading.

NOTE: I am currently interested in networking, full-time positions related to my skill set, and long-term consulting projects. Those interested in discussing professional opportunities can find me on LinkedIn after writing a note expressing their interest.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Effortless but Powerful Exception Logging in R: loggit! 1.0.0 Released on CRAN

By Ryan Price

(This article was first published on Another Blog About R, and kindly contributed to R-bloggers)
Frustrated with the lack of pain-free logging in R, a colleague inspired me to write my first public R package: loggit. What follows is the README that you can find on Github, and verison 1.0.0 of loggit is now available on CRAN. Please feel free to submit requests, bug reports, etc.!
This package is designed for anyone who has written their own package or analysis functions that already use basic exception handling by way of base R’s message(), warning(), and stop(). The loggit package masks base R’s versions of these functions, but slots in a logging function (loggit()) just before the exception handler executes. Since these functions maintain identical functionality to their base equivalents, there is no need for any change in your existing code, if you have already written handlers. Just install & load/import loggit, tell R where to store your log file on startup, and instantly have your exceptions recorded.
The included loggit() function is also very flexible on its own, and allows for any number of custom fields to be added to your logs, in addition to maintaining the aforementioned functionality.
This package was inspired by a colleague’s suggestion of how we could log our ETL workflow results in R, as well as the rlogging package. When we looked at how rlogging was doing this, we felt that we could expand upon it and make it more flexible, system-portable, and with even less user effort. As opposed to other packages like log4r, futile.logger and logging, which may allow for more powerful logs in some cases, we wanted to make great logging “out of sight, out of mind”; at least until your boss asks for those ETL logs!

Features

R has a selection of built-in functions for handling different exceptions, or special cases where diagnostic messages are provided, and/or function execution is halted because of an error. However, R itself provides nothing to record this diagnostic post-hoc; useRs are left with what is printed to the console as their only means of analyzing the what-went-wrong of their code. There are some slightly hacky ways of capturing this console output, such as sinking to a text file, repetitively cating identical exception messages that are passed to existing handler calls, etc. But there are two main issues with these approaches:
  1. The console output is not at all easy to parse, so that a user can quickly identify the causes of failure without manually scanning through it
  2. Even if the user tries to structure a text file output, they would likely have to ensure consistency in that output across all their work, and there is still the issue of parsing that text file into a familiar, usable format
Enter: JSON
JSON is a lightweight, portable (standardized) data format that is easy to read and write by both humans and machines. An excerpt from the introduction of the JSON link above:
“JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition – December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.”
Basically, you can think of JSON objects like you would think of lists in R: a set of key-value pairs, with the option to nest lists within other lists. This makes JSON not only as powerful as R’s lists, but it allows this nested-list functionality to be translated to other software, and let them handle them just as easily! As such, JSON is the default log file output type in the loggit package. You can manipulate & review JSON objects in R by way of the jsonlite package.

How to Use

Chances are that if you’ve read this far, you’re familiar with R’s exception handlers. If so, then there are only two other things that you need to know in order to use loggit:
  1. As per CRAN policies, a package cannot write to a user’s “home filespace” without approval. Therefore, you need to set the log file before any logs are written to disk, using setLogFile(logfile) (I recommend in your working directory, and naming it “loggit.json”). If you are using loggit in your own package, you can wrap this in a call to .onLoad(), so that logging is set on package load. If not, then make the set call as soon as possible (e.g. at the top of your script(s), right after your calls to library()); otherwise, no logs will be written to persistent storage!
  2. The logs are output to your log file each time an exception is raised, or when loggit() is called, and you can review it there. The fields supplied by default are:
  • timestamp: when the exception was raised
  • log_lvl: the “level” of the exception (INFO, WARN, ERROR, or custom)
  • log_msg: the diagnostic message generated by R
  • log_detail: additional exception details you can pass to the exception handlers, though not necessary to supply (will be filled with “” if not supplied).
That’s it!

However, if you want more control, or care a bit more about the details:
While this package automatically masks the base handlers, the loggit function that they call is also exported for use, if so desired. This allows you to log arbitrary things as much or as little as you see fit. The loggit function is called as: loggit(log_lvl, log_msg, log_detail, ...). The argument you may not recognize is ..., which is supplied as named vectors (each of length one) providing any additional field names you would like to log:
loggit("INFO", "This is a message", but_maybe = "you want more fields?", sure = "why not?", like = 2, or = 10, what = "ever")
Since loggit() calls dplyr::bind_rows() behind the scenes, you can add as many fields of any type as you want without any append issues or errors. Note, however, that this means that any future calls to loggit() for the same custom fields, require that those fields are spelled the same as originally provided. If not, then the misspelling will cause another field to be created, and (the data frame version of) your log file will look like this:
> loggit("INFO", "Result 1", scientist = "Ryan")
> loggit("INFO", "Result 2", scietist = "Thomas")

timestamp log_lvl log_msg log_detail scientist scietist
1 2018-02-13 18:01:32 INFO Result 1 "" Ryan
2 2018-02-13 18:02:32 INFO Result 2 "" Thomas
Also note that you do not always have to supply a custom field once it is created; thanks, JSON and dplyr!
> loggit("INFO", "Result 1", scientist = "Ryan")
> loggit("INFO", "Result 2", scientist = "Thomas")
> loggit("INFO", "Result 3")

timestamp log_lvl log_msg log_detail scientist
1 2018-02-13 18:01:32 INFO Result 1 "" Ryan
2 2018-02-13 18:02:32 INFO Result 2 "" Thomas
3 2018-02-13 18:03:32 INFO Result 3 ""

Also, other control options:
  • You can control the output name & location of the log file using setLogFile(logfile, folder). loggit will not write to disk unless an exception is raised, or unless loggit() is called, but you should specify this change early, if desired. You can see the current log file path at package attachment, or by calling getLogFile().
  • If for any reason you do not want to log the output to a log file, you can set each handler’s .loggit argument to FALSE. This will eventually be a global option that the user can set, and leave the handlers without the argument. If using loggit in your own package, you can just importFrom(loggit, function) the handlers that you do want.
  • You can control the format of the timestamp in the logs; it defaults to ISO format "%Y-%m-%d %H:%M:%S", but you may set it yourself using setTimestampFormat(). Note that this format is passed to format.Date(), so the supplied format needs to be valid.

Note on What Gets Logged

Note that this package does not mask the handler functions included in other packages, including in base R; for example, running the following will throw an error as usual, but will not write to the log file:
> 2 + "a"
Error in 2 + "a" : non-numeric argument to binary operator
> dplyr::left_join(data.frame(a = 1), data.frame(b = 2))
Error: `by` required, because the data sources have no common variables
> # Did loggit write these exception messages to the logfile?
> file.exists(loggit:::.config$logfile)
[1] FALSE
This is integral to how R works with packages, and is by design; it has to do with namespaces, which is how R looks for what to do when you ask it to do something. Basically, if a package you use doesn’t have loggit in its NAMESPACE file, then its internal exception handlers won’t be masked by loggit.
If you really wish to have all exception messages logged by loggit, please be patient, as this feature is in the works.

Installation

You can install the latest CRAN release of loggit via install.packages("loggit").
Or, to get the latest development version from GitHub —
devtools::install_github("ryapric/loggit")
Or, clone & build from source:
cd /path/to/your/repos
git clone https://github.com/ryapric/loggit.git loggit
R CMD INSTALL loggit
To use the most recent development version of loggit in your own package, you can include it in your Remotes: field in your DESCRIPTION file:
Remotes: github::ryapric/loggit
Note that packages being submitted to CRAN cannot have a Remotes field. Refer here for more info.

To leave a comment for the author, please follow the link and comment on their blog: Another Blog About R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Speeding up spatial analyses by integrating `sf` and `data.table`: a test case

By Lorenzo Busetto

(This article was first published on Spatial Processing in R, and kindly contributed to R-bloggers)

The problem

Last week, I replied to this interesting question posted by @Tim_K over stackoverflow. He was seeking efficient solutions to identify all points falling within a maximum distance of xx meters with respect to each single point in a spatial points dataset.
If you have a look at the thread, you will see that a simple solution based on creating a “buffered” polygon dataset beforehand and then intersecting it with the original points is quite fast for “reasonably sized” datasets, thanks to sf spatial indexing capabilities which reduce the number of the required comparisons to be done (See http://r-spatial.org/r/2017/06/22/spatial-index.html). In practice, something like this:
# create test data: 50000 uniformly distributed points on a "square" of 100000
# metres
maxdist 500
pts data.frame(x = runif(50000, 0, 100000),
y = runif(50000, 0, 100000),
id = 1:50000) %>%
sf::st_as_sf(coords = c("x", "y"))
# create buffered polygons
pts_buf sf::st_buffer(pts, maxdist)
# Find points within 500 meters wrt each point
int sf::st_intersects(pts_buf, pts)
int
## Sparse geometry binary predicate list of length 50000, where the predicate was `intersects'
## first 10 elements:
## 1: 1, 11046, 21668, 25417
## 2: 2, 8720, 12595, 23620, 26926, 27169, 39484
## 3: 3, 11782, 20058, 27869, 33151, 47864
## 4: 4, 35665, 45691
## 5: 5, 37093, 37989
## 6: 6, 31487
## 7: 7, 38433, 42009, 45597, 49806
## 8: 8, 12129, 31486
## 9: 9, 27840, 35577, 36797, 40906
## 10: 10, 15482, 16615, 26103, 41417
However, this starts to have problems over really large datasets, because the total number of comparisons to be done still rapidly increase besides the use of spatial indexes. A test done by changing the number of points in the above example in the range 25000 – 475000 shows for example this kind of behavior, for two different values of maxdist (500 and 2000 m):

On the test dataset, the relationships are almost perfectly quadratic (due to the uniform distribution of points). Extrapolating them to the 12 Million points dataset of the OP, we would get an execution time of about 14 hours for maxdist = 500, and a staggering 3.5 days formaxdist = 2000. Still doable, but not ideal…
My suggestion to the OP was therefore to “split” the points in chunks based on the x-coordinate and then work on a per-split basis, eventually assigning each chunk to a different core within a parallellized cycle.
In the end, I got curious and decided to give it a go to see what kind of performance improvement it was possible to obtain with that kind of approach. You can find results of some tests below.

A (possible) solution: Speeding up computation by combining data.table and sf_intersect

The idea here is to use a simple divide-and-conquer approach.
We first split the total spatial extent of the dataset in a certain number of regular quadrants. We then iterate over the quadrants and for each one we:
  1. Extract the points contained into the quadrant and apply a buffer to them;
  2. Extract the points contained in a slightly larger area, computed by expanding the quadrant by an amount equal to the maximum distance for which we want to identify the “neighbors”;
  3. Compute and save the intersection between the buffered points and the points contained in the “expanded” quadrant
“Graphically”, this translates to exploring the dataset like this:

, where the points included in the current “quadrant” are shown in green and the additional points needed to perform the analysis for that quadrant are shown in red.

Provided that the subsetting operations do not introduce an excessive overhead (i.e., they are fast enough…) this should provide a performance boost, because it should consistently reduce the total number of comparisons to be done.

Now, every “R” expert will tell you that if you need to perform fast subsetting over large datasets the way to go is to use properly indexeddata.tables, which provide lightning-speed subsetting capabilities.
So, let’s see how we could code this in a functions:
points_in_distance  function(in_pts,
maxdist,
ncuts = 10) {

require(data.table)
require(sf)
# convert points to data.table and create a unique identifier
pts data.table(in_pts)
pts pts[, or_id := 1:dim(in_pts)[1]]

# divide the extent in quadrants in ncuts*ncuts quadrants and assign each
# point to a quadrant, then create the index over "x" to speed-up
# the subsetting
range_x range(pts$x)
limits_x 1] + (0:ncuts)*(range_x[2] - range_x[1])/ncuts)
range_y range(pts$y)
limits_y range_y[1] + (0:ncuts)*(range_y[2] - range_y[1])/ncuts
pts[, `:=`(xcut = as.integer(cut(x, ncuts, labels = 1:ncuts)),
ycut = as.integer(cut(y, ncuts, labels = 1:ncuts)))] %>%
setkey(x)

results list()
count 0
# start cycling over quadrants
for (cutx in seq_len(ncuts)) {

# get the points included in a x-slice extended by `maxdist`, and build
# an index over y to speed-up subsetting in the inner cycle
min_x_comp ifelse(cutx == 1,
limits_x[cutx],
(limits_x[cutx] - maxdist))
max_x_comp ifelse(cutx == ncuts,
limits_x[cutx + 1],
(limits_x[cutx + 1] + maxdist))
subpts_x pts[x >= min_x_comp & x max_x_comp] %>%
setkey(y)

for (cuty in seq_len(ncuts)) {
count
count + 1

# subset over subpts_x to find the final set of points needed for the
# comparisons
min_y_comp ifelse(cuty == 1,
limits_y[cuty],
(limits_x[cuty] - maxdist))
max_y_comp ifelse(cuty == ncuts,
limits_x[cuty + 1],
(limits_x[cuty + 1] + maxdist))
subpts_comp subpts_x[y >= min_y_comp & y max_y_comp]

# subset over subpts_comp to get the points included in a x/y chunk,
# which "neighbours" we want to find. Then buffer them by maxdist.
subpts_buf
subpts_comp[ycut == cuty & xcut == cutx] %>%
sf::st_as_sf() %>%
sf::st_buffer(maxdist)

# retransform to sf since data.tables lost the geometric attrributes
subpts_comp sf::st_as_sf(subpts_comp)

# compute the intersection and save results in a element of "results".
# For each point, save its "or_id" and the "or_ids" of the points within "dist"
inters sf::st_intersects(subpts_buf, subpts_comp)

# save results
results[[count]] data.table(
id = subpts_buf$or_id,
int_ids = lapply(inters, FUN = function(x) subpts_comp$or_id[x]))
}
}
data.table::rbindlist(results)
}
The function takes as input a points sf object, a target distance and a number of “cuts” to use to divide the extent in quadrants, and provides in output a data frame in which, for each original point, the “ids” of the points within maxdist are reported in the int_ids list column.
Now, let’s see if this works:
pts  data.frame(x = runif(20000, 0, 100000),
y = runif(20000, 0, 100000),
id = 1:20000) %>%
st_as_sf(coords = c("x", "y"), remove = FALSE)
maxdist 2000
out points_in_distance(pts, maxdist = maxdist, ncut = 10)
out
##           id                              int_ids
## 1: 15119 2054,18031, 9802, 8524, 4107, 7412,
## 2: 14392 12213,10696, 6399,14392, 4610, 1200,
## 3: 14675 2054,18031, 9802, 8524, 4107, 7412,
## 4: 3089 12293,18031, 8524, 4107,12727, 2726,
## 5: 17282 9802,8524,4107,7412,2726,1275,
## ---
## 19995: 248 16610, 7643, 8059,15998,16680, 1348,
## 19996: 8433 16821,15638,16680, 3876,13851, 1348,
## 19997: 17770 11060, 7643, 8059,19868, 7776,10146,
## 19998: 11963 9948, 9136,15956,18512, 9219, 8925,
## 19999: 15750 5291,18093,14462,15362,12575, 5189,
# get a random point
sel_id sample(pts$id,1)
pt_sel pts[sel_id, ]
pt_buff pt_sel %>% sf::st_buffer(maxdist)
# get ids of points within maxdist
id_inters unlist(out[id == sel_id, ]$int_ids)
pt_inters pts[id_inters,]

#plot results
ggplot(pt_buff) + theme_light() +
geom_point(data = pts, aes(x = x, y = y), size = 1) +
geom_sf(col = "blue", size = 1.2, fill = "transparent") +
geom_sf(data = pt_inters, col = "red", size = 1.5) +
geom_point(data = pt_sel, aes(x = x, y = y), size = 2, col = "green") +
xlim(st_bbox(pt_buff)[1] - maxdist, st_bbox(pt_buff)[3] + maxdist) +
ylim(st_bbox(pt_buff)[2] - maxdist, st_bbox(pt_buff)[4] + maxdist) +
ggtitle(paste0("id = ", sel_id, " - Number of points within distance = ", length(id_inters)))

So far, so good. Now, let’s do the same exercise with varying number of points to see how it behaves in term of speed:

Already not bad! In particular for the maxdist = 2000 case, we get a quite large speed improvement!
However, a nice thing about the points_in_distance approach is that it is easily parallelizable. All is needed is to change some lines of the function so that the outer loop over the x “chunks” exploits a parallel backend of some kind. (You can find an example implementation exploiting foreachin this gist)
On a not-particularly-fast PC, using a 6-cores parallelization leads to this:

Looking good! Some more skilled programmer could probably squeeze out even more speed from it by some additional data.table magic, but the improvement is very noticeable.
In terms of execution time, extrapolating again to the “infamous” 12 Million points dataset, this would be what we get:

Method Maxdist Expected completion time (hours)
st_intersect 500 15.00
points_in_distance – serial 500 2.50
points_in_distance – parallel 500 0.57
st_intersect 2000 85.00
points_in_distance – serial 2000 15.20
points_in_distance – parallel 2000 3.18
So, we get a 5-6X speed improvement already on the “serial” implementation, and another 5X thanks to parallelization over 6 cores! On themaxdist = 2000 case, this means going from more than 3 days to about 3 hours. And if we had more cores and RAM to throw at it, it would finish in minutes!
Nice!

Final Notes

  • The timings shown here are merely indicative, and related to the particular test-dataset we built. On a less uniformly distributed dataset I would expect a lower speed improvement.
  • Some time is “wasted” because sf does not (yet) extend data.tables, making it necessary to recreate sf objects from thedata.table subsets.
  • The parallel implementation is quick-and-dirty, and it is a bit of a memory-hog! Be careful before throwing at it 25 processors!
  • Speed is influenced in a non-trivial way by the number of “cuts” used to subdivide the spatial extent. There may be a sweet-spot related to points distribution and maxdist allowing reaching maximum speed.
  • A similar approach for parallelization could exploit repeatedly “cropping” the original sf points object over the extent of the chunk/extended chunk. The data.table approach seems however to be faster.

That’s all! Hope you liked this (rather long) post!

To leave a comment for the author, please follow the link and comment on their blog: Spatial Processing in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Deep Learning Image Classification with Keras and Shiny

By Jasmine Dumas' R Blog

alt text

(This article was first published on Jasmine Dumas’ R Blog, and kindly contributed to R-bloggers)

I have to admit my initial thoughts of deep learning were pessimistic and in order to not succumb to impostor syndrome, I put off learning any new techniques in the growing sub field of machine learning, until recently. After attending & speaking at Data Day Texas and listening to Lukas Biewald’s Keynote titled: Deep Learning in the Real World, I began to see through the complexities of Deep Learning and understand the real world applications. My favorite example from the keynote was Coca Cola deploying a deep learning model to easily capture under the cap promotional codes. I left the conference with some initial ideas about detecting deer in my backyard using a web cam and running a image classification algorithm as my first step into learning by doing.

For this image classification project I leveraged a pre-trained model from the R interface to Keras, that had been previously trained on a similar task. This enabled me to prototype something quickly and cheaply in a weekend and wrap the code as an interactive web app with a shiny flexdashboard. Here is the link to the shiny app which enables you to upload a image and return the top 3 predicted classes for that image: https://jasminedumas.shinyapps.io/image_clf/ and a preview of the app in action below.

To leave a comment for the author, please follow the link and comment on their blog: Jasmine Dumas’ R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

BIKE SERVICES API + SHINY = NICE APP

By rdata.lu Blog | Data science with R

(This article was first published on rdata.lu Blog | Data science with R, and kindly contributed to R-bloggers)

<!–
words 941

pre code, pre, code {
white-space: pre !important;
overflow-x: scroll !important;
overflow-y: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
max-height:30vh !important;
}
p img{
width:100%; !important;
}

–>

Hi everyone,

In this blog post, I will be short and I will introduce our shiny application on bike self-service stations.

The code is in 2 parts, the ui.R file for the interface and the server.R file for the backend. You can check the code on GitHub if you want to download the code and run it on your computer. JCDecaux provides an API that gives us real time information on each bike self-service station. This infomation is: •Station id •Station name •Address •Position latitude/longitude •Presence of a payment terminal •Presence of a bonus station •If the station is open •How many bike stands are in the station •how many bike stands are available in the station (no bikes on the stand) •How many bikes are available •Time of the last api update

The JCDecaux API gives the data under the following format:

{
  "number": 123,
  "contract_name" : "Paris",
  "name": "stations name",
  "address": "address of the station",
  "position": {
    "lat": 48.862993,
    "lng": 2.344294
  },
  "banking": true,
  "bonus": false,
  "status": "OPEN",
  "bike_stands": 20,
  "available_bike_stands": 15,
  "available_bikes": 5,
  "last_update": 
}

Hence, our shiny application gets real time information on bike stations in 27 cities.

We also used the modern open-source library leaflet to display city map (open street map).

This application works better on computer than on smartphone because shiny is not fully smartphone friendly. However, Shiny has a user-friendly interface.

If you want to try your own Shiny app, I advice you to check this gallery. It contains a lot of examples that will serve as a good introduction.

Don’t hesitate to follow us on twitter @rdata_lu <!– or @brodriguesco –> and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!

To leave a comment for the author, please follow the link and comment on their blog: rdata.lu Blog | Data science with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

webmockr: mock HTTP requests

By rOpenSci – open tools for open science

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)

webmockr

webmockr is an R library for stubbing and setting expectations on HTTP requests.
It is a port of the Ruby gem webmock.

webmockr works by plugging in to another R package that does HTTP requests. It currently only works with crul right now, but we plan to add support for curl and httr later.

webmockr has the following high level features:

  • Stubbing HTTP requests at low http client level
  • Setting expectations on HTTP requests
  • Matching requests based any combination of HTTP method (e.g., GET/POST/PUT), URI (i.e., URL), request headers and body
  • Will soon integrate with vcr so that you can cache real HTTP responses – and easily integrate with testthat

webmockr has been on CRAN for a little while now, but I’ve recently made some improvements and am nearing another
CRAN release, which is also preparation for a first release of the related vcr package. The following is a
run down of major features and how to use the package, including a bit on testing at the end.

Similar art in other languages

webmockr was most closely inspired by Ruby’s webmock, but there are others out there, e.g,. HTTPretty and pook for Python, from which we’ll steal ideas as time allows.

Installation

webmockr is on CRAN, so you can install that version

install.packages("webmockr")

But I’ve been making some improvements, so you’ll probably want the dev version:

Install some dependencies

if (!requireNamespace("remotes", quietly=TRUE)) install.packages("remotes")
remotes::install_github(c("ropensci/vcr@webmockit", "ropenscilabs/fauxpas"))

Install webmockr

remotes::install_github("ropensci/webmockr")

Load webmockr

library(webmockr)

Using webmockr

After loading webmockr

library(webmockr)

webmockr is loaded but not “turned on”. At this point webmockr doesn’t
change anything about your HTTP requests.

Once you turn on webmockr:

webmockr::enable()

webmockr will now by default NOT allow real HTTP requests from the http
libraries that adapters are loaded for (right now only crul).

Next, you’ll want to think about stubbing requests

Stubbing requests

Stubbing requests simply refers to the act of saying “I want all HTTP
requests that match this pattern to return this thing”. The “thing”
can be made up of a lot of different things.

First, load crul and enable webmockr

library(crul)
webmockr::enable()

You can stub requests based on HTTP method and uri

stub_request("get", "https://httpbin.org/get")
#>  
#>   method: get
#>   uri: https://httpbin.org/get
#>   with: 
#>     query: 
#>     body: 
#>     request_headers: 
#>   to_return: 
#>     status: 
#>     body: 
#>     response_headers: 
#>   should_timeout: FALSE
#>   should_raise: FALSE

The returned object is of class StubbedRequest, with attributes and
accessors – which you usually won’t use yourself – although we use the
object internally within webmockr. The object has a print method that
summarizes your stub, including the HTTP method, the URI, if you set
any patterns for query parameters, the query body, or request headers,
and whether you set any expectations on the response, including
status code, body or response headers. Last, there’s booleans for whether
you set a timeout or raise expectation (more on that below).

You can check the stub registry at any time to see what stubs
you have:

stub_registry()
#>  
#>  Registered Stubs
#>    get: https://httpbin.org/get

protip: you can clear all stubs at any time by running stub_registry_clear()

Once we have the stub, we can then do a request with crul

x   
#>   url: https://httpbin.org/get
#>   request_headers: 
#>     User-Agent: libcurl/7.54.0 r-curl/3.1 crul/0.5.1.9210
#>     Accept-Encoding: gzip, deflate
#>     Accept: application/json, text/xml, application/xml, */*
#>   response_headers: 
#>   status: 200

A response is returned that is of the same structure as a normal
crul response, but no actual HTTP request was performed. The
response object is a bit different from a real HTTP response in that
we don’t have response headers, but you can set an expectation of
response headers, and even use real response headers you’ve
retrieved in real HTTP requests if you like.

Once we’ve made a request we can take a peek into the request
registry:

request_registry()
#>  
#>   Registered Requests
#>   GET https://httpbin.org/get   
#>      with headers {
#>        User-Agent: libcurl/7.54.0 r-curl/3.1 crul/0.5.0, 
#>        Accept-Encoding: gzip, deflate, 
#>        Accept: application/json, text/xml, application/xml, */*
#>      } was made 1 times

(the printing behavior above is manual – if you run this you’ll see one
line for each request – will make some prettier printing behavior later)

We can see that a request to https://httpbin.org/get was made with
certain request headers and was made 1 time. If we make that request
again the registry will then say 2 times. And so on.

You can also set the request headers that you want to match on.
If you do this the request has to match the HTTP method, the URI,
and the certain request headers you set.

stub_request("get", "https://httpbin.org/get") %>%
  wi_th(headers = list('User-Agent' = 'libcurl/7.51.0 r-curl/2.6 crul/0.3.6', 
                       'Accept-Encoding' = "gzip, deflate"))
#>  
#>   method: get
#>   uri: https://httpbin.org/get
#>   with: 
#>     query: 
#>     body: 
#>     request_headers: User-Agent=libcurl/7.51.0 r-curl/2.6 crul/0.3.6, Accept-Encoding=gzip, deflate
#>   to_return: 
#>     status: 
#>     body: 
#>     response_headers: 
#>   should_timeout: FALSE
#>   should_raise: FALSE

You can set the query parameters in a stub as well:

stub_request("get", "https://httpbin.org/get") %>%
  wi_th(
    query = list(hello = "world"))
#>  
#>   method: get
#>   uri: https://httpbin.org/get
#>   with: 
#>     query: hello=world
#>     body: 
#>     request_headers: 
#>   to_return: 
#>     status: 
#>     body: 
#>     response_headers: 
#>   should_timeout: FALSE
#>   should_raise: FALSE

Stubbing responses

Up until now the stubs we’ve created have only set what to match on, but
have not set what to return. With to_return you can set response headers,
response body, and response HTTP status code. You don’t match requests
on these three things, but rather they determine what’s returned in the
response.

Here, we’ll state that we want an HTTP status of 418 returned:

stub_request("get", "https://httpbin.org/get") %>%
    to_return(status = 418)
#>  
#>   method: get
#>   uri: https://httpbin.org/get
#>   with: 
#>     query: 
#>     body: 
#>     request_headers: 
#>   to_return: 
#>     status: 418
#>     body: 
#>     response_headers: 
#>   should_timeout: FALSE
#>   should_raise: FALSE

Stubbing HTTP exceptions

Sometimes its useful in your stubs to say that you expect a particular HTTP
error/exception.

webmockr has two functions for this:

  • to_raise: raise any HTTP exception. pass in any http exception from the
    fauxpas package
  • to_timeout: raise an HTTP timeout exception. this is it’s own function
    since timeout exceptions are rather common and one may want to use them often

Here, with to_raise we pass in an HTTP exception, in this case the HTTPBadRequest
exception:

library(fauxpas)
stub_request("get", "https://httpbin.org/get?a=b") %>% 
    to_raise(fauxpas::HTTPBadRequest)
#>  
#>   method: get
#>   uri: https://httpbin.org/get?a=b
#>   with: 
#>     query: 
#>     body: 
#>     request_headers: 
#>   to_return: 
#>     status: 
#>     body: 
#>     response_headers: 
#>   should_timeout: FALSE
#>   should_raise: HTTPBadRequest
x  Error: Bad Request (HTTP 400).
#>  - The request could not be understood by the server due to malformed syntax. The client SHOULD NOT repeat the request without modifications.

With to_timeout a matched request to our stub will then return a timeout error:

stub_request("post", "https://httpbin.org/post") %>% 
    to_timeout()
#>  
#>   method: post
#>   uri: https://httpbin.org/post
#>   with: 
#>     query: 
#>     body: 
#>     request_headers: 
#>   to_return: 
#>     status: 
#>     body: 
#>     response_headers: 
#>   should_timeout: TRUE
#>   should_raise: FALSE
x  Error: Request Timeout (HTTP 408).
#>  - The client did not produce a request within the time that the server was prepared to wait. The client MAY repeat the request without modifications at any later time.

Check out fauxpas for more information about HTTP exceptions.

Allowing real requests

You can always disable webmockr by using webmockr::disable(), which completely
disables mocking.

You can also allow some real requests while still using webmockr.

One way to allow some requests is to execute webmockr_allow_net_connect(), which
doesn’t disable webmockr completely – it allows real HTTP requests to be
made that have no stubs associated.

Another way is to disallow all requests except for certain URIs using
webmockr_disable_net_connect(). For example, if we run
webmockr_disable_net_connect("google.com") no real HTTP requests are allowed
other than those that match "google.com".

You can also allow only localhost HTTP requests with the allow_localhost parameter
in the webmockr_configure() function. Run webmockr_configure(allow_localhost = TRUE),
and then all localhost requests will be allowed while all non-locahost
requests will not be allowed. allow_localhost works for all the localhost variants:
localhost, 127.0.0.1, and 0.0.0.0.

You can check whether you are allowing real requests with
webmockr_net_connect_allowed(), and you can see your webmockr configuration with
webmockr_configuration().

Storing actual HTTP responses

webmockr doesn’t do that. Check out vcr for that.

Some have noted that in some cases storing actual http responses can get to be too cumbersome (maybe too much disk space, or other reasons) – and have reverted from using a tool like vcr that caches real responses to using something like webmockr that only “stubs” fake responses.

There’s many options for testing a library that does HTTP requests:

  1. Do actual HTTP requests
  2. Stub HTTP requests with a tool like webmockr
  3. Cache real HTTP responses with a too like vcr

webmockr for your test suite

You can use webmockr for your test suite right now. Here’s a quick example of using
webmockr with testthat

library(crul)
library(testthat)

Make a stub

stub_request("get", "https://httpbin.org/get") %>%
   to_return(body = "success!", status = 200)
#>  
#>   method: get
#>   uri: https://httpbin.org/get
#>   with: 
#>     query: 
#>     body: 
#>     request_headers: 
#>   to_return: 
#>     status: 200
#>     body: success!
#>     response_headers: 
#>   should_timeout: FALSE
#>   should_raise: FALSE

Check that it’s in the stub registry

stub_registry()
#>  
#>  Registered Stubs
#>    get: https://httpbin.org/get   | to_return:  with body "success!"  with status 200

Make the request

z 

Run tests (nothing returned means it passed)

expect_is(z, "HttpResponse")
expect_equal(z$status_code, 200)
expect_equal(z$parse("UTF-8"), "success!")

Todo

There are a number of things to still get done with webmockr

Feedback!

We’d love to get some eyes on this; to sort out problems that will no doubt arise from real world scenarios; to flesh out new use cases we hadn’t thought of, etc.

Open an issue: https://github.com/ropensci/webmockr/issues/new

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Blog about something you just learned

By Edwin Thoen

(This article was first published on That’s so Random, and kindly contributed to R-bloggers)

Great effort has recently been made to encourage also the not-so-experienced to jump into the water and blog about data science. Some of the community’s hot shots gracefully draw attention to blogs of first-times on twitter to give them extra exposure. Most of the blogger newbies write about an analysis they did on a topic they care about. That sounds like the obvious thing to do. However, I think there is a second option that is not considered by many. That is writing about a topic you just learned. This might seem strange, why would you want to tell the world about something you are by no means an expert on? Won’t you just make a fool of yourself by pretending you know stuff you just picked up? I don’t think so, let me give you four reasons why I think this is actually a good idea.

It forces you to study deeply

One of the challenges of learning new aspects about R or data science in general on your own, is that there is no pressure to fully study something other than intrinsic motivation. Combined with a million other things to do in your life, this results all too often in picking up the new thing haphazardly. Revisiting the topic time and time again and leaving yourself with this nagging feeling that you never finish it. By blogging about it, you all of sudden have a great motivation to explore the topic inside out. When you write about it you want to make sure you have covered the most important material there is on the topic and you will not rest until you grasp most of it. As a result you will have a more profound understanding and you will be a more satisfied learner.

You understand other learners

Because you only just learned yourself, nobody has a better feeling about the challenges of grasping the topic. A seasoned expert does not only forgets the challenges she had when she was learning, she probably forgets it was a challenge in the first place. Here is Dave Robinson creating awareness for this fallacy

When teaching, be careful not to mix up “I learned this a long time ago” with “This is simple”#rstats

— David Robinson (@drob) April 20, 2016

By reflecting on your journey of picking up the new thing, you are most likely a more thorough and compassionate teacher than the long time expert.

You draw attention of experts

You might think experts couldn’t care less about some newbie getting his feet wet. You are wrong. R community members care deeply about what others do and they provide feedback in a constructive way. Even, or maybe especially, when you are at the beginning of a path they have already completed. When I published a blog on NSE, I got several experts getting back to me within a day. They pointed out to me that I was putting way too much emphasis on a detail in NSE, which distorted the whole picture. As a result the blog post got better (I adjusted it) and my knowledge became more profound.

You train in being wrong

So, the NSE blog I worked on for quite some time still had a couple of significant flaws. Experts pointed this out publicly to me (in the blog’s Disqus and on twitter) and this was a great thing. The longer I work as a data scientist, the more I begin to value that being wrong is a skill one should actively develop. Sounds strange? Let me explain. Whether in blogging or in daily work, one’s objective should be to bring as much insight as possible. This is not the same as being right all the time. The latter is the goal of your ego, and the ego has its own agenda. Typically, wanting to be always right leads to withdrawing and postponing, because you are still unsure of $x$ or you haven’t completely figured out $y$ yet. As a result you keep chewing on it, meanwhile (collective) insight doesn’t grow. Of course you want to do your work thorough, I am not encouraging sloppiness or laziness here. Rather, when you have done your best, by all means get it out in the open and let others shoot at it. Your ego will get a few blows now and then, but that’s fine, it better gets used to it. Blogging in this way is a perfect practice for publicizing stuff you haven’t totally figured out yet and experiencing what it is like to be publicly wrong from time to time.

I hope you will consider writing blog posts about stuff you only just learned about. It is a great way to grow and I will certainly keep doing it!

To leave a comment for the author, please follow the link and comment on their blog: That’s so Random.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Tip: Use seq_len() to Avoid The Backwards Sequence Bug

By John Mount

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Another R tip. Use seq_len() to avoid The backwards seqeunce bug.

Many R users use the “colon sequence” notation to build sequences. For example:

for(i in 1:5) {
  print(paste(i, i*i))
}
#> [1] "1 1"
#> [1] "2 4"
#> [1] "3 9"
#> [1] "4 16"

However, the colon notation can be unsafe as it does not properly handle the empty sequence case:

n  [1] 1 0

Notice the above example built a reversed sequence, instead of an empty sequence. To avoid this use seq_len():

seq_len(5)
#> [1] 1 2 3 4 5

n  integer(0)

integer(0)” is a length zero sequence of integers (not a sequence containing the value zero).

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

DALEX: understand a black box model – conditional responses for a single variable

By smarterpoland

Screen Shot 2018-02-19 at 12.27.58 AM

(This article was first published on SmarterPoland.pl » English, and kindly contributed to R-bloggers)

Black-box models, like random forest model or gradient boosting model, are commonly used in predictive modelling due to their elasticity and high accuracy. The problem is, that it is hard to understand how a single variable affects model predictions.

As a remedy one can use excellent tools like pdp package (Brandon Greenwell, pdp: An R Package for Constructing Partial Dependence Plots, The R Journal 9(2017)) or ALEPlot package (Apley, Dan. Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models (2016)).
OR
Now one can use the DALEX package to not only plot a conditional model response but also superimpose responses from different models to better understand differences between models.

Consult the following vignette to learn more about the DALEX package and explainers for a single variable.

OR
if you want to learn more about explainers, join our DALEX Invasion!
Find our DALEX workshops at SER (Warsaw, April 2018), ERUM (Budapest, May 2018), WhyR (Wroclaw, June 2018) or UseR (Brisbane, July 2018).

To leave a comment for the author, please follow the link and comment on their blog: SmarterPoland.pl » English.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Sketchnotes from TWiML&AI #111: Learning “Common Sense” and Physical Concepts with Roland Memisevic

By Dr. Shirin Glander

Sketchnotes from TWiMLAI talk #111: Learning Common Sense and Physical Concepts with Roland Memisevic

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Learning “Common Sense” and Physical Concepts with Roland Memisevic:

Sketchnotes from TWiMLAI talk #111: Learning “Common Sense” and Physical Concepts with Roland Memisevic

You can listen to the podcast here.

In today’s episode, I’m joined by Roland Memisevic, co-founder, CEO, and chief scientist at Twenty Billion Neurons. Roland joined me at the RE•WORK Deep Learning Summit in Montreal to discuss the work his company is doing to train deep neural networks to understand physical actions. In our conversation, we dig into video analysis and understanding, including how data-rich video can help us develop what Roland calls comparative understanding, or AI “common sense”. We briefly touch on the implications of AI/ML systems having comparative understanding, and how Roland and his team are addressing problems like getting properly labeled training data. https://twimlai.com/twiml-talk-111-learning-common-sense-physical-concepts-roland-memisevic/

To leave a comment for the author, please follow the link and comment on their blog: Shirin’s playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News