2017 rOpenSci ozunconf :: Reflections and the realtime Package

By rOpenSci – open tools for open science

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)

This year’s rOpenSci ozunconf was held in Melbourne, bringing together over 45 R enthusiasts from around the country and beyond. As is customary, ideas for projects were discussed in GitHub Issues (41 of them by the time the unconf rolled around!) and there was no shortage of enthusiasm, interesting concepts, and varied experience.

I’ve been to a few unconfs now and I treasure the time I get to spend with new people, new ideas, new backgrounds, new approaches, and new insights. That’s not to take away from the time I get to spend with people I met at previous unconfs; I’ve gained great friendships and started collaborations on side projects with these wonderful people.

When the call for nominations came around this year it was an easy decision. I don’t have employer support to attend these things so I take time off work and pay my own way. This is my networking time, my development time, and my skill-building time. I wasn’t sure what sort of project I’d be interested in but I had no doubts something would come up that sounded interesting.

As it happened, I had been playing around with a bit of code, purely out of interest and hoping to learn how htmlwidgets work. The idea I had was to make a classic graphic equaliser visualisation like this

using R.

This presents several challenges; how can I get live audio into R, and how fast can I plot the signal? I had doubts about both parts, partly because of the way that R calls tie up the session (for now…) and partly because constructing a ggplot2 object is somewhat slow (in terms of raw audio speeds). I’d heard about htmlwidgets and thought there must be a way to leverage that towards my goal.

I searched for a graphic equaliser javascript library to work with and didn’t find much that aligned with what I had in my head. Eventually I stumbled on p5.js and its examples page which has an audio-input plot with a live demo. It’s a frequency spectrum, but I figured that’s just a bit of binning away from what I need. Running the example there looks like

This seemed to be worth a go. I managed to follow enough of this tutorial to have the library called from R. I modified the javascript canvas code to look a little more familiar, and the first iteration of geom_realtime() was born

This seemed like enough of an idea that I proposed it in the GitHub Issues for the unconf. It got a bit of attention, which was worrying, because I had no idea what to do with this next. Peter Hickey pointed out that Sean Kross had already wrapped some of the p5.js calls into R calls with his p5 package, so this seemed like a great place to start. It’s quite a clever way of doing it too; it involves re-writing the javascript which htmlwidgets calls on each time you want to do something.

Fast forward to the unconf and a decent number of people gathered around a little slip of paper with geom_realtime() written on it. I had to admit to everyone that the ggplot2 aspect of my demo was a sham (it’s surprisingly easy to draw a canvas in just the right shade of grey with white gridlines), but people stayed, and we got to work seeing what else we could do with the idea. We came up with some suggestions for input sources, some different plot types we might like to support, and set about trying to understand what Sean’s package actually did.

As it tends to work out, we had a great mix of people with different experience levels in different aspects of the project; some who knew how to make a package, some who knew how to work with javascript, some who knew how to work with websockets, some who knew about realtime data sources, and some who knew about nearly none of these things (✋ that would be me). If everyone knew every aspect about how to go about an unconf project I suspect the endeavor would be a bit boring. I love these events because I get to learn so much about so many different topics.

I shared my demo script and we deconstructed the pieces. We dug into the inner workings of the p5 package and started determining which parts we could siphon off to meet our own needs. One of the aspects that we wanted to figure out was how to simulate realtime data. This could be useful both for testing, and also in the situation where one might want to ‘re-cast’ some time-coded data. We were thankful that Jackson Kwok had gone deep-dive into websockets and pretty soon (surprisingly soon, perhaps; within the first day) we had examples of (albeit, constructed) real-time (every 100ms) data streaming from a server and being plotted at-speed

Best of all, running the plot code didn’t tie up the session; it uses a listener written into the javascript so it just waits for input on a particular port.

With the core goal well underway, people started branching out into aspects they found most interesting. We had some people work on finding and connecting actual data sources, such as the bitcoin exchange rate

and a live-stream of binary-encoded data from the Australian National University (ANU) Quantum Random Numbers Server

Others formalised the code so that it can be piped into different ‘themes’, and retain the p5 structure for adding more components

These were still toy examples of course, but they highlight what’s possible. They were each constructed using an offshoot of the p5 package whereby the javascript is re-written to include various features each time the plot is generated.

Another route we took is to use the direct javascript binding API with factory functions. This had less flexibility in terms of adding modular components, but meant that the javascript could be modified without worrying about how it needed to interact with p5 so much. This resulted in some outstanding features such as side-scrolling and date-time stamps. We also managed to pipe the data off to another thread for additional processing (in R) before being sent to the plot.

The example we ended up with reads the live-feed of Twitter posts under a given hashtag, computes a sentiment analysis on the words with R, and live-plots the result:

Overall I was amazed at the progress we made over just two days. Starting from a silly idea/demo, we built a package which can plot realtime data, and can even serve up some data to be plotted. I have no expectations that this will be the way of the future, but it’s been a fantastic learning experience for me (and hopefully others too). It’s highlighted that there’s ways to achieve realtime plots, even if we’ve used a library built for drawing rather than one built for plotting per se.

It’s even inspired offshoots in the form of some R packages; tRainspotting which shows realtime data on New South Wales public transport using leaflet as the canvas

and jsReact which explores the interaction between R and Javascript

The possibilities are truly astounding. My list of ‘things to learn’ has grown significantly since the unconf, and projects are still starting up/continuing to develop. The ggeasy package isn’t related, but it was spawned from another unconf Github Issue idea. Again; ideas and collaborations starting and developing.

I had a great time at the unconf, and I can’t wait until the next one. My hand will be going up to help out, attend, and help start something new.

My thanks and congratulations go out to each of the realtime developers: Richard Beare, Jonathan Carroll, Kim Fitter, Charles Gray, Jeffrey O Hanson, Yan Holtz, Jackson Kwok, Miles McBain and the entire cohort of 2017 rOpenSci ozunconf attendees. In particular, my thanks go to the organisers of such a wonderful event; Nick Tierney, Rob Hyndman, Di Cook, and Miles McBain.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Updated curl package provides additional security for R on Windows

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

There are many R packages that connect to the internet, whether it’s to import data (readr), install packages from Github (devtools), connect with cloud services (AzureML), or many other web-connected tasks. There’s one R package in particular that provides the underlying connection between R and the Web: curl, by Jeroen Ooms, who is also the new maintainer for R for Windows. (The name comes from curl, a command-line utility and interface library for connecting to web-based services). The curl package provides replacements for the standard url and download.file functions in R with support for encryption, and the package was recently updated to enhance its security, particularly on Windows.

To implement secure communications, the curl package needs to connect with a library that handles the SSL (secure socket layer) encryption. On Linux and Macs, curl has always used the OpenSSL library, which is included on those systems. Windows doesn’t have this library (at least, outside of the Subsystem for Linux), so on Windows the curl package included the OpenSSL library and associated certificate. This raises its own set of issues (see the post linked below for details), so version 3.0 of the package instead uses the built-in winSSL library. This means curl uses the same security architecture as other connected applications on Windows.

This shouldn’t have any impact on your web-connectivity from R now or in the future, except the knowledge that the underlying architecture is more secure. Nonetheless, it’s possible to switch back to OpenSSL-based encryption (and this remains the default on Windows 7, which does not include the winSSL).

Version 3.0 of the curl package is available now on CRAN (though you’ll likely never need to load it explicitly — packages that use it do that for you automatically). You can learn more about the changes at the link below. If you’d like to know more about what the cur packahe can do, this vignette is a great place to start. Many thanks to Jeroen Ooms for this package.

rOpenSci: Changes to Internet Connectivity in R on Windows

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Twitter Outer Limits : Seeing How Far Have Folks Fallen Down The Slippery Slope to “280” with rtweet

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

By now, virtually every major media outlet has covered the “280 Apocalypse”™. For those still not “in the know”, Twitter recently moved the tweet character cap to 280 after a “successful” beta test (some of us have different ideas of what “success” looks like).

I had been on a hiatus from the platform for a while and planned to (and did) jump back into the fray today but wanted to see what my timeline looked like tweet-length-wise. It’s a simple endeavour: use rtweet to grab the timeline, count the characters per-tweet and look up the results. I posted the results of said process to — of course — Twitter and some folks asked me for the code.

Others used it and there were some discussions as to why timelines looked similar (distribution-wise) with not many Tweets going over 140 characters. One posit I had was that it might be due to client-side limitations since I noted that Twitter for macOS — a terrible client they haven’t updated in ages (but there really aren’t any good ones) — still caps tweets at 140 characters. Others, like Buffer on the web, do have support for 280, so I modified the code a bit to look at the distribution by client.

Rather than bore you with my own timeline analysis, and to help the results be a tad more reproducible (which was another discussion that spawned from the tweet-length tweet), here’s a bit of code that tries to grab the last 3,000 tweets with the #rstats hashtag and plots the distribution by Twitter client:

library(rtweet)
library(ggalt)
library(rprojroot)
library(hrbrthemes)
library(tidyverse)

rt %
  filter(n > 5) -> usable_sources  # need data for density + I wanted a nice grid 

# We want max tweet length & total # of tweets for sorting & labeling facets
filter(rstats, source %in% usable_sources$source) %>%
  group_by(source) %>%
  summarise(max=max(tweet_length), n=n()) %>%
  arrange(desc(max)) -> ordr

# four breaks per panel regardless of the scales (we're using free-y scales)
there_are_FOUR_breaks %
  filter(source %in% usable_sources$source) %>%
  mutate(source = factor(source, levels=ordr$source,
                         labels=sprintf("%s (n=%s)", ordr$source, ordr$n))) %>%
  ggplot(aes(tweet_length)) +
  geom_bkde(aes(color=source, fill=source), bandwidth=5, alpha=2/3) +
  geom_vline(xintercept=140, linetype="dashed", size=0.25) +
  scale_x_comma(breaks=seq(0,280,70), limits=c(0,280)) +
  scale_y_continuous(breaks=there_are_FOUR_breaks, expand=c(0,0)) +
  facet_wrap(~source, scales="free", ncol=5) +
  labs(x="Tweet length", y="Density",
       title="Tweet length distributions by Twitter client (4.5 days #rstats)",
       subtitle="Twitter client facets in decreasing order of ones with >140 length tweets",
       caption="NOTE free Y axis scalesnBrought to you by #rstats, rtweet & ggalt") +
  theme_ipsum_rc(grid="XY", strip_text_face="bold", strip_text_size=8, axis_text_size=7) +
  theme(panel.spacing.x=unit(5, "pt")) +
  theme(panel.spacing.y=unit(5, "pt")) +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(axis.text.y=element_blank()) +
  theme(legend.position="none")

FIN

While the 140 barrier has definitely been broken, it has not been abused (yet) but the naive character counting is also not perfect since it looks like it doesn’t “count” the same way Twitter-proper does (image “attachments”, as an example, are counted as characters here here but they aren’t counted that way in Twitter clients). Bots are also counted as Twitter clients.

It’ll be interesting to track this in a few months as folks start to inch-then-blaze past the former hard-limit.

Give the code (or use your timeline info) a go and post a link with your results! You can find an RStudio project directory over on GitHub 🔗.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

normal variates in Metropolis step

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

A definitely puzzled participant on X validated, confusing the Normal variate or variable used in the random walk Metropolis-Hastings step with its Normal density… It took some cumulated efforts to point out the distinction. Especially as the originator of the question had a rather strong a priori about his or her background:

“I take issue with your assumption that advice on the Metropolis Algorithm is useless to me because of my ignorance of variates. I am currently taking an experimental course on Bayesian data inference and I’m enjoying it very much, i believe i have a relatively good understanding of the algorithm, but i was unclear about this specific.”

despite pondering the meaning of the call to rnorm(1)… I will keep this question in store to use in class when I teach Metropolis-Hastings in a couple of weeks.

Filed under: Books, Kids, R, Statistics, University life Tagged: cross validated, Gaussian random walk, Markov chain Monte Carlo algorithm, MCMC, Metropolis-Hastings algorithm, Monte Carlo Statistical Methods, normal distribution, normal generator, random variates

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Spatial networks – case study St James centre, Edinburgh (2/3)

By mikerspencer

As with the previous post I’ve used

(This article was first published on R – scottishsnow, and kindly contributed to R-bloggers)

This is part two in a series I’m writing on network analysis. The first part is here. In this section I’m going to cover allocating resources, again using the St James’ development in Edinburgh as an example. Most excitingly (for me), the end of this post covers the impact of changes in resource allocation.

Edinburgh (and surrounds) has more than one shopping centre. Many more. I’ve had a stab at narrowing these down to those that are similar to the St James centre, i.e. they’re big, (generally) covered and may have a cinema. You can see a plot of these below. As you can see the majority are concentrated around the population centre of Edinburgh.

Location of big shopping centres in and around Edinburgh.

As with the previous post I’ve used GRASS GIS for the network analysis, QGIS for cartography and R for some subsequent analysis. I’ve used the Ordnance Survey code-point open and openroads datasets for the analysis and various Ordnance Survey maps for the background.

An allocation map shows how you can split your network to be serviced by different resource centres. I like to think of it as deciding which fire station sends an engine to which road. But this can be extended to any resource with multiple locations: bank branches, libraries, schools, swimming pools. In this case we’re using shopping centres. As always the GRASS manual page contains a full walk through of how to run the analysis. I’ll repeat the steps I took below:

# connect points to network
v.net roads_EH points=shopping_centres out=centres_net op=connect thresh=200

# allocate, specifying range of center cats (easier to catch all):
v.net.alloc centres_net out=centres_alloc center_cats=1-100000 node_layer=2

# Create db table
v.db.addtable map=centres_alloc@shopping_centres
# Join allocation and centre tables
v.db.join map=centres_alloc column=cat other_table=shopping_centres other_column=cat

# Write to shp
v.out.ogr -s input=centres_alloc output=shopping_alloc format=ESRI_Shapefile output_layer=shopping_alloc

The last step isn’t strictly necessary, as QGIS and R can connect directly to the GRASS database, but old habits die hard! We’ve now got a copy of the road network where all roads are tagged with which shopping centre they’re closest too. We can see this below:

shopping_allocation

Allocation network of EH shopping centres.

A few things stand out for me:

  • Ocean terminal is a massive centre but is closest to few people.
  • Some of the postcodes closest to St James, as really far away.
  • The split between Fort Kinnaird and St James is really stark just east of the A702.

If I was a councillor and I coordinated shopping centres in a car free world, I now know where I’d be lobbying for better public transport!

We can also do a similar analysis using the shortest path, as in the previous post. Instead of looking for the shortest path to a single point, we can get GRASS to calculate the distance from each postcode to its nearest shopping centre (note this is using the postcodes_EH file from the previous post):

# connect postcodes to streets as layer 2
v.net --overwrite input=roads_EH points=postcodes_EH output=roads_net1 operation=connect thresh=400 arc_layer=1 node_layer=2

# connect shops to streets as layer 3
v.net --overwrite input=roads_net1 points=shopping_centres output=roads_net2 operation=connect thresh=400 arc_layer=1 node_layer=3

# inspect the result
v.category in=roads_net2 op=report

# shortest paths from postcodes (points in layer 2) to nearest stations (points in layer 3)
v.net.distance --overwrite in=roads_net2 out=pc_2_shops flayer=2 to_layer=3

# Join postcode and distance tables
v.db.join map=postcodes_EH column=cat other_table=pc_2_shops other_column=cat
# Join station and distance tables
v.db.join map=postcodes_EH column=tcat other_table=shopping_centres other_column=cat subset_columns=Centre

# Make a km column
# Really short field name so we can output to shp
v.db.addcolumn map=postcodes_EH columns="dist_al_km double precision"
v.db.update map=postcodes_EH column=dist_al_km qcol="dist/1000"

# Make a st james vs column
# Uses results from the previous blog post
v.db.addcolumn map=postcodes_EH columns="diff_km double precision"
v.db.update map=postcodes_EH column=diff_km qcol="dist_km-dist_al_km"

# Write to shp
v.out.ogr -s input=postcodes_EH output=pc_2_shops format=ESRI_Shapefile output_layer=pc_2_shops

Again we can plot these up in QGIS (below). These are really similar results to the road allocation previously, but give us a little more detail on where the population are as each postcode is show. However, the eagle eyed of you will have noticed we pulled out the distance for each postcode in the code above and then compared it to the distance to St James alone. We can use this for considering the impact of resource allocation.

shopping_nearest

Closest shopping centre for each EH postcode.

Switching to R, we can interrogate the postcode data further. Using R’s gdal library we can read in the shp file and generate some summary statistics:

Centre No. of postcodes closest
Almondvale 4361
Fort Kinnaird 7813
Gyle 3437
Ocean terminal 1321
St James 7088
# Package
install.packages("rgdal")
library(rgdal)

# Read file
postcodes = readOGR("/home/user/dir/dir/network/data/pc_2_shops.shp")

# How many postcodes for each centre?
table(postcodes$Centre)

We can also look at the distribution of distances for each shopping centre using a box and whisker plot. As in the map we can see that Fort Kinnaird and St James are closest to the most distant postcodes, and that Ocean terminal has a small geographical catchment. The code for this plot is a the end of this post.all-shops_distance_boxplot

We can also repeat the plot from the previous blog post and look at how many postcodes are within walking and cycling distance of their nearest centre. In the previous post I showed the solid line and circle points for the St James centre. We can now compare those results to the impact of people travelling to their closest centre (below). The number of postcodes within walking distance of their nearest centre is nearly double that of St James alone, and those within cycling distance rises to nearly 50%! Code at the end of the post.

all-shops_postcode-distance

We also now have two curves on the above plot, and the area between them is the distance saved if each postcode travelled to its closest shopping centre instead of the St James.

The total distance is a whopping 123,680 km!

This impact analysis is obviously of real use in these times of reduced public services. My local council, Midlothian, is considering closing all its libraries bar one. What impact would this have on users? How would the road network around the kept library cope? Why have they just been building new libraries? It’s also analysis I really hope the DWP undertook before closing job centres across Glasgow. Hopefully the work of this post helps people investigate these impacts themselves.

# distance saved
# NA value is one postcode too far to be joined to road - oops!
sum(postcodes$diff_km, na.rm=T)

# Boxplot
png("~/dir/dir/network/figures/all-shops_distance_boxplot.png", height=600, width=800)
par(cex=1.5)
boxplot(dist_al_km ~ Centre, postcodes, lwd=2, range=0,
        main="Box and whiskers of EH postcodes to their nearest shopping centre",
        ylab="Distance (km)")
dev.off()

# Line plot
# Turn into percentage instead of postcode counts
x = sort(postcodes$dist_km)
x = quantile(x, seq(0, 1, by=0.01))
y = sort(postcodes$dist_al_km)
y = quantile(y, seq(0, 1, by=0.01))

png("~/dir/dir/network/figures/all-shops_postcode-distance.png", height=600, width=800)
par(cex=1.5)
plot(x, type="l",
     main="EH postcode: shortest road distances to EH shopping centres",
     xlab="Percentage of postcodes",
     ylab="Distance (km)",
     lwd=3)
lines(y, lty=2, lwd=3)

points(max(which(x

To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

SQL Saturday statistics – Web Scraping with R and SQL Server

By tomaztsql

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule 

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule %
 read_html()

# Event schedule
 schedule_info 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! 🙂

As always, code is available at Github.

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Visualizing classifier thresholds

By Corey Chivers

(This article was first published on Rstats – bayesianbiologist, and kindly contributed to R-bloggers)

Lately I’ve been thinking a lot about the connection between prediction models and the decisions that they influence. There is a lot of theory around this, but communicating how the various pieces all fit together with the folks who will use and be impacted by these decisions can be challenging.

One of the important conceptual pieces is the link between the decision threshold (how high does the score need to be to predict positive) and the resulting distribution of outcomes (true positives, false positives, true negatives and false negatives). As a starting point, I’ve built this interactive tool for exploring this.

The idea is to take a validation sample of predictions from a model and experiment with the consequences of varying the decision threshold. The hope is that the user will be able to develop an intuition around the tradeoffs involved by seeing the link to the individual data points involved.

Code for this experiment is available here. I hope to continue to build on this with other interactive, visual tools aimed at demystifying the concepts at the interface between predictions and decisions.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – bayesianbiologist.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Make memorable plots with memery. v0.3.0 now on CRAN.

By Matt Leonawicz

Make memorable plots with memery. memery is an R package that generates internet memes including superimposed inset graphs and other atypical features, combining the visual impact of an attention-grabbing meme with graphic results of data analysis. Version 0.3.0 of memery is now on CRAN. The latest development version and a package vignette are available on GitHub.

[original post]

Below is an example interleaving a semi-transparent ggplot2 graph between a meme image backdrop and overlying meme text labels. The meme function will produce basic memes without needing to specify a number of additional arguments, but this is not the main purpose of the package. Adding a plot is then as simple as passing the plot to inset.

memery offers sensible defaults as well as a variety of basic templates for controlling how the meme and graph are spliced together. The example here shows how additional arguments can be specified to further control the content and layout. See the package vignette for a more complete set of examples and description of available features and graph templates.

Please do share your data analyst meme creations. Enjoy!

library(memery)
 
# Make a graph of some data
library(ggplot2)
x 

Source:: R News

Update on coordinatized or fluid data

By John Mount

Cdata

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We have just released a major update of the cdata R package to CRAN.

If you work with R and data, now is the time to check out the cdata package.

Among the changes in the 0.5.* version of cdata package:

  • All coordinatized data or fluid data operations are now in the cdata package (no longer split between the cdata and replyr packages).
  • The transforms are now centered on the more general table driven moveValuesToRowsN() and moveValuesToColumnsN() operators (though pivot and un-pivot are now made available as convenient special cases).
  • All the transforms are now implemented in SQL through DBI (no longer using tidyr or dplyr, though we do include examples of using cdata with dplyr).
  • This is (unfortunately) a user visible API change, however adapting to the changed API is deliberately straightforward.

cdata now supplies very general data transforms on both in-memory data.frames and remote or large data systems (PostgreSQL, Spark/Hive, and so on). These transforms include operators such as pivot/un-pivot that were previously not conveniently available for these data sources (for example tidyr does not operate on such data, despite dplyr doing so).

To help transition we have updated the existing documentation:

The fluid data document is a bit long, as it covers a lot of concepts quickly. We hope to develop more targeted training material going forward.

In summary: cdata theory and package now allow very concise and powerful transformations of big data using R.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ShinyProxy 1.0.2

By Open Analytics

Docker swarm logo

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations. Since our last blog post ten new
releases of ShinyProxy have seen the light, but with the 1.0.2 release it is time
to provide an overview of the lines of development and advances made.

Scalability

ShinyProxy now allows to run 1000s of Shiny apps concurrently on a Docker Swarm cluster.
Moreover, ShinyProxy will automatically detect whether the Docker API URL is a
Docker Engine API or a Swarm cluster API. In other words changing the back-end from
a single Docker host to a Docker Swarm is plug and play.

Single-Sign On

Complex deployments asked for advanced functionality for identity and access management (IAM).
To tackle this we introduced a new authentication mechanism authentication: keycloak
which integrates ShinyProxy with Keycloak, RedHat’s open source IAM solution. Features like single-sign on, identity brokering, user federation etc. are now available for ShinyProxy
deployments.

keycloak logo

Larger Applications and Networks

Often times Shiny applications will be offered as part of larger applications that are
written in other languages than R. To enable this type of integrations, we have introduced
functionality to entirely hide the ShinyProxy user interface elements for seamless embedding
as views in bigger user interfaces.

Next to integration within other user interfaces, the underlying Shiny code may need to interact
with applications that live in specific networks. To make sure the Shiny app containers
have network interfaces configured for the right networks, a new docker-network configuration
parameter has been added to the app-specific configurations. Together with Docker volume mounting
for persistence, and the possibility to pass environment variables to Docker containers,
this gives Shiny developers lots of freedom to develop serious applications. An example configuration is given below. A Shiny app communicates over a dedicated Docker network db-net with a database back-end and configuration information is made available to the Shiny app via environment variables that are
read from a configuration file db.env:

  - name: db-enabled-app
    display-name: Shiny App with a Database Persistence Layer
    description: Shiny App connecting with a Database for Persistence
    docker-image: registry.openanalytics.eu/public/db-enabled-app:latest
    docker-network-connections: [ "db-net" ]
    docker-env-file: db.env
    groups: [db-app-users]

Usage Statistics

Gathering usage statistics was already part of ShinyProxy since version 0.6.0, but was limited
to an InfluxDB back-end so far. Customers asked us to integrate Shiny applications
with MonetDB (and did not want a separate database to store usage statistics) so we developed a MonetDB adapter for version 0.8.4. Configuration has been streamlined with a usage-stats-url and support for DB credentials is now offered through a usage-stats-username and usage-stats-password.

keycloak logo

Security

Proper security for ShinyProxy setups of all sizes is highly important and a number
of improvements have been implemented. The ShinyProxy security page
has been extended and has extra content has been added on dealing
with sensitive configuration.
On the authentication side LDAPS support has been around for a long time, but since release 1.0.0
we also offer LDAP+StartTLS support out of the box.

Deployment

Following production deployments for customers, we now also offer RPM files for deployment
on CentOS 7 and RHEL 7, besides the .deb packages for Ubuntu and the platform-independent
JAR files.

Further Information

For all these new features, detailed documentation is provided on http://shinyproxy.io and as always community support on this new release is available at

https://support.openanalytics.eu

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News