Visualising SSH attacks with R

By Iñaki Úcar

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

If you have any machine with an SSH server open to the world and you take a look at your logs, you may be alarmed to see so many login attempts from so many unknown IP addresses. DenyHosts is a pretty neat service for Unix-based systems which works in the background reviewing such logs and appending the offending addresses into the hosts.deny file, thus avoiding brute-force attacks.

The following R snippet may be useful to quickly visualise a hosts.deny file with logs from DenyHosts. Such file may have comments (lines starting with #), and actual records are stored in the form : . Therefore, read.table is more than enough to load it into R. The rgeolocate package is used to geolocate the IPs, and the counts per country are represented in a world map using rworldmap:

hosts.deny "/etc/hosts.deny"
db "extdata", "GeoLite2-Country.mmdb", package="rgeolocate")
read.table(hosts.deny, col.names=c("service", "IP")) %>%
  pull(IP) %>%
  maxmind(db, fields="country_code") %>%
  count(country_code) %>% %>%
  joinCountryData2Map(joinCode="ISO2", nameJoinColumn="country_code") %>%
  mapCountryData(nameColumnToPlot="n", catMethod="pretty", mapTitle="Attacks per country")
## 74 codes from your data successfully matched countries in the map
## 2 codes from your data failed to match with a country code in the map
## 168 codes from the map weren't represented in your data

Then, you may consider more specific access restrictions based on IP prefixes…

Article originally published in Visualising SSH attacks with R.

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R 3.4.3 released

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R 3.4.3 has been released, as announced by the R Core team today. As of this writing, only the source distribution (for those that build R themselves) is available, but binaries for Windows, Mac and Linux should appear on your local CRAN mirror within the next day or so.

This is primarily a bug-fix release. It fixes an issue with incorrect time zones on MacOS High Sierra, and some issues with handling Unicode characters. (Incidentally, representing international and special characters is something that R takes great care in handling properly. It’s not an easy task: a 2003 essay by Joel Spolsky describes the minefield that is character representation, and not much has changed since then.) You can check out the complete list of changes here. Whatever your platform, R 3.4.3 should be backwards-compatible will other R versions in the R 3.4.x series, and so your scripts and packages should continue to function as they did before.

The codename for this release is “Kite-Eating Tree”, and as with all R codenames this is a references to a classic Peanuts episode. If you’re interested in the source of other R release names, Lucy D’Agostino McGowan provides the Peanuts references for R release names back to R 2.14.0.

r-devel mailing list: R 3.4.3 is released

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A quick introduction to using color in density plots

By Sharp Sight

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

Right now, many people are pursuing data science because they want to learn artificial intelligence and machine learning. And for good reason. Machine learning is white hot right now, and it will probably reshape almost every sector of the world economy.

Having said that, if you want to be a great machine learning expert, and a great data scientist in general, you need to master data visualization too.

This is because data visualization is a critical prerequisite for advanced topics (like machine learning), and also because visualization is very useful for getting things done in its own right.

So let’s talk a little more about data visualization. As you begin learning data visualization in R, you should master the basics: the how to use ggplot2, how to think about data visualization, how to make basic plots (like the bar chart, line chart, histogram, and scatterplot).

And I really mean that you need to master these basics. To be a great data scientist, you need to be “fluent” in these basics. You should be able to write the code for basic charts and plots without even thinking about it. If you can’t, you should go back and practice them. Don’t get shiny object syndrome and try to move on to advanced topics before you do. Show some discipline. Master the foundations.

After you do master the foundations though, you’ll need to learn some intermediate tools.

One such thing that you’ll need to learn is how to work with color. Specifically, you’ll need to learn how to manipulate the “fill” color of things like density plots (as well as heatmaps).

With that in mind, let’s take a look at using color in density plots.

As always, we’re going to use the meta-learning strategy of learning the tools with basic cases. Essentially, we’ll learn and practice how to modify the fill aesthetic for very simple plots. Later, once you get the hang of it, you can move on to more advanced applications.

As always, first we will load the packages we will need.



Next, we will set a “seed” that will make the dataset exactly reproducible when we create our data. Without this, running rnorm() and runif() would produce data that is similar, but not exactly the same as the data you see here. To be clear, it won’t make too much difference for the purposes of this blog post, but it’s a best practice when you want to show reproducible results, so we will do this anyway.


Now, we will create our data frame.

We will use runif() to create 20,000 uniformly distributed values for our x variable, and rnorm() to create 20,000 random normal values for our y variables. We’re also using the tibble() function to ultimately create a tibble/data.frame out of these two new variables.


Now that we have a dataset created, let's create a simple plot of the data. Ultimately, we will be working with density plots, but it will be useful to first plot the data points as a simple scatter plot.

Here, we're using the typical ggplot syntax: we're specifying the data frame inside of ggplot() and specifying our variable mappings inside of aes().


ggplot(, aes(x = x, y = y)) +

Ok. We can at least see the data points and the general structure of the data (i.e., the horizontal band).

Having said that, these data are very heavily overplotted. There are a few ways to mitigate this overplotting (e.g., manipulating the alpha aesthetic), but a great way is to create a density plot.

To create the density plot, we’re using stat_density2d(). I won’t explain this in detail here, but essentially in this application, stat_density2d() calculates the density of observations in each region of the plot, and then fills in that region with a color that corresponds to the density.


ggplot(, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F)

This isn’t bad. It gives us a sense of the density of the data (you can see the thick band across the middle). However, there are two issues.

First, the differences in density are not completely obvious, because of the color scale. The default light blue/dark blue color scheme doesn’t illuminate the differences in data density.

Second, this is just not very aesthetically appealing. It just doesn’t look that good.

To fix these issues, let’s modify the color scheme. I’ll show you a few options. Some will work better than others, and after you see these, I’ll encourage you to experiment with other color palettes.

Let’s first take a look at the color palette options.

You can examine a large number of ready-made color palettes from the RColorBrewer package by using display.brewer.all().



As you can see, there are quite a few palettes from RColorBrewer. We’ll use a few of these to change the fill color of our plot.


ggplot(, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
  scale_fill_distiller(palette = 'Greens')

Now let’s use a few more color palettes.

# FILL WITH COLOR PALETTE: Reds, Greys, Red/Yellow/Blue

ggplot(, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
  scale_fill_distiller(palette = 'Reds')

… grey

ggplot(, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
  scale_fill_distiller(palette = 'Greys')

… and a scale from red, to yellow, to blue.

ggplot(, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +
  scale_fill_distiller(palette = 'RdYlBu')

These aren’t bad.

I think they work a little better than the default color scheme, but I think we can do better, so let’s try one more.

The following plot uses a custom color palette from the viridis package.


ggplot(, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..density..), geom = 'tile', contour = F) +

I should have called this blog post “ggplot for people who love Mark Rothko.”

Ok, the viridis color palette (and a related set of palettes in the viridis package) is probably my favorite option. Not only do I think this color palette is one of the most aesthetically attractive, it’s also more functional. As noted in the documentation for the package, the viridis color palette is “designed in such a way that it will analytically be perfectly perceptually-uniform … It is also designed to be perceived by readers with the most common form of color blindness.”

Master these techniques with simple cases

Admittedly, when you move on to more complex datasets later, it will take a bit of finesse to properly apply these color palettes.

But as I noted earlier, when you’re trying to master R (or any programming language), you should first master the syntax and techniques on basic cases, and then increase the complexity as you attain basic competence.

With that in mind, if you want to master these color and fill techniques, learn and practice these tools with simple cases like the ones shown here, and you can attempt more advanced applications later.

Sign up now, and discover how to rapidly master data science

To master data visualization and data science, you need to master the essential tools.

Moreover, to make rapid progress, you need to know what to learn, what not to learn, and you need to know how to practice what you learn.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

Sign up now for our email list, and you’ll receive regular tutorials and lessons.

You’ll learn:

  • How to do data visualization in R
  • How to practice data science
  • How to apply data visualization to more advanced topics (like machine learning)
  • … and more

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.


The post A quick introduction to using color in density plots appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Introducing RITCH: Parsing ITCH Files in R (Finance & Market Microstructure)

By David Zimmermann

(This article was first published on R – Data Shenanigans, and kindly contributed to R-bloggers)

Recently I was faced with a file compressed in NASDAQ’s ITCH-protocol, as I wasn’t able to find an R-package that parses and loads the file to R for me, I spent (probably) way to much time to write one, so here it is.

But you might wonder, what exactly is ITCH and why should I care?

Well, ITCH is the outbound protocol NASDAQ uses to communicate market data to its clients, that is, all information including market status, orders, trades, circuit breakers, etc. with nanosecond timestamps for each day and each exchange. Kinda a must-have if you are looking into market microstructure, a good-to-have-looked-into-it if you are interested in general finance and/or if you are interested in data analytics and large structured datasets.

If you are wondering where you might get some of these fancy datasets in the first place, I have good news for you. NASDAQ provides some sample datasets (6 days for 3 exchanges (NASDAQ, PSX, and BX), together about 25GBs gzipped) on its FTP server:

The RITCH Package

Now that I (hopefully) have your attention, let me represent to you RITCH an R package that parses ITCH-files (version 5.0).

Currently the package only lives on GitHub (, but it should find its way into CRAN eventually. Until then, you have to use devtools to install it

# install.packages("devtools")

And you should be good to go (if not, please let me know! If you get a message related to make, you can safely ignore it for now).

RITCH so far has a very limited scope: extracting the messages from the ITCH-file plus some functions to count messages. The package leverages C++ and the excellent Rcpp library to optimise parsing. RITCH itself does not contain any data as the datasets are too large for any repos and I have no copyright on the datasets in any way.

For the following code I will use the 20170130.BX_ITCH_50-file from NASDAQ’s FTP-server, as its not too large at 714MB gzipped (1.6GB gunzipped), but still has almost 55 million messages. All functions can take the gzipped or unzipped files, but if you use the file more than once and hard-drive space is not of utmost concern, I suggest you gunzip the file by hand (i.e., use R.utils::gunzip(file, new_file, remove = FALSE) in R or gunzip -k YYYYMMDD.XXX_ITCH_50.gz in the terminal) and call the functions on the “plain”-file. I will address some concerns to size and speed later on.

To download and prepare the data in R, we can use the following code

# might take some time as it downloads 714MB
download.file("", "20170130.BX_ITCH_50.gz", mode = "wb")

# gunzip the file, but keep the original file
R.utils::gunzip("20170130.BX_ITCH_50.gz", "20170130.BX_ITCH_50", remove = FALSE)

First, we want to get a general overview of the file, which we can do with count_messages()

file  [Counting]   54473386 messages found
#> [Converting] to data.table

#>     msg_type    count                                  msg_name                                    msg_group  doc_nr
#>  1:        S        6                      System Event Message                         System Event Message     4.1
#>  2:        R     8371                           Stock Directory                       Stock Related Messages   4.2.1
#>  3:        H     8401                      Stock Trading Action                       Stock Related Messages   4.2.2
#>  4:        Y     8502                       Reg SHO Restriction                       Stock Related Messages   4.2.3
#>  5:        L     6011               Market Participant Position                       Stock Related Messages   4.2.4
#>  6:        V        2                MWCB Decline Level Message                       Stock Related Messages
#>  7:        W        0                       MWCB Status Message                       Stock Related Messages
#>  8:        K        0                 IPO Quoting Period Update                       Stock Related Messages   4.2.6
#>  9:        J        0                       LULD Auction Collar                       Stock Related Messages   4.2.7
#> 10:        A 21142017                         Add Order Message                            Add Order Message   4.3.1
#> 11:        F    20648      Add Order - MPID Attribution Message                            Add Order Message   4.3.2
#> 12:        E  1203625                    Order Executed Message                        Modify Order Messages   4.4.1
#> 13:        C     8467 Order Executed Message With Price Message                        Modify Order Messages   4.4.2
#> 14:        X  1498904                      Order Cancel Message                        Modify Order Messages   4.4.3
#> 15:        D 20282644                      Order Delete Message                        Modify Order Messages   4.4.4
#> 16:        U  3020278                     Order Replace Message                        Modify Order Messages   4.4.5
#> 17:        P   330023                 Trade Message (Non-Cross)                               Trade Messages   4.5.1
#> 18:        Q        0                       Cross Trade Message                               Trade Messages   4.5.2
#> 19:        B        0                      Broken Trade Message                               Trade Messages   4.5.3
#> 20:        I        0                              NOII Message Net Order Imbalance Indicator (NOII) Message     4.6
#> 21:        N  6935487                   Retail Interest Message    Retail Price Improvement Indicator (RPII)     4.7
#>     msg_type    count                                  msg_name                                    msg_group  doc_nr

As you can see, there are a lot of different message types. Currently this package parses only messages from the group “Add Order Messages” (type ‘A’ and ‘F’), “Modify Order Messages” (type ‘E’, ‘C’, ‘X’, ‘D’, and ‘U’), and “Trade Messages” (type ‘P’, ‘Q’, and ‘B’). You can extract the different message-types by using the functions get_orders, get_modifications, and get_trades, respectively. The doc-number refers to the section in the official documentation (which also contains more detailed description what each type contains).

If you are annoyed by the feedback the function gives you ([Counting] ... [Converting]...), you can always turn the feedback off with the quiet = TRUE option (this applies to all functions).

Lets try to parse the first 10 orders

orders  10 messages found
#> [Loading]    .
#> [Converting] to data.table
#> [Formatting]

#>     msg_type locate_code tracking_number    timestamp order_ref   buy shares stock   price mpid       date            datetime
#>  1:        A        7584               0 2.520001e+13     36132  TRUE 500000  UAMY  0.0001   NA 2017-01-30 2017-01-30 07:00:00
#>  2:        A        3223               0 2.520001e+13     36133  TRUE 500000  GLOW  0.0001   NA 2017-01-30 2017-01-30 07:00:00
#>  3:        A        2937               0 2.520001e+13     36136 FALSE    200   FRP 18.6500   NA 2017-01-30 2017-01-30 07:00:00
#>  4:        A        5907               0 2.520001e+13     36137  TRUE   1500   PIP  3.1500   NA 2017-01-30 2017-01-30 07:00:00
#>  5:        A        5907               0 2.520001e+13     36138 FALSE   2000   PIP  3.2500   NA 2017-01-30 2017-01-30 07:00:00
#>  6:        A        5907               0 2.520001e+13     36139  TRUE   3000   PIP  3.1000   NA 2017-01-30 2017-01-30 07:00:00
#>  7:        A        5398               0 2.520001e+13     36140  TRUE    200   NSR 33.0000   NA 2017-01-30 2017-01-30 07:00:00
#>  8:        A        5907               0 2.520001e+13     36141 FALSE    500   PIP  3.2500   NA 2017-01-30 2017-01-30 07:00:00
#>  9:        A        2061               0 2.520001e+13     36142 FALSE   1300  DSCI  7.0000   NA 2017-01-30 2017-01-30 07:00:00
#> 10:        A        1582               0 2.520001e+13     36143  TRUE    500  CPPL 17.1500   NA 2017-01-30 2017-01-30 07:00:00

The same works for trades using the get_trades() function and for order modifications using the get_modifications() function.

To speed up the get_* functions, we can use the message-count information from earlier. For example the following code yields the same results as above, but saves time.


If you want to get more information about each field, you can have a look at the
official ITCH-protocol specification manual or you can get a small data.table about each message type by calling get_meta_data().

Having a Look at some the most traded ETFs

To have at least one piece of eye-candy in this post, lets have a quick go at the orders and trades of SPY (an S&P 500 ETF and one of the most traded assets, in case you didn't know), IWO (Russel 2000 Growth ETF), IWM (Russel 2000 Index ETF), and VXX (S&P 500 VIX ETF) on the BX-exchange.

In case you are wondering, I got these four tickers with


get_orders(file, 1, count_orders(msg_count), quiet = T) %>% 
  .$stock %>% 
  table %>% 
  sort(decreasing = T) %>% 
#> .
#>    SPY    IWO    IWM    VXX 
#> 135119 135016 118123 117395 

First we load the data (orders and trades) from the file, then we do some data munging, and finally plot the data using ggplot2.


# 0. load the data
orders  21162665 messages found
#> [Loading]    ................
#> [Converting] to data.table
#> [Formatting]

trades  330023 messages found
#> [Loading]    ................
#> [Converting] to data.table
#> [Formatting]

# 1. data munging
tickers = 0.99 * min_price & price 

Now its up to you to do something interesting with the data, I hope RITCH can help you with it.

Speed & RAM concerns

If your machine struggles with some files, you can always load only parts of a file as shown above. And of course, make sure that you have only necessary datasets in your R-environment and that no unused programs are open (Chrome with some open tabs in the background happily eats your RAM).

If your machine is able to handle larger datasets, you can increase the buffersize of each function call to 1GB or more using the buffer_size = 1e9 argument, increasing the speed with which a file is parsed.


If you find this package useful or have any other kind of feedback, I'd be happy if you let me know. Otherwise, if you need more functionality for additional message types, please feel free to create an issue or a pull request on GitHub.

Filed under:

To leave a comment for the author, please follow the link and comment on their blog: R – Data Shenanigans. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

A crazy day in the Bitcoin World

By insightr

(This article was first published on R – insightR, and kindly contributed to R-bloggers)
By Gabriel Vasconcelos

Today, November 29, 2017 was a crazy day in the Bitcoin world and the craziness is still going on as I write this post. The price range was of thousands of Dollars in a few hours. Bitcoins were today the main topic in all discussion groups I participate. Some people believe we are in the middle of a giant bubble and are very skeptical about Bitcoins intrinsic value and other people believe cryptocurrencies are the future and are already counting on a price of hundreds of thousands of dollars in a few years. I am no expert and I have no idea which group is right, but I hope it is the second because I really like the Bitcoin idea as the money of the future.

My objective here is to have a brief look at what happened with Bitcoins prices and transactions today (2017-11-29) compared to two days ago (2017-11-27), which was an ordinary day given the current levels of Bitcoin prices and volume. I will use data available by a Brazilian Bitcoin exchange called “Mercado Bitcoin (MB) (Bitcoin Market)”. You can see more details in their github page here. The prices are in Brazilian Real, which trades today for approximately 31 cents of US Dollar. Since some readers may be only interested in the results, I will put all codes to download the data and generate the plots in the end of the post.

First I will show a small table that demonstrates how volume and prices changed from one day to the other. The table shows that transactions and the number of Bitcoins traded increased something close 500% and the average amount of Bitcoins in each transaction also increased. The price range was of less than 2000 BRL in November 27th compared to more than 10000 BRL today (November 29th).

stat Nov.29 Nov.27
Transactions 31838.000 6975.000
Amount 2583.511 446.133
Avg.Amount 0.081 0.064
Avg.Price 40063.902 34274.563
Min.Price 33002.000 33500.000
Max.Price 44980.000 34965.000

The boxplot helps us to understand how the prices were distributed. The distance between the 25% and the 75% percentiles of today covers much more than the entire range of prices from November 27th. If you are more interested in the dynamics of prices and volume you should look at the figure below the boxplot. It shows how prices and volume behaved across the days (the data is coerced to minutes). Both figures show that November 29th was clearly an unusual day in the Bitcoin world.

plot of chunk unnamed-chunk-4



## == The data is stored in JSON  == ##
## == We can only download 1000 trades each time == ##
## == The while is to download all trades in the day == ##
## == Dates are in UNIX time == ##
## == Data from the last 24h (now it is 23:17 - 2017-11-29) == ##
start = 1512004273 - 86400
end = 1512004273
datalist = list()
while(start < end){
  file = paste("",start,"/",sep="")
  data <- fromJSON(file, flatten = TRUE)
  start = max(data$date) + 1
  datalist[[length(datalist) + 1]] = data
df_present = Reduce("rbind", datalist)
df_present = df_present %>% filter(date<=end)

## == Data from the same time 2 days before == ##
start = 1512004273 - 86400 * 2
end = 1512004273 - 86400 * 1
datalist = list()
while(start < end){
  file = paste("",start,"/",sep="")
  data <- fromJSON(file, flatten = TRUE)
  start = max(data$date) + 1
  datalist[[length(datalist) + 1]] = data
df_past = Reduce("rbind", datalist)
df_past = df_past %>% filter(date <= end)

## = adjust date = ##
df_past$date = as.POSIXct(df_past$date, origin = "1970-01-01")
df_present$date = as.POSIXct(df_present$date, origin = "1970-01-01")

# = statistics = #
past = c(nrow(df_past), sum(df_past$amount, na.rm = TRUE), mean(df_past$amount, na.rm = TRUE), mean(df_past$price), range(df_past$price))
present = c(nrow(df_present), sum(df_present$amount, na.rm = TRUE), mean(df_present$amount, na.rm = TRUE), mean(df_present$price), range(df_present$price))
stat = round(data.frame("Nov.29" = present,"Nov.27" = past), 3)
stat = data.frame(stat = c("Transactions","Amount","Avg.Amount","Avg.Price","Min.Price","Max.Price"),

## =  make data by minute = ##
df_present_min = df_present %>%
  mutate(day = 1 + day(date) - min(day(date)), hour = hour(date), min = minute(date)) %>%
  mutate(date = make_datetime(day = day, hour = hour,min = min,tz = "BRST")) %>%
  group_by(date) %>% summarise(price = tail(price, 1) ,vol = sum(amount))

df_past_min=df_past %>%
  mutate(day = 1 + day(date) - min(day(date)), hour = hour(date) ,min = minute(date)) %>%
  mutate(date = make_datetime(day = day, hour = hour,min = min, tz="BRST")) %>%
  group_by(date) %>% summarise(price = tail(price, 1), vol = sum(amount))

df_min = full_join(df_present_min, df_past_min,by=c("date")) %>% arrange(date)
df_min$price.x = na.locf(df_min$price.x, na.rm = FALSE)
df_min$price.y = na.locf(df_min$price.y, na.rm = FALSE)

## = Plots = ##
df1 = melt(df_min[, c(1, 2, 4)], id.vars = "date")
df2 = melt(df_min[, c(1, 3, 5)], id.vars = "date")

p0 = ggplot() +
  geom_boxplot(data = df1, aes(variable, value)) + labs(x="Day", y="Price") +
  scale_x_discrete(labels = c("Nov. 29", "Nov. 27"))

p1 = ggplot() +
  geom_line(data = df1, aes(x = date, y = value, color = factor(variable, labels = c("Nov. 29", "Nov. 27")))) +
  labs(x = "Time",y = "Price") + guides(color = guide_legend(title = "Day")) +
  scale_x_datetime(labels = date_format("%H:%M"))

p2 = ggplot() +
  geom_line(data = df2,aes(x = date,y = value,color = factor(variable, labels=c("Nov. 29", "Nov. 27")))) +
  labs(x = "Time", y = "Volume") + guides(color = guide_legend(title = "Day")) +
  scale_x_datetime(labels = date_format("%H:%M"))

grid.draw(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "first"))

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using Heatmap Coloring on a Density Plot Using R to Visualize Distributions

By Tim Bock

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

Lots of different visualizations have been proposed for understanding distributions. In this post, I am going show how to create my current favorite, which is a density plot using heatmap shading. I find them particularly useful for showing duration data.

Example: time to purchase

The table below shows the number of days it took 200 people to purchase a product, after their first trial of the product. The average time taken is 131 days and the median is 54 days. The rest of the post looks at a number of different ways of visualizing this data, finishing with the heated density plot.

Density plot

One of the classic ways of plotting this type of data is as a density plot. The standard R version is shown below. I have set the default from argument to better display this data, as otherwise density plots tend to show negative values even when all the data contains no negative values.

plot(density(days, from = 0),
             main = "Density plot",
             xlab =  "Number of days since trial started")

This plot clearly shows that purchases occur at relatively close to 0 days since the trial started. But, unless you use a ruler, there is no way to work out precisely where the peak occurs. Is it at 20 days or 50 days? The values shown on the y-axis have no obvious meaning (unless you read the technical documentation, and even then they are not numbers that can be explained in a useful way to non-technical people).

We can make this easier to read by only plotting the data for the first 180 days (to = 180), and changing the bandwidth used in estimating the density (adjust = 0.2) to make the plot less smooth. We can now see that the peak is around 30. While the plot does a good job at describing the shape of the data, it does not allow us to draw conclusions regarding the cumulative proportion of people to purchase by a specific time. For example, what proportion of people buy in the first 100 days? We need a different plot.

plot(density(days, from = 0, to = 180, adjust = 0.2),
             main = "Density plot - Up to 180 days (86% of data)",
             xlab =  "Number of days since trial started")

Survival curve

Survival curves have been invented for this type of data. An example is shown below. In this case, the survival curve shows the proportion of people who have yet to purchase at each point in time. The relatively sharp drop in the survival function at 30 is the same pattern that was revealed by the peak in the density plot. We can see also see that at about 100 days, around 26% or so of people have “survived” (i.e., not purchased). The plot also reveals that around 14% of customers have not purchased during the time interval shown (as this is where the line crosses the right-side of the plot). To my mind the survival plot is more informative than the density plot. But, it is also harder work. It is hard to see most non-technical audiences finding all the key patterns in this plot without guidance.

surv.days = Surv(days) = survfit(surv.days~1)
plot(, main = "Kaplan-Meier estimate with 95% confidence bounds (86% of data)",
     xlab = "Days since trial started",
     xlim = c(0, 180),
     ylab = "Survival function")
grid(20, 10, lwd = 2) 

Heated density plot

I call the visualization below a heated density plot. No doubt somebody invented this before we did, so please tell me if there is a more appropriate name. It is identical to the density plot from earlier in this post, except that:

  • The heatmap coloring shows the cumulative proportion to purchase, ranging from red (0%), to yellow (50%, the median), to blue (100%). Thus, we can now see that the median is at about 55, which could not be ascertained from the earlier density plots.
  • If you hover your mouse pointer (if you have one) over the plot, it will show you the cumulative percentage of people to purchase. We can readily see that the peak occurs near 30 days. Similarly, we can see that 93% of people purchased within 500 days.
HeatedDensityPlot(days, from = 0,
                  title = "Heated density plot", 
                  x.title = "Number of days since trial started",
                  legend.title = "% of buyers")

Below I show the plot limited to the first 180 days with the bandwidth adjusted as in the earlier density plot. The use of the legend on the right allows the viewer to readily see that only a little over 85% of people have purchased.

HeatedDensityPlot(days, from = 0, to = 180, adjust = 0.2,
                  title = "Heated density plot - Up to 180 days (86% of data)", 
                  x.title = "Number of days since trial started",
                  legend.title = "% of buyers")

Bonus: they are awesome for discrete data

The visualization below shows randomly-generated data where I have generated whole numbers in the range of 0 through 12 (i.e., one 0; six 1s, seven 2s, etc). In all the previous heated density plots the color transition was relatively smooth. In the visualization below, the discrete nature of the data has been emphasized by the clear vertical lines between each color. This is not a feature that can be seen on either a traditional density plot or histogram with small sample sizes.

x = rpois(100, 5)
HeatedDensityPlot(x, from = 0, 
                  title = "Poisson(5), n = 100", 
                  x.title = "x")


The heated density plot is, to my mind, a straightforward improvement on traditional density plots. It shows additional information about the cumulative values of the distribution and reveals when the data is discrete. However, the function we have written is still a bit rough around the edges. For example, sometimes a faint line appears at the top of the plot, and a whole host of warnings appear in R when the plot is created. If you can improve on this, please tell us how!

The code

The R function used to create the plot is shown below. If you run this code you will get a plot of a normal distribution. If you want to reproduce my exact examples, please click here to sign into Displayr and access the document that contains all my code and charts. This document contains further additional examples of the use of this plot. To see the code, just click on one of the charts in the document, and the code should be shown on the right in Properties > R CODE.



My colleague Carmen wrote most of the code. It is written using the wonderful plotly package.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

Sentiment Analysis of “A Christmas Carol”

By hrbrmstr

(This article was first published on R –, and kindly contributed to R-bloggers)

Our family has been reading, listening to and watching “A Christmas Carol” for just abt 30 years now. I got it into my crazy noggin to perform a sentiment analysis on it the other day and tweeted out the results, but a large chunk of the R community is not on Twitter and it would be good to get a holiday-themed post or two up for the season.

One reason I embarked on this endeavour is that @juliasilge & @drob made it so gosh darn easy to do so with:

(btw: That makes an excellent holiday gift for the data scientist[s] in your life.)

Let us begin!

STAVE I: hrbrmstr’s Code

We need the text of this book to work with and thankfully it’s long been in the public domain. As @drob noted, we can use the gutenbergr package to retrieve it. We’ll use an RStudio project structure for this and cache the results locally to avoid burning bandwidth:



How did I know to use 46? We can use gutenberg_works() to get to that info:

gutenberg_works(author=="Dickens, Charles")
## # A tibble: 74 x 8
##    gutenberg_id                                                                                    title
##  1           46                             A Christmas Carol in Prose; Being a Ghost Story of Christmas
##  2           98                                                                     A Tale of Two Cities
##  3          564                                                               The Mystery of Edwin Drood
##  4          580                                                                      The Pickwick Papers
##  5          588                                                                  Master Humphrey's Clock
##  6          644                                                  The Haunted Man and the Ghost's Bargain
##  7          650                                                                      Pictures from Italy
##  8          653 "The ChimesrnA Goblin Story of Some Bells That Rang an Old Year out and a New Year In"
##  9          675                                                                           American Notes
## 10          678                                          The Cricket on the Hearth: A Fairy Tale of Home
## # ... with 64 more rows, and 6 more variables: author , gutenberg_author_id , language ,
## #   gutenberg_bookshelf , rights , has_text 

STAVE II: The first of three wrangles

We’re eventually going to make a ggplot2 faceted chart of the sentiments by paragraphs in each stave (chapter). I wanted nicer titles for the facets so we’ll clean up the stave titles first:

#' Convenience only
carol_txt %
    stri_replace_first_regex("STAVE [[:alpha:]]{1,3}: ", "") %>%
) -> stave_titles

stri_trans_totitle() is a super-handy function and all we’re doing here is extracting the stave titles and doing a small transformation. There are scads of ways to do this, so don’t get stuck on this example. Try out other ways of doing this munging.

You’ll also see that I made sure we started at the first stave break vs include the title bits in the analysis.

Now, we need to prep the text for text analysis.

STAVE III: The second of three wrangles

There are other text mining packages and processes in R. I’m using tidytext because it takes care of so many details for you and does so elegantly. I was also at the rOpenSci Unconf where the idea was spawned & worked on and I’m glad it blossomed into such a great package and a book!

Since we (I) want to do the analysis by stave & paragraph, let’s break the text into those chunks. Note that I’m doing an extra break by sentence in the event folks out there want to replicate this work but do so on a more granular level.

#' Break the text up into chapters, paragraphs, sentences, and words,
#' preserving the hierarchy so we can use it later.
data_frame(txt = carol_txt) %>%
  unnest_tokens(chapter, txt, token="regex", pattern="STAVE [[:alpha:]]{1,3}: [[:alpha:] [:punct:]]+") %>%
  mutate(stave = 1:n()) %>%
  unnest_tokens(paragraph, chapter, token = "paragraphs") %>% 
  group_by(stave) %>%
  mutate(para = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(sentence, paragraph, token="sentences") %>% 
  group_by(stave, para) %>%
  mutate(sent = 1:n()) %>% 
  ungroup() %>%
  unnest_tokens(word, sentence) -> carol_tokens

##  A tibble: 28,710 x 4
##   stave  para  sent   word
## 1     1     1     1 marley
## 2     1     1     1    was
## 3     1     1     1   dead
## 4     1     1     1     to
## 5     1     1     1  begin
## 6     1     1     1   with
## 7     1     1     1  there
## 8     1     1     1     is
## 9     1     1     1     no
## 0     1     1     1  doubt
##  ... with 28,700 more rows

By indexing each hierarchy level, we have the flexibility to do all sorts of structured analyses just by choosing grouping combinations.

STAVE IV: The third of three wrangles

Now, we need to layer in some sentiments and do some basic sentiment calculations. Many of these sentiment-al posts (including this one) take a naive approach with basic match and only looking at 1-grams. One reason I didn’t go further was to make the code accessible to new R folk (since I primarily blog for new R folk :-). I’m prepping some 2018 posts with more involved text analysis themes and will likely add some complexity then with other texts.

#' Retrieve sentiments and compute them.
#' I left the `index` in vs just use `paragraph` since it'll make this easier to reuse
#' this block (which I'm not doing but thought I might).
inner_join(carol_tokens, get_sentiments("nrc"), "word") %>%
  count(stave, index = para, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  left_join(stave_titles, "stave") -> carol_with_sent

STAVE V: The end of it

Now, we just need to do some really basic ggplot-ing to to get to our desired result:

ggplot(carol_with_sent) +
  geom_segment(aes(index, sentiment, xend=index, yend=0, color=title), size=0.33) +
  scale_x_comma(limits=range(carol_with_sent$index)) +
  scale_y_comma() +
  scale_color_ipsum() +
  facet_wrap(~title, scales="free_x", ncol=5) +
  labs(x=NULL, y="Sentiment",
       title="Sentiment Analysis of A Christmas Carol",
       subtitle="By stave & ¶",
       caption="Humbug!") +
  theme_ipsum_rc(grid="Y", axis_text_size = 8, strip_text_face = "italic", strip_text_size = 10.5) +

You’ll want to tap/click on that to make it bigger.

Despite using a naive analysis, I think it tracks pretty well with the flow of the book.

Stave one is quite bleak. Marley is morose and frightening. There is no joy apart from Fred’s brief appearance.

The truly terrible (-10 sentiment) paragraph also makes sense:

Marley’s face. It was not in impenetrable shadow as the other objects in the yard were, but had a dismal light about it, like a bad lobster in a dark cellar. It was not angry or ferocious, but looked at Scrooge as Marley used to look: with ghostly spectacles turned up on its ghostly forehead. The hair was curiously stirred, as if by breath or hot air; and, though the eyes were wide open, they were perfectly motionless. That, and its livid colour, made it horrible; but its horror seemed to be in spite of the face and beyond its control, rather than a part of its own expression.

(I got to that via this snippet which you can use as a template for finding the other significant sentiment points:)

  carol_tokens, stave == 1,
  para == filter(carol_with_sent, stave==1) %>% 
    filter(sentiment == min(sentiment)) %>% 

Stave two (Christmas past) is all about Scrooge’s youth and includes details about Fezziwig’s party so the mostly-positive tone also makes sense.

Stave three (Christmas present) has the highest:

The Grocers’! oh, the Grocers’! nearly closed, with perhaps two shutters down, or one; but through those gaps such glimpses! It was not alone that the scales descending on the counter made a merry sound, or that the twine and roller parted company so briskly, or that the canisters were rattled up and down like juggling tricks, or even that the blended scents of tea and coffee were so grateful to the nose, or even that the raisins were so plentiful and rare, the almonds so extremely white, the sticks of cinnamon so long and straight, the other spices so delicious, the candied fruits so caked and spotted with molten sugar as to make the coldest lookers-on feel faint and subsequently bilious. Nor was it that the figs were moist and pulpy, or that the French plums blushed in modest tartness from their highly-decorated boxes, or that everything was good to eat and in its Christmas dress; but the customers were all so hurried and so eager in the hopeful promise of the day, that they tumbled up against each other at the door, crashing their wicker baskets wildly, and left their purchases upon the counter, and came running back to fetch them, and committed hundreds of the like mistakes, in the best humour possible; while the Grocer and his people were so frank and fresh that the polished hearts with which they fastened their aprons behind might have been their own, worn outside for general inspection, and for Christmas daws to peck at if they chose.

and lowest (sentiment) points of the entire book:

And now, without a word of warning from the Ghost, they stood upon a bleak and desert moor, where monstrous masses of rude stone were cast about, as though it were the burial-place of giants; and water spread itself wheresoever it listed, or would have done so, but for the frost that held it prisoner; and nothing grew but moss and furze, and coarse rank grass. Down in the west the setting sun had left a streak of fiery red, which glared upon the desolation for an instant, like a sullen eye, and frowning lower, lower, lower yet, was lost in the thick gloom of darkest night.

Stave four (Christmas yet to come) is fairly middling. I had expected to see lower marks here. The standout negative sentiment paragraph (and the one that follows) are pretty dark, though:

They left the busy scene, and went into an obscure part of the town, where Scrooge had never penetrated before, although he recognised its situation, and its bad repute. The ways were foul and narrow; the shops and houses wretched; the people half-naked, drunken, slipshod, ugly. Alleys and archways, like so many cesspools, disgorged their offences of smell, and dirt, and life, upon the straggling streets; and the whole quarter reeked with crime, with filth, and misery.

Finally, Stave five is both short and positive (whew!). Which I heartily agree with!


The code is up on GitHub and I hope that it will inspire more folks to experiment with this fun (& useful!) aspect of data science.

Make sure to send links to anything you create and shoot over PRs for anything you think I did that was awry.

For those who celebrate Christmas, I hope you keep Christmas as well as or even better than old Scrooge. “May that be truly said of us, and all of us! And so, as Tiny Tim observed, God bless Us, Every One!”

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to generate a Secret Santa list with R

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Several recent blog posts have explored the Secret Santa problem and provided solutions in R. This post provides a roundup of various solutions and how they are implemented in R.

If you wanted to set up a “Secret Santa” gift exchange at the office, you could put everyone’s name into a hat and have each participant draw a name at random. The problem is that someone might draw their own name, but if that happens you can just reshuffle all the names back into the hat and start the process over. That’s essentially what the R code below, from a blog post by David Selby, does:

🎁 w/ code:
“Secret Santa in R” ✏ @TeaStats #rstats

— Mara Averick (@dataandme) November 24, 2017

That’s not an entirely satisfying solution (at least to me), with all of the having to check for self-giving and restarting if so. Thomas Lumley calculates that the chance of requiring a do-over is about 63% (for more than 2 participants, anyway), and on average you’d need about 2.7 tries to get a “valid” gift list. (This is an example of the negative binomial distribution in action: we keep on drawing a set of names from the hat, until we get a set that has no-one giving to themselves. You can generate 100 examples of this playing out in R with rnbinom(100,1, exp(-1))+1 — the +1 is for the final successful draw from the hat.)

An easier way might be simply to seat all the participants in random order in a circle and assign them to give a gift to the person on their right. This is easy to do in R, and doing it in code has the benefit of keeping the recipients secret from each other. First, let’s select 10 names from the babynames dataset:

> library(babynames)
> santas  sample(babynames$name, 10, prob=babynames$prop) 

(Using prob=babynames$prop selects names according to their prevalence in US births 1880-2015.) Then, it’s a simple matter of reordering the names at random, and assigning a gift to the next in line:

> p  sample(santas)
> cbind(santa=p, recipient=c(tail(p,-1),p[1]))
      santa       recipient  
 [1,] "Sherman"   "Shayna"   
 [2,] "Shayna"    "Elizabeth"
 [3,] "Elizabeth" "Mary"     
 [4,] "Mary"      "Kathleen" 
 [5,] "Kathleen"  "Russell"  
 [6,] "Russell"   "James"    
 [7,] "James"     "Arlene"   
 [8,] "Arlene"    "Ruth"     
 [9,] "Ruth"      "Darryl"   
[10,] "Darryl"    "Sherman"  

Now, this “sit in a circle” process isn’t quite the same as the “keep on drawing names from the hat” process. In the above example, the 10 names form a cycle, and it will never happen that Arlene gives to Ruth and Ruth gives to Arlene. Nonetheless, I think it’s the simplest and fairest way of generating a Secret Santa list.

Thinking of the gift-giving relationship as a graph, the process above generates a Hamiltonian path through each of the recipients, and never generates more than one clique. Tristan Mahr explores the graph-theory nature of the Secret Santa problem in a blog post. There, he uses the DiagrammeR package to solve variants of the Secret Santa problem by constructing graphs, which can created and visualized quite easily in R.

Finally, Sarah Lotspeich and Lucy D’Agostino McGowan take the whole process one step further and show how to generate emails each participant using the ponyexpress package, notifying them of their Secret Santa recipient.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Programmers: What is your biggest problem when working with Census data?

By Ari Lamstein

(This article was first published on R –, and kindly contributed to R-bloggers)

A few weeks ago I announced my latest project: Data Science instruction at the Census Bureau.

In addition to announcing the project, I also snuck a bit of market research into the post. I asked people the types of analyses they do when working with Census data. I also asked what packages they use when solving those problems.

23 people left comments, and they have been very helpful in shaping the curriculum of the course. Thank you to everyone who left a comment!

That was such an effective way to learn about the community of R Census users that I’ve decided to do it again. If you are an R programmer who has worked with Census data, please leave a comment with an answer to this question:

What is your biggest problem when working with Census data in R?

Understanding the obstacles people face has the potential to help us design better courses.

Leave your answer as a comment below!

The post R Programmers: What is your biggest problem when working with Census data? appeared first on

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Win-Vector LLC announces new “big data in R” tools

By John Mount

Blacksmith working

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN):

  • partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners can make your code faster and sequence steps to avoid critical issues (the complementary problems of too long in-mutate dependence chains, of too many mutate steps, and incidental bugs; all explained in the linked tutorials).
  • if_else_device(): provides a dplyr::mutate() based simulation of per-row conditional blocks (including conditional assignment). This allows powerful imperative code (such as often seen in porting from SAS) to be directly and legibly translated into performant dplyr::mutate() data flow code that works on Spark (via Sparklyr) and databases.

Image by Jeff Kubina from Columbia, Maryland – [1], CC BY-SA 2.0, Link

For “big data in R” users these two function families (plus the included support functions and examples) are simple, yet game changing. These tools were developed by Win-Vector LLC to fill gaps identified by Win-Vector and our partners when standing-up production scale R plus Apache Spark projects.

We are happy to share these tools as open source, and very interested in consulting with your teams on developing R/Spark solutions (including porting existing SAS code). For more information please reach out to Win-Vector.

To teams get started we are supplying the following initial documentation, discussion, and examples:

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News