Our R package roundup

By Christoph Safferling

gif

A year in review

And yet again, it’s that time of the year when one eats too much and gets in a reflective mood! 2016 is nearly over, and us bloggers here at opiateforthemass.es thought it would be nice to argue endlessly which R package was the best/neatest/most fun/most useful/most whatever in this year! By now, it’s become a tradition.

Like last year, we will each present our top pick for this year’s R package, a purely subjective list of packages we (and Chuck Norris) approve of.

But do not despair, dear reader! We have also pulled hard data on R package popularity from CRAN, and will present this first.

Top Popular CRAN packages

Let’s start with some factual data before we go into our personal favourites of 2016. We’ll pull the titles of the new 2016 R packages from cranberries. Our approach from last year no longer works, since rvest::read_html chokes on the tags in the page – they must have been added during this year. Instead, we read the RSS feed of Dirk’s page and go a little crazy on the grep. Always a healthy thing, I say. Any problem can be solved if you throw enough grep at it. Randall agrees:

xkcd regexp problems

As mentioned last year, using downloads per day as a ranking metric could have the problem that earlier package releases have had more time to create a buzz and shift up the average downloads per day, skewing the data in favour of older releases. Or, it could have the complication that younger package releases are still on the early “hump” part of the downloads (let’s assume they’ll follow a log-normal (exponential decay) distribution, which most of these things do), thus skewing the data in favour of younger releases. We still don’t know, since we still didn’t bother to dig deeper into this topic. Perhaps we’ll manage next year, but I’m not going to do any New Year’s Resolutions…

For now, let’s again assume that average downloads per day is a relatively stable metric to gauge package success with. With some quick dplyr and ggplot magic, these are the top 20 new CRAN packages from 2016, by average number of daily downloads:

top 20 new CRAN packages in 2016

The full code and data is availble on github, of course.

As we can see, the main bias does not come from our choice of ranking metric, but by the fact that some packages are more “under the hood” and are pulled by many packages as dependencies, thus inflating the download statistics.

tibble is Hadley’s re-take of the venerable data.frame and is pulled as a dependency by practically all of his packages. It was always part of dplyr known as tbl and now got an own package to keep things clean and isolated. tidyverse is a meta-package containing all of Hadley’s “tidy data” packages, and allows for easier loading of these. We thought about picking these as our top packages, but we settled for naming them “honorable, because obvious mentions”. We want to put some focus on the not-so-universally-known packages as well!

The packages (sourcetools, hms, plogr, backports) are mainly technical packages, racking up downloads by being pulled as dependencies by popular packages or by being helpful in running R code “deeply”. ModelMetrics contains metrics to compare Rcpp models. rsconnect and miniUI are packages used by shiny, thus their appearance high in the list.

Excluding these, the clear winner of “frontline” packages are rprojroot, allowing you to define your project root directory, by matching a certain criterion (for instance, a certain file is contained in that directory), forcats for making working with categorical variables easier, and modelr, enhancing the pipe workflow established by magittr. I must admit that I had not known of the rprojroot package, probably because I’m a sickler for clean directory setups.

An interesting find is ggrepel, as it’s the single “gg-packages” (of 23 in total) that made it into the top20. Since the possibility to build extensions for ggplot2 there has been an explosion of specialised and fantastic packages to build even more, and better graphs in R. In fact, the new version of ggplot2 should be listed as our top “new” package for 2016, if we’d count new major versions as a new package. Also this year Rstudio finally got to it’s 1.0.0 version, with many cool new features.

But now, this blog’s authors’ personal top four of new R packages for 2016:

bookdown

(safferli’s pick)

bookdown is a specialised markdown package by the venerable Yihui, of knitr fame. It’s still early in development, but already very much productive. It’s used to build actual books from Rmarkdown documents. It can build gitbook, pdf, mobile, and html books, all from a single markdown source.

Being a huge LaTeX fan, I’m a little sad that with me working more with data in R the advent of knitrand now bookdown I actually never get to use it directly any more. For extremely specialised (and static) books or reports I would probably still move back to LaTeX, but the power and flexibility of Rmarkdown to build dynamic html reports outweighs the detailed configuration possibilities and pure beauty of direct LaTeX. Especially if you are using data in your reports.

A big thumbs up for this package. If you’re looking to write a long-ish documentation or report, currently nothing beats gitbook, and bookdown gives you the power of R and Rmarkdown to work with.

feather

(Yuki’s pick)

Yuki focuses more on Python now, and will continue blogging on the blog about Python-stuff and data science topics. To honor this, we picked feather for Yuki, a package that makes storing dataframes to disk lightning fast and shareable between R and Python.

flexdashboard

(Kirill’s pick)

My choice of the year I discovered quite late (2 weeks ago), so maybe I am biased, because I am still amazed by it. I was building interactive graphs with highcharter and wanted to share the results with my colleagues. Since I had a couple of charts, I thought I will make a Rmarkdown document and insert them there. That’s where I discovered flexdashboard. Basically an Rmarkdown formatter to make static dashboards. Just build some interactive graphs with any htmlwidget based packed (or any package you like) arrange it easily with flexdashboard, et voilà, you have a static easy dashboard. I even realized that in many cases, if people want to share interactive graphs with other, the go-to tool is shiny, but this is overengineering often if all it does is filtering the data by date or categories. The beauty of flexdashboard is to have a single standalone html file, which doesn’t need serve side logic or any complex stuff in the backend.

ggiraph

(Jess’s pick)

I spent 2016 discovering the beauty of shinydashboard building quite some dashboards with many many tabs and many many more plots. Eventually I wanted the values to be displayed when hovering over a point with the mouse. This is when ggiraph came to the rescue, a package written by David Grohel that extends the ggplot functionality and makes it easy to animate ggplots. Take a look at the website and browse through the examples to see what ggiraph is capable of. Super easy to use with plain ggplot as well as with shiny!

Our R package roundup was originally published by Kirill Pomogajko at Opiate for the masses on December 31, 2016.

Source:: R News

Power BI custom visuals, based on R

By David Smith

You’ve been able to include user-defined charts using R in Power BI dashboards for a while now, but a recent update to Power BI includes seven new custom charts based on R in the customs visuals gallery. You can see the new chart types by visiting the Power BI Custom Visuals Gallery and clicking on the “R-powered visuals” tab.

The new custom visuals are listed below. Click on the visual names to see an example, and click on the “GitHub” link at the bottom of the pop-up to see the actual R code used.

You can also combine custom R charts with other Power BI tools to create interactive R charts. A favorite of mine is the Timeline slicer, which allows you to select a range from a date variable. (The dashboard, including the R charts, is updated using only data from the date range you select.) You can also use the built-in date slicer for similar effect, and Tomaž Kaštrun demonstrates how to combine it with a custom R chart in his blog post R graphs and Tables in Power BI Desktop.

Source:: R News

Fireworks (in R)

By Jeroen Kromme

Happy New Year

(This article was first published on Rbloggers – The Analytics Lab, and kindly contributed to R-bloggers)

New Year – a new chapter, new verse, or just the same old story ? Ultimately we write it. The choice is ours. ― Alex Morritt

The Analytics Lab and Cmotions wish everybody a happy year. A year full of challenges, new experiences and new knowledge.

Github


library(ggplot2)
rm(list = ls())

# First rocket
t1 = 1:30            # number of points
size1 = 2            # size of rocket
r1_center = c(-4,10) # end center

rocket1 = data.frame(x1 = rep(r1_center[1], length(t1)) 
                , x2 = size1*cos(t1)  + r1_center[1]
                , y1 = rep(r1_center[2], length(t1))
                , y2 = size1*sin(t1) + r1_center[2]
                ) 

# Second rocket
t2 = 1:44
size2 = 3
r2_center = c(3,12)
rocket2 = data.frame(x1 = rep(r2_center[1], length(t2)) 
                     , x2 = size2*cos(t2)  + r2_center[1]
                     , y1 = rep(r2_center[2], length(t2))
                     , y2 = size2*sin(t2) + r2_center[2]
) 


# Third rocket
t3 = 1:44
size3 = 4
r3_center = c(-2,17)
rocket3 = data.frame(x1 = rep(r3_center[1], length(t3)) 
                     , x2 = size3*cos(t3)  + r3_center[1]
                     , y1 = rep(r3_center[2], length(t3))
                     , y2 = size3*sin(t3) + r3_center[2]
) 

# Fourth rocket
t4 = 1:44
size4 = 4
r4_center = c(4,20)
rocket4 = data.frame(x1 = rep(r4_center[1], length(t4)) 
                     , x2 = size4*cos(t4)  + r4_center[1]
                     , y1 = rep(r4_center[2], length(t4))
                     , y2 = size4*sin(t4) + r4_center[2]
) 



# Plot fireworks
ggplot() +
  
  # set theme
  theme(panel.background = element_rect(fill = '#252525', colour = '#252525'),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.line=element_blank(),axis.text.x=element_blank(),
        axis.text.y=element_blank(),axis.ticks=element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),legend.position="none") + 
  
  # first rocket
  geom_point(aes(x = x2, y = y2), data = rocket1, shape = 4, colour = '#74add1')+
  geom_curve(aes(x = x1, y = y1, xend = x2, yend = y2 ), curvature = 0.1, data = rocket1, colour = '#74add1') +
  geom_curve(aes(x = 0, y = -7, xend = r1_center[1], yend = r1_center[2] ), curvature = 0.1, colour = '#878787') +
  
  # second rocket
  geom_point(aes(x = x2, y = y2), data = rocket2, shape = 4, colour = '#fed976')+
  geom_curve(aes(x = x1, y = y1, xend = x2, yend = y2), curvature = 0.1, data = rocket2, colour = '#fed976') +
  geom_curve(aes(x = 0, y = -7, xend = r2_center[1], yend = r2_center[2] ), curvature = -0.1, colour = '#878787') +

  # third rocket
  geom_point(aes(x = x2, y = y2), data = rocket3, shape = 4, colour = '#fa9fb5')+
  geom_curve(aes(x = x1, y = y1, xend = x2, yend = y2 ), curvature = 0.1, data = rocket3, colour = '#fa9fb5') +
  geom_curve(aes(x = 0, y = -7, xend = r3_center[1], yend = r3_center[2] ), curvature = 0.1, colour = '#878787') +

  # fouth rocket
  geom_point(aes(x = x2, y = y2), data = rocket4, shape = 4, colour = '#addd8e')+
  geom_curve(aes(x = x1, y = y1, xend = x2, yend = y2 ), curvature = 0.1, data = rocket4, colour = '#addd8e') +
  geom_curve(aes(x = 0, y = -7, xend = r4_center[1], yend = r4_center[2] ), curvature = -0.2, colour = '#878787') +
  
  # title
  ggtitle('The Analytics Lab wishes you a Happy New Year') +
  
  # save
  ggsave('Happy New Year.png', units = 'cm', width = 15, height = 20)

To leave a comment for the author, please follow the link and comment on their blog: Rbloggers – The Analytics Lab.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Outlier App: An Interactive Visualization of Outlier Algorithms

By Rajiv Shah

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

I was recently trying various outlier detection algorithms. For me, the best way to understand an algorithm is to tinker with it. I built a shiny app that allows you to play around with various outlier algorithms and wanted to share it with everyone.

The shiny app is available on my site, but even better, the code is on github for you to run locally or improve! Let me give you a quick tour of the app in this post. If you prefer, I have also posted a video that provides background on the app. Another tutorial how to build a interactive web apps with shiny is published at DataScience+.

Background

The available algorithms include:

– Hierarchical Clustering (DMwR)
– Kmeans Euclidean Distance
– Kmeans Mahalanobis
– Kmeans Manhattan
– Fuzzy kmeans – Gustafson and Kessel
– Fuzzy k-medoids
– Fuzzy k-means with polynomial fuzzifier
– Local Outlier Factor (dbscan)
– RandomForest (proximity from randomForest)
– Isolation Forest (IsolationForest)
– Autoencoder (Autoencoder)
– FBOD and SOD (HighDimOut)

There are also a wide range of datasets to try as well. They include randomly scattered points, defined clusters, and some more unusual patterns like the smiley face and spirals. Additionally, you can use your mouse to add and/or remove points by clicking directly within the visualization. This allows you to create your own dataset.

Using the app

Once the data is loaded, you can start exploring. One thing you can do is look at the effect scaling can have. In this example, you can see how outliers differ when scaling is used with kmeans. The values on the far right no longer dominate the distance measurements, and there are now outliers from other areas:

It quickly becomes apparent that different algorithms may select different outliers. In this case, you see a difference between the outliers selected using an autoencoder versus isolation forest.

Another example here is the difference between chosen outliers using kmeans versus fuzzy kmeans:

A density based algorithm can also select different outliers versus a distance based algorithm. This example nicely shows the difference between kmeans and lof (local outlier factor from dbscan)

An important part of using this visualization is studying the distance numbers that are calculated. Are these numbers meshing with your intuition? How big of a quantitative difference is there between outliers and other points?

3D+ App?

The next thing is whether to expand this to larger datasets. This is something that you would run locally (large datasets take too long to run for my shiny server). The downside of larger datasets is that it gets tricker to visualize them. For now, I am using a TSNE plot. I am open to suggestions, but the intent here is a way to evaluate outlier algorithms on a variety of datasets.

Source Code

The source code for the outlier app is on github. The app is built off a variety of R packages and could easily be extended to new packages or incorporate additional datasets. Please send me bug fixes, additional algorithms, tighter code, or ideas for improving the app.

    Related Post

    1. Creating an animation using R
    2. The importance of Data Visualization
    3. ggplot2 themes examples
    4. Map the Life Expectancy in United States with data from Wikipedia
    5. What can we learn from the statistics of the EURO 2016 – Application of factor analysis
    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Sparse matrices, k-means clustering, topic modelling with posts on the 2004 US Presidential election

    By Peter's stats stuff – R

    (This article was first published on Peter’s stats stuff – R, and kindly contributed to R-bloggers)

    Daily Kos bags of words from the time of the 2004 Presidential election

    This is a bit of a rambly blog entry today. My original motivation was to just explore moving data around from R into the H2O machine learning software. While successful on this, and more on it below, I got a bit interested in my example data in its own right. The best initial exploratory analysis of that data turns out on my opinion not to be one of the approaches implemented in H2O, so by the end of the post I’m back in R illustrating the use of topic modelling for analysing blog posts. As a teaser, this is where we’re going to end up:

    That graphic represents eight latent “topics” identified in 3,430 posts on the liberal (as per US terminology) Daily Kos site from the time of the 2004 US Presidential Election. The data are available at the Machine Learning Repository of the Univerity of California’s Center for Machine Learning and Intelligent Systems. The metadata are low on context (or perhaps I was just looking in the wrong spot) – the origin is just described as “KOS blog entries … orig source: dailykos.com”, and I’m presuming from their content that they are from 2004. More on what it all means down below.

    Sparse matrices

    I chose this data because it’s a nice example of a medium sized sparse matrix. The data has the number of occurences of 6,906 words in 3,430 documents so the obvious conceptual format is a matrix with 3,430 rows and 6,906 columns, where most cells are empty (meaning zero occurences of that word in that document) and the filled cells are counts. Of course, we try to avoid moving data around like that. Instead of that super-wide matrix, the data is distributed as a text file with three columns that looks like this:

    1 61 2
    1 76 1
    1 89 1
    1 211 1
    1 296 1
    1 335 1
    1 404 1
    1 441 1
    1 454 2

    In this case, it means that the first document has two counts of word 61; one count of word 76; one count of word 89; and so on. A separate file tells us that word 61 is “action”, word 76 is “added” and so on. In fact, the two files for distributing these data are just the two tables that you’d represent them with in a relational database. This is the optimally efficient way of storing data of this sort.

    Now, I’ve been exploring the machine learning software H2O which provides fast and efficient estimation of models such as :

    • deep learning neural networks
    • random forests
    • gradient boosting machines
    • generalized linear models with elastic net regularisation
    • k-means clustering

    All these tools need data to be represented in a wide format, rather than the efficient long format you’d have in a relational database. In this case, that would mean a row for each document (or observation) and a column for each word, which would be used as “features” in machine learning terminology. That is, it needs the sort of operation you’d use PIVOT for in SQL Server, or tidyr::spread in R (or, before the coming of the tidyverse, reshape or reshape2::cast).

    H2O is expected to work with another tool for data persistence eg a Hadoop cluster. One thing I’m still getting my head around is the several variants on making data actually available to H2O. I’m particularly interested in the case of medium-sized data that can be handled on a single machine in efficient sparse format, when a Hadoop cluster isn’t available. So the Daily Kos data was my small familiarisation example in this space.

    A common workflow I expect with this sort of data is a bunch of pre-processing that will need a general data munging tool like R or Python before upload into H2O. For example, with text data, I might want to do one or more of remove common “stopwords” (like ‘the’, ‘and’, etc); reduce words to their “stems” (so “abandons” and “abandon” are treated as the same); match words to sentiment; etc. This is the sort of thing done with NLTK in Python or tm, SnowballC and the brilliant new tidytext in R.

    tidytext by Julia Silge and David Robinson makes it possible to analyse textual data using the general data manipulation grammar familiar to users of Hadley Wickham’s tidyverse. One verb tidytext offers is cast_sparse, which converts data in long tidy format into a wide but sparse Matrix format (using Doug Bates and Martin Maechler’s Matrix package, which is distributed with base R) that is suitable for a range of statistical and machine learning models. So I was pleased to note that the latest stable version of H2O (not yet on CRAN at the time of writing) and its accompanying R interface expands the h2o::as.H2O function to support the sparse Matrix format.

    Importing and modifying the data

    So here’s how I set about exploring a workflow based on the following key tools

    • Pre-processing in R

      • Import data with data.table::fread – the fastest way to get data in. While readr or even utils functions would also do the job well in this modest-sized case, I want to scale up in the future and fread is the best there is for getting large, regular text data into R.
      • Modify with tools from the tidyverse and tidytext, calling in specialist packages for stemming etc as necessary
      • Cast to sparse Matrix format and export to H2O with h2o::as.h2o
    • Analyse in H2O
    • Export the summary results back to R for presentation

    I had my doubts about whether h2o::as.h2o is right for the job, because I understand it doesn’t make the most of H2O’s own fast data importing abilities. But it is definitely a candidate for medium sized data.

    Here’s how I got started. This first chunk just loads the functionality used in the rest of the post and downloads the main table, if it hasn’t been downloaded before (I developed this script in little chunks of time between doing other stuff, and didn’t want to download the file every time I ran it or have to bother with selectively running bits of the script). Lots of R packages needed for today’s exploration:

    library(ggplot2)
    library(scales)
    library(R.utils)    # for gunzip
    library(tidyverse)
    library(tidytext)
    library(Matrix)     # for sparse matrices
    library(data.table) # for fread, fast version of read.table
    library(SnowballC)  # for wordStem
    library(foreach)
    library(doParallel)
    library(testthat)   # used for various checks and tests along the way
    library(topicmodels)
    library(slam)       # for other types of sparse matrices, used with topicmodels
    library(wordcloud)
    library(grid)       # for annotating graphics 
    
    # latest version of h2o needed at time of writing (December 2016) for sparse matrix support
    # install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-tutte/1/R")))
    library(h2o)
    
    # https://archive.ics.uci.edu/ml/datasets/Bag+of+Words
    # download the smallest example bag of words for first trial
    if(!"docword.kos.txt" %in% list.files()){
       download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.kos.txt.gz",
                     destfile = "docword.kos.txt.gz", mode = "wb")
       gunzip("docword.kos.txt.gz")
    }

    Importing the data is super fast with data.table::fread. Combining it with the vocabulary look up table, removing stop words and performing stemming is made super-easy by tidytext in combination with text mining specialist tools, in this case SnowballC::wordStem:

    kos <- fread("docword.kos.txt", skip = 3, sep = " ")  
    names(kos) <- c("doc", "word", "count")
    expect_equal(sum(kos$count), 467714)
    
    kos_vocab <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.kos.txt")$V1
    
    kos2 <- kos %>%
       # attach the names of words back to the words
       mutate(word = factor(word, labels = kos_vocab)) %>%
       # stopwords are meant to be already removed but no harm in checking (note - makes no difference):
       anti_join(stop_words, by = "word") %>%
       # reduce words to their stems (otherwise abandon, abandoning, abandons all treated as different, etc),
       mutate(word = wordStem(word)) %>%
       mutate(word = gsub("^iraq$", "iraqi", word)) %>%
       # ... and aggregate back up by the new stemmed words to remove duplicates
       group_by(doc, word) %>%
       summarise(count = sum(count)) %>%
       # correct some names that went wrong in the stemming:
       mutate(word = gsub("^chenei$", "cheney", word),
              word = gsub("^kerri$", "kerry", word),
              word = gsub("^georg$", "george", word),
              word = gsub("^john$", "John", word))
    		  
    # how many words per post?
    wordcount <- kos2 %>%
       group_by(doc) %>%
       summarise(words = sum(count))
    
    par(font.main = 1)
    hist(wordcount$words, col = "grey", fg = "grey80", border = "grey95",
         main = "Words per post in the KOS data set", xlab = "Number of words")

    histogram

    Because the post lengths varied in the predictable way (histogram above), I decided to convert all the counts to proportions of words in each original document. The next chunk does this, and then casts the data out of long format into wide, analysis-ready format. I create both sparse and dense versions of the wide matrix to check that things are working the way I want (this is one of the reasons for checking things out on a fairly small dataset). In the chunk below, note that I had a bit of a fiddle (and then built in a test to check I got it right) to get the columns in the same order with cast_sparse as when I used spread for the dense version.

    # OK, quite a variety of words per post.
    # Convert counts to *relative* count (ie proportion of words in each doc)
    # need to do this as otherwise the main dimension in the data is just the 
    # length of each blog post
    kos3 <- kos2 %>%
       group_by(doc) %>%
       mutate(count = count / sum(count))
    
    # Sparse version
    kos_sparse <- kos3 %>%
       # arrange it so the colnames after cast_sparse are in alphabetical order:
       arrange(word) %>%
       cast_sparse(doc, word, count)
    
    # Dense version
    kos_df <- kos3 %>%
       spread(word, count, fill = 0)
    kos_dense <- as.matrix(kos_df[ , -1])
    
    expect_equal(colnames(kos_dense), colnames(kos_sparse))
    
    dim(kos_sparse) # would be 3430 documents and 6906 words except that there's been more processing, so only 4566 words
    object.size(kos) # about 4MB
    object.size(kos_sparse) # about the same.  Slightly less efficient
    object.size(kos_dense) # 120+ MB

    k-means cluster analysis in R

    As this exercise started as essentially just a test of getting data into H2O, I needed an H2O algorithm to test on the data. With just bags of words and no other metadata (such as how popular each post was, or who wrote it, or anything else of that sort) there’s no obvious response variable to test predictive analytics methods on, so I chose k-means clustering as my test algorithm instead. There’s reasons why this method isn’t particularly useful for this sort of high-dimensional (ie lots of columns) data, but they’re not enough to stop it being a test of sorts.

    One of the challenges of cluster analysis is to know how many groups (k) to divide the data into. So give R and H2O a decent run, I applied the k-means clustering to the data (using the dense version of the matrix) with each value of k from 1 to 25.

    Close readers – if there are any – may note that I haven’t first scaled all the columns in the matrix before applying the k-means algorithm, even though this is often recommended to avoid columns with high variance dominating the clustering. This is a deliberate choice from me, applicable to this particular dataset. As all the columns are on the same scale (“this word’s frequency as a proportion of total words in the document for this row”) I thought it made sense to let the high variance columns sway the results, rather than (in effect) let low variance columns for rarely-used words get equal votes.

    #========k-means cluster analysis in R===========
    # set up parallel cluster for the R version of analysis 
    # (sorry for there being two uses of the word "cluster"...)
    cluster <- makeCluster(7) # only any good if you have at least 7 processors :)
    registerDoParallel(cluster)
    clusterExport(cluster, "kos_dense")
    
    # How to work out how many clusters to find in the data?
    # great answer at http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters 
    # But I only use the simplest method - visual inspection of a scree plot of the total within squares
    
    # takes some time, even with the parallel processing:
    max_groups <- 25
    wss <- foreach (i = 1:max_groups) %dopar% {
       kmeans(kos_dense, centers = i)$tot.withinss
    }
    
    par(font.main = 1)
    plot(1:max_groups, wss, type="b", xlab="Number of groups",
         ylab="Within groups sum of squares", bty = "l")
    grid()
    
    stopCluster(cluster)

    This graphic shows the total within-groups sum of squares for each value of k. The within-groups sum of squares is an indicator of how homogenous the points within each group are. Inevitably it goes down as the number of groups dividing the data increases; we draw plots like this one to see if there is a visually-obvious “levelling-off” point to indicate when we are getting diminishing marginal returns from adding more groups.

    scree

    I won’t worry too much about interpreting this as I think the method is of dubious value anyway. But my gut feel is that either there is 1, 2, or about 20 groups in this data.

    Getting the data into H2O

    For the H2O comparison, first step is to initiate an H2O cluster and get the data into it. With the newly enhanced as.h2o function working with sparse Matrix objects this is easy and takes only 3 seconds (compared to 22 seconds for uploading the dense version of the matrix, code not shown).

    There’s a funny quirk to watch out for; the column names don’t get imported, and only 100 column names are applied.

    I suspect the colnames issue after importing a sparse Matrix to H2O is a bug. but will make sure before I report it as an issue.

    Having imported the data, I do a basic check. k-means with a single group should return just the mean of each column as the p-dimensional centroid of a single big hyper-sphere containing all the data. I want to check that the various versions of the data in R and H2O all return the same values for this.

    #==============cluster analysis in h2o======================
    h2o.init(nthreads = -1, max_mem_size = "10G")
    
    system.time({ kos_h1 <- as.h2o(kos_sparse)}) # 3 seconds
    dim(kos_h1) # correct
    colnames(kos_h1) # for some reason, only 100 columnames, c1:100
    colnames(kos_h1) <- colnames(kos_sparse)
    
    #--------------------checking we have the same data in all its different versions--------------------
    # Compare the simplest single cluster case - should just be the mean of each column
    simple_r <- kmeans(kos_dense, centers = 1)
    simple_h <- h2o.kmeans(kos_h1, x = colnames(kos_h1), k = 1, standardize = FALSE)
    simple_m1 <- apply(kos_dense, 2, mean)
    simple_m2 <- t(apply(kos_h1, 2, mean))
    simple_m3 <- apply(kos_sparse, 2, mean)
    
    comparison <- data.frame(
       R = simple_r$centers[1 ,],
       manual1 = simple_m1,
       H2O = t(h2o.centers(simple_h))[,1],
       manual2 = as.vector(simple_m2[,1]),
       manual3 = simple_m3
    )
    
    par(font.main = 1)
    pairs(comparison, main = "Different methods of estimating the center of the data")

    And the result is fine. The data import has worked ok. Note that this wasn’t the case when I first did it – after a while I tracked it down to the non-alphabetical ordering of columns created by cast_sparse, which I have now forced to be alphabetical in the code earlier in this post.

    k-means cluster analysis in H2O

    To complete my experiment, I ran the k-means clustering algorithm on the data in H2O in a similar way to how I’d already done it in base R. One advantage of H2O in this context is it lets you use cross-validation for assessing the within-group sum of squares, rather than getting it from the data that was used to set the centers. This takes longer but is a better way of doing it. In fact, I tried it with and without cross validation and got similar results – the cross-validation total within groups sum of squares is always a little higher than that from the training data, but not enough to disrupt patterns.

    max_groups <- 25
    wss_h2o <- numeric(max_groups)
    for(i in 1:max_groups){
       kos_km1 <- h2o.kmeans(kos_h1, x= colnames(kos_h1), estimate_k = FALSE, k = i, standardize = FALSE, 
                             nfolds = 5, max_iterations = 25)   
       wss_h2o[i] <- h2o.tot_withinss(kos_km1, xval = TRUE) # xval=TRUE means cross validation used
    }
    
    par(font.main = 1)
    plot(1:max_groups, wss_h2o, type="b", xlab="Number of Clusters",
         ylab="Within groups sum of squares", bty = "l")
    grid()

    h2o-scree

    Note that the scree plot from H2O is much less regular than that from base R. This was the case whether it was created with cross-validation values of total within group sum of squares, or those from the training data. Basically, the H2O k-means algorithm appears to be a little less steady than that in R. This would worry me except for my concerns with the appropriateness of the method at all to this sort of data. Because of the “curse of dimensionality”, nearly all points in this 4,000 dimensional space are pretty much equally far from each-other, and it’s not surprising that a technique that aims to efficiently group them into clusters finds itself in unstable territory.

    So I’m not going to spend more space on presenting the final model, as I just don’t think the method is a good one in this case. The exercise has served its purpose of showing that the new as.h2o upload of sparse Matrix format data into H2O works. Unfortunately, it didn’t scale up to larger data. The New York Times bags of words in the same repository as the Kos Blog data has about 300,000 documents and a vocabulary of 100,000 words, and although R, tidytext and Matrix handle it fine I couldn’t upload to H2O with as.h2o. So this remains an open problem. If you know how to fix it, please answer my question on Stack Overflow! The solution could be either a way of pivoting/spreading long form data in H2O itself, or a better way of handling the upload of pre-pivoted wide sparse data. Or perhaps I just need more computing power.

    Topic modelling in R

    I would have stopped there but I thought it was a waste of a good dataset to use only an admittedly inferior method to analyse it. So I reached into the toolbox and grabbed the topic modelling approach, which is a more appropriate tool in this case. Topic modelling is a Bayesian approach that postulates a number of unseen latent “topics”. Each topic has its own probability distribution to generate a bag of words. The probability of each estimated topic generating each observed bag of words conditional on the best estimate of topics is considered, the process improved iteratively, and eventually one is left with a bunch of topics each of which has a probability of generating each word.

    As an inferential method I think it’s over-rated, but as an exploratory approach to otherwise intractable large bunches of bags of words it’s pretty useful – sort of the equivalent exploratory analytical/graphical combined method (in my view) to the armoury of density plots, scatter plots and line charts that we use when first exploring continuous data. Oddly enough, the results are usually presented as tables of data listing the words most characteristic of each topic. I think we can do better than that… here’s the graphic from the top of the post again, for convenience:

    Word clouds can be a silly gimmick, but if used well they are a good use of space to pack text in. Like any graphic, they should facilitate comparisons (in this case, between different topics).

    My interpretation of these topics?

    • Topics 1 and 8 might be the more tactically focused discussion of polls and punditry in the leadup to the actual election
    • Topic 2 stands out clearly as the Iraq War topic
    • Topic 6 clearly relates to the broader war on terror, intelligence and defence issues
    • Topic 5 looks to be the general “Bush Administration” discussions
    • Topic 3 seems to be focused on the Democratic primary process
    • Topic 4 might be discussion of non-Presidential (Senate, House, Governor) races
    • Topic 7 might be a more general discussion of the Democratic party, people, politics and parties
    • The Daily Kos writers are a pretty nationally focused lot. A political blog in Australia or New Zealand might discuss elections in the USA, Britain, and probably France, Germany, Japan and maybe Indonesia as well as their own country, but the Daily Kos has the luxury of focusing on just the USA.

    Overall, it’s not a bad overview of the subject-matter of 3,400 articles that the computer has read so we don’t need to.

    As this is a standard and nicely curated dataset I thought there would be stacks of people who have done this before me but on a cursory search I couldn’t find it. Here are a couple of bits of related analysis:

    There’s a fair bit of discussion and not much clarity on the web on how to choose the correct number of topics for a given corpus of text. I think I’ll save that for another time.

    Here’s the R code that created it, no frills or playing around with arguments. Note that I had to put the data into another sparse matrix format for the excellent topicmodels R package by Bettina Gruen to work with. I use the kos2 object defined earlier in this post, which was the data after stemming and re-aggregation, but before word frequencies were converted to counts. The most common form of topic modelling uses a method called latent Dirichlet allocation which unless I’m mistaken does that conversion under the hood; but the topicmodels::LDA function I use below needs to be provided counts, not proportions.

    kos_counts <- kos2 %>%
       arrange(word) %>%
       # take advantage of factor() as a quick way to make number:level pairings
       mutate(word = factor(word))
    
    # convert to triplet (row, column, cell) sparse matrix format:
    kos_trip <- with(kos_counts, simple_triplet_matrix(i = doc, j = as.numeric(word), v = count))
    
    dimnames(kos_trip)[[2]] <- levels(kos_counts$word)
    
    # function for drawing word clouds of all topics from an object created with LDA
    wclda <- function(lda, n = 100, palette = "Blues", ...){
       p <- posterior(lda)
       w1 <- as.data.frame(t(p$terms)) 
       w2 <- w1 %>%
          mutate(word = rownames(w1)) %>%
          gather(topic, weight, -word) 
       
       pal <- rep(brewer.pal(9, palette), each = ceiling(n / 9))[n:1]
       
       for(i in 1:ncol(w1)){
          w3 <- w2 %>%
             filter(topic == i) %>%
             arrange(desc(weight))
          with(w3[1:n, ], 
               wordcloud(word, freq = weight, random.order = FALSE, 
                         ordered.colors = TRUE, colors = pal, ...))
          title(paste("Topic", i))
       }
    }
    
    system.time(lda3 <- LDA(kos_trip, k = 8)) # about 300 seconds
    
    par(mfrow = c(2, 4), font.main = 1)
    wclda(lda3)
    grid.text(0.5, 0.55, label = "Latent topics identified in 3,430 dailykos.com blog posts in 2004",
              gp = gpar(fontface = "italic"))
    grid.text(0.98, 0.02, hjust = 1,
              label = "Source: data from https://archive.ics.uci.edu/ml/datasets/Bag+of+Wordsn           analysis from https://ellisp.github.io",
              gp = gpar(fontface = "italic", cex = 0.7, col = "grey60"))
    To leave a comment for the author, please follow the link and comment on their blog: Peter’s stats stuff – R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    dotplot for GSEA result

    By R on Guangchuang Yu

    (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

    For GSEA analysis, we are familar with the above figure which shows the running enrichment score. But for most of the software, it lack of visualization method to summarize the whole enrichment result.

    In DOSE (and related tools including clusterProfiler, ReactomePA and meshes), we provide enrichMap and cnetplot to summarize GSEA result.

    Here is an example of using enrichMap which is good to visualize relationship among enriched gene sets.

    cnetplot excel on visualizing relationship among gene sets and corresponding core genes.

    Now DOSE support visualize GSEA result using dotplot which can visualize more enriched gene sets in one figure. This is a feature request from @guidohooiveld.

    dotplot was previously implemented in DOSE to visualize hypergeometric test result. I modified it to support GSEA result.

    Internally, .sign was reserved for the sign of NES (activated for NES > 0 and suppressed for NES < 0). We can pass split parameter and then it will apply showCategory by splitting the results. The following example plot 30 activated and 30 suppressed enriched disease sets.

    PS: Count is the number of core genes and GeneRatio is Count/setSize.

    Citation

    G Yu, LG Wang, GR Yan, QY He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609.

    To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Pokemon and TrelliscopeJS!

    By ryan hafen

    (This article was first published on Ryan Hafen, and kindly contributed to R-bloggers)

    I’m always looking for ways to spark my kid’s interest in computers, data, etc. This has proven to be more difficult than I thought it would be (kids these days…). I suspect this may have something to do with the ubiquity of electronic devices that “just work”, making them less novel and less interesting to tinker with, but speculation on this is a post for another time…

    Anyway, all of my kids are crazy into Pokemon so when I came across some Pokemon data the other day that leant itself very nicely to a Trelliscope display, I thought I might have a chance to engage them. And then I thought why not write a post about it. You can find the resulting display and code to recreate it in this post. Hope you enjoy!

    To start out, here’s the display:

    If this display doesn’t appear correctly for you (because of blog aggregators, etc.), you can follow this link to the display in a dedicated window. For better viewing, you can also click the bottom right “fullscreen” button to expand the display to fill the window.

    The data from which this was created is a simple data frame of Pokemon statistics, based on this source (which borrows from here). I slightly modified the data to add some variables that enhance the display (I changed the image URL to a better source, added a variable “pokedex” that provides a link to the pokemon’s pokedex entry on pokemon.com, and removed a few special Pokemon that I couldn’t find on pokedex).

    Since this data is simply a data frame where each row refers to a Pokemon, it lends itself nicely to a Trelliscope display showing an image of the Pokemon as the panel and allowing interaction with the Pokemon being viewed based on the various statistics provided.

    Here’s the code to make the display. Once the data is read, it’s just a few lines.

    # install packages if not installed
    devtools::install_github("hafen/trelliscopejs")
    install.packages(c("readr", "dplyr"))
    
    library(readr)
    library(dplyr)
    library(trelliscopejs)
    
    # read the data (making "_id" columns strings)
    pok <-
      read_csv("https://raw.githubusercontent.com/hafen/pokRdex/master/pokRdex_mod.csv") %>%
      mutate_at(vars(matches("_id$")), as.character)
    
    # take a look
    glimpse(pok)
    
    Observations: 801
    Variables: 30
    $ pokemon                 <chr> "bulbasaur", "ivysaur", "venusaur", "ve...
    $ id                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
    $ species_id              <chr> "1", "2", "3", "3", "4", "5", "6", "6",...
    $ height                  <int> 7, 10, 20, 24, 6, 11, 17, 17, 17, 5, 10...
    $ weight                  <int> 69, 130, 1000, 1555, 85, 190, 905, 1105...
    $ base_experience         <int> 64, 142, 236, 281, 62, 142, 240, 285, 2...
    $ type_1                  <chr> "grass", "grass", "grass", "grass", "fi...
    $ type_2                  <chr> "poison", "poison", "poison", "poison",...
    $ attack                  <int> 49, 62, 82, 100, 52, 64, 84, 130, 104, ...
    $ defense                 <int> 49, 63, 83, 123, 43, 58, 78, 111, 78, 6...
    $ hp                      <int> 45, 60, 80, 80, 39, 58, 78, 78, 78, 44,...
    $ special_attack          <int> 65, 80, 100, 122, 60, 80, 109, 130, 159...
    $ special_defense         <int> 65, 80, 100, 120, 50, 65, 85, 85, 115, ...
    $ speed                   <int> 45, 60, 80, 80, 65, 80, 100, 100, 100, ...
    $ ability_1               <chr> "overgrow", "overgrow", "overgrow", "th...
    $ ability_2               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
    $ ability_hidden          <chr> "chlorophyll", "chlorophyll", "chloroph...
    $ color_1                 <chr> "#78C850", "#78C850", "#78C850", "#78C8...
    $ color_2                 <chr> "#A040A0", "#A040A0", "#A040A0", "#A040...
    $ color_f                 <chr> "#81A763", "#81A763", "#81A763", "#81A7...
    $ egg_group_1             <chr> "monster", "monster", "monster", "monst...
    $ egg_group_2             <chr> "plant", "plant", "plant", "plant", "dr...
    $ url_image               <chr> "http://assets.pokemon.com/assets/cms2/...
    $ generation_id           <chr> "1", "1", "1", NA, "1", "1", "1", NA, N...
    $ evolves_from_species_id <chr> NA, "1", "2", NA, NA, "4", "5", NA, NA,...
    $ evolution_chain_id      <chr> "1", "1", "1", NA, "2", "2", "2", NA, N...
    $ shape_id                <chr> "8", "8", "8", NA, "6", "6", "6", NA, N...
    $ shape                   <chr> "quadruped", "quadruped", "quadruped", ...
    $ pokebase                <chr> "bulbasaur", "ivysaur", "venusaur", "ve...
    $ pokedex                 <chr> "http://www.pokemon.com/us/pokedex/bulb...
    

    Now we can create a Trelliscope display by specifying url_image as the source for the panel images. We also specify a default state indicating that the values for the variables pokemon and pokedex should be shown as labels by default.

    pok %>%
      mutate(panel = img_panel(url_image)) %>%
      trelliscope("pokemon", nrow = 3, ncol = 6,
        state = list(labels = c("pokemon", "pokedex")))
    

    This will produce the interactive plot shown at the top of this post. You can use the display to find Pokemon based on sorting or filtering on several of their attributes.

    Note that despite my kids constantly telling me about and showing me their Pokemon cards, I am not a Pokemon expert, so there may be some interesting things I am missing. But I can say that my kids were finally impressed and engaged with something that I showed them. Success!

    If this is your first exposure to Trelliscope and you are interested in other things it can do, please see my original blog post.

    This is my last post of the year. Happy new year!

    To leave a comment for the author, please follow the link and comment on their blog: Ryan Hafen.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    a Galton-Watson riddle

    By xi’an

    riddle

    (This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

    The Riddler of this week has an extinction riddle which summarises as follows:

    One observes a population of N individuals, each with a probability of 10⁻⁴ to kill the observer each day. From one day to the next, the population decreases by one individual with probability

    K√N 10⁻⁴

    What is the value of K that leaves the observer alive with probability ½?

    Given the sequence of population sizes N,N¹,N²,…, the probability to remain alive is

    where the sum stops with the (sure) extinction of the population. Which is the moment generating function of the sum. At x=1-10⁻⁴. Hence the problem relates to a Galton-Watson extinction problem. However, given the nature of the extinction process I do not see a way to determine the distribution of the sum, except by simulation. Which returns K=26.3 for the specific value of N=9.

    N=9
    K=3*N
    M=10^4
    vals=rep(0,M)
    targ=0
    ite=1
    while (abs(targ-.5)>.01){
    
    for (t in 1:M){
      gen=vals[t]=N
      while (gen>0){
       gen=gen-(runif(1)<K*sqrt(gen)/10^4)
       vals[t]=vals[t]+gen}
      }
      targ=mean(exp(vals*log(.9999)))
    print(c(as.integer(ite),K,targ))
      if (targ<.5){ K=K*ite/(1+ite)}else{
         K=K/(ite/(1+ite))}
      ite=ite+1}
    

    Filed under: R, Travel Tagged: Francis Galton, Galton-Watson extinction, R, The Riddler

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Using R to prevent food poisoning in Chicago

    By David Smith

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    There are more than 15,000 restaurants in Chicago, but fewer than 40 inspectors tasked with making sure they comply with food-safety standards. To help prioritize the facilities targeted for inspection, the City of Chicago used R to create a model that predicts which restaurants are most likely to fail an inspection. Using this model to deploy inspectors, the City is able to detect unsafe restaurants more than a week sooner than by using traditional selection methods, and cite 37 additional restaurants per month.

    Chicago’s Department of Public Health used the R language to build and deploy the model, and made the code available as an open source project on GitHub. The reasons given are twofold:

    An open source approach helps build a foundation for other models attempting to forecast violations at food establishments. The analytic code is written in R, an open source, widely-known programming language for statisticians. There is no need for expensive software licenses to view and run this code.

    Releasing the model as open source has had benefits for beyond Chicago as well: Montogomery County, MD adopted the process and also saw improvements in its food safety inpection process.

    You can see how the model is used in practice in the video below from PBS NewsHour. Fast forward to the 3:00 mark to see the Tom Schenk, Chief Data Officer for the City of Chicago, describe how the data science team there used R to develop the model. (There’s also a close-up of R code using the data.table package around the 6:45 mark.)

    The video also describes the Foodborne Chicago Twitter detection system for flagging tweets describing food poisoning in Chicago (also implemented with R).

    PBS NewsHour: Up to code? An algorithm is helping Chicago health officials predict restaurant safety violations (via reader MD)

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Intermediate Tree 1

    By Hasan Imtiaz

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    If you followed through the Basic Decision Tree exercise, this should be useful for you. This is like a continuation but we add so much more. We are working with a bigger and badder datasets. We will be also using techniques we learned from model evaluation and work with ROC, accuracy and other metrics.

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1
    read in the adult.csv file with header=FALSE. Store this in df. Use str() command to see the dataframe. Download the Data from here

    Exercise 2
    You are given the meta_data that goes with the CSV. You can download this here Use that to add the column names for your dataframe. Notice the df is ordered going from V1,V2,V3 _ _ and so on. As a side note, it is always best practice to use that to match and see if all the columns are read in correctly.

    Exercise 3
    Use the table command and print out the distribution of the class feature.

    Exercise 4
    Change the class column to binary.

    Learn more about decision trees in the online courses

    Exercise 5
    Use the cor() command to see the corelation of all the numeric and integer columns including the class column. Remember that numbers close to 1 means high corelation and number close to 0 means low. This will give you a rough idea for feature selection

    Exercise 6
    Split the dataset into Train and Test sample. You may use sample.split() and use the ratio as 0.7 and set the seed to be 1000. Make sure to install and load caTools package.

    Exercise 7
    Check the number of rows of Train
    Check the number of rows of Test

    Exercise 8
    We are ready to use decision tree in our dataset. Load the package “rpart” and “rpart.plot” If it is not installed, then use the install.packages() commmand.

    Exercise 9
    Use rpart to build the decision tree on the Train set. Include all features.Store this model in dec

    Exercise 10
    Use the prp() function to plot the decision tree. If you get any error use this code before the prp() command

    par(mar = rep(2, 4))

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News