Tutorial: Using R for Scalable Data Analytics

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

At the recent Strata conference in San Jose, several members of the Microsoft Data Science team presented the tutorial Using R for Scalable Data Analytics: Single Machines to Spark Clusters. The materials are all available online, including the presentation slides and hands-on R scripts. You can follow along with the materials at home, using the Data Science Virtual Machine for Linux, which provides all the necessary components like Spark and Microsoft R Server. (If you don’t already have an Azure account, you can get $200 credit with the Azure free trial.)

The tutorial covers many different techniques for training predictive models at scale, and deploying the trained models as predictive engines within production environments. Among the technologies you’ll use are Microsoft R Server running on Spark, the SparkR package, the sparklyr package and H20 (via the rsparkling package). It also touches on some non-Spark methods, like the bigmemory and ff packages for R (and various other packages that make use of them), and using the foreach package for coarse-grained parallel computations. You’ll also learn how to create prediction engines from these trained models using the mrsdeploy package.

The tutorial also includes scripts for comparing the performance of these various techniques, both for training the predictive model:

and for generating predictions from the trained model:

Scoring

(The above tests used 4 worker nodes and 1 edge node, all with with 16 cores and 112Gb of RAM.)

You can find the tutorial details, including slides and scripts, at the link below.

Strata + Hadoop World 2017, San Jose: Using R for scalable data analytics: From single machines to Hadoop Spark clusters

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ggedit 0.2.0 is now on CRAN

By Jonathan Sidi

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Jonathan Sidi, Metrum Research Group

We are pleased to announce the release of the ggedit package on CRAN.

To install the package you can call the standard R command

install.packages('ggedit')

The source version is still tracked on github, which has been reorganized to be easier to navigate.

To install the dev version:

devtools::install_github('metrumresearchgroup/ggedit')

What is ggedit?

ggedit is an R package that is used to facilitate ggplot formatting. With ggedit, R users of all experience levels can easily move from creating ggplots to refining aesthetic details, all while maintaining portability for further reproducible research and collaboration.
ggedit is run from an R console or as a reactive object in any Shiny application. The user inputs a ggplot object or a list of objects. The application populates Bootstrap modals with all of the elements found in each layer, scale, and theme of the ggplot objects. The user can then edit these elements and interact with the plot as changes occur. During editing, a comparison of the script is logged, which can be directly copied and shared. The application output is a nested list containing the edited layers, scales, and themes in both object and script form, so you can apply the edited objects independent of the original plot using regular ggplot2 grammar.
Why does it matter? ggedit promotes efficient collaboration. You can share your plots with team members to make formatting changes, and they can then send any objects they’ve edited back to you for implementation. No more email chains to change a circle to a triangle!

Updates in ggedit 0.2.0:

  • The layer modal (popups) elements have been reorganized for less clutter and easier navigation.
  • The S3 method written to plot and compare themes has been removed from the package, but can still be found on the repo, see plot.theme.

Deploying

  • call from the console: ggedit(p)
  • call from the addin toolbar: highlight script of a plot object on the source editor window of RStudio and run from toolbar.
  • call as part of Shiny: use the Shiny module syntax to call the ggEdit UI elements.
    • server: callModule(ggEdit,'pUI',obj=reactive(p))
    • ui: ggEditUI('pUI')
  • if you have installed the package you can see an example of a Shiny app by executing runApp(system.file('examples/shinyModule.R',package = 'ggedit'))

Outputs

ggedit returns a list containing 8 elements either to the global enviroment or as a reactive output in Shiny.

  • updatedPlots
    • List containing updated ggplot objects
  • updatedLayers
    • For each plot a list of updated layers (ggproto) objects
    • Portable object
  • updatedLayersElements
    • For each plot a list elements and their values in each layer
    • Can be used to update the new values in the original code
  • updatedLayerCalls
    • For each plot a list of scripts that can be run directly from the console to create a layer
  • updatedThemes
    • For each plot a list of updated theme objects
    • Portable object
    • If the user doesn’t edit the theme updatedThemes will not be returned
  • updatedThemeCalls
    • For each plot a list of scripts that can be run directly from the console to create a theme
  • updatedScales
    • For each plot a list of updated scales (ggproto) objects
    • Portable object
  • updatedScaleCalls
    • For each plot a list of scripts that can be run directly from the console to create a scale

Short Clip to use ggedit in Shiny


Jonathan Sidi joined Metrum Research Group in 2016 after working for several years on problems in applied statistics, financial stress testing and economic forecasting in both industrial and academic settings. To learn more about additional open-source software packages developed by Metrum Research Group please visit the Metrum website. Contact: For questions and comments, feel free to email me at: yonis@metrumrg.com or open an issue for bug fixes or enhancements at github.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The 5 Most Effective Ways to Learn R

By Brian Back

(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)

Whether you’re plotting a simple time series or building a predictive model for the next election, the R programming language’s flexibility will ensure you have all the capabilities you need to get the job done. In this blog we will take a look at five effective tactics for learning this essential data science language, as well as some of the top resources associated with each. These tactics should be used to complement one another on your path to mastering the world’s most powerful statistical language!

1. Watch Instructive Videos

We often flock to YouTube when we want to learn how to play a song on the piano, change a tire, or chop an onion, but why should it be any different when learning how to perform calculations using the most popular statistical programming language? LearnR, Google Developers, and MarinStatsLectures are all fantastic YouTube channels with playlists specifically dedicated to the R language.

2. Read Blogs

There’s a good chance you came across this article through the R-bloggers website, which curates content from some of the best blogs about R that can be found on the web today. Since there are 750+ blogs that are curated on R-bloggers alone, you shouldn’t have a problem finding an article on the exact topic or use case you’re interested in!

A few notable R blogs:

3. Take an Online Course

As we’ve mentioned in previous blogs, there are a great number of online classes you can take to learn specific technical skills. In many instances, these courses are free, or very affordable, with some offering discounts to college students. Why spend thousands of dollars on a university course, when you can get as good, if not better (IMHO), of an understanding online.

Some sites that offer great R courses include:

4. Read Books

Many times, books are given a bad rap since most programming concepts can be found online, for free. Sure, if you are going to use the book just as a reference, you’d probably be better off saving that money and taking to Google search. However, if you’re a beginner, or someone who wants to learn the fundamentals, working through an entire book at the foundational level will provide a high degree of understanding.

There is a fantastic list of the best books for R at Data Science Central.

5. Experiment!

You can read articles and watch videos all day long, but if you never try it for yourself, you’ll never learn! Datazar is a great place for you to jump right in and experiment with what you’ve learned. You can immediately start by opening the R console or creating a notebook in our cloud-based environment. If you get stuck, you can consult with other users and even work with scripts that have been opened up by others!

I hope you found this helpful and as always if you would like to share any additional resources, feel free to drop them in the comments below!

Resources Included in this Article


The 5 Most Effective Ways to Learn R was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Easy leave-one-out cross validation with pipelearner

By Simon Jackson

unnamed-chunk-7-1.jpeg

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to show you how to do leave-one-out cross validation using pipelearner.

Leave-one-out cross validation

Leave-one-out is a type of cross validation whereby the following is done for each observation in the data:

  • Run model on all other observations
  • Use model to predict value for observation

This means that a model is fitted, and a predicted is made n times where n is the number of observations in your data.

Leave-one-out in pipelearner

pipelearner is a package for streamlining machine learning pipelines, including cross validation. If you’re new to it, check out blogR for other relevant posts.

To demonstrate, let’s use regression to predict horsepower (hp) with all other variables in the mtcars data set. Set this up in pipelearner as follows:

library(pipelearner)

pl <- pipelearner(mtcars, lm, hp ~ .)

How cross validation is done is handled by learn_cvpairs(). For leave-one-out, specify k = number of rows:

pl <- learn_cvpairs(pl, k = nrow(mtcars))

Finally, learn() the model on all folds:

pl <- learn(pl)

This can all be written in a pipeline:

pl <- pipelearner(mtcars, lm, hp ~ .) %>% 
  learn_cvpairs(k = nrow(mtcars)) %>% 
  learn()

pl
#> # A tibble: 32 × 9
#>    models.id cv_pairs.id train_p      fit target model     params
#>        <chr>       <chr>   <dbl>   <list>  <chr> <chr>     <list>
#> 1          1          01       1 <S3: lm>     hp    lm <list [1]>
#> 2          1          02       1 <S3: lm>     hp    lm <list [1]>
#> 3          1          03       1 <S3: lm>     hp    lm <list [1]>
#> 4          1          04       1 <S3: lm>     hp    lm <list [1]>
#> 5          1          05       1 <S3: lm>     hp    lm <list [1]>
#> 6          1          06       1 <S3: lm>     hp    lm <list [1]>
#> 7          1          07       1 <S3: lm>     hp    lm <list [1]>
#> 8          1          08       1 <S3: lm>     hp    lm <list [1]>
#> 9          1          09       1 <S3: lm>     hp    lm <list [1]>
#> 10         1          10       1 <S3: lm>     hp    lm <list [1]>
#> # ... with 22 more rows, and 2 more variables: train <list>, test <list>

Evaluating performance

Performance can be evaluated in many ways depending on your model. We will calculate R2:

library(tidyverse)

# Extract true and predicted values of hp for each observation
pl <- pl %>% 
  mutate(true = map2_dbl(test, target, ~as.data.frame(.x)[[.y]]),
         predicted = map2_dbl(fit, test, predict)) 

# Summarise results
results <- pl %>% 
  summarise(
    sse = sum((predicted - true)^2),
    sst = sum(true^2)
  ) %>% 
  mutate(r_squared = 1 - sse / sst)

results
#> # A tibble: 1 × 3
#>        sse    sst r_squared
#>      <dbl>  <dbl>     <dbl>
#> 1 41145.56 834278 0.9506812

Using leave-one-out cross validation, the regression model obtains an R2 of 0.95 when generalizing to predict horsepower in new data.

We’ll conclude with a plot of each true data point and it’s predicted value:

pl %>% 
  ggplot(aes(true, predicted)) +
      geom_point(size = 2) +
      geom_abline(intercept = 0, slope = 1, linetype = 2)  +
      theme_minimal() +
      labs(x = "True value", y = "Predicted value") +
      ggtitle("True against predicted values basednon leave-one-one cross validation")

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

#2: Even Easier Package Registration

By Thinking inside the box

Welcome to the second post in rambling random R recommendation series, or R4 for short.

Two days ago I posted the initial (actual) post. It provided context for why we need package registration entries (tl;dr: because R CMD check now tests for it, and because it The Right Thing to do, see documentation in the posts). I also showed how generating such a file src/init.c was essentially free as all it took was single call to a new helper function added to R-devel by Brian Ripley and Kurt Hornik.

Now, to actually use R-devel you obviously need to have it accessible. There are a myriad of ways to achieve that: just compile it locally as I have done for years, use a Docker image as I showed in the post — or be creative with eg Travis or win-builder both of which give you access to R-devel if you’re clever about it.

But as no good deed goes unpunished, I was of course told off today for showing a Docker example as Docker was not “Easy”. I think the formal answer to that is baloney. But we leave that aside, and promise to discuss setting up Docker at another time.

R is after all … just R. So below please find a script you can save as, say, ~/bin/pnrrs.r. And calling it—even with R-release—will generate the same code snippet as I showed via Docker. Call it a one-off backport of the new helper function — with a half-life of a few weeks at best as we will have R 3.4.0 as default in just a few weeks. The script will then reduce to just the final line as the code will be present with R 3.4.0.

#!/usr/bin/r

library(tools)

.find_calls_in_package_code <- tools:::.find_calls_in_package_code
.read_description <- tools:::.read_description

## all what follows is from R-devel aka R 3.4.0 to be

package_ff_call_db <- function(dir) {
    ## A few packages such as CDM use base::.Call
    ff_call_names <- c(".C", ".Call", ".Fortran", ".External",
                       "base::.C", "base::.Call",
                       "base::.Fortran", "base::.External")

    predicate <- function(e) {
        (length(e) > 1L) &&
            !is.na(match(deparse(e[[1L]]), ff_call_names))
    }

    calls <- .find_calls_in_package_code(dir,
                                         predicate = predicate,
                                         recursive = TRUE)
    calls <- unlist(Filter(length, calls))

    if(!length(calls)) return(NULL)

    attr(calls, "dir") <- dir
    calls
}

native_routine_registration_db_from_ff_call_db <- function(calls, dir = NULL, character_only = TRUE) {
    if(!length(calls)) return(NULL)

    ff_call_names <- c(".C", ".Call", ".Fortran", ".External")
    ff_call_args <- lapply(ff_call_names,
                           function(e) args(get(e, baseenv())))
    names(ff_call_args) <- ff_call_names
    ff_call_args_names <-
        lapply(lapply(ff_call_args,
                      function(e) names(formals(e))), setdiff,
               "...")

    if(is.null(dir))
        dir <- attr(calls, "dir")

    package <- # drop name
        as.vector(.read_description(file.path(dir, "DESCRIPTION"))["Package"])

    symbols <- character()
    nrdb <-
        lapply(calls,
               function(e) {
                   if (startsWith(deparse(e[[1L]]), "base::"))
                       e[[1L]] <- e[[1L]][3L]
                   ## First figure out whether ff calls had '...'.
                   pos <- which(unlist(Map(identical,
                                           lapply(e, as.character),
                                           "...")))
                   ## Then match the call with '...' dropped.
                   ## Note that only .NAME could be given by name or
                   ## positionally (the other ff interface named
                   ## arguments come after '...').
                   if(length(pos)) e <- e[-pos]
                   ## drop calls with only ...
                   if(length(e) < 2L) return(NULL)
                   cname <- as.character(e[[1L]])
                   ## The help says
                   ##
                   ## '.NAME' is always matched to the first argument
                   ## supplied (which should not be named).
                   ##
                   ## But some people do (Geneland ...).
                   nm <- names(e); nm[2L] <- ""; names(e) <- nm
                   e <- match.call(ff_call_args[[cname]], e)
                   ## Only keep ff calls where .NAME is character
                   ## or (optionally) a name.
                   s <- e[[".NAME"]]
                   if(is.name(s)) {
                       s <- deparse(s)[1L]
                       if(character_only) {
                           symbols <<- c(symbols, s)
                           return(NULL)
                       }
                   } else if(is.character(s)) {
                       s <- s[1L]
                   } else { ## expressions
                       symbols <<- c(symbols, deparse(s))
                       return(NULL)
                   }
                   ## Drop the ones where PACKAGE gives a different
                   ## package. Ignore those which are not char strings.
                   if(!is.null(p <- e[["PACKAGE"]]) &&
                      is.character(p) && !identical(p, package))
                       return(NULL)
                   n <- if(length(pos)) {
                            ## Cannot determine the number of args: use
                            ## -1 which might be ok for .External().
                            -1L
                        } else {
                            sum(is.na(match(names(e),
                                            ff_call_args_names[[cname]]))) - 1L
                        }
                   ## Could perhaps also record whether 's' was a symbol
                   ## or a character string ...
                   cbind(cname, s, n)
               })
    nrdb <- do.call(rbind, nrdb)
    nrdb <- as.data.frame(unique(nrdb), stringsAsFactors = FALSE)

    if(NROW(nrdb) == 0L || length(nrdb) != 3L)
        stop("no native symbols were extracted")
    nrdb[, 3L] <- as.numeric(nrdb[, 3L])
    nrdb <- nrdb[order(nrdb[, 1L], nrdb[, 2L], nrdb[, 3L]), ]
    nms <- nrdb[, "s"]
    dups <- unique(nms[duplicated(nms)])

    ## Now get the namespace info for the package.
    info <- parseNamespaceFile(basename(dir), dirname(dir))
    ## Could have ff calls with symbols imported from other packages:
    ## try dropping these eventually.
    imports <- info$imports
    imports <- imports[lengths(imports) == 2L]
    imports <- unlist(lapply(imports, `[[`, 2L))

    info <- info$nativeRoutines[[package]]
    ## Adjust native routine names for explicit remapping or
    ## namespace .fixes.
    if(length(symnames <- info$symbolNames)) {
        ind <- match(nrdb[, 2L], names(symnames), nomatch = 0L)
        nrdb[ind > 0L, 2L] <- symnames[ind]
    } else if(!character_only &&
              any((fixes <- info$registrationFixes) != "")) {
        ## There are packages which have not used the fixes, e.g. utf8latex
        ## fixes[1L] is a prefix, fixes[2L] is an undocumented suffix
        nrdb[, 2L] <- sub(paste0("^", fixes[1L]), "", nrdb[, 2L])
        if(nzchar(fixes[2L]))
            nrdb[, 2L] <- sub(paste0(fixes[2L]), "$", "", nrdb[, 2L])
    }
    ## See above.
    if(any(ind <- !is.na(match(nrdb[, 2L], imports))))
        nrdb <- nrdb[!ind, , drop = FALSE]

    ## Fortran entry points are mapped to l/case
    dotF <- nrdb$cname == ".Fortran"
    nrdb[dotF, "s"] <- tolower(nrdb[dotF, "s"])

    attr(nrdb, "package") <- package
    attr(nrdb, "duplicates") <- dups
    attr(nrdb, "symbols") <- unique(symbols)
    nrdb
}

format_native_routine_registration_db_for_skeleton <- function(nrdb, align = TRUE, include_declarations = FALSE) {
    if(!length(nrdb))
        return(character())

    fmt1 <- function(x, n) {
        c(if(align) {
              paste(format(sprintf("    {"%s",", x[, 1L])),
                    format(sprintf(if(n == "Fortran")
                                       "(DL_FUNC) &F77_NAME(%s),"
                                   else
                                       "(DL_FUNC) &%s,",
                                   x[, 1L])),
                    format(sprintf("%d},", x[, 2L]),
                           justify = "right"))
          } else {
              sprintf(if(n == "Fortran")
                          "    {"%s", (DL_FUNC) &F77_NAME(%s), %d},"
                      else
                          "    {"%s", (DL_FUNC) &%s, %d},",
                      x[, 1L],
                      x[, 1L],
                      x[, 2L])
          },
          "    {NULL, NULL, 0}")
    }

    package <- attr(nrdb, "package")
    dups <- attr(nrdb, "duplicates")
    symbols <- attr(nrdb, "symbols")

    nrdb <- split(nrdb[, -1L, drop = FALSE],
                  factor(nrdb[, 1L],
                         levels =
                             c(".C", ".Call", ".Fortran", ".External")))

    has <- vapply(nrdb, NROW, 0L) > 0L
    nms <- names(nrdb)
    entries <- substring(nms, 2L)
    blocks <- Map(function(x, n) {
                      c(sprintf("static const R_%sMethodDef %sEntries[] = {",
                                n, n),
                        fmt1(x, n),
                        "};",
                        "")
                  },
                  nrdb[has],
                  entries[has])

    decls <- c(
        "/* FIXME: ",
        "   Add declarations for the native routines registered below.",
        "*/")

    if(include_declarations) {
        decls <- c(
            "/* FIXME: ",
            "   Check these declarations against the C/Fortran source code.",
            "*/",
            if(NROW(y <- nrdb$.C)) {
                 args <- sapply(y$n, function(n) if(n >= 0)
                                paste(rep("void *", n), collapse=", ")
                                else "/* FIXME */")
                c("", "/* .C calls */",
                  paste0("extern void ", y$s, "(", args, ");"))
           },
            if(NROW(y <- nrdb$.Call)) {
                args <- sapply(y$n, function(n) if(n >= 0)
                               paste(rep("SEXP", n), collapse=", ")
                               else "/* FIXME */")
               c("", "/* .Call calls */",
                  paste0("extern SEXP ", y$s, "(", args, ");"))
            },
            if(NROW(y <- nrdb$.Fortran)) {
                 args <- sapply(y$n, function(n) if(n >= 0)
                                paste(rep("void *", n), collapse=", ")
                                else "/* FIXME */")
                c("", "/* .Fortran calls */",
                  paste0("extern void F77_NAME(", y$s, ")(", args, ");"))
            },
            if(NROW(y <- nrdb$.External))
                c("", "/* .External calls */",
                  paste0("extern SEXP ", y$s, "(SEXP);"))
            )
    }

    headers <- if(NROW(nrdb$.Call) || NROW(nrdb$.External))
        c("#include <R.h>", "#include <Rinternals.h>")
    else if(NROW(nrdb$.Fortran)) "#include <R_ext/RS.h>"
    else character()

    c(headers,
      "#include <stdlib.h> // for NULL",
      "#include <R_ext/Rdynload.h>",
      "",
      if(length(symbols)) {
          c("/*",
            "  The following symbols/expresssions for .NAME have been omitted",
            "", strwrap(symbols, indent = 4, exdent = 4), "",
            "  Most likely possible values need to be added below.",
            "*/", "")
      },
      if(length(dups)) {
          c("/*",
            "  The following name(s) appear with different usages",
            "  e.g., with different numbers of arguments:",
            "", strwrap(dups, indent = 4, exdent = 4), "",
            "  This needs to be resolved in the tables and any declarations.",
            "*/", "")
      },
      decls,
      "",
      unlist(blocks, use.names = FALSE),
      ## We cannot use names with '.' in: WRE mentions replacing with "_"
      sprintf("void R_init_%s(DllInfo *dll)",
              gsub(".", "_", package, fixed = TRUE)),
      "{",
      sprintf("    R_registerRoutines(dll, %s);",
              paste0(ifelse(has,
                            paste0(entries, "Entries"),
                            "NULL"),
                     collapse = ", ")),
      "    R_useDynamicSymbols(dll, FALSE);",
      "}")
}

package_native_routine_registration_db <- function(dir, character_only = TRUE) {
    calls <- package_ff_call_db(dir)
    native_routine_registration_db_from_ff_call_db(calls, dir, character_only)
}

package_native_routine_registration_db <- function(dir, character_only = TRUE) {
    calls <- package_ff_call_db(dir)
    native_routine_registration_db_from_ff_call_db(calls, dir, character_only)
}

package_native_routine_registration_skeleton <- function(dir, con = stdout(), align = TRUE,
                                                         character_only = TRUE, include_declarations = TRUE) {
    nrdb <- package_native_routine_registration_db(dir, character_only)
    writeLines(format_native_routine_registration_db_for_skeleton(nrdb,
                align, include_declarations),
               con)
}

package_native_routine_registration_skeleton(".")  ## when R 3.4.0 is out you only need this line

Here I use /usr/bin/r as I happen to like littler a lot, but you can use Rscript the same way.

Easy enough now?

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Source:: R News

Take your data frames to the next level.

By realdataweb

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

While finishing up with R-rockstar Hadley Wickham’s book (Free Book – R for Data Science), the section on model building elaborates on something pretty cool that I had no idea about – list columns.

Most of us have probably seen the following data frame column format:

df <- data.frame("col_uno" = c(1,2,3),"col_dos" = c('a','b','c'), "col_tres" = factor(c("google", "apple", "amazon")))

And the output:

df
##   col_uno col_dos col_tres
## 1       1       a   google
## 2       2       b    apple
## 3       3       c   amazon

This is an awesome way to organize data and one of R’s strong points. However, we can use list functionality to go deeper. Check this out:

library(tidyverse)
library(datasets)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
nested <- iris %>%
  group_by(Species) %>%
  nest()
## # A tibble: 3 × 2
##      Species              data
##       <fctr>            <list>
## 1     setosa <tibble [50 × 4]>
## 2 versicolor <tibble [50 × 4]>
## 3  virginica <tibble [50 × 4]>

Using nest we can compartmentalize our data frame for readability and more efficient iteration. Here we can use map from the purrr package to compute the mean of each column in our nested data.

means <- map(nested$data, colMeans)
## [[1]]
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.006        3.428        1.462        0.246 
## 
## [[2]]
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.936        2.770        4.260        1.326 
## 
## [[3]]
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        6.588        2.974        5.552        2.026

Once you’re done messing around with data-ception, use unnest to revert your data back to its original state.

head(unnest(nested))
## # A tibble: 6 × 5
##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##    <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
## 1  setosa          5.1         3.5          1.4         0.2
## 2  setosa          4.9         3.0          1.4         0.2
## 3  setosa          4.7         3.2          1.3         0.2
## 4  setosa          4.6         3.1          1.5         0.2
## 5  setosa          5.0         3.6          1.4         0.2
## 6  setosa          5.4         3.9          1.7         0.4

I was pretty excited to learn about this property of data.frames and will definitely make use of it in the future. If you have any neat examples of nested dataset usage, please feel free to share in the comments. As always, I’m happy to answer questions or talk data!

Kiefer Smith

To leave a comment for the author, please follow the link and comment on their blog: R – Real Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Customising Shiny Server HTML Pages

By Mark Sellors

customising-shiny-server-login-original

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)
Mark Sellors
Head of Data Engineering

At Mango we work with a great many clients using the Shiny framework for R. Many of those use Shiny Server or Shiny Server Pro to publish their shiny apps within their organisations. Shiny Server Pro in particular is a great product, but some of the stock html pages are a little plain, so I asked the good folks at RStudio if it was possible to customise them to match corporate themes and so on. It turns out that it’s a documented feature that’s been available for around 3 years now!

The stock Shiny Server Pro login screen

If you want to try this yourself, check out the Shiny Server Admin Guide, but it’s pretty simple to do. My main point of interest in this is to customise the Shiny Server Pro login screen, but you can customise several other pages too. Though it’s worth noting that if you create a custom 404, it will not be available from within an application, only in places where it would be rendered by Shiny Server rather than by Shiny itself.

To get a feel for how this works, I asked Mango’s resident web dev, Ben, to rustle me up a quick login page for a fake company and I then set about customising that to fit the required format. The finished article can be seen below and we hope you’ll agree, it’s a marked improvement on the stock one. (Eagle eyed readers and sci-fi/film-buffs will hopefully recognise the OCP logo!)

Our new, customised Shiny Server Pro login screen

This new login screen works exactly like the original and is configured for a single app only on the server, rather than all apps. We do have an additional “Forgot” button, that could be used to direct users to a support portal or to mail an administrator or similar.

Customisations are very simple, and use the reasonably common handlebars/mustache format. A snippet of our custom page is below. Values placed within handlebars, like `{{value}}`, are actually variables that Shiny Server will replace with the correct info when it renders the page. I stripped out some stuff from the original template that we didn’t need, like some logic to use if the server is using different authentication mechanisms to keep things simple.



Hopefully this post has inspired you to check out this feature. It’s an excellent way to provide custom corporate branding, especially as it can be applied on an app-by-app basis. It’s worth knowing that this feature is not yet available in RStudio Connect, but hopefully it will arrive in a future update. If you do create any customisations of your own be sure to let us know! You’ll find us on twitter @MangoTheCat.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger

By Super User

zinsontledingmogelijk

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

Parts of Speech (POS) tagging is a crucial part in natural language processing. It consists of labelling each word in a text document with a certain category like noun, verb, adverb, pronoun, … . At BNOSAC, we use it on a dayly basis in order to select only nouns before we do topic detection or in specific NLP flows. For R users working with different languages, the number of POS tagging options is small and all have up or downsides. The following taggers are commonly used.

  • The Stanford Part-Of-Speech Tagger which is terribly slow, the language set is limited to English/French/German/Spanish/Arabic/Chinese (no Dutch). R packages for this are available at http://datacube.wu.ac.at.
  • Treetagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger) contains more languages but is only usable for non-commercial purposes (can be used based on the koRpus R package)
  • OpenNLP is faster and allows to do POS tagging for Dutch, Spanish, Polish, Swedish, English, Danish, German but no French or Eastern-European languages. R packages for this are available at http://datacube.wu.ac.at.
  • Package pattern.nlp (https://github.com/bnosac/pattern.nlp) allows Parts of Speech tagging and lemmatisation for Dutch, French, English, German, Spanish, Italian but needs Python installed which is not always easy to request at IT departments
  • SyntaxNet and Parsey McParseface (https://github.com/tensorflow/models/tree/master/syntaxnet) have good accuracy for POS tagging but need tensorflow installed which might be too much installation hassle in a corporate setting not to mention the computational resources needed.

Comes in RDRPOSTagger which BNOSAC released at https://github.com/bnosac/RDRPOSTagger. It has the following features:

  1. Easily installable in a corporate environment as a simple R package based on rJava
  2. Covering more than 40 languages:
    UniversalPOS annotation for languages: Ancient_Greek, Ancient_Greek-PROIEL, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Czech-CAC, Czech-CLTT, Danish, Dutch, Dutch-LassySmall, English, English-LinES, Estonian, Finnish, Finnish-FTB, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Kazakh, Latin, Latin-ITTB, Latin-PROIEL, Latvian, Norwegian, Old_Church_Slavonic, Persian, Polish, Portuguese, Portuguese-BR, Romanian, Russian-SynTagRus, Slovenian, Slovenian-SST, Spanish, Spanish-AnCora, Swedish, Swedish-LinES, Tamil, Turkish. Prepend the UD_ to the language if you want to used these models.
    MORPH annotation for languages: Bulgarian, Czech, Dutch, French, German, Portuguese, Spanish, Swedish
    POS annotation for languages: English, French, German, Hindi, Italian, Thai, Vietnamese
  3. Fast tagging as the Single Classification Ripple Down Rules are easy to execute and hence are quick on larger text volumes
  4. Competitive accuracy in comparison to state-of-the-art POS and morphological taggers
  5. Cross-platform running on Windows/Linux/Mac
  6. It allows to do the Morphological, POS tagging and universal POS tagging of sentences

The Ripple Down Rules a basic binary classification trees which are built on top of the Universal Dependencies datasets available at http://universaldependencies.org. The methodology of this is explained in detail at the paper ‘ A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging’ available at http://content.iospress.com/articles/ai-communications/aic698. If you just want to apply POS tagging on your text, you can go ahead as follows:

library(RDRPOSTagger)
rdr_available_models()

## POS annotation
x <- c("Oleg Borisovich Kulik is a Ukrainian-born Russian performance artist")
tagger <- rdr_model(language = "English", annotation = "POS")
rdr_pos(tagger, x = x)

## MORPH/POS annotation
x <- c("Dus godvermehoeren met pus in alle puisten , zei die schele van Van Bukburg .",
"Er was toen dat liedje van tietenkonttieten kont tieten kontkontkont",
" ", "", NA)
tagger <- rdr_model(language = "Dutch", annotation = "MORPH")
rdr_pos(tagger, x = x)

## Universal POS tagging annotation
tagger <- rdr_model(language = "UD_Dutch", annotation = "UniversalPOS")
rdr_pos(tagger, x = x)

## This gives the following output
sentence.id word.id word word.type
1 1 Dus ADV
1 2 godvermehoeren VERB
1 3 met ADP
1 4 pus NOUN
1 5 in ADP
1 6 alle PRON
1 7 puisten NOUN
1 8 , PUNCT
1 9 zei VERB
1 10 die PRON
1 11 schele ADJ
1 12 van ADP
1 13 Van PROPN
1 14 Bukburg PROPN
1 15 . PUNCT
2 1 Er ADV
2 2 was AUX
2 3 toen SCONJ
2 4 dat SCONJ
2 5 liedje NOUN
2 6 van ADP
2 7 tietenkonttieten VERB
2 8 kont PROPN
2 9 tieten VERB
2 10 kontkontkont PROPN
2 11 . PUNCT
3 0 <NA> <NA>
4 0 <NA> <NA>
5 0 <NA> <NA>

The function rdr_pos requests as input a vector of sentences. If you need to transform you text data to sentences, just use tokenize_sentences from the tokenizers package.

Good luck with text mining.
If you need our help for a text mining project. Let us know, we’ll be glad to get you started.

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Live Presentation: Kyle Walker on the tigris Package

By Ari Lamstein

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Kyle Walker

Next Tuesday (April 4) at 9am PT I will be hosting a webinar with Kyle Walker about the tigris package in R. In addition to creating the Tigris package, Kyle is also a Professor of Geography at Texas Christian University.

If you are interested in R, Mapping and US Census Geography then I encourage you to attend!

What tigris Does

When I released choroplethr back in 2014, many people loved that it allows you to easily map US States, Counties and ZIP Codes. What many people don’t realize is that:

  1. These maps come from the US Census Bureau
  2. The Census Bureau publishes many more maps than choroplethr ships with

What Tigris does is remarkable: rather than relying on someone creating an R package with the US map you want, it allows you to download shapefiles directly from the US Census Bureau and import them into your R session. All without leaving the R console! From there you can render the map with your favorite graphics library.

Kyle will explain what shapefiles the Census Bureau has available, and also give a demonstration of how to use Tigris. After the presentation, there will be a live Q&A session.

If you are unable to attend live, please register so I can email you a link to the replay.

SAVE MY SEAT

The post Live Presentation: Kyle Walker on the tigris Package appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Learning Scrabble strategy from robots, using R

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While you might think of Scrabble as that game you play with your grandparents on a rainy Sunday, some people take it very seriously. There’s an international competition devoted to Scrabble, and no end of guides and strategies for competitive play. James Curley, a psychology professor at Columbia University, has used an interesting method to collect data about what plays are most effective in Scrabble: by having robots play against each other, thousands of times.

The data were generated with a Visual Basic script that automated two AI players completing a game in Quackle. Quackle emulates the Scrabble board, and provides a number of AI players; the simulation used the “Speedy Player” AI which tends to make tactical scoring moves while missing some longer-term strategic plays (like most reasonably skilled Scrabble players). He recorded the results of 2566 games between two such computer players and provided the resulting plays and boards in an R package. With these data, you can see some interesting statistics on long-term outcomes from competitive Scrabble games, like this map (top-left) of which squares on the board are most used in games (darker means more frequently), and also for just the Q, Z and blank tiles. Scrabble games in general tend to follow the diagonals where the double-word score squares are located, while the high-scoring Q and Z tiles tend to be used on double- and triple-letter squares. The zero-point blank tile, by comparison, is used fairly uniformly across the board.

Further analysis of the actual plays during the simulated games reveals some interesting Scrabble statistics:

It’s best to play first. Player 1 won 54.6% of games, while Player 2 won 44.9%, a statistically significant difference. (The remaining 0.4% of the games were ties.)

Some uncommon words are frequently used. The 10 most frequently-played words were QI, QAT, QIN, XI, OX, EUOI, XU, ZO, ZA, and EX; not necessarily words you’d use in casual conversation, but all words very familiar to competitive scrabble players. As we’ll see later though, not all are high-scoring plays. Sometimes it’s a good idea to get rid of a high-scoring but restrictive letter (QI, the life energy in Chinese philosophy), or simply to shake up a vowel-heavy rack for more opportunities next turn (EUOI, a cry of impassioned rapture in ancient Bacchic revels).

For high-scoring plays, go for the bingo. A Scrabble Bingo, where you lay down all 7 tiles in your rack in one play, comes with a 50-point bonus. The top three highest-scoring plays in the simulation were all bingo plays: REPIQUED (239 points), CZARISTS (230 points), and IDOLIZED. (Remember though, that this is from just a couple of thousand simulated games; there are many many more potentially high-scoring words.)

High-scoring non-bingo plays can be surprisingly long words. It’s super-satisfying to lay down a short word like EX with the X making a second word on a triple-letter tile (for a 9x bonus), so I was surprised to see the top 10 highest-scoring non-bingo plays were still fairly long words: XENURINE (144 points), CYANOSES (126 pts) and SNAPWEED (126 points), all using at least 2 tiles already on the board. The shortest word in the top 10 of this list was ZITS.

Some tiles just don’t work well together. The pain of getting a Q without a U seems obvious, but it turns out getting two U’s is way worse in point-scoring potential. From the simulation, you can estimate the point-scoring potential of any pair of tiles in your rack: lighter is better, and darker is worse.

Managing the scoring potential of the tiles in your rack is a big part of Scrabble strategy, as we saw in another Scrabble analysis using R a few years ago. The lowly zero-point blank is actually worth a lot of potential points, while the highest-scoring tile Q is actually a liability. Here are the findings from that analysis:

  • The blank is worth about 30 points to a good player, mainly by making 50-point “bingo” plays possible.
  • Each S is worth about 10 points to the player who draws it.
  • The Q is a burden to whichever player receives it, effectively serving as a 5 point penalty for having to deal with it due to its effect in reducing bingo opportunities, needing either a U or a blank for a chance at a bingo and a 50-point bonus.
  • The J is essentially neutral pointwise.
  • The X and the Z are each worth about 3-5 extra points to the player who receives them. Their difficulty in playing in bingoes is mitigated by their usefulness in other short words.

For more James Curley’s recent Scrabble analysis, including the R code using his scrabblr package, follow the link to below.

RPubs: Analyzing Scrabble Games

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News