sparklyr 0.8

By Kevin Kuo

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to announce that sparklyr 0.8 is now available on CRAN! Sparklyr provides an R interface to Apache Spark. It supports dplyr syntax for working with Spark DataFrames and exposes the full range of machine learning algorithms available in Spark ML. You can also learn more about Apache Spark and sparklyr at spark.rstudio.com and the sparklyr webinar series. In this version, we added support for Spark 2.3, Livy 0.5, and various enhancements and bugfixes. For this post, we’d like to highlight a new feature from Spark 2.3 and introduce the mleap and graphframes extensions.

Parallel Cross-Validation

Spark 2.3 supports parallelism in hyperparameter tuning. In other words, instead of training each model specification serially, you can now train them in parallel. This can be enabled by setting the parallelism parameter in ml_cross_validator() or ml_train_split_validation(). Here’s an example:

library(sparklyr)
sc %
  ft_vector_assembler(
    c("Sepal_Width", "Sepal_Length", "Petal_Width", "Petal_Length"),
    "features"
  ) %>%
  ft_string_indexer_model("Species", "label", labels = labels) %>%
  ml_logistic_regression()

# Specify hyperparameter grid
grid 

Once the models are trained, you can inspect the performance results by using the newly available helper function ml_validation_metrics():

ml_validation_metrics(cv_model)
##       f1 elastic_net_param_1 reg_param_1
## 1 0.9506                0.25       1e-03
## 2 0.9384                0.75       1e-03
## 3 0.9384                0.25       1e-04
## 4 0.9569                0.75       1e-04
spark_disconnect(sc)

Pipelines in Production

Earlier this year, we announced support for ML Pipelines in sparklyr, and discussed how one can persist models onto disk. While that workflow is appropriate for batch scoring of large datasets, we also wanted to enable real-time, low-latency scoring using pipelines developed with sparklyr. To enable this, we’ve developed the mleap package, available on CRAN, which provides an interface to the MLeap open source project.

MLeap allows you to use your Spark pipelines in any Java-enabled device or service. This works by serializing Spark pipelines which can later be loaded into the Java Virtual Machine (JVM) for scoring without requiring a Spark cluster. This means that software engineers can take Spark pipelines exported with sparklyr and easily embed them in web, desktop or mobile applications.

To get started, simply grab the package from CRAN and install the necessary dependencies:

install.packages("mleap")
library(mleap)
install_maven()
install_mleap()

Then, build a pipeline as usual:

library(sparklyr)
sc %
  ft_binarizer("hp", "big_hp", threshold = 100) %>%
  ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>%
  ml_gbt_regressor(label_col = "mpg")
pipeline_model 

Once we have the pipeline model, we can export it via ml_write_bundle():

# Export model
model_path 

At this point, we’re ready to use mtcars_model.zip in other applications. Notice that the following code does not require Spark:

# Import model
model 
## Observations: 2
## Variables: 6
## $ qsec        16.2, 18.1
## $ hp          101, 99
## $ wt          2.68, 3.08
## $ big_hp      1, 0
## $ features    [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]]
## $ prediction  21.07, 22.37

Notice that MLeap requires Spark 2.0 to 2.2. You can find additional details in the production pipelines guide.

Graph Analysis

The other extension we’d like to highlight is graphframes, which provides an interface to the GraphFrames Spark package. GraphFrames allows us to run graph algorithms at scale using a DataFrame-based API.

Let’s see graphframes in action through a quick example, where we analyze the relationships among package on CRAN.

library(graphframes)
library(dplyr)
sc %
  `[`(, c("Package", "Depends", "Imports")) %>%
  as_tibble() %>%
  transmute(
    package = Package,
    dependencies = paste(Depends, Imports, sep = ",") %>%
      gsub("n|s+", "", .)
  )

# Copy data to Spark
packages_tbl %
  mutate(
    dependencies = dependencies %>%
      regexp_replace("\(([^)]+)\)", "")
  ) %>%
  ft_regex_tokenizer(
    "dependencies", "dependencies_vector",
    pattern = "(s+)?,(s+)?", to_lower_case = FALSE
  ) %>%
  transmute(
    src = package,
    dst = explode(dependencies_vector)
  ) %>%
  filter(!dst %in% c("R", "NA"))

Once we have an edges table, we can easily create a GraphFrame object by calling gf_graphframe() and running PageRank:

# Create a GraphFrame object
g %
  gf_vertices() %>%
  arrange(desc(pagerank))
## # Source:     table [?? x 2]
## # Database:   spark_connection
## # Ordered by: desc(pagerank)
##    id        pagerank
##    
##  1 methods      259. 
##  2 stats        209. 
##  3 utils        194. 
##  4 Rcpp         109. 
##  5 graphics     104. 
##  6 grDevices     60.0
##  7 MASS          53.7
##  8 lattice       34.7
##  9 Matrix        33.3
## 10 grid          32.1
## # ... with more rows

We can also collect a sample of the graph locally for visualization:

library(gh)
library(visNetwork)
list_repos %
    vapply("[[", "", "name")
}
rlib_repos %
  as_tibble() %>%
  filter(Priority == "base") %>%
  pull(Package)

top_packages %
  gf_vertices() %>%
  arrange(desc(pagerank)) %>%
  head(75) %>%
  pull(id)

edges_local %
  gf_edges() %>%
  filter(src %in% !!top_packages && dst %in% !!top_packages) %>%
  rename(from = src, to = dst) %>%
  collect()

vertices_local %
  gf_vertices() %>%
  filter(id %in% top_packages) %>%
  mutate(
    group = case_when(
      id %in% !!rlib_repos ~ "r-lib",
      id %in% !!tidyverse_repos ~ "tidyverse",
      id %in% !!base_packages ~ "base",
      TRUE ~ "other"
    ),
    title = id) %>%
  collect()

visNetwork(vertices_local, edges_local, width = "100%") %>%
  visEdges(arrows = "to")

spark_disconnect(sc)

Notice that GraphFrames currently supports Spark 2.0 and 2.1. You can find additional details in the graph analysis guide.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.