A minimal Project Tree in R

By mareviv

(This article was first published on R – Tales of R, and kindly contributed to R-bloggers)

You can also check this post, written in #blogdown, here: minimal-project-tree-r.

Introduction

The last two days arrived at my twitter feed some discussions on how bad are the following sentences at the beginning of your R script/notebook, sparked by @JennyBryan‘s slides at the IASC-ARS/NZSA Conference:

setwd()

and

rm(list = ls())

Blog post elaborating on my advice to avoid setwd() and rm(list = ls()) at top of an #rstats script.
Wow that reall… twitter.com/i/web/status/9…


Jenny Bryan (@JennyBryan) December 12, 2017

The only two things that make @JennyBryan 😠🤯. Instead use projects + here::here() #rstats https://t.co/GwxnHePL4n


Hadley Wickham (@hadleywickham) December 11, 2017

How to ensure @JennyBryan doesn’t 🔥 your 💻, or: “Project-oriented workflow” (from the arsonist herself)… twitter.com/i/web/status/9…


Mara Averick (@dataandme) December 12, 2017

Jenny Bryan offered a detailed explanation for this, as well as some fixes, in her tidyverse blog post. The main idea was:

  • To ensure reproducibility within a stable working directory tree. She proposes the very concise here::here() but other methods are available such as the template package.
  • To avoid break havoc in other’s computers with rm(list = ls())!.

All of this buzz around project self-containment and reproducibility motivated me to finish a minimal directory tree that (with some variations) I have been using for this year’s data analysis endeavours.

It is a extremely simple tree which separates a /data, a /plot and an /img directory inside the main folder (root)

  • The data folder contains both raw data and processed data files saved by R.
  • The plot folder contains all the plots saved during the workflow.
  • The img folder has every other image (logos, etc) that R takes as an input to build the results.
  • Inside the root folder I store the main .R or .Rmd scripts.

This ensures that every folder has an unidirectional relationship with the root folder (except the data dir in this case). But the important thing is that the paths in the scripts are set relative to the root folder, so the entire tree can be copied elsewhere and still work as expected.

I also added some more features to the tree:

  • An .Rproj file
  • Parametrize the .Rmd file
  • A Git repository so the tree can be conveniently cloned or downloaded, with a .gitignore file:

Here is a sketch of how it works:

And here is the actual code of the notebook/script. I have not included regular markdown text outside the R chunks, as this template is intended to be filled with new text each time:

Script code

# Installs missing libraries on render!
list.of.packages "rmarkdown", "dplyr", "ggplot2", "Rcpp", "knitr", "readxl")
new.packages of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, repos='https://cran.rstudio.com/')
library(dplyr)
library(knitr)
library(ggplot2)

Working directories

# directory where the notebook is
wdir # directory where data are imported from & saved to
datadir "/data", sep="")
# directory where external images are imported from
imgdir "/img", sep="")
# directory where plots are saved to
plotdir "/plot", sep="")
# y la carpeta inmediatamente por encima
wdirUp "ProjectR", "", wdir) 

Data import

# Data name (stored as a parameter in the Rmarkdown notebook)
params $dataname "cars"
# MSEXCEL
dataname $dataname # archive name
routexl "/", dataname, ".xlsx", sep="")  # complete route to archive

library(readxl)
mydata 1)  # imports first sheet
# CSV / TSV (separated by tabs in this example)
dataname $dataname # archive name
routecsv "/", dataname, ".csv", sep="")  # complete route to archive

mydata ""), 
         header = TRUE, 
         sep = "t",
         dec = ",")

Data operations

head(mydata)
     speed dist
   1     4    2
   2     4   10
   3     7    4
   4     7   22
   5     8   16
   6     9   10
p1 - ggplot(cars, aes(x=speed, y=dist)) + geom_point()
p1

Save plots

# TO PDF 
plotname1 "p1.pdf"
# TO PNG 
plotname2 "p1.png"

routeplot1 "/", plotname1, sep="")
routeplot2 "/", plotname2, sep="")
ggsave(routeplot1)  # (see http://ggplot2.tidyverse.org/reference/ggsave.html)
ggsave(routeplot2) 

Save data

# RDATA
save(mydata, file="data/mydata.RData")
# MSEXCEL # not run
dataname2 "mydata"  # name we will give to file
routexl2 "/", dataname2, ".xlsx", sep="")   # complete route to future archive

library(xlsx)
write.xlsx(mydata, routexl2) # creates archive in specified route
# CSV / TSV (separated by tabs in this example)
dataname2 "mydata"  # name we will give to file
routecsv2 "/", dataname2, ".csv", sep="")  # complete route to future archive

write.table(mydata, file = routecsv2, append = FALSE, quote = FALSE, sep = "t ",
            eol = "n", na = "NA", dec = ".", row.names = FALSE,
            col.names = TRUE)

This script -and the dir tree that contains it- is saving me a lot of time and headaches (where I’ve put that data?….), I hope it can be also useful for people out there!.

Future improvements

To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A chart of Bechdel Test scores

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A movie is said to satisfy the Bechdel Test if it satisfies the following three criteria:

  1. The movie has at least two named female characters
  2. … who have a conversation with each other
  3. … about something other than a man

The website BechdelTest.com scores movies accordingly, granting one point for each of the criteria above for a maximum of three points. The recent Wonder Woman movie scores the full three points (Diana and her mother discuss war, for example), while Dunkirk, which features no named female characters, gets zero. (It was still a great film, however.)

The website also offers an API, which enabled data and analytics director Austin Wehrwein to create this time series chart of Bechdel scores for movies listed on BechdelTest.com:

This chart only includes ratings for that subset of movies listed on Bechdeltest.com, so it’s not clear whether this is representative of movies as a whole. Austin suggests combining these data with the broader data from IDMB, so maybe someone wants to give that a try. Austin’s R code for generating the above chart and several others is available at the link below, so click through for the full analysis.

Austin Wehrwein: A quick look at Bechdel test data (& an awtools update) (via Mara Averick)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leveraging pipeline in Spark trough scala and Sparklyr

By pierluigi mazzuca

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Intro

Pipeline concept is definitely not new for software world, Unix pipe operator (|) links two tasks putting the output of one as the input of the other. In machine learning solutions it is pretty much usual to apply several transformation and manipulation to datasets, or to different portions or sample of the same dataset (from classic test/train slices to samples taken for cross-validation procedure). In these cases, pipelines are definitely useful to define a sequence of tasks chained together to define a complete process, which in turn simplifies the operation of the ml solution. In addition, in BigData environment, it is possible to apply the “laziness” of execution to the entire process in order to make it more scalable and robust, therefore no surprise to see pipeline implemented in Spark machine learning library and R API available, by SparklyR package, to leverage the construct. Pipeline component in Spark are basically of two types :

  • transformer: since dataframe usually need to undergo several kinds of changes column-wide, row-wide or even single value-wide, transformers are the component meant to deliver these transformations. Typically a transformer has a table or dataframe as input and delivers a table/dataframe as output. Sparks, through MLlib, provide a set of feature’s transformers meant to address most common transformations needed;
  • estimator: estimator is the component which delivers a model, fitting an algorithm to train data. Indeed fit() is the key method for an estimator and produces, as said a model which is a transformer. Leveraging the parallel processing which is provided by Spark, it is possible to run several estimators in parallel on different training dataset in order to find the best solution (or even to avoid overfitting issue). ML algorithms are basically a set of Estimators, they build a rich set of machine learning (ML) common algorithms, available from MLlib. This is a library of algorithms meant to be scalable and run in a parallel environment. Starting from the 2.0 release of Spark, the RDD-based library is in maintenance mode (the RDD-based APIs are expected to be removed in 3.0 release) whereas the mainstream development is focused on supporting dataframes. In MLlib features are to be expressed with labeledpoints, which means numeric vectors for features and predictors.Pipelines of transformers are, even for this reason, extremely useful to operationalize an ML solutions Spark-based. For additional details on MLlib refer to Apache Spark documentation

In this post we’ll see a simple example of pipeline usage and we’ll see two version of the same example: the first one using Scala (which is a kind of “native” language for Spark environment), afterward we’ll see how to implement the same example in R, leveraging SaprklyR package in order to show how powerful and complete it is.

The dataset

For this example, the dataset comes from UCI – Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science. “Adults Income” dataset reports individual’s annual income results from various factors. The annual income will be our label (it is divided into two classes: 50K) and there are 14 features, for each individual, we can leverage to explore the possibility in predicting income level. For additional detail on “adults dataset” refer to the UCI machine learning repository http://www.cs.toronto.edu/~delve/data/adult/desc.html.

Scala code

As said, we’ll show how we can use scala API to access pipeline in MLlib, therefore we need to include references to classes we’re planning to use in our example and to start a spark session :

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer, VectorIndexer}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.sql.SparkSession
then we’ll read dataset and will start to manipulate data in order to prepare for the pipeline. In our example, we’ll get data out of local repository (instead of referring to an eg. HDFS or Datalake repository, there are API – for both scala and R – which allows the access to these repositories as well). We’ll leverage this upload activity also to rename some columns, in particular, we’ll rename the “income” column as “label” since we’ll use this a label column in our logistic regression algorithm.


//load data source from local repository
val csv = spark.read.option("inferSchema","true")
  .option("header", "true").csv("...yyyyxxxxadult.csv")

val data_start = csv.select(($"workclass"),($"gender"),($"education"),($"age"),
($"marital-status").alias("marital"), ($"occupation"),($"relationship"), 
($"race"),($"hours-per-week").alias("hr_per_week"), ($"native-country").alias("country"),
($"income").alias("label"),($"capital-gain").alias("capitalgain"),
($"capital-loss").alias("capitalloss"),($"educational-num").alias("eduyears")).toDF

We’ll do some data clean up basically recoding the “working class” and “marital” columns, in order to reduce the number of codes and we’ll get rid of rows for which “occupation”, “working class”” (even recoded) and “capital gain” are not available. For first two column the dataset has the “?” value instead of “NA”, for capital gain there’s the 99999 value which will be filtered out. To recode “working class” and “marital” columns we’ll use UDF functions which in turn are wrappers of the actual recoding functions. To add a new column to the (new) dataframe we’ll use the “withColumn” method which will add “new_marital” and “new_workclass” to the startingdata dataframe. Afterwards, we’ll filter out all missing values and we’ll be ready to build the pipeline.


// recoding marital status and working class, adding a new column 
def newcol_marital(str1:String): String = { 
    var nw_str = "noVal"
     if ((str1 == "Married-spouse-absent") | (str1 =="Married-AF-spouse") 
         | (str1 == "Married-civ-spouse")) {nw_str = "Married" } 
        else if ((str1 == "Divorced") | (str1 == "Never-married" ) 
                 | (str1 == "Separated" ) | (str1 == "Widowed")) 
          {nw_str = "Nonmarried" } 
        else { nw_str = str1}
    return nw_str
}
val udfnewcol = udf(newcol_marital _)
val recodeddata = data_start.withColumn("new_marital", udfnewcol('marital'))

def newcol_wkclass(str1:String): String = { 
    var nw_str = "noVal"
     if ((str1 == "Local-gov") | (str1 =="Federal-gov") | (str1 == "State-gov")) 
        {nw_str = "Gov" } 
    else if ((str1 == "Self-emp-not-inc") | (str1 == "Self-emp-inc" )) 
        {nw_str = "Selfemployed" } 
    else if ((str1 == "Never-worked") | (str1 == "Without-pay")) 
        {nw_str = "Notemployed" } 
    else { nw_str = str1}
    return nw_str
}

val udfnewcol = udf(newcol_wkclass _)
val startingdata = recodeddata.withColumn("new_workclass", udfnewcol('workclass'))

// remove missing data
val df_work01 = startingdata.na.drop("any")
val df_work = startingdata.filter("occupation  '?' 
                                and capitalgain 99999 
                                and new_workclass  '?' 
                                and country  '?' ")

In our example, we’re going to use 12 features, 7 are categorical variables and 5 are numeric variables. The feature’s array we’ll use to fit the model will be the results of merging two arrays, one for categorical variables and the second one for numeric variables. Before building the categorical variables array, we need to transform categories to indexes using transformers provided by spark.ml, even the label must be transformed to an index. Our pipeline then will include 7 pipeline stages to transform categorical variables, 1 stage to transform the label, 2 stages to build categorical and numeric arrays, and the final stage to fit the logistic model. Finally, we’ll build an 11-stages pipeline.
To transform categorical variables into index we’ll use “Stringindexer” Transformer. StringIndexer encodes a vector of string to a column of non-negative indices, ranging from 0 to the number of values. The indices ordered by label frequencies, so the most frequent value gets index 0. For each variable, we need to define the input column and the output column which we’ll use as input for other transformer or evaluators. Finally it is possible to define the strategy to handle unseen labels (possible when you use the Stringindexer to fit a model and run the fitted model against a test dataframe) through setHandleInvalid method , in our case we simply put “skip” to tell Stringindexer we want to skip unseen labels (additional details are available in MLlib documentation).


// define stages
val new_workclassIdx = new StringIndexer().setInputCol("new_workclass")
.setOutputCol("new_workclassIdx").setHandleInvalid("skip")

val genderIdx = new StringIndexer().setInputCol("gender")
.setOutputCol("genderIdx").setHandleInvalid("skip")

val maritalIdx = new StringIndexer().setInputCol("new_marital")
.setOutputCol("maritalIdx").setHandleInvalid("skip")

val occupationIdx = new StringIndexer().setInputCol("occupation")
.setOutputCol("occupationIdx").setHandleInvalid("skip")

val relationshipIdx = new StringIndexer().setInputCol("relationship")
.setOutputCol("relationshipIdx").setHandleInvalid("skip")

val raceIdx = new StringIndexer().setInputCol("race")
.setOutputCol("raceIdx").setHandleInvalid("skip")

val countryIdx = new StringIndexer().setInputCol("country")
.setOutputCol("countryIdx").setHandleInvalid("skip")

val labelIdx = new StringIndexer().setInputCol("label")
.setOutputCol("labelIdx").setHandleInvalid("skip")

In addition to Transfomer and Estimator provided by spark.ml package, it is possible to define custom Estimator and Transformers. As an example we’ll see how to define a custom transformer aimed at recoding “marital status” in our dataset (basically we’ll do the same task we have already seen, implementing it with a custom transformer; for additional details on implementing customer estimator and transformer see the nice article by H.Karau. To define a custom transformer, we’ll define a new scala class, columnRecoder, which extends the Transformer class, we’ll override the transformSchemamethod to map the correct type of the new column we’re going to add with the transformation and we’ll implement the transform method which actually does the recoding in our dataset. A possible implementation is :


import org.apache.spark.ml.Transformer
class columnRecoder(override val uid: String) extends Transformer {
  final val inputCol= new Param[String](this, "inputCol", "input column")
  final val outputCol = new Param[String](this, "outputCol", "output column")

def setInputCol(value: String): this.type = set(inputCol, value)

def setOutputCol(value: String): this.type = set(outputCol, value)

def this() = this(Identifiable.randomUID("columnRecoder"))

def copy(existingParam: ParamMap): columnRecoder = {defaultCopy(existingParam)}
override def transformSchema(schema: StructType): StructType = {
    // Check inputCol type
    val idx = schema.fieldIndex($(inputCol))
    val field = schema.fields(idx)
    if (field.dataType != StringType) {
      throw new Exception(s"Input type ${field.dataType} type mismatch: String expected")
    }
    // The return field
    schema.add(StructField($(outputCol),StringType, false))
  }

val newcol_recode = new marital_code()

private def newcol_recode(str1: String): String = { 
    var nw_str = "noVal"
     if ((str1 == "Married-spouse-absent") | (str1 =="Married-AF-spouse") 
        | (str1 == "Married-civ-spouse")) 
       {nw_str = "Married" } 
        else if ((str1 == "Divorced") | (str1 == "Never-married" ) 
        | (str1 == "Separated" ) | (str1 == "Widowed")) 
          {nw_str = "Nonmarried" } 
        else { nw_str = str1}
    nw_str
  }
  
private def udfnewcol =  udf(newcol_recode.recode(_))
  
def transform(df: Dataset[_]): DataFrame = { 
  df.withColumn($(outputCol), udfnewcol(df($(inputCol)))) 
  }
}

Once defined as a transformer, we can use it in our pipeline as the first stage.


// define stages
val new_marital = new columnRecoder().setInputCol("marital")
.setOutputCol("new_marital")

val new_workclassIdx = new StringIndexer().setInputCol("new_workclass")
.setOutputCol("new_workclassIdx").setHandleInvalid("skip")

val genderIdx = new StringIndexer().setInputCol("gender")
.setOutputCol("genderIdx").setHandleInvalid("skip")

val maritalIdx = new StringIndexer().setInputCol("new_marital")
.setOutputCol("maritalIdx").setHandleInvalid("skip")

.......

A second step in building our pipeline is to assemble categorical indexes in a single vector, therefore many categorical indexes are put all together in a single vector through the VectorAssemblertransformer. This VectorAssembler will deliver a single column feature which will be, in turn, transformed to indexes by VectorIndexer transformer to deliver indexes within the “catFeatures” column:

// cat vector for categorical variables
val catVect = new VectorAssembler()
                  .setInputCols(Array("new_workclassIdx", "genderIdx", "catVect","maritalIdx", "occupationIdx","relationshipIdx","raceIdx","countryIdx"))
                  .setOutputCol("cat01Features")

val catIdx = new VectorIndexer()
                  .setInputCol(catVect.getOutputCol)
                  .setOutputCol("catFeatures")

For numeric variables we need just to assemble columns with VectorAssembler, then we’re ready to put these two vectors (one for categorical variables, the other for numeric variables) together in a single vector.


// numeric vector for numeric variable
val numVect = new VectorAssembler()
                  .setInputCols(Array("age","hr_per_week","capitalgain","capitalloss","eduyears"))
                  .setOutputCol("numFeatures")

val featVect = new VectorAssembler()
                    .setInputCols(Array("catFeatures", "numFeatures"))
                    .setOutputCol("features")

We have now label and features ready to build the logistic regression model which is the final component of our pipeline. We can also set some parameters for the model, in particular, we can define the threshold (by default set to 0.5) to make the decision between label values, as well as the max number of iterations for this algorithm and a parameter to tune regularization.
When all stages of the pipeline are ready, we just need to define the pipeline component itself, passing as an input parameter an array with all defined stages:

val lr = new LogisticRegression().setLabelCol("labelIdx").setFeaturesCol("features")
  .setThreshold(0.33).setMaxIter(10).setRegParam(0.2)

val pipeline = new Pipeline().setStages(Array(new_marital,new_workclassIdx, labelIdx,maritalIdx,occupationIdx, relationshipIdx,raceIdx,genderIdx, countryIdx,catVect, catIdx, numVect,featVect,lr))

Now the pipeline component, which encompasses a number of transformations as well as the classification algorithm, is ready; to actually use it we supply a train dataset to fit the model and then a test dataset to evaluate our fitted model. Since we have defined a pipeline, we’ll be sure that both, train and test datasets, will undergo the same transformations, therefore, we don’t have to replicate the process twice.
We need now to define train and test datasets.In our dataset, label values are unbalanced being the “more than 50k USD per year” value around the 25% of the total, in order to preserve the same proportion between label values we’ll subset the original dataset based on label value, obtaining a low-income dataset and an high-income dataset. We’ll split both dataset for train (70%) and test (30%), then we’ll merge back the two “train”” and the two “test” datasets and we’ll use resulting “train” dataset as input for our pipeline:

// split betwen train and test
val df_low_income = df_work.filter("label == ')
val df_high_income = df_work.filter("label == '>50K'")

val splits_LI = df_low_income.randomSplit(Array(0.7, 0.3), seed=123)
val splits_HI = df_high_income.randomSplit(Array(0.7, 0.3), seed=123)

val df_work_train = splits_LI(0).unionAll(splits_HI(0))
val df_work_test = splits_LI(1).unionAll(splits_HI(1))

// fitting the pipeline
val data_model = pipeline.fit(df_work_train)

Once the pipeline is trained, we can use the data_model for testing against the test dataset, calculate the confusion matrix and evaluate the classifier metrics :

// generate prediction
val data_prediction = data_model.transform(df_work_test)
val data_predicted = data_prediction.select("features", "prediction", "label","labelIdx")

// confusion matrix
val tp = data_predicted.filter("prediction == 1 AND labelIdx == 1").count().toFloat
val fp = data_predicted.filter("prediction == 1 AND labelIdx == 0").count().toFloat
val tn = data_predicted.filter("prediction == 0 AND labelIdx == 0").count().toFloat
val fn = data_predicted.filter("prediction == 0 AND labelIdx == 1").count().toFloat
val metrics = spark.createDataFrame(Seq(
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn)))).toDF("metric", "value")
metrics.show()

R code and SparklyR

Now we’ll try to replicate the same example we just saw in R, more precisely, working with the SparklyR package.
We’ll use the developer version of SparklyR (as you possibly know, there’s an interesting debate on the best API to access Apache Spark resources from R. For those that wants to know more about https://github.com/rstudio/sparklyr/issues/502, http://blog.revolutionanalytics.com/2016/10/tutorial-scalable-r-on-spark.html).
We need to install it from github before connecting with Spark environment.
In our case Spark is a standalone instance running version 2.2.0, as reported in the official documentation for SparklyR, configuration parameters can be set through spark_config() function, in particular, spark_config() provides the basic configuration used by default for spark connection. To change parameters it’s possible to get default configuration via spark_connection() then change parameters as needed ( here’s the link for additional details to run sparklyR on Yarn cluster).

devtools::install_github("rstudio/sparklyr")
library(sparklyr)
library(dplyr)

sc  spark_connect(master = "local",spark_home = "...Localspark",version="2.2.0")

After connecting with Spark, we’ll read the dataset into a Spark dataframe and select fields (with column renaming, where needed) we’re going to use for our classification example. It is worthwhile to remember that dplyr (and therefore sparklyR) uses lazy evaluation when accessing and manipulating data, which means that ‘the data is only fetched at the last moment when it’s needed’ (Efficient R programming, C.Gillespie R.Lovelace).For this reason, later on we’ll that we’ll force the statement execution calling action functions (like collect()).
As we did in scala script, we’ll recode “marital status” and “workclass” in order to simplify the analysis. In renaming dataframe columns, we’ll use the “select” function available from dplyr package, indeed one of the SparklyR aims is to allow the manipulation of Spark dataframes/tables through dplyr functions. This is an example of how function like “select” or “filter” can be used also for spark dataframes.

income_table  spark_read_csv(sc,"income_table","...adultincomeadult.csv")

income_table  select(income_table,"workclass","gender","eduyears"="educationalnum",
        "age","marital"="maritalstatus","occupation","relationship",
        "race","hr_per_week"="hoursperweek","country"="nativecountry",
        "label"="income","capitalgain","capitalloss")

# recoding marital status and workingclass

income_table  income_table %>% mutate(marital = ifelse(marital == "Married-spouse-absent" | marital == "Married-AF-spouse" | 
marital == "Married-civ-spouse","married","nonMarried"))

income_table  income_table %>% mutate(workclass = ifelse(workclass == "Local-gov"| workclass == "Federal-gov" | workclass == "State_gov",
"Gov",ifelse(workclass == "Self-emp-inc" | workclass == "Self-emp-not-inc","Selfemployed",ifelse(workclass=="Never-worked" | 
workclass == "Without-pay","Notemployed",workclass))))   

SparklyR provides functions which are bound to Spark spark.ml package, therefore it is possible to build Machine Learning solutions putting together the power of dplyr grammar with Spark ML algorithms. To simply link package function to Spark.ml, SparklyR provides functions which use specific prefixes to identify functions group:

  • functions prefixed with sdf_ generally access the Scala Spark DataFrame API directly (as opposed to the dplyr interface which uses Spark SQL) to manipulate dataframes;
  • functions prefixed with ft_ are functions to manipulate and transform features. Pipeline transformers and estimators belong to this group of functions;
  • functions prefixed with ml_ implement algorithms to build machine learning workflow. Even pipeline instance is provided by ml_pipeline() which belongs to these functions.

We can then proceed with pipeline, stages and feature array definition. Ft-prefixed functions recall the spark.ml functions these are bound to:

income_pipe  ml_pipeline(sc,uid="income_pipe")
income_pipe ft_string_indexer(income_pipe,input_col="workclass",output_col="workclassIdx")
income_pipe  ft_string_indexer(income_pipe,input_col="gender",output_col="genderIdx")
income_pipe  ft_string_indexer(income_pipe,input_col="marital",output_col="maritalIdx")
income_pipe  ft_string_indexer(income_pipe,input_col="occupation",output_col="occupationIdx")
income_pipe  ft_string_indexer(income_pipe,input_col="race",output_col="raceIdx")
income_pipe  ft_string_indexer(income_pipe,input_col="country",output_col="countryIdx")

income_pipe  ft_string_indexer(income_pipe,input_col="label",output_col="labelIdx")

array_features  c("workclassIdx","genderIdx","maritalIdx",
                    "occupationIdx","raceIdx","countryIdx","eduyears",
                    "age","hr_per_week","capitalgain","capitalloss")

income_pipe  ft_vector_assembler(income_pipe, input_col=array_features, output_col="features")

In our example we’ll use ml_logistic_regression() to implement the classification solution aimed at predicting the income level. Being bound to the LogisticRegression() function in spark.ml package, (as expected) all parameters are available for specific setting, therefore we can can “mirror” the same call we made in scala code: e.g. we can manage the “decision” threshold for our binary classifier (setting the value to 0.33).

# putting in pipe the logistic regression evaluator
income_pipe  ml_logistic_regression(income_pipe, 
features_col = "features", label_col = "labelIdx",
family= "binomial",threshold = 0.33, reg_param=0.2, max_iter=10L)

Final steps in our example are to split data between train and test subset, fit the pipeline and evaluate the classifier. As we’ve had already done in scala code, we’ll manage the unbalance in label values, splitting the dataframe in a way which secures the relative percentage of label values. Afterwards, we’ll fit the pipeline and get the predictions relative to test dataframe (as we did already in scala code).

# data split
# dealing with label inbalance

df_low_income = filter(income_table,income_table$label == ")
df_high_income = filter(income_table,income_table$label == ">50K")

splits_LI  sdf_partition(df_low_income, test=0.3, train=0.7, seed = 7711)
splits_HI  sdf_partition(df_high_income,test=0.3,train=0.7,seed=7711)

df_test  sdf_bind_rows(splits_LI[[1]],splits_HI[[1]])
df_train  sdf_bind_rows(splits_LI[[2]],splits_HI[[2]])

df_model  ml_fit(income_pipe,df_train)

Once fitted, the model exposes, for each pipeline stage, all parameters and logistic regression (which is the last element in the stages list) coefficients.

df_model$stages
df_model$stages[[9]]$coefficients

We can then process the test dataset putting it in the pipeline to get predictions and evaluate the fitted model

df_prediction  ml_predict(df_model,df_test)
df_results  select(df_prediction,"prediction","labelIdx","probability")

We can then proceed to evaluate the confusion matrix and get main metrics to evaluate the model (from precision to AUC).

# calculating confusion matrix

df_tp  filter(df_results,(prediction == 1 && labelIdx == 1))
df_fp  filter(df_results,(prediction ==1 && labelIdx == 0))
df_tn  filter(df_results,(prediction == 1 && labelIdx == 0))
df_fn  filter(df_results,(prediction == 1 && labelIdx == 1))


tp  df_tp %>% tally() %>% collect() %>% as.integer()
fp  df_fp %>% tally() %>% collect() %>% as.integer()
tn  df_tn %>% tally() %>% collect() %>% as.integer()
fn  df_fn %>% tally() %>% collect() %>% as.integer()

df_precision  (tp/(tp+fp))
df_recall  (tp/(tp+fn))
df_accuracy = (tp+tn)/(tp+tn+fp+fn)

c_AUC  ml_binary_classification_eval(df_prediction, label_col = "labelIdx",
prediction_col = "prediction", metric_name = "areaUnderROC")

Conclusion

As we have seen, pipelines are a useful mechanism to assemble and serialize transformation in order to make it repeatable for different sets of data. Are then a simple way for fitting and evaluating models through train/test datasets, but also suitable to run the same sequence of transformer/estimator in parallel over different nodes of the Spark cluster (i.e. to find the best parameter set). Moreover, a powerful API to deal with pipelines in R is available by SparklyR package, which provides in addition a comprehensive set of functions to leverage the ML spark package (here’s a link for a guide to deploy SparlyR in different cluster environment). Finally a support to run R code distributed across a Spark cluster has been to SparklyR with the spark_apply() function (https://spark.rstudio.com/guides/distributed-r/) which makes evenmore interesting the possibility to leverage pipelines in R in ditributed environment for analytical solutions.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New R Course: Introduction to the Tidyverse!

By gabriel

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Hi! Big announcement today as we just launched Introduction to the Tidyverse R course by David Robinson!

This is an introduction to the programming language R, focused on a powerful set of tools known as the “tidyverse”. In the course you’ll learn the intertwined processes of data manipulation and visualization through the tools dplyr and ggplot2. You’ll learn to manipulate data by filtering, sorting and summarizing a real dataset of historical country data in order to answer exploratory questions. You’ll then learn to turn this processed data into informative line plots, bar plots, histograms, and more with the ggplot2 package. This gives a taste both of the value of exploratory data analysis and the power of tidyverse tools. This is a suitable introduction for people who have no previous experience in R and are interested in learning to perform data analysis.

Take me to chapter 1!

Introduction to the Tidyverse features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a Tidyverse expert!

What you’ll learn

1. Data wrangling

In this chapter, you’ll learn to do three things with a table: filter for particular observations, arrange the observations in a desired order, and mutate to add or change a column. You’ll see how each of these steps lets you answer questions about your data.

2. Data visualization

You’ve already been able to answer some questions about the data through dplyr, but you’ve engaged with them just as a table (such as one showing the life expectancy in the US each year). Often a better way to understand and present such data is as a graph. Here you’ll learn the essential skill of data visualization, using the ggplot2 package. Visualization and manipulation are often intertwined, so you’ll see how the dplyr and ggplot2 packages work closely together to create informative graphs.

3. Grouping and summarizing

So far you’ve been answering questions about individual country-year pairs, but we may be interested in aggregations of the data, such as the average life expectancy of all countries within each year. Here you’ll learn to use the group by and summarize verbs, which collapse large datasets into manageable summaries.

4. Types of visualizations

You’ve learned to create scatter plots with ggplot2. In this chapter you’ll learn to create line plots, bar plots, histograms, and boxplots. You’ll see how each plot needs different kinds of data manipulation to prepare for it, and understand the different roles of each of these plot types in data analysis.

Master the Tidyverse with our course Introduction to the Tidyverse

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Be a BigBallR in #rstats : Stayeth or Switcheth

By MikeJackTzen

(This article was first published on rstats – MikeJackTzen, and kindly contributed to R-bloggers)

If you’re a Big Baller, you know when to stayeth in your lane but also when to switcheth lanes.

@LangstonKerman four score and seven years ago our fathers brought forth on this continent, a new lane to stayeth in


MikeJackTzen (@MKJCKTZN) November 19, 2017

The Big Baller Brand brothers just inked a deal to play professional basketball in Europe. They switched into a different lane to achieve their original goal.

It’s not about the money for the Ball Brothers. They have a passion to play Basketball and to experience playing as… twitter.com/i/web/status/9…


Big Baller Brand (@bigballerbrand) December 12, 2017

The knee-jerk reaction from the non-globally minded is that this spells doom for any NBA hoop dreams. Not so.

@bigballerbrand @MELOD1P @LiAngeloBall shoutouts to @bigballerbrand
for pursuing other ways 2 achieve goals.
maybe… twitter.com/i/web/status/9…


MikeJackTzen (@MKJCKTZN) December 12, 2017

Four score and ten years ago, your only shot of making it to the NBA was usually through the NCAA. The NBA has adopted a global perspective. You see professionals from different continental leagues finding a way into the NBA. The 2016 to 2017 diagram is interesting, we’re seeing a darker edge from the NBA to the G-League. The NBA Players Association worked hard on installing ‘2 way contracts’ letting NBA players play in (and get paid in) both the NBA and the G-League, this was a great move, a no-brainer.

Options are great, but keep in mind that the NBA is the crème de la crème. You see the lightest edges enter the NBA node while you see many dark edges enter the ‘none’ node.

When I started making the above network plot of basketball league to league migrations, I looked into off the shelf R packages. There’s a fragmented handfull of R packages to do this. Kicker, none of them do directed edges the way I wanted.

If you’re an #rstats user and want to adopt the BigBallR philosophy, find multiple ways to achieve your goal.

I went back to the basics and hand rolled my own network plot with vanilla ggplot2. At the end of the day, a network plot needs two things, nodes and edges, eg points (stored in dat_nodes) and lines (stored in dat_edges). Once you have your data in that format, you can make the network migration plot above, with the code snippet below

ggplot(data=dat_nodes,aes(X1,X2))+
 geom_point(aes(size=10,label=league),show.legend = FALSE) +
 geom_curve(data=dat_edges,
 arrow=arrow(angle=30,length = unit(0.2, "inches")),
 alpha=0.03,
 aes(x=X1_send,y = X2_send,
 xend = X1_rec,yend = X2_rec)
 )

I hand-rolled the network diagram cuz the other packages didn’t have the custom features I needed. In my hand-rolled plot there is still one thing missing. I want to place the ‘arrow head’ along some other part of the curve (say the mid-point), other than the end-point. This is probably hard to do, since arrow() looks like it just needs the receiving coordinate and plops the arrow head there. For what I want, ggplot2 needs to know the desired placement-coordinates output from geom_curve(). So somehow, the internal curve calculations need to be returned in order to pass into arrow().

To leave a comment for the author, please follow the link and comment on their blog: rstats – MikeJackTzen.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

When a Tweet Turns Into an R Package

By Mango Solutions

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Reproduced with the kind permission of our Head of Data Engineering, Mark Sellors and first published on his blog

When a Tweet Turns Into an R Package

Boy, that escalated quickly

I just wanted to write up a brief post about the power of R, its community, and tell the story of how actually putting stuff out into the world can have amazing consequences.

About 24 hours ago I was going to tweet something like this:

Hey Mac #rstats users – system(‘say “hello rstats user”’)

I’d been playing with the MacOS command line tool, ‘say’, with the kids, and just figured it would be funny to make R say stuff.

Before I tweeted though, I thought I’d better check that it worked as intended. While I was doing that I decided it would be fun to expose the say command’s different voices and figured I’d make a gist on github instead, as it was getting too long for a tweet.

So, I started playing around with the different voices and thinking of a nice way to expose that and by this point I’m thinking I might as well just make it into a package. Making a package in RStudio is so easy, it seemed like the easiest way to share what I was doing. And so now we have the rsay package.

Bear in mind that I didn’t start out to create a package. It has one function, which is essentially the thinnest wrapper around an existing command line tool and it only works on MacOS and it’s pretty silly really, so I didn’t expect there to be a great deal of interest.

In the end, instead of tweeting that one command, I tweeted about how I accidentally made this R package.

That in turn got noticed by the awesome Mara Averick, which resulted in this tweet:

🇬🇧😺💬
To illustrate the incredible power of @sellorm‘s new 📦, “rsay”https://t.co/twYQZSU9RJ #rstats pic.twitter.com/DFg94bKsDZ

— Mara Averick (@dataandme) December 9, 2017

Then Mark Edmondson sees Mara’s tweet, and mentions how it might be fun to build it into a sort of Babelfish using some of his existing work around the Google APIs. (Mark knows a lot about Google’s APIs and R!)

A few hours after initially mentioning this, the following tweet appears in my timeline:

Using @sellorm ‘s new rsay package that talks to you on MacOS, enhanced the googleLanguageR demo Shiny app so that it now can transcribe speech to text, translate it, and talk back to you https://t.co/rsip5VltGg pic.twitter.com/DvGzEBicIB

— Mark Edmondson (@HoloMarkeD) December 9, 2017

He’d only gone and done it – How cool is that?!?!

To recap; silly idea, and R’s easy to use packaging tools lead to an accidental (but still pretty silly) R package. Package is popularised by prominent developer advocate. Package is then integrated into a shiny app for automated machine translation, by well known R/Google API developer.

It’s been a fun 24 hours, but the main thing to take away, is just to get your stuff out there, even if you don’t think it’s that great/interesting/useful. When you share an idea, even if it seems frivolous or trivial, that idea takes on a life of its own and you never know where that may lead.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Point Pattern Analysis using Ecological Methods in R

By James

(This article was first published on R – Spatial.ly, and kindly contributed to R-bloggers)

Here is a quick example for how to get started with some of the more sophisticated point pattern analysis tools that have been developed for ecologists – principally the adehabitathr package – but that are very useful for human data. Ecologists deploy point pattern analysis to establish the “home range” of a particular animal based on the know locations it has been sighted (either directly or remotely via camera traps). Essentially it is where the animal spends most of its time. In the case of human datasets the analogy can be extended to identify areas where most crimes are committed – hotspots – or to identify the activity spaces of individuals or the catchment areas of services such as schools and hospitals.

This tutorial offers a rough analysis of crime data in London so the maps should not be taken as definitive – I’ve just used them as a starting point here.

Police crime data have been taken from here https://data.police.uk/data/. This tutorial London uses data from London’s Metropolitan Police (September 2017).

For the files used below click here.

#Load in the library and the csv file.

library("rgdal")
library(raster)
library(adehabitatHR)

input

At the moment input is a basic data frame. We need to convert the data frame into a spatial object. Note we have first specified our epsg code as 4326 since the coordinates are in WGS84. We then use spTransform to reproject the data into British National Grid – so the coordinate values are in meters.

Crime.Spatial

The plot reveals that we have crimes across the UK, not just in London. So we need an outline of London to help limit the view. Here we load in a shapefile of the Greater London Authority Boundary.

London

Thinking ahead we may wish to compare a number of density estimates, so they need to be performed across a consistently sized grid. Here we create an empty grid in advance to feed into the kernelUD function.

Extent

OK now run the density estimation note we use grid= xy utlise the grid we just created. This is for all crime in London.

all 

Unsurprisingly we have a hotpot over the centre of London. Are there differences for specific crime types? We may, for example, wish to look at the density of burglaries.

plot(Crime.Spatial[Crime.Spatial$Crime.type=="Burglary",]) # quick plot of burglary points

burglary

There’s a slight difference but still it’s tricky to see if there are areas where burglaries concentrate more compared to the distribution of all crimes. A very rough way to do this is to divide one density grid by the other.

both

This hasn’t worked particularly well since there are edge effects on the density grid that are causing issues due to a few stray points at the edge of the grid. We can solve this by capping the values we map – in this we are only showing values of between 0 and 1. Some more interesting structures emerge with burglary occuring in more residential areas, as expected.

both2 = 1] 

There’s many more sophisticated approaches to this kind of analysis – I’d encourgae you to look at the adehabitatHR vignette on the package page.

To leave a comment for the author, please follow the link and comment on their blog: R – Spatial.ly.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A Not So Simple Bar Plot Example Using ggplot2

By INWT-Blog-RBloggers

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

This is a reproduction of the (simple) bar plot of chapter 6.1.1 in Datendesign mit R with ggplot2. To download the data you can use the following lines:


dir.create("data") writeLines("*", "data/.gitignore") download.file("http://www.datendesign-r.de/alle_daten.zip",               "data/alle_daten.zip") unzip("data/alle_daten.zip", exdir = "data") 

And to download the original script and the base R version of this plot:


download.file("http://www.datendesign-r.de/beispielcode.zip",               "data/beispielcode.zip") unzip("data/beispielcode.zip", exdir = "data") 

After downloading check out the original pdf version of this plot in data/beispielcode/pdf/balkendiagramm_einfach.pdf.

Preparing Data

Here are some steps to modify the data such that it can be easily used with ggplot2.


# remember to adjust the path ipsos <- openxlsx::read.xlsx("../data/alle_daten/ipsos.xlsx")  ipsos <- ipsos[order(ipsos$Wert),] ipsos$Land <- ordered(ipsos$Land, ipsos$Land) ipsos$textFamily <- ifelse(ipsos$Land %in% c("Deutschland","Brasilien"),                            "Lato Black", "Lato Light") ipsos$labels <- paste0(ipsos$Land, ifelse(ipsos$Wert < 10, "     ", "  "),                        ipsos$Wert) rect <- data.frame(   ymin = seq(0, 80, 20),   ymax = seq(20, 100, 20),   xmin = 0.5, xmax = 16.5,   colour = rep(c(grDevices::rgb(191,239,255,80,maxColorValue=255),                  grDevices::rgb(191,239,255,120,maxColorValue=255)),                length.out = 5)) 

The Basic Plot

First we add the geoms, then modifications to the scales and flip of the coordinate system. The remaining code is just modifying the appearance.


library("ggplot2") ggBar <- ggplot(ipsos) +   geom_bar(aes(x = Land, y = Wert), stat = "identity", fill = "grey") +   geom_bar(aes(x = Land, y = ifelse(Land %in% c("Brasilien", "Deutschland"), Wert, NA)),            stat = "identity", fill = rgb(255,0,210,maxColorValue=255)) +   geom_rect(data = rect,             mapping = aes(ymin = ymin, ymax = ymax,                           xmin = xmin, xmax = xmax),             fill = rect$colour) +   geom_hline(aes(yintercept = 45), colour = "skyblue3") +   scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100), expand = c(0, 0)) +   scale_x_discrete(labels = ipsos$labels) +     coord_flip() +   labs(y = NULL,        x = NULL,        title = NULL) +   theme_minimal() +   theme(panel.grid.minor = element_blank(),         panel.grid.major = element_blank(),         axis.ticks = element_blank(),         axis.text.y = element_text(           family = ipsos$textFamily),         text = element_text(family = "Lato Light")) ggBar 

Annotations and Layout

Of course you can simply add the title and text annotations to the plot using ggplot2, but I didn’t find a way to do the exact placement comparable to the original version without the package grid.


library("grid") vp_make <- function(x, y, w, h)    viewport(x = x, y = y, width = w, height = h, just = c("left", "bottom")) main <- vp_make(0.05, 0.05, 0.9, 0.8) title <- vp_make(0, 0.9, 0.6, 0.1) subtitle <- vp_make(0, 0.85, 0.4, 0.05) footnote <- vp_make(0.55, 0, 0.4, 0.05) annotation1 <- vp_make(0.7, 0.85, 0.225, 0.05) annotation2 <- vp_make(0.4, 0.85, 0.13, 0.05) # To see which space these viewports will use: grid.rect(gp = gpar(lty = "dashed")) grid.rect(gp = gpar(col = "grey"), vp = main) grid.rect(gp = gpar(col = "grey"), vp = title) grid.rect(gp = gpar(col = "grey"), vp = subtitle) grid.rect(gp = gpar(col = "grey"), vp = footnote) grid.rect(gp = gpar(col = "grey"), vp = annotation1) grid.rect(gp = gpar(col = "grey"), vp = annotation2)

And now we can add the final annotations to the plot:


# pdf_datei<-"balkendiagramme_einfach.pdf" # cairo_pdf(bg = "grey98", pdf_datei, width=9, height=6.5) grid.newpage() print(ggBar, vp = main) 

grid.text("'Ich glaube fest an Gott oder ein höheres Wesen'",           gp = gpar(fontfamily = "Lato Black", fontsize = 14),           just = "left", x = 0.05, vp = title) grid.text("...sagten 2010 in:",           gp = gpar(fontfamily = "Lato Light", fontsize = 12),           just = "left",           x = 0.05, vp = subtitle) grid.text("Quelle: www.ipsos-na.com, Design: Stefan Fichtel, ixtract",           gp = gpar(fontfamily = "Lato Light", fontsize = 9),           just = "right",           x = 0.95, vp = footnote) grid.text("Alle Angaben in Prozent",           gp = gpar(fontfamily = "Lato Light", fontsize = 9),           just = "right",           x = 1, y = 0.55, vp = annotation1) grid.text("Durchschnitt: 45",           gp = gpar(fontfamily = "Lato Light", fontsize = 9),           just = "right",           x = 0.95, y = 0.55, vp = annotation2) 


# dev.off()

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Stock-Recruitment Graphing Questions

By fishR Blog

plot of chunk srStarts1

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

A fishR user recently asked me

In the book that you published, I frequently use the stock-recruit curve code. The interface that shows both the Ricker/Beverton-Holt figure with the recruit per spawner to spawner figure (i.e., the dynamic plot for srStarts()) has not been working for quite some time. Additionally, I can get the recruits versus spawner plot for the Beverton-Holt or Ricker curve with confidence bounds around the curve, but how do you do the same for the recruit per spawner to spawner curve?

Below I answer both questions. First, however, I load required packages …

library(FSA)      # srStarts, srFuns
library(FSAdata) # PSalmonAK data
library(dplyr) # %>%, filter, mutate, select
library(nlstools) # nlsBoot

… and obtain the same data (in PSalmonAK.csv available from here) used in the Stock-Recruitment chapter of my Introductory Fisheries Analyses with R (IFAR) book. Note that I created two new variables here: retperesc is the “recruits per spawner” and logretperesc is the natural log of the recruits per spawner variable.

pinks  read.csv("data/PSalmonAK.csv") %>%
filter(!is.na(esc),!is.na(ret)) %>%
mutate(esc=esc/1000,ret=ret/1000,logret=log(ret),
retperesc=ret/esc,logretperesc=log(retperesc)) %>%
select(-SST)
headtail(pinks)
##    year   esc    ret   logret retperesc logretperesc
## 1 1960 1.418 2.446 0.894454 1.724965 0.5452066
## 2 1961 2.835 14.934 2.703640 5.267725 1.6615986
## 3 1962 1.957 10.031 2.305680 5.125703 1.6342676
## 28 1987 4.289 18.215 2.902245 4.246911 1.4461918
## 29 1988 2.892 9.461 2.247178 3.271438 1.1852298
## 30 1989 4.577 23.359 3.150982 5.103561 1.6299386

Dynamic Plot Issue

The first question about the dynamic plot is due to a change in functionality in the FSA package since IFAR was published. The dynamicPlot= argument was removed because the code for that argument relied on the tcltk package, which I found difficult to reliably support. A similar, though more manual, approach is accomplished with the new fixed= and plot= arguments. For example, using plot=TRUE (without fixed=) will produces a plot of “recruits” versus “stock” with the chosen stock-recruitment model evaluated at the automatically chosen parameter starting values superimposed.

svR  srStarts(ret~esc,data=pinks,type="Ricker",plot=TRUE)

The user, however, can show the stock-recruitment model evaluated at manually chosen parameter starting values by including those starting values in a list that is supplied to fixed=. These values can be iteratively changed in subsequent calls to srStarts() to manually find starting values that provide a model that reasonably fits (by eye) the stock-recruit data.

svR  srStarts(ret~esc,data=pinks,type="Ricker",plot=TRUE,
fixed=list(a=4,b=0.15))

plot of chunk srStarts2

Note however that srStarts() no longer supports the simultaneously plotting of spawners versus recruits and recruits per spawner versus recruits.

Plot of Recruits per Spawner versus Spawners

The first way that I can imagine plotting recruits per spawners versus spawners with the fitted curve and confidence bands is to first follow the code for fitting the stock-recruit function to the stock and recruit data as described in IFAR. In this case, the stock-recruit function is fit on the log scale to adjust for a multiplicative error structure (as described in IFAR).

rckr  srFuns("Ricker")
srR nls(logret~log(rckr(esc,a,b)),data=pinks,start=svR)
bootR nlsBoot(srR)
cbind(estimates=coef(srR),confint(bootR))
##    estimates     95% LCI   95% UCI
## a 2.84924206 1.74768271 4.8435727
## b 0.05516673 -0.08443862 0.2158164

As described in the book, the plot of spawners versus recruits is made by (i) constructing a sequence of “x” values that span the range of observed numbers of spawners, (ii) predicting the number of recruits at each spawner value using the best-fit stock-recruitment model, (iii) constructing lower and upper confidence bounds for the predicted number of recruits at each spawner value with the bootstrap results, (iv) making a schematic plot on which to put (v) a polygon for the confidence band, (vi) the raw data points, and (vii) the best-fit curve. The code below follows these steps and reproduces Figure 12.4 in the book.

x  seq(0,9,length.out=199)        # many S for prediction
pR rckr(x,a=coef(srR)) # predicted mean R
LCI UCI numeric(length(x))

for(i in 1:length(x)) { # CIs for mean R @ each S
tmp apply(bootR$coefboot,MARGIN=1,FUN=rckr,S=x[i])
LCI[i] quantile(tmp,0.025)
UCI[i] quantile(tmp,0.975)
}
ylmts range(c(pR,LCI,UCI,pinks$ret))
xlmts range(c(x,pinks$esc))

plot(ret~esc,data=pinks,xlim=xlmts,ylim=ylmts,col="white",
ylab="Returners (millions)",
xlab="Escapement (millions)")
polygon(c(x,rev(x)),c(LCI,rev(UCI)),col="gray80",border=NA)
points(ret~esc,data=pinks,pch=19,col=col2rgbt("black",1/2))
lines(pR~x,lwd=2)

plot of chunk RickerFit1

These results can be modified to plot recruits per spawner versus spawners by replacing the “recruits” in the code above with “recruits per spawner.” This is simple for the actual data as ret is simply replaced with retperesc. However, the predicted number of recruits (in pR) and the confidence bounds (in LCI and UCI) from above must be divided by the number of spawners (in x). As the / symbol has a special meaning in R formulas, this division must be contained within I() as when it appears in a formula (see the lines() code below). Of course, the y-axis scale range must also be adjusted. Thus, a plot of recruits per spawner versus spawners is produced from the previous results with the following code.

ylmts  c(0.7,7)
plot(retperesc~esc,data=pinks,
xlim=xlmts,ylim=ylmts,col="white",
ylab="Returners/Escapement",
xlab="Escapement (millions)")
polygon(c(x,rev(x)),c(LCI/x,rev(UCI/x)),col="gray80",border=NA)
points(retperesc~esc,data=pinks,pch=19,col=col2rgbt("black",1/2))
lines(I(pR/x)~x,lwd=2)

plot of chunk RickerFit2

Alternatively, the Ricker stock-recruitment model could be reparameterized by dividing each side of the function by “spawners” such that the right-hand-side becomes “recruits per spawner” (this is a fairly typical reparameterization of the model). This model can be put into an R function, with model parameters then estimated with nonlinear regression similar to above. The results below show that the paramter point estimates are identical and the bootsrapped confidence intervals are similar to what was obtained above.

rckr2  function(S,a,b=NULL) {
if (length(a)>1) { b a[[2]]; a a[[1]] }
a*exp(-b*S)
}
srR2 nls(logretperesc~log(rckr2(esc,a,b)),data=pinks,start=svR)
bootR2 nlsBoot(srR2)
cbind(estimates=coef(srR2),confint(bootR2))
##    estimates     95% LCI   95% UCI
## a 2.84924202 1.67734916 4.8613811
## b 0.05516673 -0.08776123 0.2040402

With this, a second method for plotting recruits per spawner versus spawners is the same as how the main plot from the book was constructed but modified to use the results from this “new” model.

x  seq(0,9,length.out=199)        # many S for prediction
pRperS rckr2(x,a=coef(srR2)) # predicted mean RperS
LCI2 UCI2 numeric(length(x))

for(i in 1:length(x)) { # CIs for mean RperS @ each S
tmp apply(bootR2$coefboot,MARGIN=1,FUN=rckr2,S=x[i])
LCI2[i] quantile(tmp,0.025)
UCI2[i] quantile(tmp,0.975)
}
ylmts range(c(pRperS,LCI2,UCI2,pinks$retperesc))
xlmts range(c(x,pinks$esc))

plot(retperesc~esc,data=pinks,xlim=xlmts,ylim=ylmts,col="white",
ylab="Returners/Escapement",
xlab="Escapement (millions)")
polygon(c(x,rev(x)),c(LCI2,rev(UCI2)),col="gray80",border=NA)
points(retperesc~esc,data=pinks,pch=19,col=col2rgbt("black",1/2))
lines(pRperS~x,lwd=2)

plot of chunk RickerFit3

The two methods described above for plotting recruits per spawner versuse spawners are identical for the best-fit curve and nearly identical for the confidence bounds (slight differences likely due to the randomness inherent in bootstrapping). Thus, the two methods produce nearly the same visual.

plot(retperesc~esc,data=pinks,xlim=xlmts,ylim=ylmts,col="white",
ylab="Returners/Escapement",
xlab="Escapement (millions)")
polygon(c(x,rev(x)),c(LCI/x,rev(UCI/x)),col=col2rgbt("red",1/5),border=NA)
lines(I(pR/x)~x,lwd=6,col="red")
polygon(c(x,rev(x)),c(LCI2,rev(UCI2)),col=col2rgbt("blue",1/5),border=NA)
lines(pRperS~x,lwd=2,col="blue")
points(retperesc~esc,data=pinks,pch=19,col=col2rgbt("black",1/2))

plot of chunk RickerFit4

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Explaining Predictions of Machine Learning Models with LIME – Münster Data Science Meetup

By Dr. Shirin Glander

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

Slides from Münster Data Science Meetup

These are my slides from the Münster Data Science Meetup on December 12th, 2017.

My sketchnotes were collected from these two podcasts:

Sketchnotes: TWiML Talk #7 with Carlos Guestrin – Explaining the Predictions of Machine Learning Models & Data Skeptic Podcast – Trusting Machine Learning Models with Lime


Example Code

  • the following libraries were loaded:
library(tidyverse)  # for tidy data analysis
library(farff)      # for reading arff file
library(missForest) # for imputing missing values
library(dummies)    # for creating dummy variables
library(caret)      # for modeling
library(lime)       # for explaining predictions

Data

The Chronic Kidney Disease dataset was downloaded from UC Irvine’s Machine Learning repository: http://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease

data_file 
  • load data with the farff package
data 

Features

  • age – age
  • bp – blood pressure
  • sg – specific gravity
  • al – albumin
  • su – sugar
  • rbc – red blood cells
  • pc – pus cell
  • pcc – pus cell clumps
  • ba – bacteria
  • bgr – blood glucose random
  • bu – blood urea
  • sc – serum creatinine
  • sod – sodium
  • pot – potassium
  • hemo – hemoglobin
  • pcv – packed cell volume
  • wc – white blood cell count
  • rc – red blood cell count
  • htn – hypertension
  • dm – diabetes mellitus
  • cad – coronary artery disease
  • appet – appetite
  • pe – pedal edema
  • ane – anemia
  • class – class

Missing data

  • impute missing data with Nonparametric Missing Value Imputation using Random Forest (missForest package)
data_imp 

One-hot encoding

  • create dummy variables (caret::dummy.data.frame())
  • scale and center
data_imp_final 

Modeling

# training and test set
set.seed(42)
index 
model_rf
## Random Forest 
## 
## 360 samples
##  48 predictor
##   2 classes: 'ckd', 'notckd' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 324, 324, 324, 324, 325, 324, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9922647  0.9838466
##   25    0.9917392  0.9826070
##   48    0.9872930  0.9729881
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.
# predictions
pred %
  mutate(prediction = colnames(.)[2:3][apply(.[, 2:3], 1, which.max)], correct = ifelse(actual == prediction, "correct", "wrong"))

confusionMatrix(pred$actual, pred$prediction)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction ckd notckd
##     ckd     23      2
##     notckd   0     15
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8308, 0.9939)
##     No Information Rate : 0.575           
##     P-Value [Acc > NIR] : 1.113e-07       
##                                           
##                   Kappa : 0.8961          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8824          
##          Pos Pred Value : 0.9200          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5750          
##          Detection Rate : 0.5750          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.9412          
##                                           
##        'Positive' Class : ckd             
## 

LIME

  • LIME needs data without response variable
train_x 
  • build explainer
explainer 
  • run explain() function
explanation_df 
  • model reliability
explanation_df %>%
  ggplot(aes(x = model_r2, fill = label)) +
    geom_density(alpha = 0.5)

  • plot explanations
plot_features(explanation_df[1:24, ], ncol = 1)

Session Info

## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.2 (2017-09-28)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  de_DE.UTF-8                 
##  tz                               
##  date     2017-12-12
## Packages -----------------------------------------------------------------
##  package      * version  date       source        
##  assertthat     0.2.0    2017-04-11 CRAN (R 3.4.0)
##  backports      1.1.1    2017-09-25 CRAN (R 3.4.2)
##  base         * 3.4.2    2017-10-04 local         
##  BBmisc         1.11     2017-03-10 CRAN (R 3.4.0)
##  bindr          0.1      2016-11-13 CRAN (R 3.4.0)
##  bindrcpp     * 0.2      2017-06-17 CRAN (R 3.4.0)
##  blogdown       0.3      2017-11-13 CRAN (R 3.4.2)
##  bookdown       0.5      2017-08-20 CRAN (R 3.4.1)
##  broom          0.4.3    2017-11-20 CRAN (R 3.4.2)
##  caret        * 6.0-77   2017-09-07 CRAN (R 3.4.1)
##  cellranger     1.1.0    2016-07-27 CRAN (R 3.4.0)
##  checkmate      1.8.5    2017-10-24 CRAN (R 3.4.2)
##  class          7.3-14   2015-08-30 CRAN (R 3.4.2)
##  cli            1.0.0    2017-11-05 CRAN (R 3.4.2)
##  codetools      0.2-15   2016-10-05 CRAN (R 3.4.2)
##  colorspace     1.3-2    2016-12-14 CRAN (R 3.4.0)
##  compiler       3.4.2    2017-10-04 local         
##  crayon         1.3.4    2017-09-16 cran (@1.3.4) 
##  CVST           0.2-1    2013-12-10 CRAN (R 3.4.0)
##  datasets     * 3.4.2    2017-10-04 local         
##  ddalpha        1.3.1    2017-09-27 CRAN (R 3.4.2)
##  DEoptimR       1.0-8    2016-11-19 CRAN (R 3.4.0)
##  devtools       1.13.4   2017-11-09 CRAN (R 3.4.2)
##  digest         0.6.12   2017-01-27 CRAN (R 3.4.0)
##  dimRed         0.1.0    2017-05-04 CRAN (R 3.4.0)
##  dplyr        * 0.7.4    2017-09-28 CRAN (R 3.4.2)
##  DRR            0.0.2    2016-09-15 CRAN (R 3.4.0)
##  dummies      * 1.5.6    2012-06-14 CRAN (R 3.4.0)
##  e1071          1.6-8    2017-02-02 CRAN (R 3.4.0)
##  evaluate       0.10.1   2017-06-24 CRAN (R 3.4.0)
##  farff        * 1.0      2016-09-11 CRAN (R 3.4.0)
##  forcats      * 0.2.0    2017-01-23 CRAN (R 3.4.0)
##  foreach      * 1.4.3    2015-10-13 CRAN (R 3.4.0)
##  foreign        0.8-69   2017-06-22 CRAN (R 3.4.1)
##  ggplot2      * 2.2.1    2016-12-30 CRAN (R 3.4.0)
##  glmnet         2.0-13   2017-09-22 CRAN (R 3.4.2)
##  glue           1.2.0    2017-10-29 CRAN (R 3.4.2)
##  gower          0.1.2    2017-02-23 CRAN (R 3.4.0)
##  graphics     * 3.4.2    2017-10-04 local         
##  grDevices    * 3.4.2    2017-10-04 local         
##  grid           3.4.2    2017-10-04 local         
##  gtable         0.2.0    2016-02-26 CRAN (R 3.4.0)
##  haven          1.1.0    2017-07-09 CRAN (R 3.4.0)
##  hms            0.4.0    2017-11-23 CRAN (R 3.4.3)
##  htmltools      0.3.6    2017-04-28 CRAN (R 3.4.0)
##  htmlwidgets    0.9      2017-07-10 CRAN (R 3.4.1)
##  httpuv         1.3.5    2017-07-04 CRAN (R 3.4.1)
##  httr           1.3.1    2017-08-20 CRAN (R 3.4.1)
##  ipred          0.9-6    2017-03-01 CRAN (R 3.4.0)
##  iterators    * 1.0.8    2015-10-13 CRAN (R 3.4.0)
##  itertools    * 0.1-3    2014-03-12 CRAN (R 3.4.0)
##  jsonlite       1.5      2017-06-01 CRAN (R 3.4.0)
##  kernlab        0.9-25   2016-10-03 CRAN (R 3.4.0)
##  knitr          1.17     2017-08-10 CRAN (R 3.4.1)
##  labeling       0.3      2014-08-23 CRAN (R 3.4.0)
##  lattice      * 0.20-35  2017-03-25 CRAN (R 3.4.2)
##  lava           1.5.1    2017-09-27 CRAN (R 3.4.1)
##  lazyeval       0.2.1    2017-10-29 CRAN (R 3.4.2)
##  lime         * 0.3.1    2017-11-24 CRAN (R 3.4.3)
##  lubridate      1.7.1    2017-11-03 CRAN (R 3.4.2)
##  magrittr       1.5      2014-11-22 CRAN (R 3.4.0)
##  MASS           7.3-47   2017-02-26 CRAN (R 3.4.2)
##  Matrix         1.2-12   2017-11-15 CRAN (R 3.4.2)
##  memoise        1.1.0    2017-04-21 CRAN (R 3.4.0)
##  methods      * 3.4.2    2017-10-04 local         
##  mime           0.5      2016-07-07 CRAN (R 3.4.0)
##  missForest   * 1.4      2013-12-31 CRAN (R 3.4.0)
##  mnormt         1.5-5    2016-10-15 CRAN (R 3.4.0)
##  ModelMetrics   1.1.0    2016-08-26 CRAN (R 3.4.0)
##  modelr         0.1.1    2017-07-24 CRAN (R 3.4.1)
##  munsell        0.4.3    2016-02-13 CRAN (R 3.4.0)
##  nlme           3.1-131  2017-02-06 CRAN (R 3.4.2)
##  nnet           7.3-12   2016-02-02 CRAN (R 3.4.2)
##  parallel       3.4.2    2017-10-04 local         
##  pkgconfig      2.0.1    2017-03-21 CRAN (R 3.4.0)
##  plyr           1.8.4    2016-06-08 CRAN (R 3.4.0)
##  prodlim        1.6.1    2017-03-06 CRAN (R 3.4.0)
##  psych          1.7.8    2017-09-09 CRAN (R 3.4.1)
##  purrr        * 0.2.4    2017-10-18 CRAN (R 3.4.2)
##  R6             2.2.2    2017-06-17 CRAN (R 3.4.0)
##  randomForest * 4.6-12   2015-10-07 CRAN (R 3.4.0)
##  Rcpp           0.12.14  2017-11-23 CRAN (R 3.4.3)
##  RcppRoll       0.2.2    2015-04-05 CRAN (R 3.4.0)
##  readr        * 1.1.1    2017-05-16 CRAN (R 3.4.0)
##  readxl         1.0.0    2017-04-18 CRAN (R 3.4.0)
##  recipes        0.1.1    2017-11-20 CRAN (R 3.4.3)
##  reshape2       1.4.2    2016-10-22 CRAN (R 3.4.0)
##  rlang          0.1.4    2017-11-05 CRAN (R 3.4.2)
##  rmarkdown      1.8      2017-11-17 CRAN (R 3.4.2)
##  robustbase     0.92-8   2017-11-01 CRAN (R 3.4.2)
##  rpart          4.1-11   2017-03-13 CRAN (R 3.4.2)
##  rprojroot      1.2      2017-01-16 CRAN (R 3.4.0)
##  rstudioapi     0.7      2017-09-07 CRAN (R 3.4.1)
##  rvest          0.3.2    2016-06-17 CRAN (R 3.4.0)
##  scales         0.5.0    2017-08-24 CRAN (R 3.4.1)
##  sfsmisc        1.1-1    2017-06-08 CRAN (R 3.4.0)
##  shiny          1.0.5    2017-08-23 CRAN (R 3.4.1)
##  shinythemes    1.1.1    2016-10-12 CRAN (R 3.4.0)
##  splines        3.4.2    2017-10-04 local         
##  stats        * 3.4.2    2017-10-04 local         
##  stats4         3.4.2    2017-10-04 local         
##  stringdist     0.9.4.6  2017-07-31 CRAN (R 3.4.1)
##  stringi        1.1.6    2017-11-17 CRAN (R 3.4.2)
##  stringr      * 1.2.0    2017-02-18 CRAN (R 3.4.0)
##  survival       2.41-3   2017-04-04 CRAN (R 3.4.0)
##  tibble       * 1.3.4    2017-08-22 CRAN (R 3.4.1)
##  tidyr        * 0.7.2    2017-10-16 CRAN (R 3.4.2)
##  tidyselect     0.2.3    2017-11-06 CRAN (R 3.4.2)
##  tidyverse    * 1.2.1    2017-11-14 CRAN (R 3.4.2)
##  timeDate       3042.101 2017-11-16 CRAN (R 3.4.2)
##  tools          3.4.2    2017-10-04 local         
##  utils        * 3.4.2    2017-10-04 local         
##  withr          2.1.0    2017-11-01 CRAN (R 3.4.2)
##  xml2           1.1.1    2017-01-24 CRAN (R 3.4.0)
##  xtable         1.8-2    2016-02-05 CRAN (R 3.4.0)
##  yaml           2.1.15   2017-12-01 CRAN (R 3.4.3)
To leave a comment for the author, please follow the link and comment on their blog: Shirin’s playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News