Using R: a function that adds multiple ggplot2 layers

By mrtnj

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Another interesting thing that an R course participant identified: Sometimes one wants to make a function that returns multiple layers to be added to a ggplot2 plot. One could think that just adding them and returning would work, but it doesn’t. I think it has to do with how + is evaluated. There are a few workarounds that achieve similar results and may save typing.

First, some data to play with: this is a built-in dataset of chickens growing:


diet1 <- subset(ChickWeight, Diet == 1)
diet2 <- subset(ChickWeight, Diet == 2)

This is just an example that shows the phenomenon. The first two functions will work, but combining them won’t.

add_line <- function(df) {
  geom_line(aes(x = Time, y = weight, group = Chick), data = df)

add_points <- function(df) {
  geom_point(aes(x = Time, y = weight), data = df)

add_line_points <- function(df) {
  add_line(df) + add_points(df)

## works
(plot1 <- ggplot() + add_line(diet1) + add_points(diet1))

## won't work: non-numeric argument to binary operator
try((plot2 <- ggplot() + add_line_points(diet1)))

First, you can get the same result by putting mappings and data in the ggplot function. This will work if all layers are going to plot the same data, but that does it for some cases:

## bypasses the issue by putting mappings in ggplot()
(plot3 <- ggplot(aes(x = Time, y = weight, group = Chick), data = diet1) +   
  geom_line() + geom_point())

One way is to write a function that takes the plot object as input, and returns a modified version of it. If we use the pipe operator %>%, found in the magrittr package, it even gets a ggplot2 like feel:

## bypasses the issue and gives a similar feel with pipes


add_line_points2 <- function(plot, df, ...) {
  plot +
    geom_line(aes(x = Time, y = weight, group = Chick), ..., data = df) +
    geom_point(aes(x = Time, y = weight), ..., data = df)

(plot4 <- ggplot() %>% add_line_points2(diet1)
   %>% add_line_points2(diet2, colour = "red"))

Finally, in many cases, one can stick all the data in a combined data frame, and avoid building up the plot from different data frames altogether.

## plot the whole dataset at once
(plot5 <- ggplot(aes(x = Time, y = weight, group = Chick, colour = Diet),
                 data = ChickWeight) +
   geom_line() + geom_point())

Okay, maybe that plot is a bit too busy to be good. But note how the difference between plotting a single diet and all diets at the same time one more mapping in aes(). No looping or custom functions required.

I hope that was of some use.

Postat i:computer stuff, data analysis Tagged: ggplot2, R

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using R as a GIS

By realdataweb

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

In real estate, spatial data is the name of the game. Countless programs
in other domains utilize the power of this data, which is becoming more
prevalent by the day.

In this post I will go over a few simple, but powerful tools to get you
started using using geographic information in R.

##First, some libraries##
#install.packages('GISTools', dependencies = T)

GISTools provides an easy-to-use method for creating shading schemes
and choropleth maps. Some of you may have heard of the sp package,
which adds numerous spatial classes to the mix. There are also functions
for analysis and making things look nice.

Let’s get rolling: source the vulgaris dataset, which contains
location information for Syringa Vulgaris (the Lilac) observation
stations and US states. This code plots the states and vulgaris

data(&quot;vulgaris&quot;) #load data
par = (mar = c(2,0,0,0)) #set margins of plot area
plot(vulgaris, add = T, pch = 20)

alt text

One thing to note here is the structure of these objects. us_states is
a SpatialPolygonsDataFrame, which stores information for plotting shapes
(like a shapefile) within its attributes. vulgaris by contrast is a
SpatialPointsDataFrame, which contains data for plotting individual
points. Much like a data.frame and $, these objects harbor
information that can be accessed via @.

Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev
3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146
3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146
3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146
3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146
3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146
3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146

Let’s take a look at some functions that use this data.

newVulgaris kable(head(data.frame(newVulgaris)))
x y
3 4896 -67.65 44.65
3 4897 -67.65 44.65
3 4898 -67.65 44.65
3 4899 -67.65 44.65
3 4900 -67.65 44.65
3 4901 -67.65 44.65

gIntersection, as you may have guessed from the name, returns the
intersection of two spatial objects. In this case, we are given the
points from vulgaris that are within us_states. However, the rest of
the vulgaris data has been stripped from the resulting object. We’ve
got to jump through a couple of hoops to get that information back.

&lt;br /&gt;newVulgaris &lt;- data.frame(newVulgaris)
tmp &lt;- rownames(newVulgaris)
tmp &lt;- strsplit(tmp, &quot; &quot;)
tmp &lt;- (sapply(tmp, &quot;[[&quot;, 2))
tmp &lt;- as.numeric(tmp)
vdf &lt;- data.frame(vulgaris)
newVulgaris &lt;- subset(vdf, row.names(vdf) %in% tmp)
Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev Long.1 Lat.1 optional
3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE

Look familiar? Now we’ve got a data frame with the clipped vulgaris
values and original data preserved.

vulgarisSpatial ```

After storing our clipped data frame as a SpatialPointsDataFrame, we can
again make use of it - in this case we add a shading scheme to the
`vulgaris` points.

``` r
shades shades$cols plot(us_states)
choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20)

alt text

Colors are pretty, but what do they mean? Let’s add a legend.

us_states@bbox #Get us_states bounding box coordinates.
 ##min max
 ## r1 -124.73142 -66.96985
 ## r2 24.95597 49.37173
choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20)
par(xpd=TRUE) #Allow plotting outside of plot area.
choro.legend(-124, 30, shades, cex = .75, title = &quot;Elevation in Meters&quot;) # Plot legend in bottom left. Takes standard legend() params.

alt text

It looks like there’s a lot going on in the Northeastern states. For a
closer look, create another clipping (like above) and plot it. Using the
structure below, we can create a selection vector. I have hidden the
full code since it is repetitive (check GitHub for the full code.)

index '...'
choropleth(vulgarisNE, vulgarisNE$Elev,shading = shades, add = T, pch = 20)
par(xpd = T)
choro.legend(-73, 39.75, shades, cex = .75, title = &quot;Elevation in Meters&quot;)

alt text

Hopefully this has been a useful introduction (or refresher) on spatial
data. I always learn a lot in the process of writing these posts. If you
have any ideas or suggestions please leave a comment or feel free to
contact me!

Happy mapping,


To leave a comment for the author, please follow the link and comment on their blog: R – Real Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Shuttering Pies With Retiring Stores

By hrbrmstr

(This article was first published on R –, and kindly contributed to R-bloggers)

I caught this “gem” in the Wall Street Journal tonight:

It’s pretty hard to compare store-to-store, even though it is fairly clear which ones are going-going-gone. If we want to see the relative percentage of each store closing and also want to see how they stack up against each other, then let’s make a column of 100% bars and label total stores in each:


"Radio Shack",550,1500
"The Limited",250,250
"Wet Seal",170,170
"American Apparel",110,110
"Sears",41,695', sep=",", header=TRUE, stringsAsFactors=FALSE) %>% 
  as_tibble() %>% 
  mutate(remaining = total - closing,
         gone = round((closing/total) * 100)/100,
         stay = 1-gone,
         rem_lab = ifelse(remaining == 0, "", scales::comma(remaining))) %>% 
  arrange(desc(stay)) %>% 
  mutate(store=factor(store, levels=store)) -> closing_df


ggplot(closing_df) +
  geom_segment(aes(0, store, xend=gone, yend=store, color="Closing"), size=8) +
  geom_segment(aes(gone, store, xend=gone+stay, yend=store, color="Remaining"), size=8) +
  geom_text(aes(x=0, y=store, label=closing), color="white", hjust=0, nudge_x=0.01) +
  geom_text(aes(x=1, y=store, label=rem_lab), color="white", hjust=1, nudge_x=-0.01) +
  scale_x_percent() +
  scale_color_ipsum(name=NULL) +
  labs(x=NULL, y=NULL, 
       title="Selected 2017 Store closings (estimated)",
       subtitle="Smaller specialty chains such as Bebe and American Apparel are closing their stores,nwhile lareger chains such as J.C. Penny and Sears are scaling back their footprint.") +
  theme_ipsum_rc(grid="X") +
  theme(axis.text.x=element_text(hjust=c(0, 0.5, 0.5, 0.5, 1))) +
  theme(legend.position=c(0.875, 1.025)) +

One might try circle packing or a treemap to show both relative store count and percentage, but I think the bigger story is the percent reduction for each retail chain. It’d be cool to see what others come up with.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using MCA and variable clustering in R for insights in customer attrition

By Mesfin Gebeyaw

Analytical challenges in multivariate data analysis and predictive modeling include identifying redundant and irrelevant variables. A recommended analytics approach is to first address the redundancy; which can be achieved by identifying groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with other variable groups in the same data set. On the other hand, relevancy is about potential predictor variables and involves understanding the relationship between the target variable and input variables.

Multiple correspondence analysis (MCA) is a multivariate data analysis and data mining tool for finding and constructing a low-dimensional visual representation of variable associations among groups of categorical variables. Variable clustering as a tool for identifying redundancy is often applied to get a first impression of variable associations and multivariate data structure.

The motivations of this post are to illustrate the applications of: 1) preparing input variables for analysis and predictive modeling, 2) MCA as a multivariate exploratory data analysis and categorical data mining tool for business insights of customer churn data, and 3) variable clustering of categorical variables for the identification of redundant variables.

Customer churn data in this analysis:
Customer attrition is a metrics businesses use to monitor and quantify the loss of customers and/or clients for various reasons. The data set includes customer-level demographic, account and services information including monthly charge amounts and length of service with the company. Customers who left the company for competitors (Yes) or staying with the company (No) have been identified in the last column labeled churn.

Load Packages


Import and Pre-process Data

The data set used in this post was obtained from the watson-analytics-blog site. Click the hyperlink “Watson Analytics Sample Dataset – Telco Customer Churn” to download the file “WA_Fn-UseC_-Telco-Customer-Churn.csv”.

setwd("path to the location  of your copy of the saved csv data file")

churn <- read.table("WA_Fn-UseC_-Telco-Customer-Churn,csv", sep=",", header=TRUE)
## inspect data dimensions and structure
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ...
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 

The raw data set contains 7043 records and 21 variables. Looking at the data structure, some data columns need recoding. For instance, changing values from “No phone service” and “No internet service” to “No”, for consistency. The following code statements are to recode those observations and more.

## recode selected observations 
churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, 
                                           from=c("No phone service"),

churn$InternetService <- as.factor(mapvalues(churn$InternetService, 
                                             from=c("Fiber optic"),

churn$PaymentMethod <- as.factor(mapvalues(churn$PaymentMethod,
                                           from=c("Credit card (automatic)","Electronic check","Mailed check",
"Bank transfer (automatic)"),

churn$Contract <- as.factor(mapvalues(churn$Contract,
                                             "Two year", "One year"),
                                      to=c("MtM","TwoYr", "OneYr")))

cols_recode1 <- c(10:15)
for(i in 1:ncol(churn[,cols_recode1])) {
        churn[,cols_recode1][,i] <- as.factor(mapvalues
                                              (churn[,cols_recode1][,i], from =c("No internet service"),to=c("No")))

Besides, values in the SeniorCitizen column were entered as 0 and 1. Let’s recode this variable as “No” and “Yes”, respectively, for consistency.

churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen,
                                      to=c("No", "Yes")))

Exclude the consumer id and total charges columns from data analysis.

cols_drop <- c(1, 20)
churn <- churn[,-cols_drop]

Let’s do summary statistics of the two numerical variables to see distribution of the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.50   70.35   64.76   89.85  118.80
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   32.37   55.00   72.00

On the basis of the data distributions above, values in the tenure and monthly charges numerical columns could be coerced to a 3-level categorical value as follows.

churn$tenure <- as.factor(car::recode(churn$tenure, "1:9 = 'ShortTenure'; 
                               9:29 = 'MediumTenure'; else = 'LongTenure'"))

churn$MonthlyCharges <- as.factor(car::recode(churn$MonthlyCharges, "1:35 = 'LowCharge';35:70 = 'MediumCharge'; else = 'HighCharge'"))

It’s time to check for missing values in the pre-processed data set.

## [1] 0

There are no missing values. How about the category levels of each variable?

## check for factor levels in each column
nfactors <- apply(churn, 2, function(x) nlevels(as.factor(x))) 
##           gender    SeniorCitizen          Partner       Dependents 
##                2                2                2                2 
##           tenure     PhoneService    MultipleLines  InternetService 
##                3                2                2                3 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                2                2                2                2 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                2                2                3                2 
##    PaymentMethod   MonthlyCharges            Churn 
##                4                3                2

Now, the data set is ready for analysis.

Partitioning the raw data into 70% training and 30% testing data sets

inTrain <- createDataPartition(churn$Churn, p=0.7, list=FALSE)
## set random seed to make reproducible results
training <- churn[inTrain,]
testing <- churn[-inTrain,]

Check for the dimensions of the training and testing data sets

dim(training) ; dim(testing)
## [1] 4931   19
## [1] 2112   19

As expected, the training data set contains 4931 observations and 19 columns, whereas the testing data set contains 2112 observations and 19 columns.

Multiple Correspondence Analysis (MCA)

Invoke the FactoMiner & factoextra packages.

res.mca <- MCA(training, quali.sup=c(17,19), graph=FALSE)
fviz_mca_var(res.mca, repel=TRUE)

Here is the plot of the MCA.

A general guide to extrapolating from the figure above for business insights would be to observe and make a note as to how close input variables are to the target variable and to each other. For instance, customers with month to month contract, those with fiber optic internet service, senior citizens, customers with high monthly charges, single customers or customers with no dependents, those with paperless billing are being related to a short tenure with the company and a propensity of risk to churn.

On the other hand, customers with 1 – 2 years contract, those with DSL internet service, younger customers, those with low monthly charges, customers with multiple lines, online security and tech support services are being related to a long tenure with the company and a tendency to stay with company.

Variable Clustering

Load the ClustOfVar package and the hclustvar function produces a tree of variable groups. A paper detailing the ClustOfVar package is here

# run variable clustering excluding the target variable (churn) 
variable_tree <- hclustvar(X.quali = training[,1:18])
#plot the dendrogram of variable groups

Here is a tree of categorical variable groups.

The dendrogram suggests that the 18 input variables can be combined into approximately 7 – 9 groups of variables. That is one way of going about it. The good news is that the ClusofVar package offers a function to cut the cluster into any number of desired groups (clusters) of variables. So, the syntax below will run 25 bootstrap samples of the trees to produce a plot of stability of variable cluster partitions.

# requesting for 25 bootstrap samplings and a plot
stability(variable_tree, B=25) 

The plot of stability of variable cluster partitions suggests approximately a 7 to 9-cluster solution. The syntax below will list a consensus list of 9-clusters along with the variables names included in each cluster.

## cut the variable tree into 9(?) groups of variables 
clus <- cutreevar(variable_tree,9)
## print the list of variables in each cluster groups
## $cluster1
##        squared loading
## gender               1
## $cluster2
##               squared loading
## SeniorCitizen               1
## $cluster3
##            squared loading
## Partner          0.7252539
## Dependents       0.7252539
## $cluster4
##               squared loading
## tenure              0.7028741
## Contract            0.7162027
## PaymentMethod       0.4614923
## $cluster5
##               squared loading
## PhoneService        0.6394947
## MultipleLines       0.6394947
## $cluster6
##                 squared loading
## InternetService       0.9662758
## MonthlyCharges        0.9662758
## $cluster7
##                squared loading
## OnlineSecurity       0.5706136
## OnlineBackup         0.4960295
## TechSupport          0.5801424
## $cluster8
##                  squared loading
## DeviceProtection       0.5319719
## StreamingTV            0.6781781
## StreamingMovies        0.6796616
## $cluster9
##                  squared loading
## PaperlessBilling               1

The 9-clusters and the variable names in each cluster are listed above. The practical guide to minimizing redundancy is to select a cluster representative. However, subject-matter considerations should have a say in the consideration and selection of other candidate representatives of each variable cluster group.

Descriptive statistics of customer churn

what was the overall customer churn rate in the training data set?

# overall customer churn rate
##   No  Yes 
## 73.5 26.5

The overall customer attrition rate was approximately 26.5%.
Customer churn rate by demography, account and service information

cols_aggr_demog <- c(1:4,6:7,9:14,16)
variable <- rep(names(training[,cols_aggr_demog]),each=4)
for(i in 1:ncol(training[,cols_aggr_demog])) {
    demog_count <- ddply(training, .(training[,cols_aggr_demog][,i],training$Churn), "nrow")
     names(demog_count) <- c("class","Churn","count")
    demog_counts <-, demog_count))

demog_churn_rate <-, demog_counts))
demog_churn_rate1 <- dcast(demog_churn_rate, variable + class ~ Churn, value.var="count")
demog_churn_rate2 <- mutate(demog_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1))
demog <-$variable,demog_churn_rate2$class))
names(demog) <- c("Category")
demog2 <-,demog_churn_rate2))
cols_aggr_nlev3 <- c(5,8,15,18)
variable <- rep(names(training[,cols_aggr_nlev3]),each=6)
for(i in 1:ncol(training[,cols_aggr_nlev3])) {
    nlev3_count <- ddply(training, .(training[,cols_aggr_nlev3][,i],training$Churn), "nrow")
     names(nlev3_count) <- c("class","Churn","count")
    nlev3_counts <-, nlev3_count))

nlev3_churn_rate <-, nlev3_counts))
nlev3_churn_rate1 <- dcast(nlev3_churn_rate, variable + class ~ Churn, value.var="count")
nlev3_churn_rate2 <- mutate(nlev3_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1))
nlev3 <-$variable,nlev3_churn_rate2$class))
names(nlev3) <- c("Category")
nlev3 <-,nlev3_churn_rate2))
variable <- rep("PaymentMethod",8)
nlev4_count <- ddply(training, .(training[,17],training$Churn), "nrow")
names(nlev4_count) <- c("class","Churn","count")
nlev4_churn_rate <-, nlev4_count))
nlev4_churn_rate1 <- dcast(nlev4_churn_rate, variable + class ~ Churn, value.var="count")
nlev4_churn_rate2 <- mutate(nlev4_churn_rate1, churn_rate=round((Yes/(No+Yes)*100)-26.5,1))
nlev4 <-$variable4,nlev4_churn_rate2$class))
names(nlev4) <- c("Category")
nlev4 <-,nlev4_churn_rate2))
final_agg <-, nlev3, nlev4))

ggplot(final_agg, aes(Category, churn_rate, color=churn_rate < 0)) +
    geom_segment(aes(x=reorder(Category, -churn_rate), xend = Category,
                     y = 0, yend = churn_rate), 
                 size = 1.1, alpha = 0.7) +
    geom_point(size = 2.5) +
        theme(legend.position="none") +
    xlab("Variable") +
    ylab("Customer Churn (%)") +
    ggtitle("Customer Attrition rate n Difference from the overall average (%)") +
    theme(axis.text.x=element_text(angle=45, hjust=1)) +

Looking at the figure above, customers with higher than average attrition rates include those with an electronic check, with month to month contracts, with higher monthly charges and paperless billing. On a positive note, customers with low monthly charges, longer period contract, with online security services, with dependents or with partners, those paying with credit card or bank transfer showed a much lower than average rates of attrition.

In conclusion

Variables such as contract length, bill payment method, internet service type and even customer demography appeared to play a role in customer attrition and retention. The next step for this company would be to deploy predictive and prescriptive models that would score prospective customers for the propensity of risk to churn. Hope this post is helpful. Please leave your comments or suggestions below. Ok to networking with the author on LinkedIn.

    Related Post

    1. Web Scraping and Applied Clustering Global Happiness and Social Progress Index
    2. Key Phrase Extraction from Tweets
    3. Financial time series forecasting – an easy approach
    4. Outlier detection and treatment with R
    5. Implementing Apriori Algorithm in R

    Source:: R News

    Programming over R

    By John Mount


    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    R is a very fluid language amenable to meta-programming, or alterations of the language itself. This has allowed the late user-driven introduction of a number of powerful features such as magrittr pipes, the foreach system, futures, data.table, and dplyr. Please read on for some small meta-programming effects we have been experimenting with.


    Meta-programming is a powerful tool that allows one to re-shape a programming language or write programs that automate parts of working with a programming language.

    Meta-programming itself has the central contradiction that one hopes nobody else is doing meta-programming, but that they are instead dutifully writing referentially transparent code that is safe to perform transformations over, so that one can safely introduce their own clever meta-programming. For example: one would hate to lose the ability to use a powerful package such as future because we already “used up all the referential transparency” for some minor notational effect or convenience.

    That being said, R is an open system and it is fun to play with the notation. I have been experimenting with different notations for programming over R for a while, and thought I would demonstrate a few of them here.

    Let Blocks

    We have been using let to code over non-standard evaluation (NSE) packages in R for a while now. This allows code such as the following:

    d <- data.frame(x = c(1, NA))
    cname <- 'x'
    rname <- paste(cname, 'isNA', sep = '_')
    let(list(COL = cname, RES = rname),
        d %>% mutate(RES =
     #    x x_isNA
     # 1  1  FALSE
     # 2 NA   TRUE

    let is in fact quite handy notation that will work in a non-deprecated manner with both dplyr 0.5 and dplyr 0.6. It is how we are future-proofing our current dplyr workflows.


    dplyr 0.6 is introducing a new execution system (alternately called rlang or tidyeval, see here) which uses a notation more like the following (but fewer parenthesis, and with the ability to control left-hand side of an in-argument assignment):

    beval(d %>% mutate(x_isNA =!!cname))))

    The inability to re-map the right-hand side of the apparent assignment is because the “(!! )” notation doesn’t successfully masquerade as a lexical token valid on the left-hand side of assignments or function argument bindings.

    And there was an R language proposal for a notation like the following (but without the quotes, and with some care to keep it syntactically distinct from other uses of “@”):

    ateval('d %>% mutate(@rname =')

    beval and ateval are just curiosities implemented to try and get a taste of the new dplyr notation, and we don’t recommend using them in production — their ad-hoc demonstration implementations are just not powerful enough to supply a uniform interface. dplyr itself seems to be replacing a lot of R‘s execution framework to achieve stronger effects.

    Write Arrow

    We are experimenting with “write arrow” (a deliberate homophone of “right arrow”). It allows the convenient storing of a pipe result into a variable chosen by name.

    'x' -> whereToStoreResult
    7 %>% sin %>% cos %->_% whereToStoreResult
     ## [1] 0.7918362

    Notice, the value “7” is stored in the variable “x” not in a variable named “whereToStoreResult”. “whereToStoreResult” was able to name where to store the value parametrically.

    This allows code such as the following:

    for(i in 1:3) { 
      i %->_% paste0('x',i)

    (Please run the above to see the automatic creation of variables named “x1”, “x2”, and “x3”, storing values 1,2, and 3 respectively.)

    We know left to right assignment is heterodox; but the notation is very slick if you are consistent with it, and add in some formatting rules (such as insisting on a line break after each pipe stage).


    One wants to use meta-programming with care. In addition to bringing in desired convenience it can have unexpected effects and interactions deeper in a language or when exposed to other meta-programming systems. This is one reason why a “seemingly harmless” proposal such as “user defined unary functions” or “at unquoting” takes so long to consider. This is also why new language features are best tried in small packages first (so users can easily chose to include them or not in their larger workflow) to drive public request for comments (RFC) processes or allow the ideas to evolve (and not be frozen at their first good idea, a great example of community accepted change being Haskel’s switch from request chaining IO to monadic IO; the first IO system “seemed inevitable” until it was completely replaced).

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    R Weekly Bulletin Vol – V

    By R programming

    (This article was first published on R programming, and kindly contributed to R-bloggers)

    This week’s R bulletin will cover topics like how to avoid for-loops, add or shorten an existing vector, and play a beep sound in R. We will also cover functions like function, readSeries, and the with and within functions. Hope you like this R weekly bulletin. Enjoy reading!

    Shortcut Keys

    1. To stop debugging – Shift+F8
    2. To quit an R session (desktop only) – Ctrl+Q
    3. To restart an R Session – Ctrl+Shift+0

    Problem Solving Ideas

    Avoiding For Loop by using “with” function

    For Loop can be slow in terms of execution speed when we are dealing with large data sets. For faster execution, one can use the “with” function as an alternative. The syntax of the with function is given below:

    with(data, expr)

    where, “data” is typically a data frame, and “expr” stands for one or more expressions to be evaluated using the contents of the data frame. If there is more than one expression, then the expressions need to be wrapped in curly braces.

    Example: Consider the NIFTY 1-year price series. Let us find the gap opening for each day using both the methods and time them using the system.time function. Note the time taken to execute the For Loop versus the time to execute the with function in combination with the lagpad function.

    # Using FOR Loop
    df = read.csv("NIFTY.csv")
    df = df[,c(1,3:6)]
    df$GapOpen = double(nrow(df))
    for ( i in 2:nrow(df)) {
        df$GapOpen[i] = round(Delt(df$CLOSE[i-1],df$OPEN[i])*100,2)

    # Using with function + lagpad, instead of FOR Loop
    dt = read.csv("NIFTY.csv")
    dt = dt[,c(1,3:6)]
    lagpad = function(x, k) {
    c(rep(NA, k), x)[1 : length(x)]
    dt$PrevClose = lagpad(dt$CLOSE, 1)
    dt$GapOpen_ = with(dt, round(Delt(dt$PrevClose,dt$OPEN)*100,2))

    Adding to an existing vector or shortening it

    Adding or shortening an existing vector can be done by assigning a new length to the vector. When we shorten a vector, the values at the end will be removed, and when we extend an existing vector, missing values will be added at the end.


    # Shorten an existing vector
    even = c(2,4,6,8,10,12)

    [1] 6

    # The new length equals the number of elements required in the vector to be shortened.
    length(even) = 3

    [1] 2 4 6

    # Add to an existing vector
    odd = c(1,3,5,7,9,11)

    [1] 6

    # The new length equals the number of elements required in the extended vector.
    length(odd) = 8
    odd[c(7,8)] = c(13,15)

    [1] 1 3 5 7 9 11 13 15

    Make R beep/play a sound

    If you want R to play a sound/beep upon executing the code, we can do this using the “beepr” package. The beep function from the package plays a sound when the code gets executed. One also needs to install the “audio” package along with the “beepr” package.


    One can select from the various sounds using the “sound” argument and by assigning one of the specified values to it.

    beep(sound = 9)

    One can keep repeating the message using beepr as illustrated in the example below (source:http: //


    work_complete <- function() {
      cat("Work complete. Press esc to sound the fanfare!!!n")
      while (TRUE) {

    One can also use the beep function to play a sound if an error occurs during the code execution.

    options(error = function() {beep(sound =5)})

    Functions Demystified function

    Environments act as a storehouse. When we create variables in R from the command prompt these get stored in the R’s global environment. To access the variables stored in the global environment, one can use the following expression:

    head(ls(envir = globalenv()), 15)

    [1] “df” “dt” “even” “i” “lagpad” “odd”

    If we want to store the variables in a specific environment, we can assign the variable to that environment or create a new environment which will store the variable. To create a new environment we use the new.env function.


    my_environment = new.env()

    Once we create a new environment, assigning a variable to the environment can be done in multiple ways. Following are some of the methods:

    # By using double square brackets
    my_environment[["AutoCompanies"]] = c("MARUTI", "TVSMOTOR", "TATAMOTORS")
    # By using dollar sign operator
    my_environment$AutoCompanies = c("MARUTI", "TVSMOTOR", "TATAMOTORS")
    # By using the assign function
    assign("AutoCompanies", c("MARUTI", "TVSMOTOR", "TATAMOTORS"), my_environment)

    The variables existing in an environment can be viewed or listed using the get function or by using the ls function.


    ls(envir = my_environment)

    [1] “AutoCompanies”

    get("AutoCompanies", my_environment)


    readSeries function

    The readSeries function is part of the timeSeries package, and it reads a file in table format and creates a timeSeries object from it. The main arguments of the function are:

    readSeries(file, header = TRUE, sep = “,”,format)

    file: the filename of a spreadsheet dataset from which to import the data records.
    header: a logical value indicating whether the file contains the names of the variables as its first line.
    format: a character string with the format in POSIX notation specifying the timestamp format.
    sep: the field separator used in the spreadsheet file to separate columns. By default, it is set as “;”.


    # Reading the NIFTY data using read.csv
    df = read.csv(file = "NIFTY.csv")

    # Reading the NIFTY data and creating a time series object using readSeries
    # function
    df = readSeries(file = "NIFTY.csv", header = T, sep = ",", format = "%Y%m%d")

    with and within functions

    The with and within functions apply an expression to a given data set and allows one to manipulate it. The within function even keeps track of changes made, including adding or deleting elements and returns a new object with these revised contents. The syntax for these two functions is given as:

    with(data, expr)
    within(data, expr)

    data – typically is a list or data frame, although other options exist for with.
    expr – one or more expressions to evaluate using the contents of data, the commands must be wrapped in braces if there is more than one expression to evaluate.

    # Consider the NIFTY data
    df ="NIFTY.csv"))
    print(head(df, 3))

    # Example of with function:
    df$Average = with(df, apply(df[3:6], 1, mean))
    print(head(df, 3))

    # Example of within function:
    df = within(df, {
       df$Average = apply(df[3:6], 1, mean)
    print(head(df, 3))

    Next Step

    We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

    The post R Weekly Bulletin Vol – V appeared first on .

    To leave a comment for the author, please follow the link and comment on their blog: R programming. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Reproducible Data Science with R

    By David Smith

    Yesterday, I had the honour of presenting at The Data Science Conference in Chicago. My topic was Reproducible Data Science with R, and while the specific practices in the talk are aimed at R users, my intent was to make a general argument for doing data science within a reproducible workflow. Whatever your tools, a reproducible process:

    • Saves time,
    • Produces better science,
    • Creates more trusted research,
    • Reduces the risk of errors, and
    • Encourages collaboration.

    Sadly there’s no recording of this presentation, but my hope is that the slides are sufficiently self-contained. Some of the images are links to further references, too. You can browse them below, or download (CC-BY) them from the SlideShare page.

    Thanks to all who attended for the interesting questions and discussion during the panel session!

    Source:: R News

    Emails from R

    By aghaynes

    (This article was first published on R – Insights of a PhD student, and kindly contributed to R-bloggers)

    There are a few packages for sending email directly from R, but I work in a place where none of these work due to strict network settings. To at least partially circumvent this, here’s some code to produce a PowerShell script to send email(s) via Outlook. The PowerShell script can then be run either by a shell call (again, not possible in my workplace) or by right clicking the file and selecting run with PowerShell.

    # Addresses 
    add <- c("", "")
    subject <- "Testing"
    # construct message
    # opening
    start <- 'Hi, 
    how are you?
    # main content
    body <- '
    sent almost exclusively from R 
    # signature
    sig <- '
    And this is my signature
    # paste components together
    Body <- paste(start, body, sig)
    # construct PowerShell code (*.ps1)
    email <- function(recip, subject, Body, filename, attachment = NULL, append){
      file <- paste0(filename, ".ps1")
      write('$Outlook = New-Object -ComObject Outlook.Application', file, append = append)
      write('$Mail = $Outlook.CreateItem(0)', file, append = TRUE)
      write(paste0('$Mail.To = "', recip, '"'), file, append = TRUE)
      write(paste0('$Mail.Subject = "', subject, '"'), file, append = TRUE)
      write(paste0('$Mail.Body = "', Body, '"'), file, append = TRUE)
        write(paste0('$File = "', attachment, '"'), file, append = TRUE)
        write('$Mail.Attachments.Add($File)', file, append = TRUE)
      write('$Mail.Send()', file, append = TRUE)
      if(append) write('', file, append = TRUE)
    for(i in 1:length(add)){
      file <- paste0("email", i, ".ps1")
      att <- file.path(getwd(), "blabla.txt")
      email(add[i], subject, Body, file, attachment = att) # with attachment
      # email(add[i], subject, Body, file)                 # without attachment
      # email(add[i], subject, Body, file, append = TRUE)  # multiple emails in a single PS file 

    Now you can go and run the PowerShell script from within windows explorer.

    To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD student. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Logistic regressions (in R)

    By Steph

    (This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

    Logistic regressions are a great tool for predicting outcomes that are categorical. They use a transformation function based on probability to perform a linear regression. This makes them easy to interpret and implement in other systems.

    Logistic regressions can be used to perform a classification for things like determining whether someone needs to go for a biopsy. They can also be used for a more nuanced view by using the probabilities of an outcome for thinks like prioritising interventions based on likelihood to default on a loan.

    I recently did a remote talk to Plymouth University on logistic regressions, which covers:

    • how they work (not maths heavy!)
    • how you build them in R
    • things to think about when preparing you data
    • ways to evaluate a logistic regression

    You can watch the video below, get the slides, and view the slides’ source code.

    This talk is a cut-down version of my community workshop on logistic regressions, which is in itself a cut-down version of a full day of training on them. Get in touch if you’re interested in the talk or workshop for your user group, or if you’d like to discuss in-depth training.

    The post Logistic regressions (in R) appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

    To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Rblpapi 0.3.6

    By Thinking inside the box

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Time for a new release of Rblpapi — version 0.3.6 is now on CRAN. Rblpapi provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg Labs (but note that a valid Bloomberg license and installation is required).

    This is the seventh release since the package first appeared on CRAN last year. This release brings a very nice new function lookupSecurity() contributed by Kevin Jin as well as a number of small fixes and enhancements. Details below:

    Changes in Rblpapi version 0.3.6 (2017-04-20)

    • bdh can now store in double preventing overflow (Whit and John in #205 closing #163)

    • bdp documentation has another ovveride example

    • A new function lookupSecurity can search for securities, optionally filtered by yellow key (Kevin Jin and Dirk in #216 and #217 closing #215)

    • Added file init.c with calls to R_registerRoutines() and R_useDynamicSymbols(); also use .registration=TRUE in useDynLib in NAMESPACE (Dirk in #220)

    • getBars and getTicks can now return data.table objects (Dirk in #221)

    • bds has improved internal protect logic via Rcpp::Shield (Dirk in #222)

    Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the Rblpapi page. Questions, comments etc should go to the issue tickets system at the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News