H2O Benchmark for CSV Import

By statcompute

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

library(SparkR, lib.loc = paste(Sys.getenv("SPARK_HOME"), "/R/lib", sep = ""))
sc 

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

Data visuals notes for my talks in 2017

By Bluecology blog

(This article was first published on Bluecology blog, and kindly contributed to R-bloggers)

Data visuals: notes for my talks in 2017

Supplementary notes for CJ Brown’s talks on dataviz in 2017 for Griffith
University’s honours students and the UQ Winterschool in
Bioinformatics
.

Skip to the quiz

Visualsing sexual dimorphism in elephant seals

I picked this example to demonstrate a simple barchart for representing
the sizes of different things.

mnsize 

Comparing volume and area

Compare these. Note that if we compare circles we should use area, not
the radius or diameter to scale their size.

n 

rad1 

Exploration of data

Let’s create a point cloud to demonstrate some data exploration
techniques

set.seed(42)
x 

Can’t see alot here. A linear model might help us explore if there is
any trend going on:

mod1 

What about identifying extreme points, that may be worth investigating
further? We can pick out points that are greater than 2SDs from the
trend:

modresid  sd2)

plot(x,y, pch = 16, col = grey(0.5,0.5), las = 1)
abline(mod1, lwd = 2, col = 'red')
points(x[ipt], y[ipt], pch = 16, col = rgb(1,0,0, 0.6))

Three ways of visualising the same x-y data

Each of these graphs of the same data has a slightly different
interpretation.

x 

mod1 

library(MASS)
filled.contour(kde2d(x,y), scale = F)

plot(x,y, pch = 16, col = grey(0.5, 0.5))
abline(mod1, lwd = 3, col = 'red')

Make your own Datasaurus dozen

The datasaurus is a great example of why you should view your data,
invented by Alberto Cairo. See Steph Locke’s code and package on
github for making this in R.

library(datasauRus)
datnames 

Convince yourself that the mean, sd and correlation is the same in all
of these plots:

library(dplyr)
datasaurus_dozen %>% group_by(dataset) %>%
    summarize(meanx = mean(x), meany = mean(y),
              sdx = sd(x), sdy = sd(y),
              corr = cor(x,y))

## # A tibble: 13 × 6
##       dataset    meanx    meany      sdx      sdy        corr
##         
## 1        away 54.26610 47.83472 16.76982 26.93974 -0.06412835
## 2    bullseye 54.26873 47.83082 16.76924 26.93573 -0.06858639
## 3      circle 54.26732 47.83772 16.76001 26.93004 -0.06834336
## 4        dino 54.26327 47.83225 16.76514 26.93540 -0.06447185
## 5        dots 54.26030 47.83983 16.76774 26.93019 -0.06034144
## 6     h_lines 54.26144 47.83025 16.76590 26.93988 -0.06171484
## 7  high_lines 54.26881 47.83545 16.76670 26.94000 -0.06850422
## 8  slant_down 54.26785 47.83590 16.76676 26.93610 -0.06897974
## 9    slant_up 54.26588 47.83150 16.76885 26.93861 -0.06860921
## 10       star 54.26734 47.83955 16.76896 26.93027 -0.06296110
## 11    v_lines 54.26993 47.83699 16.76996 26.93768 -0.06944557
## 12 wide_lines 54.26692 47.83160 16.77000 26.93790 -0.06657523
## 13    x_shape 54.26015 47.83972 16.76996 26.93000 -0.06558334

We can also save these as .png images to make a .gif image (see also
here)

for (ilvs in 1:nlevels){
  i 

Effect size

When doing confirmatory analysis, we might want to know how strong an
effect is. Data viz is very useful for this task. Lets compare two
datasets that have similar p-values, but very different effect sizes

set.seed(42)
x |t|)
## (Intercept)   2.8054     0.4614   6.080 1.71e-09 ***
## x             4.2096     0.4603   9.145  |t|)
## (Intercept)  2.98703    0.03076   97.11   

Don’t waste digital ink

Plots with less ‘add-ons’ tend to communicate the key message more
clearly. For instance, just like excel plots dont:

x 

You can get additional themes for ggplot2 using this excellent
package
.
A cleaner view:

ggplot(dat, aes(x = x, y = y)) + geom_point() +
    theme_base()

Or simply:

plot(dat$x, dat$y, xlab = "x", ylab = "y", las = 1)

A good principle is to not use ‘ink’ on figures that isn’t needed to
communicate your message. Tufte
takes the ‘less ink’ philosophy to the extreme:

ggplot(dat, aes(x = x, y = y)) + geom_point() +
    theme_tufte()+theme(axis.text=element_text(size=20),
        axis.title=element_text(size=20))

When is ggplot2 appropriate, or when should I use base R?

In general I think ggplot2 is appropriate for
problems of intermediate complexity. Like this:


Base R is great if you just want to plot a barplot quickly, or do an x-y
plot. ggplot2 comes into its own for slight more complex plots, like
having multiple panels for different groups or colouring lines by a 3rd
factor. But once you move to really complex plots, like overlaying a
subplot on a map, ggplot2 becomes very difficult, if not impossible. At
that point it is better to move back to Base R. ggplot2 can also get
very fiddly if you are very specific about your plots, like having
certain colours, or the labels in a certain way.

As an example, ggplot2 is great for data like this:

x1 

It is also pretty handy for faceting:

ggplot(dat, aes(x = x1, y = y1)) +
  geom_point() + facet_wrap(~grps)+
  theme_bw()

The key with ggplot2 is to have your data in a data-frame.

In reality both ggplot2 and base R graphics are worth learning, but I
would start with learning the basics of base R graphics and then move
onto ggplot2 if you want to quickly plot lots of structured data-sets.

Pie graphs vs bar graphs

In Mariani et al. they plot rates of seafood fraud by several European
countries. While its a foundational study that establishes improvement
in the accuracy of food labelling, their graphics could be improved in
several ways.

First they use perspective pies. This makes it incredibly hard to
compare the two groups (fish that are labelled/mislabelled). Humans are
very bad at comparing angles and pretty bad at comparing areas. With the
perspective you can’t even compare the areas properly. They do provide
the raw numbers, but then, why bother with the pies?
Note that the % pies misrepresent the data slightly because the %
figures are actually odds ratios (mis-labels / correct labels), rather
than percent (mis-labeels / total samples).
Second the pies are coloured red/green, which will be hard for red-green
colourblind people to see.
Third, they have coloured land blue on their map, so it appears to be
ocean at first look.
Fourth, the map is not really neccessary. There are no spatial patterns
going on that the authors want to draw attention to. I guess having a
map does emphasize that the study is in Europe. Finally, the size of
each pie is scaled to the sample size, but the scale bar for the sample
size shows a sample of only 30, whereas most of their data are for much
larger samples sizes (>200). Do you get the impression from the pies
that the UK has 668 samples, whereas Ireland only has 187? Therefore,
from this graphic we have no idea what sample size was used in each
country.

In fact, all the numbers that are difficult to interpret in the figure
are very nicely presented in Table 1.

Below is a start at improving the presentation. For instance, you could
do a simple bar chart, ordering by rate of mislabelling.

cnames 

You could add another subfigure to this, showing the rate by different
species too.

The barplot doesn’t communicate the sample size, but then that is
probably not the main point. The sample sizes are probably best reported
in the table

If we felt the map was essential, then putting barcharts on it would be
more informative. It is not that easy to add barcharts ontop of an
existing map in R, so I would recommend creating the barcharts
seperately, then adding them on in Illustrator or Powerpoint.

We can make a basic map like this:

library(maps)
library(maptools)
map('world', xlim = c(-20, 20), ylim = c(35, 60), col = 'grey', fill = T)

Then create some nice barcharts. We write a loop so we get one barchart
for each country.

nc 

Choosing axes

It can be misleading to present % and proportion data on axes that are
not scaled 0 – 100%. For instance, compare these three graphs:

y 

Choosing colour scales

Alot of thought should go into choosing colour scales for graphs for
instance- will it print ok? will colour blind people be able to see
this? does the scale create artificial visual breaks in the data?
Luckily there is a package to help you make the right decision for a
colour scale, it is called RColorBrewer. Check out colorbrewer.org for
a helpful interactive web interface for choosing colours.
Here I recreate the map of that in Cinner et al. 2016 Nature for fish
biomass (using dummy data, as I don’t have their data).

First let’s specify some dummy data

fishbio 

Using red-green palettes makes it hard for colour blind people. Also,
using a diverging palette makes it look like there is something
important about the middle point (yellow). A better palette to use would
be one of the sequential ones. Using reds (and reversing it using
rev()) emphasises the worst places. We could use greens and emphasise
the best places too. I am using a light grey for the fill so that the
points stand out more.

#Create sequential colours for locations
cols 

To make it easier to understand, let’s look at these again as contour
plots. I will use a more appropriate diverging palette to the red-green
one though.

z 

filled.contour(z, col = brewer.pal(9, 'RdBu'), nlevels = 10)

Notice the diverging pallette creates an artificial split at yellow

One of the only legitimate uses for pie graphs (I think) is visualising
the colour scales. Here is how:

reds 

Interpreting rates

People are bad at interpreting rates, we just can’t get our heads around
accumulation very well. Here is a numerical example. Check out the below
figure and ask yourself:

At what time is the number of people in the shopping centre declining?

Would you say it is at point A, B, C or D?

Before you proceed with code below, take the poll:

Loading poll…

Here is how we made the figure and generated the data:

par(mar = c(4,4.5,2,2), mgp = c(3,1,0))
plot(times, inrate_err, type = 'l', xlab = 'Hour of day', ylab = 'People per 10 minutes', las = 1, cex.axis = 2, lwd = 3, col = 'darkblue', cex.lab = 2, ylim = c(0, 12))
lines(times, outrate_err, lwd = 3, col = 'tomato')

abline(v = 12, lwd = 2, col = grey(0.5,0.5))
text(12, 13, 'A', xpd = NA, cex = 2)
abline(v = 13.5, lwd = 2, col = grey(0.5,0.5))
text(13.5, 13, 'B', xpd = NA, cex = 2)
abline(v = 14.2, lwd = 2, col = grey(0.5,0.5))
text(14.2, 13, 'C', xpd = NA, cex = 2)
abline(v = 15.8, lwd = 2, col = grey(0.5,0.5))
text(15.8, 13, 'D', xpd = NA, cex = 2)

legend('bottomleft', legend = c('Entering', 'Leaving'), lwd = 2, col = c('darkblue','tomato'), cex = 1.5, xpd = NA)

Let’s plot the total number of people:

par(mar = c(5,5.5,2,2), mgp = c(4,1,0))
plot(times, cumsum(inrate_err) - cumsum(outrate_err), type = 'l', xlab = 'Hour of day', ylab = 'People in shopping centre', las = 1, cex.axis = 2, lwd = 3, col = 'darkblue', cex.lab = 2, ylim = c(0, 120))

abline(v = 12, lwd = 2, col = grey(0.5,0.5))
text(12, 130, 'A', xpd = NA, cex = 2)
abline(v = 13.5, lwd = 2, col = grey(0.5,0.5))
text(13.5, 130, 'B', xpd = NA, cex = 2)
abline(v = 14.1, lwd = 2, col = 'tomato')
text(14.2, 130, 'C', xpd = NA, cex = 2)
abline(v = 15.8, lwd = 2, col = grey(0.5,0.5))
text(15.8, 130, 'D', xpd = NA, cex = 2)

Hopefully the answer is obvious now. So the right scales can help make
interpretation much easier.

Anatomy of a simple chart

The construction of a simple chart in R can be a surprisingly long piece
of code. Here is an example to get you started. Don’t be afraid to
experiment!

# --------------- #
# Make some data
# --------------- #
set.seed(42)

n 

Which look terrible. Let’s build a better chart.

# Package for colours
library(RColorBrewer)

#Set axis limits
ymax 

Resources and further reading

To leave a comment for the author, please follow the link and comment on their blog: Bluecology blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Visualization with googleVis exercises part 4

By Euthymios Kasvikis

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Adding Features to your Charts

We saw in the previous charts some basic and well-known types of charts that googleVis offers to users. Before continuing with other, more sophisticated charts in the next parts we are going to “dig a little deeper” and see some interesting features of those we already know.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

Secondly we will create an experimental data frame which will be used for our charts’ plotting. You can create it with:
df=data.frame(name=c("James", "Curry", "Harden"),
Pts=c(20,23,34),
Rbs=c(13,7,9))

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

Customizing Chart

We are going to use the two-axis Line Chart we created in part 1. This is the code we used, in case you forgot it:

LineC2
options=list(
series="[{targetAxisIndex: 0},
{targetAxisIndex:1}]",
vAxes="[{title:'Pts'}, {title:'Rbs'}]"
))
plot(LineC2)

Colours

To set the color of every line we can use:
series="[{color:'green', targetAxisIndex: 0,

Exercise 1

Change the colours of your line chart to green and yellow respectively and plot the chart.

Line Width

You can determine the line width of every line with:
series="[{color:'green',targetAxisIndex: 0, lineWidth: 3},

Exercise 2

Change the line width of your lines to 3 and 6 respectively and plot the chart.

Dashed lines

You can tranform your lines to dashed with:
series="[{color:'green', targetAxisIndex: 0,
lineWidth: 1, lineDashStyle: [2, 2, 20, 2, 20, 2]},

There are many styles and colours available and you can find them here.

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 3

Choose two different styles of dashed lines for every line of your chart from the link above and plot your chart.

Point Shape

With the pointShape option you can choose from a variety of shapes for your points.

We will use the scatter chart we built in part 3 to see how it works. Here is the code:
ScatterCD
options=list(
legend="none",
pointSize=3,lineWidth=2,
title="Cars", vAxis="{title:'speed'}",
hAxis="{title:'dist'}",
width=600, height=300))
plot(ScatterCD)

Exercise 4

Change the shape of your scatter chart’s points to ‘square’ and plot it. HINT: Use pointShape.

Exercise 5

Change the shape of your scatter chart’s points to ‘triangle’, their point size to 7 and plot it.

Edit Button

A really useful and easy feature that googleVis provides is the edit button which gives the user the ability to customize the chart in an automated way.
options=list(gvis.editor="Edit!"))

Exercise 6

Add an edit button in the scatter chart you just created. HINT: Use gvis.editor.

Chart with more options

Now let’s see how we can create a chart with many features that can enhance its appearance. We will use again the 2-axis line that we used before.
LineCD2
options=list(
series="[{color:'green',targetAxisIndex: 0, lineWidth: 3,
lineDashStyle: [14, 2, 2, 7]},
{color:'yellow',targetAxisIndex:1,lineWidth: 6,
lineDashStyle: [10, 2]}]",
vAxes="[{title:'Pts'}, {title:'Rbs'}]"
))
plot(LineCD2)

Background color

You can decide the background color of your chart with:
backgroundColor="red",

Exercise 7

Set the background color of your line chart to “lightblue” and plot it. HINT: Use backgroundColor.

Title

To give a title and decide its features you can use:
title="Title",
titleTextStyle="{color:'orange',
fontName:'Courier',
fontSize:14}",

Exercise 8

Give a title of your choise to the line chart and set its font to blue, Courier of size 16. HINT: Use titleTextStyle.

Curve Type & Legend

Another nice-looking choise that googleVis gives you is to display the lines like curves with:
curveType="function"

You can also move the legend of your chart to the bottom with:
legend="bottom"

Exercise 9

Smooth the lines of your line chart by setting the curveType option to function and move the legend to the bottom. HINT: Use curveType and legend.

Axes features

Finally you can “play” with your axes. This is an example:
vAxis="{gridlines:{color:'green', count:4}}",
hAxis="{title:'City', titleTextStyle:{color:'red'}}",
series="[{color:'yellow', targetAxisIndex: 0},
{color: 'brown',targetAxisIndex:1}]",
vAxes="[{title:'val1'}, {title:'val2'}]",

Exercise 10

Give the title “Name” to your hAxis and color it orange. Separate your vAxis with 3 red gridlines. HINT: Use titleTextStyle and gridlines

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Weekly Bulletin Vol – XII

By QuantInsti

(This article was first published on R programming, and kindly contributed to R-bloggers)

This week’s R bulletin will cover topics on how to resolve some common errors in R.

We will also cover functions like do.call, rename, and lapply. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Find and Replace – Ctrl+F
2. Find Next – F3
3. Find Previous – Shift+F3

Problem Solving Ideas

Resolving the ‘: cannot open the connection’ Error

There can be two reasons for this error to show up when we run an R script: 1) A file/connection can’t be opened because R can’t find it (mostly due to an error in the path) 2) Failure in .onLoad() because a package can’t find a system dependency

Example:

symbol = "AXISBANK"
noDays = 1
dirPath = paste(getwd(), "/", noDays, " Year Historical Data", sep = "")
fileName = paste(dirPath, symbol, ".csv", sep = "")
data = as.data.frame(read.csv(fileName))

Warning in file(file, “rt”): cannot open file ‘C:/Users/Madhukar/Documents/
1 Year Historical DataAXISBANK.csv’: No such file or directory
Error in file(file, “rt”): cannot open the connection

We are getting this error because we have specified the wrong path to the “dirPath” object in the code. The right path is shown below. We missed adding a forward slash after “Year Historical Data” in the paste function. This led to the wrong path, and hence the error.

dirPath = paste(getwd(),”/”,noDays,” Year Historical Data/”,sep=””)

After adding the forward slash, we re-ran the code. Below we can see the right dirPath and fileName printed in the R console.

Example:

symbol = "AXISBANK"
noDays = 1
dirPath = paste(getwd(), "/", noDays, " Year Historical Data/", sep = "")
fileName = paste(dirPath, symbol, ".csv", sep = "")
data = as.data.frame(read.csv(fileName))
print(head(data, 3))

Resolving the ‘could not find function’ Error

This error arises when an R package is not loaded properly or due to the misspelling of the function names.

When we run the code shown below, we get a “could not find the function ymd” error in the console. This is because we have misspelled the “ymd” function as “ymed”. If we do not load the required packages, this will also throw up a “could not find function ymd” error.

Example:

# Read NIFTY price data from the csv file
df = read.csv("NIFTY.csv")

# Format date
dates = ymed(df$DATE)

Error in eval(expr, envir, enclos): could not find function “ymed”

Resolving the “replacement has” Error

This error occurs when one tries to assign a vector of values to an existing object and the lengths do not match up.

In the example below, the stock price data of Axis bank has 245 rows. In the code, we created a sequence “s” of numbers from 1 to 150. When we try to add this sequence to the Axis Bank data set, it throws up a “replacement error” as the lengths of the two do not match. Thus to resolve such errors one should ensure that the lengths match.

Example:

symbol = "AXISBANK" ; noDays = 1 ;
dirPath = paste(getwd(),"/",noDays," Year Historical Data/",sep="")
fileName = paste(dirPath,symbol,".csv",sep="")
df = as.data.frame(read.csv(fileName))

# Number of rows in the dataframe "df"
n = nrow(df); print(n);

# create a sequence of numbers from 1 to 150
s = seq(1,150,1)

# Add a new column "X" to the existing data frame "df"
df$X = s
print(head(df,3))

Error in $

Functions Demystified

do.call function

The do.call function is used for calling other functions. The function which is to be called is provided as the first argument to the do.call function, while the second argument of the do.call function is a list of arguments of the function to be called. The syntax for the function is given as:

do.call (function_name, arguments)

Example: Let us first define a simple function that we will call later in the do.call function.

numbers = function(x, y) {
sqrt(x^3 + y^3)
}

# Now let us call this 'numbers' function using the do.call function. We provide the function name as # the first argument to the do.call function, and a list of the arguments as the second argument.

do.call(numbers, list(x = 3, y = 2))

[1] 5.91608

rename function

The rename function is part of the dplyr package, and is used to rename the columns of a data frame. The syntax for the rename function is to have the new name on the left-hand side of the = sign, and the old name on the right-hand side. Consider the data frame “df” given in the example below.

Example:

library(dplyr)
Tic = c("IOC", "BPCL", "HINDPETRO", "ABAN")
OP = c(555, 570, 1242, 210)
CP = c(558, 579, 1248, 213)
df = data.frame(Tic, OP, CP)
print(df)

# Renaming the columns as 'Ticker', 'OpenPrice', and 'ClosePrice'. This can be done in the following 
# manner:

renamed_df = rename(df, Ticker = Tic, OpenPrice = OP, ClosePrice = CP)
print(renamed_df)

lapply function

The lapply function is part of the R base package, and it takes a list “x” as an input, and returns a list of the same length as “x”, each element of which is the result of applying a function to the corresponding element of X. The syntax of the function is given as:

lapply(x, Fun)
where,
x is a vector (atomic or list)
Fun is the function to be applied

Example 1:

Let us create a list with 2 elements, OpenPrice and the ClosePrice. We will compute the mean of the values in each element using the lapply function.

x = list(OpenPrice = c(520, 521.35, 521.45), ClosePrice = c(521, 521.1, 522))
lapply(x, mean)

$OpenPrice
[1] 520.9333

$ClosePrice
[1] 521.3667

Example 2:

x = list(a = 1:10, b = 11:15, c = 1:50)
lapply(x, FUN = length)

$a
[1] 10

$b
[1] 5

$c
[1] 50

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – XII appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using Tweedie Parameter to Identify Distributions

By statcompute

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

In the development of operational loss models, it is important to identify which distribution should be used to model operational risk measures, e.g. frequency and severity. For instance, why should we use the Gamma distribution instead of the Inverse Gaussian distribution to model the severity?

In my previous post https://statcompute.wordpress.com/2016/11/20/modified-park-test-in-sas, it is shown how to use the Modified Park test to identify the mean-variance relationship and then decide the corresponding distribution of operational risk measures. Following the similar logic, we can also leverage the flexibility of the Tweedie distribution to accomplish the same goal. Based upon the parameterization of a Tweedie distribution, the variance = Phi * (Mu ** P), where Mu is the mean and P is the power parameter. Depending on the specific value of P, the Tweedie distribution can accommodate several important distributions commonly used in the operational risk modeling, including Poisson, Gamma, Inverse Gaussian. For instance,

  • With P = 0, the variance would be independent of the mean, indicating a Normal distribution.
  • With P = 1, the variance would be in a linear form of the mean, indicating a Poisson-like distribution
  • With P = 2, the variance would be in a quadratic form of the mean, indicating a Gamma distribution.
  • With P = 3, the variance would be in a cubic form of the mean, indicating an Inverse Gaussian distribution.

In the example below, it is shown that the value of P is in the neighborhood of 1 for the frequency measure and is near 3 for the severity measure and that, given P closer to 3, the Inverse Gaussian regression would fit the severity better than the Gamma regression.

library(statmod)
library(tweedie)

profile1 

Together with the Modified Park test, the estimation of P in a Tweedie distribution is able to help us identify the correct distribution employed in operational loss models in the context of GLM.

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

Track changes in data with the lumberjack %>>%

By mark

(This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R

> data(retailers, package="validate")
> head(retailers, 3)
  size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1  sc0      0.02    75       NA        NA      1130          NA       18915  20045  NA
2  sc3      0.14     9     1607        NA      1607         131        1544     63  NA
3  sc3      0.14    NA     6886       -33      6919         324        6493    426  NA

This data is dirty with missings and full of errors. Let us do some imputations with simputation.

> out % 
+   impute_lm(other.rev ~ turnover) %>%
+   impute_median(other.rev ~ size)
> 
> head(out,3)
  size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1  sc0      0.02    75       NA  6114.775      1130          NA       18915  20045  NA
2  sc3      0.14     9     1607  5427.113      1607         131        1544     63  NA
3  sc3      0.14    NA     6886   -33.000      6919         324        6493    426  NA
> 

Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.

The lumberjack operator is a `pipe'[1] operator that allows you to track changes in data.

> library(lumberjack)
> retailers$id  out >% 
+   start_log(log=cellwise$new(key="id")) %>>%
+   impute_lm(other.rev ~ turnover) %>>%
+   impute_median(other.rev ~ size) %>>%
+   dump_log(stop=TRUE)
Dumped a log at cellwise.csv
> 
> read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
  step                     time                      expression key  variable old      new
1    2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size)   1 other.rev  NA 6114.775
2    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   2 other.rev  NA 5427.113
3    1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover)   6 other.rev  NA 6341.683
> 

So, to track changes we only need to switch from %>% to %>>% and add the start_log() and dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.

There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.

If this post got you interested, please install the package using

install.packages('lumberjack')

You can get started with the introductory vignette or even just use the lumberjack operator %>>% as a (close) replacement of the %>% operator.

As always, I am open to suggestions and comments. Either through the packages github page.

Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.

And finally, here’s a picture of a lumberjack smoking a pipe.

[1] It really should be called a function composition operator, but potetoes/potatoes.

Markdown with by ❤wp-gfm
To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: >%” >R News

The R community is one of R’s best features

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And that’s an area where R really shines, as Shannon Ellis explains in this lovely ROpenSci blog post. For software, a thriving community offers developers, expertise, collaborators, writers and documentation, testers, agitators (to keep the community and software on track!), and so much more. Shannon provides links where you can find all of this in the R community:

  • #rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter
  • R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
  • Local R meetup groups — a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
  • Rweekly — an incredible weekly recap of all things R
  • R-bloggers — an awesome resource to find posts from many different bloggers using R
  • DataCarpentry and Software Carpentry — a resource of openly available lessons that promote and model reproducible research
  • Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)

I’ll add a couple of others as well:

  • R Conferences — The annual useR! conference is the major community event of the year, but there are many smaller community-led events on various topics.
  • Github — there’s a fantastic community of R developers on Github. There’s no directory, but the list of trending R developers is a good place to start.
  • The R Consortium — proposing or getting involved with an R Consortium project is a great way to get involved with the community

As I’ve said before, the R community is one of the greatest assets of R, and is an essential component of what makes R useful, easy, and fun to use. And you couldn’t find a nicer and more welcoming group of people to be a part of.

To learn more about the R community, be sure to check out Shannon’s blog post linked below.

ROpenSci Blog: Hey! You there! You are welcome here

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Logarithmic Scale Explained with U.S. Trade Balance

By Gregory Kanevsky

(This article was first published on novyden, and kindly contributed to R-bloggers)
Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest.
Consider U.S. 2016 merchandise trade partner balances data set where each point is a country with 2 features: U.S. imports and exports against it:
Suppose we decided to visualize top 30 U.S trading partners using bubble chart, which simply is a 2D scatter plot with the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports for xy coordinates and trade balance (abs(export – import)) for size:
China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to “solve” this problem is to eliminate 3 mentioned outliers from the picture:

While this plot does look better it no longer serves its original purpose of displaying all top trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.
Quick refresher from algebra. Log function (in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbers A, B, and C such that

`A*B=C and A,B,C > 0`

applying log results in additive relationship:

`log(A) + log(B) = log(C)`
For example, let A=100, B=1000, and C=100000 then

`100 * 1000 = 100000`

so that after transformation it becomes

`log(100) + log(1000) = log(100000)` or `2 + 3 = 5`

Observe this on 1D plane:

Logarithmic scale is simply a log transformation applied to all feature’s values before plotting them. In our example we used it on both trading partners’ features – imports and exports which gives bubble chart new look:

The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.
For more detailed discussion of logarithmic scale refer to When Should I Use Logarithmic Scales in My Charts and Graphs? Oh, and how about that trade deficit with China?
This is a re-post from the original blog on LinkedIn.
To leave a comment for the author, please follow the link and comment on their blog: novyden.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Working With SPSS© Data in R

By Mauricio Vargas S. 帕夏

plot of chunk exploring_1

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

Introduction

I was in need of importing SPSS© data for work. There are some options but I’ve used both foreign and haven R packages. I prefer haven because it integrates better with R’s tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.

The Data

For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.

Importing Data

#devtools::install_github("ropenscilabs/skimr")

# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
library(readr)

# Import foreign statistical formats
library(haven)

# Data
url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav"

if(!file.exists(sav)){download.file(url,sav)}

survey = read_sav(sav)

Exploring data

To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.

# How many surveys do I have by day?
daily = survey %>%
  mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
  rename(date = Fecha) %>% 
  group_by(date) %>%
  summarise(n = n())

ggplot(daily, aes(date, n)) +
  geom_line()
# How is the age distributed?
summary(survey$Edad_Entrevistado)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   32.00   48.00   47.92   61.00   89.00 
age = survey %>%
  mutate(as.integer(Edad_Entrevistado)) %>% 
  rename(age = Edad_Entrevistado) %>% 
  group_by(age) %>%
  summarise(n = n())

ggplot(age, aes(age, n)) +
  geom_line()

plot of chunk exploring_1

# How is the sex distributed?
survey %>%
  rename(sex_id = Sexo_Entrevistado) %>% 
  group_by(sex_id) %>%
  summarise(n = n())
# A tibble: 2 x 2
     sex_id     n
  
1         1   651
2         2   651

Exploring labels

In the last tibble we have no idea what is 1 and 2.

survey %>%
  select(Sexo_Entrevistado) %>% 
  rename(sex_id = Sexo_Entrevistado) %>% 
  distinct() %>% 
  mutate(sex = as_factor(sex_id))
# A tibble: 2 x 2
     sex_id    sex
  
1         2  Mujer
2         1 Hombre

The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.

I could run

survey %>%
  rename(sex = Sexo_Entrevistado) %>% 
  mutate(sex = as.integer(sex)) %>% 
  mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>% 
  group_by(sex) %>%
  summarise(n = n())
# A tibble: 2 x 2
     sex     n
   
1 Female   651
2   Male   651

The column names are labelled as well. Here sjlabelled helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.

Describing the dataset

valid_replies = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="complete") %>% 
  mutate(description = get_label(survey)) %>% 
  select(var,description,everything()) %>% 
  select(-c(stat,level,type)) %>% 
  rename(pcent_valid = value) %>% 
  mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))

histograms = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="hist") %>% 
  select(var,level) %>% 
  rename(histogram = level)

survey_description = valid_replies %>% 
  left_join(histograms) %>% 
  write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv")

survey_description
# A tibble: 203 x 4
                 var          description pcent_valid  histogram
               
 1        PONDERADOR           Ponderador        100% ▂▇▇▅▅▃▁▁▁▁
 2             Folio                Folio        100% ▇▇▇▇▇▇▇▇▇▇
 3            Región               Región        100% ▁▁▂▁▂▁▁▁▇▁
 4            Comuna               Comuna        100% ▁▁▂▁▁▂▁▁▇▁
 5             Fecha     Fecha entrevista        100%       
 6  Sexo_Encuestador   Sexo Entrevistador         91% ▂▁▁▁▁▁▁▁▁▇
 7               GSE           GSE Visual        100% ▁▁▂▁▇▁▁▆▁▁
 8 Sexo_Entrevistado    Sexo Entrevistado        100% ▇▁▁▁▁▁▁▁▁▇
 9 Edad_Entrevistado    Edad Entrevistado        100% ▇▆▅▆▇▇▅▃▃▂
10       Hora_Inicio Hora Inicio Medición        100%       
# ... with 193 more rows

Exploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News