## Different demand functions and optimal price estimation in R

By insightr

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Yuri Fonseca

## Demand models

In the previous post about pricing optimization (link here), we discussed a little about linear demand and how to estimate optimal prices in that case. In this post we are going to compare three different types of demand models for homogeneous products and how to find optimal prices for each one of them.

For the linear model, demand is given by:

where $alpha$ is the slope of the curve and $beta$ the intercept. For the linear model, the elasticity goes from zero to infinity. Another very common demand model is the constant-elasticity model, given by:

$displaystyle ln d(p) = alpha ln p + beta,$
or

$displaystyle d(p) = d_0 e^beta p^alpha = Cp^alpha,$

where $alpha$ is the elasticity of the demand and $C$ is a scale factor. A much more interesting demand curve is given by the logistic/sigmoide function:

$displaystyle d(p) = Cfrac{e^{alpha p + beta}}{1 + e^{alpha p + beta}} = frac{C}{1+e^{-alpha(p - p_0)}},$

where $C$ is a scale factor and $alpha$ measures price sensitivity. We also can observe $p_0 = -alpha/beta$ as the inflection point of the demand.

Some books changes the signs of the coefficients using the assumption that $alpha$ is a positive constant and using a minus sign in front of it. However, it does not change the estimation procedure or final result, it is just a matter of convenience. Here, we expect $alpha$ to be negative in the three models.

In the Figure below we can check a comparison among the shapes of the demand models:

```library(ggplot2)
library(reshape2)
library(magrittr)

linear = function(p, alpha, beta) alpha*p + beta
constant_elast = function(p, alpha, beta) exp(alpha*log(p)+beta)
logistic = function(p, c, alpha, p0) c/(1+exp(-alpha*(p-p0)))

p = seq(1, 100)
y1 = linear(p, -1, 100)
y2 = constant_elast(p, -.5, 4.5)
y3 = logistic(p, 100, -.2, 50)

df = data.frame('Prices' = p, 'Linear' = y1, 'Constant_elast' = y2, 'Logistic' = y3)
df.plot = melt(df, id = 'Prices') %&gt;% set_colnames(c('Prices', 'Model', 'Demand'))

ggplot(df.plot) + aes(x = Prices, y = Demand) +
geom_line(color = 'blue', alpha = .6, lwd = 1) +
facet_grid(~Model)
```

Of course that in practice prices does not change between 1 and 100, but the idea is to show the main differences in the shape of the models.

All the models presented above have positive and negative points. Although local linear approximation may be reasonable for small changes in prices, sometimes this assumption is too strong and does not capture the correct sensitivity of bigger price changes. In the constant elasticity model, even though it is a non-linear relationship between demand and price, the constant elasticity assumption might be too restrictive. Moreover, it tends to over estimate the demand for lower and bigger prices. In a fist moment, I would venture to say that the logistic function is the most robust and realistic among the three types.

## Pricing with demand models

In a general setting, one have for the total profit function:

$displaystyle L(p) = d(p)(p-c),$

where, $L$ gives the profit, $d$ is the demand function that depends of the price and $c$ is the marginal cost. Taking the derivative with respect to price we have:

$displaystyle L'(p) = d'(p)(p - c) + d(p).$

Making $L'(p) = 0$ to calculate the optimum price (first order condition), we have:

$displaystyle d'(p^star)(p^star - c) + d(p^star) = 0$
$displaystyle d'(p^star)p^star + d(p^star) = d'(p^star)c,$

which is the famous condition that in the optimal price, marginal cost equals marginal revenue. Next, let’s see how to calculate the optimum prices for each demand functions.

#### Linear model

For the linear model $d'(p) = alpha$. Hence:

$displaystyle d'(p^star)p + d(p^star) = d'(p^star)c,$
$displaystyle alpha p^star + alpha p^star + beta = alpha c,$
$displaystyle p^star = frac{alpha c - beta}{2alpha}.$

Example:

```library(tidyverse)

# Synthetic data
p = seq(80,130)
d = linear(p, alpha = -1.5, beta = 200) + rnorm(sd = 5, length(p))
c = 75
profit = d*(p-c)

# Fit of the demand model
model1 = lm(d~p)
profit.fitted = model1\$fitted.values*(p - c)

# Pricing Optimization
alpha = model1\$coefficients[2]
beta = model1\$coefficients[1]
p.max.profit = (alpha*c - beta)/(2*alpha)

# Plots
df.linear = data.frame('Prices' = p, 'Demand' = d,
'Profit.fitted' = profit.fitted, 'Profit' = profit)

ggplot(select(df.linear, Prices, Demand)) + aes(x = Prices, y = Demand) +
geom_point() + geom_smooth(method = lm)
```

```ggplot(select(df.linear, Prices, Profit)) + aes(x = Prices, y = Profit) +
geom_point() + geom_vline(xintercept = p.max.profit, lty = 2) +
geom_line(data = df.linear, aes(x = Prices, y = Profit.fitted), color = 'blue')
```

#### Constant elasticity model

For the constant elasticity model, since $lim_{Delta rightarrow 0}frac{Delta D}{Delta p} = d'(p)$, we have that:

$displaystyle epsilon = frac{%D}{%p} = frac{pDelta D}{DDelta p} = -frac{d'(p)p}{D}.$

Therefore,

$displaystyle d'(p^star)p^star + d(p^star) = d'(p^star)c,$
$displaystyle frac{d'(p^star)p^star}{d(p^star)} + 1 = frac{d'(p^star)c}{d(p^star)},$
$displaystyle -epsilon + 1 = epsilon frac{c}{p^star},$
$displaystyle p^star = frac{epsilon c}{1-epsilon} = frac{c}{1-1/epsilon}.$

Moreover, knowing that $frac{%D}{%p} sim frac{Delta ln D}{Delta ln p}$ and using the constant elasticity model, we have that:

$displaystyle epsilon sim lim_{Delta rightarrow0} frac{Delta ln D}{Delta ln P} = frac{dln D}{dln p} = alpha.$

Thus, we can calculate the optimum profit price for the constant elasticity model as:

$displaystyle p^star = frac{c}{1 - frac{1}{|alpha|}}$

It is interesting to note that one needs 1″ title=”|alpha| > 1″>, otherwise the profit function will be convex with respect to price and the optimal price will be $infty$. If one have a monopolistic market, normally this assumption holds.

Example:

```# Synthetic data
p = seq(80,130)
d = constant_elast(p, alpha = -3, beta = 15)*exp(rnorm(sd = .15, length(p)))
c = 75
profit = d*(p-c)

# Fitting of demand model
model2 = lm(log(d)~log(p))
profit.fitted = exp(model2\$fitted.values)*(p - c)

# pricing optimization
alpha = model2\$coefficients[2]
p.max.profit = c/(1-1/abs(alpha))

# Plots
df.const_elast = data.frame('Prices' = p, 'Demand' = d,
'Profit.fitted' = profit.fitted, 'Profit' = profit)

ggplot(select(df.const_elast, Prices, Demand)) + aes(x = log(Prices), y = log(Demand)) +
geom_point() + geom_smooth(method = lm)
```

```ggplot(select(df.const_elast, Prices, Profit)) + aes(x = Prices, y = Profit) +
geom_point() + geom_vline(xintercept = p.max.profit, lty = 2) +
geom_line(data = df.const_elast, aes(x = Prices, y = Profit.fitted), color = 'blue')
```

#### Logistic model

For the logistic function, one can check that $d'(p) = alpha d(p)(1-d(p)/C)$. Thus:

$displaystyle d'(p^star)(p^star - c) + d(p^star) = 0,$
$displaystyle alpha d(p^star)(1-d(p^star)/C)(p^star-c) + d(p^star) = 0,$
$displaystyle alpha(1-d(p^star)/C)(p^star-c) + 1 = 0,$
$displaystyle frac{alpha e^{-alpha(p^star - p_0)}(p^star - c) + 1+ e^{-alpha(p^star - p_0)}}{1+ e^{-alpha(p^star - p_0)}} = 0,$
$displaystyle alpha(p^star-c)+1]e^{-alpha(p^star - p_0)} + 1 = 0.$

Since the last equation above does not have an analytical solution (at least we couldn’t solve it), one can easily find the result with a newton-step algorithm or minimization problem. We will use the second approach with the following formulation:

$displaystyle min_{p in mathbb{R}} big{(}[alpha(p-c)+1]e^{-alpha(p - p_0)} + 1big{)}^2$

Example:

```# Objective functions for optimization
demand_objective = function(par, p, d) sum((d - logistic(p, par[1], par[2], par[3]))^2)
price_objective = function(p, alpha, c, p0) (exp(-alpha*(p-p0))*(alpha*(p-c)+1) + 1)^2

# A cleaner alternative for pricing optimization is to min:
price_objective2 = function(p, c, alpha, C, p0) -logistic(p, C, alpha, p0)*(p-c)

# synthetic data
p = seq(80,130)
c = 75
d = logistic(p, 120, -.15, 115) + rnorm(sd = 10, length(p))
profit = d*(p-c)

# Demand fitting, we can't use lm anymore
par.start = c(max(d), 0, mean(d)) # initial guess

demand_fit = optim(par = par.start, fn = demand_objective, method = 'BFGS',
p = p, d = d)

par = demand_fit\$par # estimated parameters for demand function
demand.fitted = logistic(p, c = par[1], alpha = par[2], p0 = par[3])
profit.fitted = demand.fitted*(p - c)

# Pricing Optimization, we don't have a closed expression anymore
price_fit = optim(mean(p), price_objective, method = 'BFGS',
alpha = par[2], c = c, p0 = par[3])

# or

price_fit2 = optim(mean(p), price_objective2, method = 'BFGS',
c = c, C = par[1], alpha = par[2], p0 = par[3])

# both results are almost identical
p.max.profit = price_fit\$par

# Graphics
df.logistic = data.frame('Prices' = p, 'Demand' = d, 'Demand.fitted' = demand.fitted,
'Profit.fitted' = profit.fitted, 'Profit' = profit)

ggplot(select(df.logistic, Prices, Demand)) + aes(x = Prices, y = Demand) +
geom_point() +
geom_line(data = df.logistic, aes(x = Prices, y = Demand.fitted), color = 'blue')
```

```ggplot(select(df.logistic, Prices, Profit)) + aes(x = Prices, y = Profit) +
geom_point() + geom_vline(xintercept = p.max.profit, lty = 2) +
geom_line(data = df.logistic, aes(x = Prices, y = Profit.fitted), color = 'blue')
```

I hope you liked the examples. In the next post we will discuss about choice models, which are demand models when products are heterogeneous. Goodbye and good luck!

## References

Phillips, Robert Lewis. Pricing and revenue optimization. Stanford University Press, 2005.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

`rquery` is an `R` package for specifying data transforms using piped Codd-style operators. It has already shown great performance on `PostgreSQL` and `Apache Spark`. `rqdatatable` is a new package that supplies a screaming fast implementation of the `rquery` system in-memory using the `data.table` package.

`rquery` is already one of the fastest and most teachable (due to deliberate conformity to Codd’s influential work) tools to wrangle data on databases and big data systems. And now `rquery` is also one of the fastest methods to wrangle data in-memory in `R` (thanks to `data.table`, via a thin adaption supplied by `rqdatatable`).

Teaching `rquery` and fully benchmarking it is a big task, so in this note we will limit ourselves to a single example and benchmark. Our intent is to use this example to promote `rquery` and `rqdatatable`, but frankly the biggest result of the benchmarking is how far out of the pack `data.table` itself stands at small through large problem sizes. This is already known, but it is a much larger difference and at more scales than the typical non-`data.table` user may be aware of.

The `R` package development candidate `rquery` 0.5.0 incorporates a number of fixes and improvements. One interesting new feature is the `DBI` package is now suggested or optional, instead of required. This means `rquery` is ready to talk to non-`DBI` big data systems such as `SparkR` (example here) and it let us recruit a very exciting new `rquery` service provider: `data.table`!

`data.table` is, by far, the fastest way to wrangle data at scale in-memory in `R`. Our experience is that it starts to outperform base `R` internals and all other packages at moderate data sizes such as mere tens or hundreds of rows. Of course `data.table` is most famous for its performance in the millions of rows and gigabytes of data range.

However, because of the different coding styles there are not as many comparative benchmarks as one would like. So performance is often discussed as anecdotes or rumors. As a small step we are going to supply a single benchmark based on our “score a logistic regression by hand” problem from “Let’s Have Some Sympathy For The Part-time R User” (what each coding solution looks like can be found here).

In this note we compare idiomatic solutions to the example problem using: `rquery`, `data.table`, base `R` (using `stats::aggregate()`), and `dplyr`. `dplyr` is included due to its relevance and popularity. Full details of the benchmarking can be found here and full results here. One can always do more benchmarking and control for more in experiments. One learns more from a diversity of benchmarks than from critiquing any one benchmark, so we will work this example briefly and provide links to a few others benchmarks. Our measurements confirm the common (correct) observation and conclusion: that `data.table` is very fast. Our primary new observation is that the overhead from the new `rqdatatable` adapter is not too large and `rqdatatable` is issuing reasonable `data.table` commands.

Both the `rquery` and `dplyr` solutions can be run in multiple modalities: allowing the exact same code to be used in memory or on a remote big data system (a great feature, critical for low-latency rehearsal and debugging). These two systems can be run as follows.

• `rquery` is a system for describing operator trees. It deliberately does not implement the data operators, but depends on external systems for implementations. Previously any sufficiently standard `SQL92` database that was `R` `DBI` compliant could serve as a back-end or implementation. This already includes the industrial scale database `PostgreSQL` and the big data system `Apache Spark` (via the `SparklyR` package). The 0.5.0 development version of `rquery` relaxes the `DBI` requirement (allowing `rquery` to be used directly with `SparkR`) and admits the possibility of non-`SQL` based implementations. We have a new `data.table` based implementation in development as the `rqdatatable` package.
• `dplyr` also allows multiple implementations (in-memory, `DBI` `SQL`, or `data.table`). We tested all three, and the `dplyr` pipeline worked identically in-memory and with `PostgreSQL`. However, the `dtplyr` pipeline did not generate valid `data.table` commands, due to an issue with window functions or ranking, so we were not able to time it using `data.table`.

We are thus set up to compare to following solutions to the logistic scoring problem:

• A direct `data.table` solution running in memory.
• The base `R` `stats::aggregate()` solution working on in-memory data.frames.
• The `rquery` solution using the `data.table` service provider `rqdatatable` to run in memory.
• The `rquery` solution sending data to `PostgreSQL`, performing the work in database, and then pulling results back to memory.
• The `dplyr` solution working directly on in-memory `data.frame`s.
• The `dplyr` solution working directly on in-memory `tibble:tbl`s (we are not counting any time for conversion).
• The `dplyr` solution sending data to `PostgreSQL`, performing the work in database, and then pulling results back to memory.

Running a 1,000,000 row by 13 column example can be summarized with the following graph.

The vertical dashed line is the median time that repeated runs of the base `R` `stats::aggregate()` solution took. We can consider results to the left of it as “fast” and results to the right of it as “slow.” Or in physical terms: `data.table` and `rquery` using `data.table` each take about 1.5 seconds on average. Whereas `dplyr` takes over 20 seconds on average. These two durations represent vastly different user experiences when attempting interactive analyses.

We have run some more tests to try to see how this is a function of problem scale (varying the number of rows of the data). Due to the large range (2 to 10,000,000 rows) we are using log scales, but they unfortunately are just not as readable as the linear scales.

What we can read off this graph includes:

• `data.table` is always the fastest system (or at worst indistinguishable from the fastest system) for this example,
at the scales of problems tested, and for this configuration and hardware.
• The `data.table` backed version of `rquery` becomes comparable to native `data.table` itself at around 100,000 rows. This is evidence the translation overhead is not too bad for this example and that the sequence
of `data.table` commands issued by `rqdatatable` are fairly good practice `data.table`.
• The database backed version of `rquery` starts to outperform `dplyr` at around 10,000 rows. Note: all database measurements include the overhead of moving the data to the database and then moving the results back to `R`. This is slower than how one would normally use a database in production: with data starting and ending on the database and no data motion between `R` and the database.
• `dplyr` appears to be slower than the base `R` `stats::aggregate()` solution at all measured scales (it is always above the shaded region).
• It is hard to read, but changes in heights are ratios of runtimes. For example the `data.table` based solutions are routinely over 10 times faster that the `dplyr` solutions once we get to 100,000 rows or more. This is an object size of only about 10 megabytes and is well below usual “use `data.table` once you are in the gigabytes range” advice.

Of course benchmarks depend on the example problems, versions, and machines- so results will vary. That being said, large differences often have a good chance of being preserved across variations of tests (and we share another grouped example here, and a join example here; for the join example `dplyr` is faster at smaller problem sizes- so results do depend on task and scale).

We are hoping to submit the `rquery` update to `CRAN` in August and then submit `rqdatatable` as a new `CRAN` package soon after. Until then you can try both packages by a simple application of:

`devtools::install_github("WinVector/rqdatatable")`

These are new packages, but we think they can already save substantial development time, documentation time, debugging time, and machine investment in “R and big data” projects. Our group (Win-Vector LLC) is offering private training in `rquery` to get teams up to speed quickly.

Note: `rqdatatable` is an implementation of `rquery` supplied by `data.table`, not a `data.table` scripting tool (as `rqdatatable` does not support important `data.table` features not found in `rquery`, such as rolling joins).

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Fancy Plot (with Posterior Samples) for Bayesian Regressions

(This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

As Bayesian models usually generate a lot of samples (iterations), one could want to plot them as well, instead (or along) the posterior “summary” (with indices like the 90% HDI). This can be done quite easily by extracting all the iterations in `get_predicted` from the `psycho` package.

# The Model

``````# devtools::install_github("neuropsychology/psycho.R")  # Install the last psycho version if needed

library(tidyverse)
library(psycho)

# Import data
df  psycho::affective

# Fit a logistic regression model
fit  rstanarm::stan_glm(Sex ~ Adjusting, data=df, family = "binomial")
``````

We fitted a Bayesian logistic regression to predict the sex (W / M) with one’s ability to flexibly adjust to his/her emotional reaction.

# Plot

To visualize the model, the most neat way is to extract a “reference grid” (i.e., a theorethical dataframe with balanced data). Our refgrid is made of equally spaced predictor values. With it, we can make predictions using the previously fitted model. This will compute the median of the posterior prediction, as well as the 90% credible interval. However, we’re interested in keeping all the prediction samples (iterations). Note that `get_predicted` automatically transformed log odds ratios (the values in which the model is expressed) to probabilities, easier to apprehend.

``````# Generate a new refgrid
refgrid  df %>%
psycho::refdata(length.out=10)

# Get predictions and keep iterations
predicted  psycho::get_predicted(fit, newdata=refgrid, keep_iterations=TRUE)

# Reshape this dataframe to have iterations as factor
predicted  predicted %>%
tidyr::gather(Iteration, Iteration_Value, starts_with("iter"))

# Plot all iterations with the median prediction
geom_line(aes(y=Iteration_Value, group=Iteration), size=0.3, alpha=0.01) +
geom_line(aes(y=Sex_Median), size=1) +
ylab("Probability of being a mann") +
theme_classic()
``````

# Credits

This package helped you? Don’t forget to cite the various packages you used

You can cite `psycho` as follows:

• Makowski, (2018). The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Ceteris Paribus Plots – a new DALEX companion

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

If you like magical incantations in Data Science, please welcome the Ceteris Paribus Plots. Otherwise feel free to call them What-If Plots.

Ceteris Paribus (latin for all else unchanged) Plots explain complex Machine Learning models around a single observation. They supplement tools like breakDown, Shapley values, LIME or LIVE. In addition to feature importance/feature attribution, now we can see how the model response changes along a specific variable, keeping all other variables unchanged.

How cancer-risk-scores change with age? How credit-scores change with salary? How insurance-costs change with age?

Well, use the ceterisParibus package to generate plots like the one below.
Here we have an explanation for a random forest model that predicts apartments prices. Presented profiles are prepared for a single observation marked with dashed lines (130m2 apartment on 3rd floor). From these profiles one can read how the model response is linked with particular variables.

Instead of original values on the OX scale one can plot qunatiles. This way one can put all variables in a single plot.

And once all variables are in the same scale, one can compare two or more models.

Yes, they are model agnostic and will work for any model!
Yes, they can be interactive (see plot_interactive function or examples below)!
And yes, you can use them with other DALEX explainers!
More examples with R code.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## StatCheck the Game

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you don’t get enough joy from publishing scientific papers in your day job, or simply want to experience what it’s like to be in a publish-or-perish environment where the P-value is the only important part of a paper, you might want to try StatCheck: the board game where the object is to publish two papers before any of your opponents.

As the game progresses, players combine “Test”, “Statistic” and “P-value” cards to form the statistical test featured in the paper (and of course, significant tests are worth more than non-significant ones). Opponents may then have the opportunity to play a “StatCheck” card to challenge the validity of the test, which can then be verified using a companion R package or online Shiny application. Other modifier cards include “Bayes Factor” (which can be used to boost the value of your own papers, or diminish the value of an opponents’), “Post-Hoc Theory” (improving the value of already-published papers), and “Behind the Paywall” (making it more difficult to challenge the validity of your statistics).

StatCheck The Game was created by Sacha Epskamp and Adela Isvoranu, who provide all the the code to create the cards as open source on GitHub, along with instructions to print and play with your own game materials. You can find everything you need (except the needed 8-sided die and some like-minded friends to play with) at the link below.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## My book ‘Practical Machine Learning in R and Python: Second edition’ on Amazon

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

The second edition of my book ‘Practical Machine Learning with R and Python – Machine Learning in stereo’ is now available in both paperback (\$10.99) and kindle (\$7.99/Rs449) versions. This second edition includes more content, extensive comments and formatting for better readability.

In this book I implement some of the most common, but important Machine Learning algorithms in R and equivalent Python code.
1. Practical machine with R and Python: Second Edition – Machine Learning in Stereo(Paperback-\$10.99)
2. Practical machine with R and Python Second Edition – Machine Learning in Stereo(Kindle- \$7.99/Rs449)

This book is ideal both for beginners and the experts in R and/or Python. Those starting their journey into datascience and ML will find the first 3 chapters useful, as they touch upon the most important programming constructs in R and Python and also deal with equivalent statements in R and Python. Those who are expert in either of the languages, R or Python, will find the equivalent code ideal for brushing up on the other language. And finally,those who are proficient in both languages, can use the R and Python implementations to internalize the ML algorithms better.

Here is a look at the topics covered

Preface …………………………………………………………………………….4
Introduction ………………………………………………………………………6
1. Essential R ………………………………………………………………… 8
2. Essential Python for Datascience ……………………………………………57
3. R vs Python …………………………………………………………………81
4. Regression of a continuous variable ……………………………………….101
5. Classification and Cross Validation ………………………………………..121
6. Regression techniques and regularization ………………………………….146
7. SVMs, Decision Trees and Validation curves ………………………………191
8. Splines, GAMs, Random Forests and Boosting ……………………………222
9. PCA, K-Means and Hierarchical Clustering ………………………………258
References ……………………………………………………………………..269

Hope you have a great time learning as I did while implementing these algorithms!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Praise you like I should: Shiny Appreciation Month

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

##### Aimée Gott, Education Practice Lead

Back in the summer of 2012 I was meant to be focusing on one thing: finishing my thesis. But, unfortunately for me, a friend and former colleague came back from a conference (JSM) and told me all about a new package that she had seen demoed.

“You should sign up for the beta testing and try it out,” she said.

So, I did.

That package was Shiny and after just a couple of hours of playing around I was hooked. I was desperate to find a way to incorporate it into my thesis, but never managed to; largely due to the fact it wasn’t available on CRAN until a few months after I had submitted and because, at the time, it was quite limited in its functionality. However, I could see the potential – I was really excited about the ways it could be used to make analytics more accessible to non-technical audiences. After joining Mango I quickly became a Shiny advocate, telling everyone who would listen about how great it was.

Six years on at Mango, not a moment goes by when somebody in the team isn’t using Shiny for something. From prototyping to large scale deployments, we live and breathe Shiny. And we are extremely grateful to the team at RStudio—led by Joe Cheng—for the continued effort that they are putting in to its development. It really is a hugely different tool to the package I beta tested so long ago.

As Shiny has developed and the community around it has become greater so too has the need to teach it because more people than ever are looking to become Shiny users. For a number of years, we have been teaching the basics of Shiny to those who want to get started, and more serious development tools to those who want to deploy apps in production. But increasingly, we have seen a demand for more. And as the Shiny team have added more and more functionality it was time for a major update to our teaching materials.

Over the past six months we have had many long discussions over what functionality should be included. We have debated best practices, we have drawn on all of our combined experiences of both learning and deploying Shiny, and we eventually reached a consensus over what we felt was best for industry users of Shiny to learn.

We are now really pleased to announce an all new set of Shiny training courses.

Our courses cover everything from taking your first steps in building a Shiny application, to building production-ready applications and a whole host of topics in between. For those who want to take a private course we can tailor to your needs, and topics as diverse as getting the most from tables in DT to managing database access in apps can all be covered in just a few days.

For us, an important element of these courses, is that they are all taught by data science consultants who have hands-on experience building and deploying apps for commercial use. These consultants are supported by platform experts who can advise on the best approaches for getting an app out to end users so that you can see the benefits of using Shiny as quickly as possible.

But, one blog post was never going to be enough for all of the Shiny enthusiasts at Mango to share their passion. We needed more time, more than one blog post and more ways to share with the community.

Therefore, Mango are declaring June to be Shiny Appreciation Month!

For the whole of June, we will be talking all things Shiny. Follow us on Twitter where we will be sharing tips, ideas and resources. To get involved, share your own with us and the Shiny community, using #ShinyAppreciation. On the blog we will be sharing, among other things, some of the ways we are using Shiny in industry and some of the technical challenges we have had to overcome.

Watch this space for updates but, for now, if you want to know more about the Shiny training that we offer, take a look at our training pages. If you are based in the UK we will be running public Shiny courses in London (see below for the currently scheduled dates). We will also be offering a snapshot of the materials for intermediate Shiny users at London EARL in September.

#### Public course dates:

Introduction to Shiny: 17th July
Intermediate Shiny: 18th July, 5th September

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Coloring Sudokus

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Someday you will find me
caught beneath the landslide
(Champagne Supernova, Oasis)

I recently read a book called Snowflake Seashell Star: Colouring Adventures in Numberland by Alex Bellos and Edmund Harris which is full of mathematical patterns to be coloured. All images are truly appealing and cause attraction to anyone who look at them, independently of their age, gender, education or political orientation. This book demonstrates how maths are an astonishing way to reach beauty.

One of my favourite patterns are tridokus, a sophisticated colored version of sudokus. Coloring a sudoku is simple: once that is solved it is enough to assign a color to each number (from 1 to 9). If you superimpose three colored sudokus with no cells at the same position sharing the same color, and using again nine colors, the resulting image is a tridoku:

There is something attractive in a tridoku due to the balance of colors but also they seem a quite messy: they are a charmingly unbalanced. I wrote a script to generalize the concept to n-dokus. The idea is the same: superimpose n sudokus without cells sharing color and position (I call them disjoint sudokus) using just nine different colors. I did’n’t prove it, but I think the maximum amount of sudokus can be overimposed with these constrains is 9. This is a complete series from 1-doku to 9-doku (click on any image to enlarge):

I am a big fan of `colourlovers` package. These tridokus are colored with some of my favourite palettes from there:

Just two technical things to highlight:

• There is a package called sudoku that generates sudokus (of course!). I use it to obtain the first solved sudoku which forms the base.
• Subsequent sudokus are obtained from this one doing two operations: interchanging groups of columns first (there are three groups: columns 1 to 3, 4 to 6 and 7 to 9) and interchanging columns within each group then.

You can find the code here: do you own colored n-dokus!