Exploring Recursive CTEs with sqldf

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Bob Horton
Sr. Data Scientist at Microsoft

Common table expressions (CTEs, or “WITH clauses”) are a syntactic feature in SQL that makes it easier to write and use subqueries. They act as views or temporary tables that are only available during the lifetime of a single query. A more sophisticated feature is the “recursive CTE”, which is a common table expression that can call itself, providing a convenient syntax for recursive queries. This is very useful, for example, in following paths of links from record to record, as in graph traversal.

This capability is supported in Postgres, and Microsoft SQL Server (Oracle has similar capabilities with a different syntax), but not in MySql. Perhaps surprisingly, it is supported in SQLite, and since SQLite is the default backend for sqldf, this gives R users a convenient way to experiment with recursive CTEs.

Factorials

This is the example from the Wikipedia article on hierarchical and recursive queries in SQL; you just pass it to sqldf and it works.

library('sqldf')

sqldf("WITH RECURSIVE temp (n, fact) AS 
(SELECT 0, 1 -- Initial Subquery
  UNION ALL 
 SELECT n+1, (n+1)*fact FROM temp -- Recursive Subquery 
        WHERE n < 9)
SELECT * FROM temp;")
##    n   fact
## 1  0      1
## 2  1      1
## 3  2      2
## 4  3      6
## 5  4     24
## 6  5    120
## 7  6    720
## 8  7   5040
## 9  8  40320
## 10 9 362880

Other databases may use slightly different syntax (for example, if you want to run this query in Microsoft SQL Server, you need to leave out the word RECURSIVE), but the concept is pretty general. Here the recursive CTE named temp is defined in a with clause. As usual with recursion, you need a base case (here labeled “Initial Subquery”), and a recursive case (“Recursive Subquery”) that performs a select operation on itself. These two cases are put together using a UNION statement (basically the SQL equivalent of rbind). The last line in the query kicks off the computation by running a SELECT statement from this CTE.

Family Tree

Let’s make a toy family tree, so we can use recursion to find all the ancestors of a given person.

Exploring Recursive CTEs with sqldf

family <- data.frame(
  person = c("Alice", "Brian", "Cathy", "Danny", "Edgar", "Fiona", "Gregg", "Heidi", "Irene", "Jerry", "Karla"),
    mom = c(rep(NA, 4), c('Alice', 'Alice', 'Cathy', 'Cathy', 'Cathy', 'Fiona', 'Fiona')),
    dad = c(rep(NA, 4), c('Brian', 'Brian', 'Danny', 'Danny', 'Danny', 'Gregg', 'Gregg')),
  stringsAsFactors=FALSE)

We can visualize this family tree as a graph:

Exploring Recursive CTEs with sqldf

library(graph)
nodes <- family$person
edges <- apply(family, 1, function(r) {
  r <- r[c("mom", "dad")]
  r <- r[!is.na(r)]
  list(edges=r)  # c(r['mom'], r['dad'])
})
names(edges) <- names(nodes) <- nodes
g <- graphNEL(nodes=nodes, edgeL=edges, edgemode='directed')

library(Rgraphviz) # from Bioconductor
g <- layoutGraph(g, layoutType="dot", attrs=list(graph=list(rankdir="BT")))
renderGraph(g)

Pointing from child to parents is backwards from how family trees are normally drawn, but this reflects how the table is laid out. I built the table this way because everybody always has exactly two biological parents, regardless of family structure.

SQLite only supports a single recursive call, so we can’t recurse on both the mom and dad columns. To be able to trace back through both parents, I put the table in “long form”; now each parent is entered in a separate row, with ‘mom’ and ‘dad’ being values in a new column called ‘parent’.

library(tidyr)

long_family <- gather(family, parent, parent_name, -person)

knitr::kable(head(long_family))

Exploring Recursive CTEs with sqldf

person parent parent_name
Alice mom NA
Brian mom NA
Cathy mom NA
Danny mom NA
Edgar mom Alice
Fiona mom Alice

Now we can use a recursive CTE to find all the ancestors in the database for a given person:

ancestors_sql <- "
WITH ancestors (name, parent, parent_name, level) AS (
  SELECT person, parent, parent_name, 1 FROM long_family WHERE person = '%s'
        UNION ALL
    SELECT A.person, A.parent, A.parent_name, P.level + 1 
        FROM ancestors P
        JOIN long_family A
        ON P.parent_name = A.person)
SELECT * FROM ancestors ORDER BY level, name, parent"

sqldf(sprintf(ancestors_sql, 'Jerry'))
##     name parent parent_name level
## 1  Jerry    dad       Gregg     1
## 2  Jerry    mom       Fiona     1
## 3  Fiona    dad       Brian     2
## 4  Fiona    mom       Alice     2
## 5  Gregg    dad       Danny     2
## 6  Gregg    mom       Cathy     2
## 7  Alice    dad        <NA>     3
## 8  Alice    mom        <NA>     3
## 9  Brian    dad        <NA>     3
## 10 Brian    mom        <NA>     3
## 11 Cathy    dad        <NA>     3
## 12 Cathy    mom        <NA>     3
## 13 Danny    dad        <NA>     3
## 14 Danny    mom        <NA>     3
sqldf(sprintf(ancestors_sql, 'Heidi'))
##    name parent parent_name level
## 1 Heidi    dad       Danny     1
## 2 Heidi    mom       Cathy     1
## 3 Cathy    dad        <NA>     2
## 4 Cathy    mom        <NA>     2
## 5 Danny    dad        <NA>     2
## 6 Danny    mom        <NA>     2
sqldf(sprintf(ancestors_sql, 'Cathy'))
##    name parent parent_name level
## 1 Cathy    dad        <NA>     1
## 2 Cathy    mom        <NA>     1

We can go the other way as well, and find all of a person’s descendants:

descendants_sql <- "
WITH RECURSIVE descendants (name, parent, parent_name, level) AS (
  SELECT person, parent, parent_name, 1 FROM long_family 
    WHERE person = '%s'
    AND parent = '%s'


    UNION ALL
    SELECT F.person, F.parent, F.parent_name, D.level + 1 
        FROM descendants D
        JOIN long_family F
        ON F.parent_name = D.name)

SELECT * FROM descendants ORDER BY level, name
"

sqldf(sprintf(descendants_sql, 'Cathy', 'mom'))
##    name parent parent_name level
## 1 Cathy    mom        <NA>     1
## 2 Gregg    mom       Cathy     2
## 3 Heidi    mom       Cathy     2
## 4 Irene    mom       Cathy     2
## 5 Jerry    dad       Gregg     3
## 6 Karla    dad       Gregg     3

Exploring Recursive CTEs with sqldf

If you work with tree- or graph-structured data in a database, recursive CTEs can make your life much easier. Having them on hand in SQLite and usable through sqldf makes it very easy to get started learning to use them.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Case for a Data Science Lab

By Mark Sellors

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Mark Sellors, Technical Architect – Mango Solutions

As more and more Data Science moves from individuals working alone, with small data sets on their laptops, to more productionised, or analytically mature settings, an increasing number of restrictions are being placed on Data Scientists in the workplace.

Perhaps, your organisation has standardised on a particular version of Python or R, or perhaps you’re using a limited subset of all available big data tools. This sort of standardisation can be incredibly empowering for the business. It ensures all analysts are working with a common set of tools and allows analyses to be run anywhere across the organisation It doesn’t matter if it’s a laptop, server, or a large-scale cluster, Data Scientists and the wider business, can be safe in the knowledge that the versions of your analytic tools are the same in each environment.

While incredibly useful for the business, this can, at times, feel very restricting for the individual Data Scientist. Maybe you want to try a new package that isn’t available for your ‘official’ version of R, or you want to try a new tool or technique that hasn’t made it into your officially supported environment yet. In all of these instances a Data Science Lab or Analytic Lab environment can prove invaluable to maintain pace with the fast paced data science world outside of your organisation.

An effective lab environment should be designed from the ground up to support innovation, both with new tools as well as new techniques and approaches. For the most part it’s rare that any two labs would be the same from one organisation to the next, however, the principles behind the implementation and operation are universal. The lab should provide a sandbox of sorts, where Data Scientists can work to improve what they do currently, as well as prepare for the challenges of tomorrow. A well implemented lab can be a source of immense value to it’s users as it can be a space for continual professional development. The benefits to the business however, can be even greater. By giving your Data Scientists the opportunity to be a part of driving requirements for your future analytic solutions, and with those solutions based on solid foundations derived from experiments and testing performed in the lab, the business can achieve and maintain true analytic maturity and meet new analytic challenges head-on.

In order to successfully implement a lab in your business, you must first establish the need. If your Data Scientists are using whatever tools are handy and nobody has a decent grasp on what tools are used, with what additional libraries, and at what versions, then you have bigger fish to fry right now and should come back when that’s sorted out!

If your business analytic landscape is well understood and documented, you must first identify and distil your existing tool set into a set of core tools. As these tools constitute the day-to-day analytic workhorses of your business, they will form the backbone of the lab. In a lot of cases, this may be a particular Hadoop distribution and version, or perhaps a particular version of python with scikit-learn and numpy, or a combination.

The next step, can often be the most challenging, as it often requires moving outside of the Data Science or Advanced Analytics team and working closely with your IT department in order to provision environments upon which the lab will be based. Naturally, if you’re lucky enough to have a suitable Data Engineer or DataOps professional on your team then you may avoid this requirement. A lot of that is going to depend on the agility model of you business and how reliant on strict silos it is.

Ideally any environments provisioned at this stage should be capable of being rapidly re-provisioned and re-purposed as needs arise, so working with a modern infrastructure is a high priority. It’s often wise at this stage to consider some form of image management for containers or VM’s, to speed deployment and ensure environments are properly managed. You need to be able to adapt the environment to the changing needs of the user base with the minimum of effort and fuss.

Once you have rapidly deployable environments at your disposal, you’re ready to start work. What form that work takes should be left largely up to your Data Science team, but broadly speaking they should be free to use and evaluate new tools or approaches. Remember, the lab is not a place where production work is done with ad hoc tools, it’s a safe space for experimentation and innovation, just like a real laboratory environment. Using the knowledge gained from running tests or trials in the lab however, can and should inform the evolution of your production tools and techniques.

A final word of warning for the business: A successful lab environment can’t be achieved through lip-service. The business must set aside time for Analysts or Data Scientists to develop the future analytic solutions that are increasingly becoming central to the success of the modern business.

For more information, or to get help building out an Analytics Lab of your own, or even if you’re just starting your journey on the path to analytic maturity, contact info@mango-solutions.com

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Science Radar – Programmer Profile

By Mango Blogger

dr-banner

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

by Doug Ashton, Mango Solutions @dougashton

Dougs-Radar

Doug Ashton Data Science Radar – Nov 2015

1. Tell us a bit about your background in Data Science.

I was a physicist for 10 years where I used Monte Carlo simulations to solve problems in materials and networks. Anything with lots of interacting components was my domain. Since moving to Data Science I’ve found that the mathematics and tools of statistical physics had set me up perfectly. C++ was my primary tool for raw speed but R and Python were always part of my stack for analysis and plotting.

2. How would you describe what a Programmer is in your own words?

Code needs to make sense months after the project. Writing code to be understood is nearly as important as what it does.

3. Were you surprised at your Data Science Radar profile result? Please explain.

Yes a little, I’m fairly opinionated about coding style which probably explains it. I thought technology would rate higher.

4. Is knowing this information beneficial to shaping your career development plan? If so, how?

The radar in general makes you think about which axis you want to push out rather than getting a perfect circle. I’m more happy to let communicator go and push out the things I really care about.

5. How do you apply your skills as a Programmer at Mango Solutions?

We use the full DevOps stack on any size of analysis. By writing good docs, unit tests and continuation integration I feel much more confident delivering the final product.

6. If someone wanted to develop their Programmer skills further, what would you recommend?

Get on GitHub and see how other people are doing it. Use all the tools of unit testing and documentation. Make a pull request and see what happens! Always version control.

7. Which of your other highest scoring skills on the Radar compliments a Programmer skill set and why?

Technologist. These days you have to integrate with so many technologies you need to have a high level view of what’s available. Again, DevOps requires managing build servers and automated testing. You can’t just know one language.

8. Whats your favourite Programming Tool?

Git and GitHub. Getting into a rhythm of commits, branches, issue tracking and CI (via Travis) makes you so much better at your job. Equivalents such as GitLab and Jira/BitBucket and Jenkins also great choices.

Create your own Data Science Radar here

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Hypnotical Fermat

By aschinchon

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Se le nota en la voz, por dentro es de colores (Si te vas, Extremoduro)

This is a gif generated with 25 plots of the Fermat’s spiral, a parabolic curve generated through the next expression:

where r is the radius, Theta is the polar angle and a is simply a compress constant.

Fermat showed this nice spiral in 1636 in a manuscript called Ad locos planos et solidos Isagoge (I love the title). Instead using paths, I use a polygon geometry to obtain bullseye style plots:
FermatGIF
Playing with this spiral is quite addictive. Try to change colors, rotate, change geometry … You can easily discover cool images like this without any effort:
Fermat065
Enjoy!

library(ggplot2)
library(magrittr)
setwd("YOUR-WORKING-DIRECTORY-HERE")
opt=theme(legend.position="none",
panel.background = element_rect(fill="white"),
panel.grid=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
axis.text=element_blank())
for (n in 1:25){
t=seq(from=0, to=n*pi, length.out=500*n)
data.frame(x= t^(1/2)*cos(t), y= t^(1/2)*sin(t)) %>% rbind(-.) -> df
p=ggplot(df, aes(x, y))+geom_polygon()+
scale_x_continuous(expand=c(0,0), limits=c(-9, 9))+
scale_y_continuous(expand=c(0,0), limits=c(-9, 9))+opt
ggsave(filename=paste0("Fermat",sprintf("%03d", n),".jpg"), plot=p, width=3, height=3)}

To leave a comment for the author, please follow the link and comment on their blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Crowd sourced benchmarks

By csgillespie

Timings1

(This article was first published on Why? » R, and kindly contributed to R-bloggers)

When discussing how to speed up slow R code, my first question is what is your computer spec? It always surprises me when complex biological experiments, costing a significant amount of money, are analysed using a six year old laptop. A new desktop machine costs around £1000 and that money would be saved within a month in user time. Typically the more the RAM you have, the larger the dataset you can handle. However it’s not so obvious of the benefit of upgrading the processor.

To quantify the impact of the CPU on an analysis, I’ve create a simple benchmarking package. The aim of this package is to provide a set of benchmarks routines and data from past runs. You can then compare your machine, with other CPUs. The package currently isn’t on CRAN, but you can install it via my drat repository


install.packages("drat")
drat::addRepo("csgillespie")
install.packages("benchmarkme")

You can load the package in the usual way, and view past results via


library("benchmarkme")
plot_past()

to get

Currently around forty machines have been benchmarked. To benchmark and compare your own system just run


## On slower machines, reduce runs.
res = benchmark_std(runs=3)
plot(res)

gives

The final step is to upload your benchmarks


## You can control exactly what is uploaded. See the help page
upload_results(res)

The current record is held by a Intel(R) Core(TM) i7-4712MQ CPU.

To leave a comment for the author, please follow the link and comment on their blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Notes from Warsaw R meetup

By Markus Gesmann

(This article was first published on mages’ blog, and kindly contributed to R-bloggers)

I had the great pleasure time to attend the Warsaw R meetup last Thursday. The organisers Olga Mierzwa and Przemyslaw Biecek had put together an event with a focus on R in Insurance (btw, there is a conference with the same name), discussing examples of pricing and reserving in general and life insurance.

Experience vs. Data

I kicked off with some observations of the challenges in insurance pricing. Accidents are thankfully rare events, that’s why we buy insurance. Hence, there is often not a lot of claims data available for pricing. Combining the information from historical data and experts with domain knowledge can provide a rich basis for the assessment of risk. I presented some examples using Bayesian analysis to understand the probability of an event occurring. Regular readers of my

Download slides

Non-life insurance in R

Emilia Kalarus from Triple A shared some of her experience of using R in non-life insurance companies. She focused on the challenges in working across teams, with different systems, data sets and mentalities.

As an example Emilia talked about the claims reserving process, which in her view should be embedded in the full life cycle of insurance, namely product development, claims, risk and performance management. Following this thought she presented an idea for claims reserving that models the live of a claim from not incurred and not reported (NINR), to incurred but not reported (IBNR), reported but not settled (RBNS) and finally paid.

Stochastic mortality modelling

The final talk was given by Adam Wróbel from the life insurer Nationale Nederlanden, discussing stochastic mortality modelling. Adam’s talk on analysing mortality made me realise that life and non-life insurance companies may be much closer to each other than I thought.

Although life and non-life companies are usually separated for regulatory reasons, they both share the fundamental challenge of predicting future cash-flows. An example where the two industries meet is product liability.

Over the last century technology has changed our environment fundamentally, more so then ever before. Yet, we still don’t know which long term impact some of the new technologies and products will have on our life expectancy. Some will prolong our lives, others may make us ill.

A classic example is asbestos, initial regraded as a miracle mineral, as it was impossible to set on fire, abundant, cheap to mine, and easy to manufacture. Not surprisingly it was widely used until it was linked to cause cancer. Over the last 35 years the non-life insurance industry has paid well in excess of hundred billion dollars in compensations.

This post was originally published on mages’ blog.

To leave a comment for the author, please follow the link and comment on their blog: mages’ blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Installing RStudio Shiny Server on AWS

By gluc

(This article was first published on ipub » R, and kindly contributed to R-bloggers)

In this beginner’s level tutorial, you’ll learn how to install Shiny Server on an AWS cloud instance, and how to configure the firewall. It will take just a few minutes!

Why?

Playing around with Shiny is simple enough: all you need is is the R package called shiny, which you can get directly from CRAN.

Making your work available to you mentor is also straight forward: open an account on shinyapps.io, and deploy your application directly from RStudio.

Blogging about your Shiny app is a different story: you might have hundreds of hits in a day, and soon enough your application will hit the max hours available for free on shinyapps.io. As a result, your app will stop working.

Another situation in which you might want to deploy your own Shiny server is if you need access to a database behind a firewall (see Shiny Crud), or if you want to restrict access to your app to people within your sub net (e.g. within your intranet).

Prerequisites

This tutorial builds on the following tutorial: Setting up an AWS instance for R, RStudio, OpenCPU, or Shiny Server. So we assume that you have a working AWS EC2 Ubuntu instance with a recent version of R installed.

Also, if you are interested in Shiny in general, I recommend this introductory post.

Other References and Links

Shiny Server Open Source is free, and has extensive documentation. However, getting started is not so easy, as it is not always clear which documents apply to Shiny Server Pro (the commercial offering) or Shiny Server Open Source.

The official installation instruction from RStudio, the company behind Shiny, can be found at this link.

And here is a similar guide for digitalocean, a competitor of AWS.

Installing Shiny Server Open Source

This section is not depending on AWS. So I am assuming you have a running Ubuntu instance, and you can access it via ssh. Also, the most recent version of R needs to be installed. If any of this is not the case, see here.

Otherwise, you should see a window like this:

As a first step, we install the R Shiny package. The following command not only installs the package, but also makes sure that it is available for all users on your machine:

sudo su - -c "R -e "install.packages('shiny', repos='https://cran.rstudio.com/')""

This might take a while, as all shiny dependencies are downloaded as well.

Next, you need to install Shiny server itself, by typing the following commands:

sudo apt-get install gdebi-core
wget https://download3.rstudio.org/ubuntu-12.04/x86_64/shiny-server-1.4.1.759-amd64.deb
sudo gdebi shiny-server-1.4.1.759-amd64.deb

You might want to replace the version number of the Shiny Server with the latest available release, as published here. However, leave the ubuntu version (12.04) as is.

When prompted whether you want to install the software package, press y, of course.

Your Shiny server is now installed. But before we can test it, there are two things missing:

  1. we need to install an app
  2. we need to open the port, so your Shiny server can be accessed from the outside world

Install Sample app

To install the sample app that is provided by the Shiny installer, type the following into your console:

<span class="kw">sudo</span> /opt/shiny-server/bin/deploy-example default

Again, type y if prompted.

Configuring Firewall

In order to be able to connect to Shiny Server, you might need to open the port on which Shiny Server listens. By default, this is port 3838.

On AWS, you can open the port by configuring the Security Group of your instance. Go to Instances, select your instance, and then click on the Security Group in the instance’s detail section:

This will bring you to the Security Groups configuration screen. Click on the Inbound tab. By default, AWS instances launched with the wizard have only port 22 open, which is needed to SSH in. No wonder we cannot access our instance!

Click on Edit and add a custom TCP rule, like so:

Open your favorite browser and enter the following address:

http://54.93.177.63:3838/sample-apps/hello/

Replace the IP address (54.93.177.63, in our example) with the public IP address of your instance, which is the same with which you connected to your instance. If everything went fine, you will see something like this:

And that’s it!

Basic Configuration and Administration

Though not the main goal of this post, let’s look at a few basic configuration options.

Start and Stop

To start and stop a Shiny Server, execute these commands:

<span class="kw">sudo</span> start shiny-server
<span class="kw">sudo</span> stop shiny-server

Configuration

Shiny Server is mainly configured via the following file:

/etc/shiny-server/shiny-server.conf

We can use the minimalistic text editor Nano to edit the configuration file. Type

sudo nano /etc/shiny-server/shiny-server.conf

You should see something like this:

For example, you could now change the port to 80, letting your users connect without specifying a port, e.g. like so:

http://54.93.177.63/sample-apps/hello/

To do that, you need to perform the following steps:

  1. in Nano, change the port to 80
  2. Save the file by hitting Ctrl+X and answering Yes
  3. Restarting the server by typing
    <span class="kw">sudo</span> restart shiny-server
  4. Opening your port 80 in the AWS EC2 Security Group by adding a custom TCP rule for port 80, as described above

I hope you enjoyed this tutorial. In the next post, we’ll describe how to enable secure https connections in Shiny Server Open Source, and we’ll explain why you would want to do this.

The post Installing RStudio Shiny Server on AWS appeared first on ipub.

To leave a comment for the author, please follow the link and comment on their blog: ipub » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Average Expenses for TV across states of USA

By Pradeep Mavuluri

(This article was first published on Coastal Econometrician Views, and kindly contributed to R-bloggers)
This post makes an attempt to depict the averages spent across the states towards their TV channel expenses for a big size country (USA). Though it has been developed using sample data belonging to a particular service provider; this post depicts its interest in regional differences in average spent on said service across the country. Herein, I would like to bring to your notice that economic importance of some USA states being notably better connected with multiple service providers and due to geographical location and population density, results/insights may be specific to this sample data. All the analysis has been carried out using R Programming Language components (R-3.2.2, RStudio (favorite IDE), ggplot2 for mapping).

Average Amount Spent ($) on TV by States in 2015 (till Nov):

The figure below depicts the map of 48 states of USA (for which the data was available) wherein it shows the average TV expense by state for the year 2015 which was available till end of the November; with five different colors (i.e. five different intervals of average spent).
As it is evident from above map that for given sample North-East (region) states region has highest averages spent on TV. Next best averages (orange color) are noticed in pacific region and few Central and East states. As mentioned earlier this may be due to economic importance or due to service provider geographical spread which the employed sample data fails to take an in-depth note.

Author undertook several projects and programs towards data sciences, views expressed here are from his industry experience. He can be reached at mavuluri.pradeep@gmail for more details.

To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ggplot your missing data

By njtierney – rbloggers

plot of chunk unnamed-chunk-2

(This article was first published on njtierney – rbloggers, and kindly contributed to R-bloggers)

Visualising missing data is important when analysing a dataset. I wanted to make a plot of the presence/absence in a dataset. One package, Amelia provides a function to do this, but I don’t like the way it looks. So I made a ggplot version of what it did.

Let’s make a dataset using the awesome wakefield package, and add random missingness.

library(dplyr)
library(wakefield)
df <- 
  r_data_frame(
  n = 30,
  id,
  race,
  age,
  sex,
  hour,
  iq,
  height,
  died,
  Scoring = rnorm,
  Smoker = valid
  ) %>%
  r_na(prob=.4)

This is what the Amelia package produces by default:

library(Amelia)

missmap(df)

And let’s explore the missing data using my own ggplot function:

# A function that plots missingness
# requires `reshape2`

library(reshape2)
library(ggplot2)

ggplot_missing <- function(x){

  x %>% 
    is.na %>%
    melt %>%
    ggplot(data = .,
           aes(x = Var2,
               y = Var1)) +
    geom_raster(aes(fill = value)) +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme_minimal() + 
    theme(axis.text.x  = element_text(angle=45, vjust=0.5)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")
}

Let’s test it out

ggplot_missing(df)

plot of chunk unnamed-chunk-4

It’s much cleaner, and easier to interpret.

This function, and others, is available in the neato package, where I store a bunch of functions I think are neat.

Quick note – there used to be a function, missing.pattern.plot that you can see here in the package mi. However, it doesn’t appear to exist anymore. This is a shame, as it was a really nifty plot that clustered the groups of missingness. My friend and colleague, Sam Clifford heard me complaining about this and wrote some code that does just that – I shall share this soon, it will likely be added to the neato repository.

Thoughts? Write them below.

To leave a comment for the author, please follow the link and comment on their blog: njtierney – rbloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News