Zillow Rent Analysis

By Anu Rajaram

(This article was first published on R – Journey of Analytics, and kindly contributed to R-bloggers)

Hello Readers,

This is a notification post – Did you realize our website has moved? The blog is live at New JA Blog under the domain http://www.journeyofanalytics.com . You can read about the rent analysis post here.

If you received this post AND an email from anu_analytics, then please disregard this post.

If you received this post update from WordPress, but did NOT receive an email from anu_analytics (via MailChimp email) then please send us an email at anuprv@journeyofanalytics.com . The email from the main site was sent out 4 hours ago. Alternatively, you can sign up using this contact form.

(Email screenshot below)

JourneyofAnalytics email Newsletter
” data-medium-file=”https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=300″ data-large-file=”https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375″ src=”https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375″ alt=”JourneyofAnalytics email Newsletter” width=”450″ srcset_temp=”https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=676&h=375 676w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=150&h=83 150w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=300&h=166 300w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg?w=768&h=426 768w, https://journeyofanalytics.files.wordpress.com/2017/08/email_newsletter.jpg 999w” sizes=”(max-width: 676px) 100vw, 676px”>

JourneyofAnalytics Newsletter

Again, the latest blogposts are available at blog.journeyofanalytics.com and all code/project files are available under the Projects page.

See you at our new site. Happy Coding!

Filed under:

To leave a comment for the author, please follow the link and comment on their blog: R – Journey of Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Obstacles to performance in parallel programming

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:

  • Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks;
  • Load balance, where the computing resources aren’t contributing equally to the problem;
  • Impacts from use of RAM and virtual memory, such as cache misses and page faults;
  • Network effects, such as latency and bandwidth, that impact performance and communication overhead;
  • Interprocess conflicts and thread scheduling;
  • Data access and other I/O considerations.

The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It’s also worth checking out Norm Matloff’s keynote from the useR!2017 conference, embedded below.

Norm Matloff: Understanding overhead issues in parallel computation

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Starting a Rmarkdown Blog with Bookdown + Hugo + Github

By mareviv

(This article was first published on R – Tales of R, and kindly contributed to R-bloggers)

Finally, -after 24h of failed attempts-, I could get my favourite Hugo theme up and running with R Studio and Blogdown.

All the steps I followed are detailed in my new Blogdown entry, which is also a GitHub repo.

After exploring some alternatives, like Shirin’s (with Jekyll), and Amber Thomas advice (which involved Git skills beyond my basic abilities), I was able to install Yihui’s hugo-lithium-theme in a new repository.

However, I wanted to explore other blog templates, hosted in GiHub, like:

The three first themes are currently linked in the blogdown documentation as being most simple and easy to set up for unexperienced blog programmers, but I hope the list will grow in the following months. For those who are willing to experiment, the complete list is here.

Finally I chose the hugo-tranquilpeak theme, by Thibaud Leprêtre, for which I mostly followed Tyler Clavelle’s entry on the topic. This approach turned out to be easy and good, given some conditions:

  • Contrary to Yihui Xie’s advice, I chose github.io to host my blog, instead of Netlify (I love my desktop integration with GitHub, so it was interesting for me not to move to another service for my static content).
  • In my machine, I installed Blogdown & Hugo using R studio (v 1.1.336).
  • In GiHub, it was easier for me to host the blog directly in my main github pages repository (always named [USERNAME].github.io), in the master branch, following Tyler’s tutorial.
  • I have basic knowledge of html, css and javascript, so I didn’t mind to tinker around with the theme.
  • My custom styles didn’t involve theme rebuilding. At this moment they’re simple cosmetic tricks.

The steps I followed were:

Git & GitHub repos

  • Setting a GitHub repo with the name [USERNAME].github.io (in my case aurora-mareviv.github.io). See this and this.
  • Create a git repo in your machine:
    • Create manually a new directory called [USERNAME].github.io.
    • Run in the terminal (Windows users have to install git first):
    cd /Git/[USERNAME].github.io # your path may be different
    
    git init # initiates repo in the directory
    git remote add origin https://github.com/[USERNAME]/[USERNAME].github.io # connects git local repo to remote Github repo
    
    git pull origin master # in case you have LICENSE and Readme.md files in the GitHub repo, they're downloaded
  • For now, your repo is ready. We will now focus in creating & customising our Blogdown.

RStudio and blogdown

  • We will open RStudio (v 1.1.336, development version as of today).
    • First, you may need to install Blogdown in R:
    install.packages("blogdown")
    • In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project may not be yet available in your RStudio version (not for much longer probably).

Creating new Project

Selecting Hugo Blogdown format

Selecting Hugo Blogdown format

Selecting Hugo Blogdown theme

Selecting Hugo Blogdown theme

A config.toml file appears

A config.toml file appears


Customising paths and styles

Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.

Modify config.toml file

To integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:

  • We need to set up the base URL to the “root” of the web page (https://[USERNAME].github.io/ in this case)
  • By default, the web page is published in the “public” folder. We need it to be published in the root of the repository, to match the structure of the GitHub masterbranch:
baseurl = "/"
publishDir = "."
  • Other useful global settings:
ignoreFiles = [".Rmd$", ".Rmarkdown$", "_files$", "_cache$"]
enableEmoji = true

Images & styling paths

We can revisit the config.toml file to make changes to the default settings.

The logo that appears in the corner must be in the root folder. To modify it in the config.toml:

picture = "logo.png" # the path to the logo

The cover (background) image must be located in /themes/hugo-tranquilpeak-theme/static/images . To modify it in the config.toml:

coverImage = "myimage.jpg"

We want some custom css and js. We need to locate it in /static/css and in /static/jsrespectively.

# Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS;
# they have to be referred from static root. Example
customCSS = ["css/my-style.css"]

# Custom JS. Put here your custom JS files. They are loaded after the theme JS;
# they have to be referred from static root. Example
customJS = ["js/myjs.js"]

Custom css

We can add arbitrary classes to our css file (see above).

Since I started writing in Bootstrap, I miss it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)

Here is my custom css file to date:

/* @import url('https://maxcdn.bootstrapcdn.com/bootswatch/3.3.7/cosmo/bootstrap.min.css'); may conflict with default theme*/
@import url('https://fonts.googleapis.com/icon?family=Material+Icons'); /*google icons*/
@import url('https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css'); /*font awesome icons*/

.input-lg {
  font-size: 30px;
}
.input {
  font-size: 20px;
}
.font-sm {
    font-size: 0.7em;
}
.texttt {
  font-family: monospace;
}
.alert {
padding: 15px;
margin-bottom: 20px;
border: 1px solid transparent;
border-radius: 4px;
}
.alert-success {
color: #3c763d;
background-color: #dff0d8;
border-color: #d6e9c6;
}
.alert-danger,
.alert-error {
  color: #b94a48;
  background-color: #f2dede;
  border-color: #eed3d7;
}
.alert-info {
  color: #3a87ad;
  background-color: #d9edf7;
  border-color: #bce8f1;
}
.alert-gray {
  background-color: #f2f3f2;
  border-color: #f2f3f2;
}

/*style for printing*/
@media print {
    .noPrint {
       display:none;
   }
    }

/*link formatting*/
a:link {
    color: #478ca7;
    text-decoration: none;
} 
a:visited {
    color: #478ca7;
    text-decoration: none;
}
a:hover {
    color: #82b5c9;
    text-decoration: none;
}

Also, we have font-awesome icons!

Site build with blogdown

Once we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post .

We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:

blogdown::build_site()
blogdown::serve_site()

In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.

Deploying the site

Updating the local git repository

This can be done with simple git commands:

cd /Git/[USERNAME].github.io # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Starting my Hugo blog" # adds all files to the local repo, with a commit message

Pushing to GitHub

git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)

Just go to the page https://[USERNAME].github.io and enjoy your blog!


R code

Works just the same as in Rmarkdown. R code is compiled into an html and published as static web content in few steps. Welcome to the era of reproducible blogging!

The figure 1 uses the ggplot2 library:

library(ggplot2)
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

diamonds plot with ggplot2.

Figure 1: diamonds plot with ggplot2.

Rmd source code

You can download it from here

I, for one, welcome the new era of reproducible blogging!

hexbins

To leave a comment for the author, please follow the link and comment on their blog: R – Tales of R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

ggvis Exercises (Part-2)

By Euthymios Kasvikis

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.

This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.
Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a list which will include the variables “Horsepower” and “MPG.city” of the “Cars93” data set and make a scatterplot. HINT: Use ggvis() and layer_points().

Exercise 2

Add a slider to the scatterplot of Exercise 1 that sets the point size from 10 to 100. HINT: Use input_slider().

Learn more about using ggvis in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Work extensively with the ggvis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 3

Add a slider to the scatterplot of Exercise 1 that sets the point opacity from 0 to 1. HINT: Use input_slider().

Exercise 4

Create a histogram of the variable “Horsepower” of the “Cars93” data set. HINT: Use layer_histograms().

Exercise 5

Set the width and the center of the histogram bins you just created to 10.

Exercise 6

Add 2 sliders to the histogram you just created, one for width and the other for center with values from 0 to 10 and set the step to 1. HINT: Use input_slider().

Exercise 7

Add the labels “Width” and “Center” to the two sliders respectively. HINT: Use label.

Exercise 8

Create a scatterplot of the variables “Horsepower” and “MPG.city” of the “Cars93” dataset with size = 10 and opacity = 0.5.

Exercise 9

Add to the scatterplot you just created a function which will set the size with the left and right keyboard controls. HINT: Use left_right().

Exercise 10

Add interactivity to the scatterplot you just created using a function that shows the value of the “Horsepower” when you “mouseover” a certain point. HINT: Use add_tooltip().

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

GoTr – R wrapper for An API of Ice And Fire

By Mango Solutions

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Ava Yang

It’s Game of Thrones time again as the battle for Westeros is heating up. There are tons of ideas, ingredients and interesting analyses out there and I was craving for my own flavour. So step zero, where is the data?

Jenny Bryan’s purrr tutorial introduced the list got_chars, representing characters information from the first five books, which seems not much fun beyond exercising list manipulation muscle. However, it led me to an API of Ice and Fire, the world’s greatest source for quantified and structured data from the universe of Ice and Fire including the HBO series Game of Thrones. I decided to create my own API functions, or better, an R package (inspired by the famous rwar package).

The API resources cover 3 types of endpoint – Books, Characters and Houses. GoTr pulls data in JSON format and parses them to R list objects. httr‘s Best practices for writing an API package by Hadley Wickham is another life saver.

The package contains: – One function got_api() – Two ways to specify parameters generally, i.e. endpoint type + id or url – Three endpoint types

## Install GoTr from github
#devtools::install_github("MangoTheCat/GoTr")
library(GoTr)
library(tidyverse)
library(listviewer)

# Retrieve books id 5
books_5 % 
  flatten_chr() %>%
  map(function(x) got_api(url = x))
# Helpful functions to check structure of list object
length(books_5)
## [1] 11
names(books_5)
##  [1] "url"           "name"          "isbn"          "authors"      
##  [5] "numberOfPages" "publisher"     "country"       "mediaType"    
##  [9] "released"      "characters"    "povCharacters"
names(house_378)
##  [1] "url"              "name"             "region"          
##  [4] "coatOfArms"       "words"            "titles"          
##  [7] "seats"            "currentLord"      "heir"            
## [10] "overlord"         "founded"          "founder"         
## [13] "diedOut"          "ancestralWeapons" "cadetBranches"   
## [16] "swornMembers"
str(characters_583, max.level = 1)
## List of 16
##  $ url        : chr "https://anapioficeandfire.com/api/characters/583"
##  $ name       : chr "Jon Snow"
##  $ gender     : chr "Male"
##  $ culture    : chr "Northmen"
##  $ born       : chr "In 283 AC"
##  $ died       : chr ""
##  $ titles     :List of 1
##  $ aliases    :List of 8
##  $ father     : chr ""
##  $ mother     : chr ""
##  $ spouse     : chr ""
##  $ allegiances:List of 1
##  $ books      :List of 1
##  $ povBooks   :List of 4
##  $ tvSeries   :List of 6
##  $ playedBy   :List of 1
map_chr(povData, "name")
##  [1] "Aeron Greyjoy"     "Arianne Martell"   "Arya Stark"       
##  [4] "Arys Oakheart"     "Asha Greyjoy"      "Brienne of Tarth" 
##  [7] "Cersei Lannister"  "Jaime Lannister"   "Samwell Tarly"    
## [10] "Sansa Stark"       "Victarion Greyjoy" "Areo Hotah"
#listviewer::jsonedit(povData)

Another powerful parameter is query which allows filtering by specific attribute such as the name of a character, pagination and so on.

It’s worth knowing about pagination. The first simple request will render a list of 10 elements, since the default number of items per page is 10. The maximum valid pageSize is 50, i.e. if 567 is passed on to it, you still get 50 characters.

# Retrieve character by name
Arya_Stark 

So how do we get ALL books, characters or houses information? The package does not provide the function directly but here’s an implementation.

# Retrieve all books
booksAll 
##  [1] "A Game of Thrones"              "A Clash of Kings"              
##  [3] "A Storm of Swords"              "The Hedge Knight"              
##  [5] "A Feast for Crows"              "The Sworn Sword"               
##  [7] "The Mystery Knight"             "A Dance with Dragons"          
##  [9] "The Princess and the Queen"     "The Rogue Prince"              
## [11] "The World of Ice and Fire"      "A Knight of the Seven Kingdoms"
# Retrieve all houses
houses % 
  map(function(x) got_api(type = "houses", query = list(page=x, pageSize="50"))) %>%
  unlist(recursive=FALSE)
map_chr(houses, "name") %>% length()
## [1] 444
map_df(houses, `[`, c("name", "region")) %>% head()
## # A tibble: 6 x 2
##                          name          region
##                         
## 1                House Algood The Westerlands
## 2 House Allyrion of Godsgrace           Dorne
## 3                 House Amber       The North
## 4               House Ambrose       The Reach
## 5  House Appleton of Appleton       The Reach
## 6     House Arryn of Gulltown        The Vale

The houses list is a starting point for a social network analysis: Mirror mirror tell me, who are the most influential houses in the Seven Kingdom? Stay tuned for that is the topic of the next blogpost.

Thanks to all open resources. Please comment, fork, issue, star the work-in-progress on our GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Estimating Gini coefficient when we only have mean income by decile by @ellis2013nz

By Peter's stats stuff – R

(This article was first published on Peter’s stats stuff – R, and kindly contributed to R-bloggers)

Income inequality data

Ideally the Gini coefficient to estimate inequality is based on original household survey data with hundreds or thousands of data points. Often this data isn’t available due to access restrictions from privacy or other concerns, and all that is published is some kind of aggregate measure. Some aggregations include the income at the 80th percentile divided by that at the 20th (or 90 and 10); the number of people at the top of the distribution whose combined income is equal to that of everyone else; or the income of the top 1% as a percentage of all income. I wrote a little more about this in one of my earliest blog posts.

One way aggregated data are sometimes presented is as the mean income in each decile or quintile. This is not the same as the actual quantile values themselves, which are the boundary between categories. The 20th percentile is the value of the 20/100th person when they are lined up in increasing order, whereas the mean income of the first quintile is the mean of all the incomes of a “bin” of everyone from 0/100 to 20/100, when lined up in order.

To explore estimating Gini coefficients from this type of binned data I used data from the wonderful Lakner-Milanovic World Panel Income Distribution database, which is available for free download. This useful collection contains the mean income by decile bin of many countries from 1985 onwards – the result of some careful and doubtless very tedious work with household surveys from around the world. This is an amazing dataset, and amongst other purposes it can be used (as Milanovic and co-authors have pioneered dating back to his World Bank days) in combination with population numbers to estimate “global inequality”, treating everyone on the planet as part of a single economic community regardless of national boundaries. But that’s for another day.

Here’s R code to download the data (in Stata format) and grab the first ten values, which happen to represent Angloa in 1995. These particular data are based on consumption, which in poorer economies is often more sensible to measure than income:

library(tidyverse)
library(scales)
library(foreign) # for importing Stata files
library(actuar)  # for access to the Burr distribution
library(acid)
library(forcats)

# Data described at https://www.gc.cuny.edu/CUNY_GC/media/LISCenter/brankoData/LaknerMilanovic2013WorldPanelIncomeDistributionLMWPIDDescription.pdf
# The database has been created specifically for the
# paper “Global Income Distribution: From the Fall of the Berlin Wall to the Great Recession”,
# World Bank Policy Research Working Paper No. 6719, December 2013, published also in World
# Bank Economic Review (electronically available from 12 August 2015). 
download.file("https://wfs.gc.cuny.edu/njohnson/www/BrankoData/LM_WPID_web.dta", 
              mode = "wb", destfile = "LM_WPID_web.dta")

wpid  read.dta("LM_WPID_web.dta")

# inc_con whether survey is income or consumption
# group income decline group 1 to 10
# RRinc is average per capita income of the decile in 2005 PPP

# the first 10 rows are Angola in 1995, so let's experiment with them
angola  wpid[1:10, c("RRinc", "group")]

Here’s the resulting 10 numbers. N

And this is the Lorenz curve:

Those graphics were drawn with this code:

ggplot(angola, aes(x = group, y = RRinc)) +
  geom_line() +
  geom_point() +
  ggtitle("Mean consumption by decile in Angola in 1995") +
  scale_y_continuous("Annual consumption for each decile group", label = dollar) +
  scale_x_continuous("Decile group", breaks = 1:10) +
  labs(caption = "Source: Lakner/Milanovic World Panel Income Distribution data") +
  theme(panel.grid.minor = element_blank())

angola %>%
  arrange(group) %>%
  mutate(cum_inc_prop = cumsum(RRinc) / sum(RRinc),
         pop_prop = group / max(group)) %>%
  ggplot(aes(x = pop_prop, y = cum_inc_prop)) +
  geom_line() +
  geom_ribbon(aes(ymax = pop_prop, ymin = cum_inc_prop), fill = "steelblue", alpha = 0.2) +
  geom_abline(intercept = 0, slope = 1, colour = "steelblue") +
  labs(x = "Cumulative proportion of population",
       y = "Cumulative proportion of consumption",
       caption = "Source: Lakner/Milanovic World Panel Income Distribution data") +
  ggtitle("Inequality in Angola in 1995",
          "Lorenz curve based on binned decile mean consumption")

Calculating Gini directly from deciles?

Now, I could just treat these 10 deciles as a sample of 10 representative people (each observation after all represents exactly 10% of the population) and calculate the Gini coefficient directly. But my hunch was that this would underestimate inequality, because of the straight lines in the Lorenz curve above which are a simplification of the real, more curved, reality.

To investigate this issue, I started by creating a known population of 10,000 income observations from a Burr distribution, which is a flexible, continuous non-negative distribution often used to model income. That looks like this:

population  rburr(10000, 1, 3, 3)

par(bty = "l", font.main = 1)
plot(density(population), main = "Burr(1,3,3) distribution")

Then I divided the data up into between 2 and 100 bins, took the means of the bins, and calculated the Gini coefficient of the bins. Doing this for 10 bins is the equivalent of calculating a Gini coefficient directly from decile data such as in the Lakner-Milanovic dataset. I got this result, which shows, that when you have the means of 10 bins, you are underestimating inequality slightly:

Here’s the code for that little simulation. I make myself a little function to bin data and return the mean values of the bins in a tidy data frame, which I’ll need for later use too:

#' Quantile averages
#' 
#' Mean value in binned groups
#' 
#' @param y numeric vector to provide summary statistics on
#' @param len number of bins to calculate means for
#' @details this is different from producing the actual quantiles; it is the mean value of y within each bin.
bin_avs  function(y, len = 10){
  # argument checks:
  if(class(y) != "numeric"){stop("y should be numeric") }
  if(length(y)  len){stop("y should be longer than len")}
  
  # calculation:
  y  sort(y)
  bins  cut(y, breaks = quantile(y, probs = seq(0, 1, length.out = len + 1)))
  tmp  data.frame(bin_number = 1:len,
                    bin_breaks = levels(bins),
                    mean = as.numeric(tapply(y, bins, mean)))
  return(tmp)
}

ginis  numeric(99)
for(i in 1:99){
  ginis[i]    weighted.gini(bin_avs(population, len = i + 1)$mean)$Gini
}
ginis_df  data.frame(
  number_bins = 2:100,
  gini = ginis
)

ginis_df %>%
  mutate(label = ifelse(number_bins  11 | round(number_bins / 10) == number_bins / 10, number_bins, "")) %>%
  ggplot(aes(x = number_bins, y = gini)) +
  geom_line(colour = "steelblue") +
  geom_text(aes(label = label)) +
  labs(x = "Number of bins",
       y = "Gini coefficient estimated from means within bins") +
  ggtitle("Estimating Gini coefficient from binned mean values of a Burr distribution population",
          paste0("Correct Gini is ", round(weighted.gini(population)$Gini, 3), ". Around 25 bins needed for a really good estimate."))

A better method for Gini from deciles?

Maybe I should have stopped there; after all, there is hardly any difference between 0.32 and 0.34; probably much less than the sampling error from the survey. But I wanted to explore if there were a better way. The method I chose was to:

  • choose a log-normal distribution that would generate (close to) the 10 decile averages we have;
  • simulate individual-level data from that distribution; and
  • estimate the Gini coefficient from that simulated data.

I also tried this with a Burr distribution but the results were very unstable. The log-normal approach was quite good at generating data with means of 10 bins very similar to the original data, and gave plausible values of Gini coefficient just slightly higher than when calculated directly of the bins’ means.

Here’s how I did that:

# algorithm will be iterative
# 1. assume the 10 binned means represent the following quantiles: 0.05, 0.15, 0.25 ... 0.65, 0.75, 0.85, 0.95
# 2. pick the best lognormal distribution that fits those 10 quantile values. 
#    Treat as a non-linear optimisation problem and solve with `optim()`.
# 3. generate data from that distribution and work out what the actual quantiles are
# 4. repeat 2, with these actual quantiles

n  10000
x  angola$RRinc

fn2  function(params) {
  sum((x - qlnorm(p, params[1], params[2])) ^ 2 / x)
}
p  seq(0.05, 0.95, length.out = 10)
fit2  optim(c(1,1), fn2)
simulated2  rlnorm(n, fit2$par[1], fit2$par[2])
p  plnorm(bin_avs(simulated2)$mean, fit2$par[1], fit2$par[2])
fit2  optim(c(1,1), fn2)
simulated2  rlnorm(n, fit2$par[1], fit2$par[2])

And here are the results. The first table shows the means of the bins in my simulated log-normal population (mean) compared to the original data for Angola’s actual deciles in 1995 (x). The next two values, 0.415 and 0.402, are the Gini coefficents from the simulated and original data respectively:

> cbind(bin_avs(simulated2), x)
   bin_number         bin_breaks      mean    x
1           1         (40.6,222]  165.0098  174
2           2          (222,308]  266.9120  287
3           3          (308,393]  350.3674  373
4           4          (393,487]  438.9447  450
5           5          (487,589]  536.5547  538
6           6          (589,719]  650.7210  653
7           7          (719,887]  795.9326  785
8           8     (887,1.13e+03] 1000.8614  967
9           9 (1.13e+03,1.6e+03] 1328.3872 1303
10         10  (1.6e+03,1.3e+04] 2438.4041 2528
> weighted.gini(simulated2)$Gini
          [,1]
[1,] 0.4145321
> 
> # compare to the value estimated directly from the data:
> weighted.gini(x)$Gini
         [,1]
[1,] 0.401936

As would be expected from my earlier simulation, the Gini coefficient from the estimated underlying log-normal distribtuion is verr slightly higher than that calculated directly from the means of the decile bins.

Applying this method to the Lakner-Milanovic inequality data

I rolled up this approach into a function to convert means of deciles into Gini coefficients and applied it to all the countries and years in the World Panel Income Distribution data. Here are the results, first over time:

.. and then as a snapshot

Neither of these is great as a polished data visualisation, but it’s difficult data to present in a static snapshot, and will do for these illustrative purposes.

Here’s the code for that function (which depends on the previously defined ) and drawing those charts. Drawing on the convenience of Hadley Wickham’s dplyr and ggplot2 it’s easy to do this on the fly and in the below I calculate the Gini coefficients twice, once for each chart. Technically this is wasteful, but with modern computers this isn’t a big deal even though there is quite a bit of computationally intensive stuff going on under the hood; the code below only takes a minute or so to run.

#' Convert data that is means of deciles into a Gini coefficient
#' 
#' @param x vector of 10 numbers, representing mean income (or whatever) for 10 deciles
#' @param n number of simulated values of the underlying log-normal distribution to generate
#' @details returns an estimate of Gini coefficient that is less biased than calculating it
#' directly from the deciles, which would be slightly biased downwards.
deciles_to_gini  function(x, n = 1000){
  fn  function(params) {
    sum((x - qlnorm(p, params[1], params[2])) ^ 2 / x)
  }
  
  # starting estimate of p based on binned means and parameters
  p  seq(0.05, 0.95, length.out = 10)
  fit  optim(c(1,1), fn)
  
  # calculate a better value of p
  simulated  rlnorm(n, fit$par[1], fit$par[2])
  p  plnorm(bin_avs(simulated)$mean, fit$par[1], fit$par[2])
  
  # new fit with the better p
  fit  optim(c(1,1), fn)
  simulated  rlnorm(n, fit$par[1], fit$par[2])
  output  list(par = fit$par)
  output$Gini  as.numeric(weighted.gini(simulated)$Gini)
  return(output)
}

# example usage:
deciles_to_gini(x = wpid[61:70, ]$RRinc)
deciles_to_gini(x = wpid[171:180, ]$RRinc)

# draw some graphs:
wpid %>%
  filter(country != "Switzerland") %>%
  mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>%
  group_by(region, country, contcod, year, inc_con) %>%
  summarise(Gini = deciles_to_gini(RRinc)$Gini) %>%
  ungroup() %>%
  ggplot(aes(x = year, y = Gini, colour = contcod, linetype = inc_con)) +
  geom_point() +
  geom_line() +
  facet_wrap(~region) +
  guides(colour = FALSE) +
  ggtitle("Inequality over time",
          "Gini coefficients estimated from decile data") +
  labs(x = "", linetype = "",
       caption = "Source: Lakner/Milanovic World Panel Income Distribution data") 

wpid %>%
  filter(country != "Switzerland") %>%
  mutate(inc_con = ifelse(inc_con == "C", "Consumption", "Income")) %>%
  group_by(region, country, contcod, year, inc_con) %>%
  summarise(Gini = deciles_to_gini(RRinc)$Gini) %>%
  ungroup() %>%
  group_by(country) %>%
  filter(year == max(year)) %>%
  ungroup() %>%
  mutate(country = fct_reorder(country, Gini),
         region = fct_lump(region, 5)) %>%
  ggplot(aes(x = Gini, y = country, colour = inc_con, label = contcod)) +
  geom_text(size = 2) +
  facet_wrap(~region, scales = "free_y", nrow = 2) +
  labs(colour = "", y = "", x = "Gini coefficient",
       caption = "Source: Lakner-Milanovic World Panel Income Distribution") +
  ggtitle("Inequality by country",
          "Most recent year available up to 2008; Gini coefficients are estimated from decile mean income.")

There we go – deciles to Gini fun with world inequality data!

# cleanup
unlink("LM_WPID_web.dta")
To leave a comment for the author, please follow the link and comment on their blog: Peter’s stats stuff – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Oil leakage… those old BMW’s are bad :-)

By Longhow Lam

(This article was first published on R – Longhow Lam’s Blog, and kindly contributed to R-bloggers)

Introduction

My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.

mitsubishi_colt_turbo_red_1984

Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.

RDW Data

There are many data sets that you can download. I have used the following:

  • Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
  • Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
  • Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02′ in the observed defects data set represents ‘oil leakage’.

Simple Analysis in R

I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.

For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.

The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage. But look at BMW: waaauwww those old BMW’s are leaking like oil crazy… 🙂 The few lines of R code can be found here.

Rplot01

Conclusion

There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.

Cheers, Longhow

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam’s Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RcppArmadillo 0.7.960.1.0

By Thinking inside the box

armadillo image

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The bi-monthly RcppArmadillo release is out with a new version 0.7.960.1.0 which is now on CRAN, and will get to Debian in due course.

And it is a big one. Lots of nice upstream changes from Armadillo, and lots of work on our end as the Google Summer of Code project by Binxiang Ni, plus a few smaller enhancements — see below for details.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 379 other packages on CRAN—an increase of 49 since the last CRAN release in June!

Changes in this release relative to the previous CRAN release are as follows:

Changes in RcppArmadillo version 0.7.960.1.0 (2017-08-11)

  • Upgraded to Armadillo release 7.960.1 (Northern Banana Republic Deluxe)

    • faster randn() when using OpenMP (NB: usually omitted when used fromR)

    • faster gmm_diag class, for Gaussian mixture models with diagonal covariance matrices

    • added .sum_log_p() to the gmm_diag class

    • added gmm_full class, for Gaussian mixture models with full covariance matrices

    • expanded .each_slice() to optionally use OpenMP for multi-threaded execution

  • Upgraded to Armadillo release 7.950.0 (Northern Banana Republic)

    • expanded accu() and sum() to use OpenMP for processing expressions with computationally expensive element-wise functions

    • expanded trimatu() and trimatl() to allow specification of the diagonal which delineates the boundary of the triangular part

  • Enhanced support for sparse matrices (Binxiang Ni as part of Google Summer of Code 2017)

    • Add support for dtCMatrix and dsCMatrix (#135)

    • Add conversion and unit tests for dgT, dtT and dsTMatrix (#136)

    • Add conversion and unit tests for dgR, dtR and dsRMatrix (#139)

    • Add conversion and unit tests for pMatrix and ddiMatrix (#140)

    • Rewrite conversion for dgT, dtT and dsTMatrix, and add file-based tests (#142)

    • Add conversion and unit tests for indMatrix (#144)

    • Rewrite conversion for ddiMatrix (#145)

    • Add a warning message for matrices that cannot be converted (#147)

    • Add new vignette for sparse matrix support (#152; Dirk in #153)

    • Add support for sparse matrix conversion from Python SciPy (#158 addressing #141)

  • Optional return of row or column vectors in collapsed form if appropriate #define is set (Serguei Sokol in #151 and #154)

  • Correct speye() for non-symmetric cases (Qiang Kou in #150 closing #149).

  • Ensure tests using Scientific Python and reticulate are properly conditioned on the packages being present.

  • Added .aspell/ directory with small local directory now supported by R-devel.

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

2017 App Update

By Val Pinskiy

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

As you may have noticed, we have made a few changes to our apps for the 2017 season to bring you a smoother and quicker experience while also adding more advanced and customizable views.

Most visibly, we moved the apps to Shiny so we can continue to build on our use of R and add new features and improvements throughout the season. We expect the apps to better handle high traffic load this season during draft season and peak traffic.

In addition to the ability to create and save custom settings, you can also choose the columns you view in our Projections tool. We have also added more advanced metrics such as weekly VOR and Projected Points Per Dollar (ROI) for those of you in auction leagues. With a free account, you’ll be able to create and save one custom setting. If you get an FFA Insider subscription, you’ll be able to create and save unlimited custom settings.

Up next is the ability to upload custom auction values to make it easier to use during auction drafts.

We are also always looking to add new features, so feel free to drop us a suggestion in the Comments section below!

The post 2017 App Update appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

20 years of the R Core Group

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The first “official” version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard:

It was twenty years ago today, Ross Ihaka got the band to play…. #rstats pic.twitter.com/msSpPz2kyA

— Peter Dalgaard (@pdalgd) August 16, 2017

Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects’ founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members. These are the volunteers that implement the R language and its base packages, document, build, test and release it, and manage all the infrastructure that makes that possible.

Thank you to all the R Core Group members, past and present!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News