If you received this post AND an email from anu_analytics, then please disregard this post.
If you received this post update from WordPress, but did NOT receive an email from anu_analytics (via MailChimp email) then please send us an email at email@example.com . The email from the main site was sent out 4 hours ago. Alternatively, you can sign up using this contact form.
Communications overhead, particularly an issue with fine-grained parallelism consisting of a very large number of relatively small tasks;
Load balance, where the computing resources aren’t contributing equally to the problem;
Impacts from use of RAM and virtual memory, such as cache misses and page faults;
Network effects, such as latency and bandwidth, that impact performance and communication overhead;
Interprocess conflicts and thread scheduling;
Data access and other I/O considerations.
The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It’s also worth checking out Norm Matloff’s keynote from the useR!2017 conference, embedded below.
Finally, -after 24h of failed attempts-, I could get my favourite Hugotheme up and running with R Studio and Blogdown.
All the steps I followed are detailed in my new Blogdown entry, which is also a GitHub repo.
After exploring some alternatives, like Shirin’s (with Jekyll), and Amber Thomas advice (which involved Git skills beyond my basic abilities), I was able to install Yihui’s hugo-lithium-theme in a new repository.
However, I wanted to explore other blog templates, hosted in GiHub, like:
The three first themes are currently linked in the blogdown documentation as being most simple and easy to set up for unexperienced blog programmers, but I hope the list will grow in the following months. For those who are willing to experiment, the complete list is here.
Finally I chose the hugo-tranquilpeak theme, by Thibaud Leprêtre, for which I mostly followed Tyler Clavelle’s entry on the topic. This approach turned out to be easy and good, given some conditions:
Contrary to Yihui Xie’s advice, I chose github.io to host my blog, instead of Netlify (I love my desktop integration with GitHub, so it was interesting for me not to move to another service for my static content).
In my machine, I installed Blogdown & Hugo using R studio (v 1.1.336).
In GiHub, it was easier for me to host the blog directly in my main github pages repository (always named [USERNAME].github.io), in the master branch, following Tyler’s tutorial.
My custom styles didn’t involve theme rebuilding. At this moment they’re simple cosmetic tricks.
The steps I followed were:
Git & GitHub repos
Setting a GitHub repo with the name [USERNAME].github.io (in my case aurora-mareviv.github.io). See this and this.
Create a git repo in your machine:
Create manually a new directory called [USERNAME].github.io.
Run in the terminal (Windows users have to install git first):
cd /Git/[USERNAME].github.io # your path may be different
git init # initiates repo in the directory
git remote add origin https://github.com/[USERNAME]/[USERNAME].github.io # connects git local repo to remote Github repo
git pull origin master # in case you have LICENSE and Readme.md files in the GitHub repo, they're downloaded
For now, your repo is ready. We will now focus in creating & customising our Blogdown.
RStudio and blogdown
We will open RStudio (v 1.1.336, development version as of today).
First, you may need to install Blogdown in R:
In RStudio, select the Menu > File > New Project following the lower half of these instructions. The wizard for setting up a Hugo Blogdown project may not be yet available in your RStudio version (not for much longer probably).
Creating new Project
Selecting Hugo Blogdown format
Selecting Hugo Blogdown theme
A config.toml file appears
Customising paths and styles
Before we build and serve our site, we need to tweak a couple of things in advance, if we want to smoothly deploy our blog into GitHub pages.
Modify config.toml file
To integrate with GiHub pages, there are the essential modifications at the top of our config.toml file:
We can revisit the config.toml file to make changes to the default settings.
The logo that appears in the corner must be in the root folder. To modify it in the config.toml:
picture = "logo.png"# the path to the logo
The cover (background) image must be located in /themes/hugo-tranquilpeak-theme/static/images . To modify it in the config.toml:
coverImage = "myimage.jpg"
We want some custom css and js. We need to locate it in /static/css and in /static/jsrespectively.
# Custom CSS. Put here your custom CSS files. They are loaded after the theme CSS;# they have to be referred from static root. ExamplecustomCSS = ["css/my-style.css"]
# Custom JS. Put here your custom JS files. They are loaded after the theme JS;# they have to be referred from static root. ExamplecustomJS = ["js/myjs.js"]
We can add arbitrary classes to our css file (see above).
Since I started writing in Bootstrap, I miss it a lot. Since this theme already has bootstrap classes, I brought some others I didn’t find in the theme (they’re available for .md files, but currently not for .Rmd)
Once we have ready our theme, we can add some content, modifying or deleting the various examples we will find in /content/post .
We need to make use of Blogdown & Hugo to compile our .Rmd file and create our html post:
In the viewer, at the right side of the IDE you can examine the resulting html and see if something didn’t go OK.
Deploying the site
Updating the local git repository
This can be done with simple git commands:
cd /Git/[USERNAME].github.io # your path to the repo may be different
git add . # indexes all files that wil be added to the local repo
git commit -m "Starting my Hugo blog"# adds all files to the local repo, with a commit message
Pushing to GitHub
git push origin master # we push the changes from the local git repo to the remote repo (GitHub repo)
The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.
This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.
Before proceeding, please follow our short tutorial.
Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions. to check your answers.
Create a list which will include the variables “Horsepower” and “MPG.city” of the “Cars93” data set and make a scatterplot. HINT: Use ggvis() and layer_points().
Add a slider to the scatterplot of Exercise 1 that sets the point size from 10 to 100. HINT: Use input_slider().
It’s Game of Thrones time again as the battle for Westeros is heating up. There are tons of ideas, ingredients and interesting analyses out there and I was craving for my own flavour. So step zero, where is the data?
Jenny Bryan’s purrr tutorial introduced the list got_chars, representing characters information from the first five books, which seems not much fun beyond exercising list manipulation muscle. However, it led me to an API of Ice and Fire, the world’s greatest source for quantified and structured data from the universe of Ice and Fire including the HBO series Game of Thrones. I decided to create my own API functions, or better, an R package (inspired by the famous rwar package).
The API resources cover 3 types of endpoint – Books, Characters and Houses. GoTr pulls data in JSON format and parses them to R list objects. httr‘s Best practices for writing an API package by Hadley Wickham is another life saver.
The package contains: – One function got_api() – Two ways to specify parameters generally, i.e. endpoint type + id or url – Three endpoint types
Another powerful parameter is query which allows filtering by specific attribute such as the name of a character, pagination and so on.
It’s worth knowing about pagination. The first simple request will render a list of 10 elements, since the default number of items per page is 10. The maximum valid pageSize is 50, i.e. if 567 is passed on to it, you still get 50 characters.
# Retrieve character by name
So how do we get ALL books, characters or houses information? The package does not provide the function directly but here’s an implementation.
# Retrieve all books
##  "A Game of Thrones" "A Clash of Kings"
##  "A Storm of Swords" "The Hedge Knight"
##  "A Feast for Crows" "The Sworn Sword"
##  "The Mystery Knight" "A Dance with Dragons"
##  "The Princess and the Queen" "The Rogue Prince"
##  "The World of Ice and Fire" "A Knight of the Seven Kingdoms"
map_chr(houses, "name") %>% length()
##  444
map_df(houses, `[`, c("name", "region")) %>% head()
## # A tibble: 6 x 2
## name region
## 1 House Algood The Westerlands
## 2 House Allyrion of Godsgrace Dorne
## 3 House Amber The North
## 4 House Ambrose The Reach
## 5 House Appleton of Appleton The Reach
## 6 House Arryn of Gulltown The Vale
The houses list is a starting point for a social network analysis: Mirror mirror tell me, who are the most influential houses in the Seven Kingdom? Stay tuned for that is the topic of the next blogpost.
Thanks to all open resources. Please comment, fork, issue, star the work-in-progress on our GitHub repository.
To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.
Ideally the Gini coefficient to estimate inequality is based on original household survey data with hundreds or thousands of data points. Often this data isn’t available due to access restrictions from privacy or other concerns, and all that is published is some kind of aggregate measure. Some aggregations include the income at the 80th percentile divided by that at the 20th (or 90 and 10); the number of people at the top of the distribution whose combined income is equal to that of everyone else; or the income of the top 1% as a percentage of all income. I wrote a little more about this in one of my earliest blog posts.
One way aggregated data are sometimes presented is as the mean income in each decile or quintile. This is not the same as the actual quantile values themselves, which are the boundary between categories. The 20th percentile is the value of the 20/100th person when they are lined up in increasing order, whereas the mean income of the first quintile is the mean of all the incomes of a “bin” of everyone from 0/100 to 20/100, when lined up in order.
To explore estimating Gini coefficients from this type of binned data I used data from the wonderful Lakner-Milanovic World Panel Income Distribution database, which is available for free download. This useful collection contains the mean income by decile bin of many countries from 1985 onwards – the result of some careful and doubtless very tedious work with household surveys from around the world. This is an amazing dataset, and amongst other purposes it can be used (as Milanovic and co-authors have pioneered dating back to his World Bank days) in combination with population numbers to estimate “global inequality”, treating everyone on the planet as part of a single economic community regardless of national boundaries. But that’s for another day.
Here’s R code to download the data (in Stata format) and grab the first ten values, which happen to represent Angloa in 1995. These particular data are based on consumption, which in poorer economies is often more sensible to measure than income:
Here’s the resulting 10 numbers. N
And this is the Lorenz curve:
Those graphics were drawn with this code:
Calculating Gini directly from deciles?
Now, I could just treat these 10 deciles as a sample of 10 representative people (each observation after all represents exactly 10% of the population) and calculate the Gini coefficient directly. But my hunch was that this would underestimate inequality, because of the straight lines in the Lorenz curve above which are a simplification of the real, more curved, reality.
To investigate this issue, I started by creating a known population of 10,000 income observations from a Burr distribution, which is a flexible, continuous non-negative distribution often used to model income. That looks like this:
Then I divided the data up into between 2 and 100 bins, took the means of the bins, and calculated the Gini coefficient of the bins. Doing this for 10 bins is the equivalent of calculating a Gini coefficient directly from decile data such as in the Lakner-Milanovic dataset. I got this result, which shows, that when you have the means of 10 bins, you are underestimating inequality slightly:
Here’s the code for that little simulation. I make myself a little function to bin data and return the mean values of the bins in a tidy data frame, which I’ll need for later use too:
A better method for Gini from deciles?
Maybe I should have stopped there; after all, there is hardly any difference between 0.32 and 0.34; probably much less than the sampling error from the survey. But I wanted to explore if there were a better way. The method I chose was to:
choose a log-normal distribution that would generate (close to) the 10 decile averages we have;
simulate individual-level data from that distribution; and
estimate the Gini coefficient from that simulated data.
I also tried this with a Burr distribution but the results were very unstable. The log-normal approach was quite good at generating data with means of 10 bins very similar to the original data, and gave plausible values of Gini coefficient just slightly higher than when calculated directly of the bins’ means.
Here’s how I did that:
And here are the results. The first table shows the means of the bins in my simulated log-normal population (mean) compared to the original data for Angola’s actual deciles in 1995 (x). The next two values, 0.415 and 0.402, are the Gini coefficents from the simulated and original data respectively:
As would be expected from my earlier simulation, the Gini coefficient from the estimated underlying log-normal distribtuion is verr slightly higher than that calculated directly from the means of the decile bins.
Applying this method to the Lakner-Milanovic inequality data
I rolled up this approach into a function to convert means of deciles into Gini coefficients and applied it to all the countries and years in the World Panel Income Distribution data. Here are the results, first over time:
.. and then as a snapshot
Neither of these is great as a polished data visualisation, but it’s difficult data to present in a static snapshot, and will do for these illustrative purposes.
Here’s the code for that function (which depends on the previously defined ) and drawing those charts. Drawing on the convenience of Hadley Wickham’s dplyr and ggplot2 it’s easy to do this on the fly and in the below I calculate the Gini coefficients twice, once for each chart. Technically this is wasteful, but with modern computers this isn’t a big deal even though there is quite a bit of computationally intensive stuff going on under the hood; the code below only takes a minute or so to run.
There we go – deciles to Gini fun with world inequality data!
My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.
Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.
There are many data sets that you can download. I have used the following:
Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02′ in the observed defects data set represents ‘oil leakage’.
Simple Analysis in R
I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.
For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.
The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage. But look at BMW: waaauwww those old BMW’s are leaking like oil crazy… The few lines of R code can be found here.
There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.
Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 379 other packages on CRAN—an increase of 49 since the last CRAN release in June!
Changes in this release relative to the previous CRAN release are as follows:
Changes in RcppArmadillo version 0.7.960.1.0 (2017-08-11)
Upgraded to Armadillo release 7.960.1 (Northern Banana Republic Deluxe)
faster randn() when using OpenMP (NB: usually omitted when used fromR)
faster gmm_diag class, for Gaussian mixture models with diagonal covariance matrices
added .sum_log_p() to the gmm_diag class
added gmm_full class, for Gaussian mixture models with full covariance matrices
expanded .each_slice() to optionally use OpenMP for multi-threaded execution
Upgraded to Armadillo release 7.950.0 (Northern Banana Republic)
expanded accu() and sum() to use OpenMP for processing expressions with computationally expensive element-wise functions
expanded trimatu() and trimatl() to allow specification of the diagonal which delineates the boundary of the triangular part
Enhanced support for sparse matrices (Binxiang Ni as part of Google Summer of Code 2017)
As you may have noticed, we have made a few changes to our apps for the 2017 season to bring you a smoother and quicker experience while also adding more advanced and customizable views.
Most visibly, we moved the apps to Shiny so we can continue to build on our use of R and add new features and improvements throughout the season. We expect the apps to better handle high traffic load this season during draft season and peak traffic.
In addition to the ability to create and save custom settings, you can also choose the columns you view in our Projections tool. We have also added more advanced metrics such as weekly VOR and Projected Points Per Dollar (ROI) for those of you in auction leagues. With a free account, you’ll be able to create and save one custom setting. If you get an FFA Insider subscription, you’ll be able to create and save unlimited custom settings.
Up next is the ability to upload custom auction values to make it easier to use during auction drafts.
We are also always looking to add new features, so feel free to drop us a suggestion in the Comments section below!
The first “official” version of R, version 1.0.0, was released on February 29, 200. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard:
Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects’ founders Ross Ihaka and Robert Gentleman, along with Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Kornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members. These are the volunteers that implement the R language and its base packages, document, build, test and release it, and manage all the infrastructure that makes that possible.
Thank you to all the R Core Group members, past and present!
To leave a comment for the author, please follow the link and comment on their blog: Revolutions.