R-based data maps in PowerBI

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

One of benefit of integrating R with PowerBI is access to rich array of data visulizations not present in the standard PowerBI loadout. R is practically unlimited in the types of graphics it can create (although the amount of programming required can vary from a few lines using an existing R package, to large custom functions for truly bespoke graphics). Some of the visualizations you can create with R include population pyramids, small multiples, annotated time series, calendar heat maps, rank plots and even emoji charts. But perhaps one of the biggest opportunities is the ability to plot data on a geographic surface with choropleths, map projections, topological maps and animated maps.

If you’d like to learn how to use R maps with PowerBI, David Eldersveld from BlueGranite has put together a useful series of tutorials. In R Maps in Microsoft Power BI: Getting Started, Dave walks you through the steps of using R’s maps and mapproj packages to create an interactive PowerBI dashboard to explore the (surprisingly numerous) airfields in the Great Lakes region, based on data provided by the FAA.

In part 2 of the tutorial, Dave explores using R to create small multiples: repeated maps showing data varying by time or data subset (a great way of making comparisons). This example shows the airports by type of owner (public, private, Army or Air Force):

You can learn more about how to create charts like these in PowerBI by following the link below. You can also download the PowerBI PBIX files and the modified dataset to try them out yourself.

DataVeld: R Maps in Microsoft Power BI: Getting Started

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Toying with models: The Game of Life with selection

By mrtnj

oscillators

(This article was first published on R – On unicorns and genes, and kindly contributed to R-bloggers)

Conway’s Game of life is probably the most famous cellular automaton, consisting of a grid of cells developing according simple rules. Today, we’re going to add mutation and selection to the game, and see let patterns evolve.

The fate of a cell depends on the number cells that live in the of neighbouring positions. A cell with fewer than two neighbours die from starvation. A cell with more than three neighbours die from overpopulation. If a position is empty and has three neighbours, it will be filled by a cell. These rules lead to some interesting patterns, such as still lives that never change, oscillators that alternate between states, patterns that eventually die out but take long time to do so, patterns that keep generating new cells, and so forth.

When I played with the Game of life when I was a child, I liked one pattern called ”virus”, that looked a bit like this. On its own, a grid of four-by-four blocks is a still life, but add one cell (the virus), and the whole pattern breaks. This is a version on a 30 x 30 cell board. It unfolds rather slowly, but in the end, a glider collides with a block, and you are left with some oscillators.

blocks virus

There are probably other interesting ways that evolution could be added to the game of life. We will take a hierarchical approach where the game is taken to describe development, and the unit of selection is the pattern. Each generation, we will create a variable population of patterns, allow them to develop and pick the fittest. So, here the term ”development” refers to what happens to a pattern when applying the rules of life, and the term ”evolution” refers to how the population of patterns change over the generations. This differ slightly from Game of life terminology, where ”evolution” and ”generation” usually refer to the development of a pattern, but it is consistent with how biologists use the words: development takes place during the life of an organism, and evolution happens over the generations as organisms reproduce and pass on their genes to offspring. I don’t think there’s any deep analogy here, but we can think of the initial state of the board as the heritable material that is being passed on and occasionally mutated. We let the pattern develop, and at some point, we apply selection.

First, we need an implementation of the game of life in R. We will represent the board as a matrix of ones (live cells) and zeroes (empty positions). Here is function develops the board one tick in time. After dealing with the corners and edges, it’s very short, but also slow as molasses. The next function does this for a given number of ticks.

## Develop one tick. Return new board matrix.
develop <- function(board_matrix) {
  padded <- rbind(matrix(0, nrow = 1, ncol = ncol(board_matrix) + 2),
                  cbind(matrix(0, ncol = 1, nrow = nrow(board_matrix)), 
                        board_matrix,
                        matrix(0, ncol = 1, nrow = nrow(board_matrix))),
                  matrix(0, nrow = 1, ncol = ncol(board_matrix) + 2))
  new_board <- padded
  for (i in 2:(nrow(padded) - 1)) {
    for (j in 2:(ncol(padded) - 1)) {
      neighbours <- sum(padded[(i-1):(i+1), (j-1):(j+1)]) - padded[i, j]
      if (neighbours < 2 | neighbours > 3) {
        new_board[i, j] <- 0
      }
      if (neighbours == 3) {
        new_board[i, j] <- 1
      }
    }
  }
  new_board[2:(nrow(padded) - 1), 2:(ncol(padded) - 1)]
}

## Develop a board a given number of ticks.
tick <- function(board_matrix, ticks) {
  if (ticks > 0) {
    for (i in 1:ticks) {
      board_matrix <- develop(board_matrix) 
    }
  }
  board_matrix
}

We introduce random mutations to the board. We will use a mutation rate of 0.0011 per cell, which gives us a mean of a bout one mutation for a 30 x 30 board.

## Mutate a board
mutate <- function(board_matrix, mutation_rate) {
  mutated <- as.vector(board_matrix)
  outcomes <- rbinom(n = length(mutated), size = 1, prob = mutation_rate)
  for (i in 1:length(outcomes)) {
    if (outcomes[i] == 1)
      mutated[i] <- ifelse(mutated[i] == 0, 1, 0)
  }
  matrix(mutated, ncol = ncol(board_matrix), nrow = nrow(board_matrix))
}

I was interested in the virus pattern, so I decided to apply a simple directional selection scheme for number of cells at tick 80, which is a while after the virus pattern has stabilized itself into oscillators. We will count the number of cells at tick 80 and call that ”fitness”, even if it actually isn’t (it is a trait that affects fitness by virtue of the fact that we select on it). We will allow the top half of the population to produce two offspring each, thus keeping the population size constant at 100 individuals.

## Calculates the fitness of an individual at a given time
get_fitness <- function(board_matrix, time) {
  board_matrix %>% tick(time) %>% sum
}

## Develop a generation and calculate fitness
grow <- function(generation) {
  generation$fitness <- sapply(generation$board, get_fitness, time = 80)
  generation
}

## Select a generation based on fitness, and create the next generation,
## adding mutation.
next_generation <- function(generation) {
  keep <- order(generation$fitness, decreasing = TRUE)[1:50]
  new_generation <- list(board = vector(mode = "list", length = 100),
                         fitness = numeric(100))
  ix <- rep(keep, each = 2)
  for (i in 1:100) new_generation$board[[i]] <- generation$board[[ix[i]]]
  new_generation$board <- lapply(new_generation$board, mutate, mutation_rate = mu)
  new_generation
}

## Evolve a board, with mutation and selection for a number of generation.
evolve <- function(board, n_gen = 10) { 
  generations <- vector(mode = "list", length = n_gen)

  generations[[1]] <- list(board = vector(mode = "list", length = 100),
                           fitness = numeric(100))
  for (i in 1:100) generations[[1]]$board[[i]] <- board
  generations[[1]]$board <- lapply(generations[[1]]$board, mutate, mutation_rate = mu)

  for (i in 1:(n_gen - 1)) {
    generations[[i]] <- grow(generations[[i]])
    generations[[i + 1]] <- next_generation(generations[[i]])
  }
  generations[[n_gen]] <- grow(generations[[n_gen]])
  generations
}

Let me now tell you that I was almost completely wrong about what happens with this pattern once you apply selection. I thought that the initial pattern of nine stable blocks (36 cells) was pretty good, and that it would be preserved for long, and that virus-like patterns (like the first animation above) would mostly have degenerated around 80. As this plot of the evolution of the number of cells in one replicate shows, I grossly underestimated this pattern. The y-axis is number of cells at time 80, and the x-axis individuals, the vertical lines separating generations. Already by generation five, most individuals do better than 36 cells in this case:

blocks_trajectory_plot

As one example, here is the starting position and the state at time 80 for a couple of individuals from generation 10 of one of my replicates:

blocks_g10_1 blocks_g10_80

blocks_g10_1b blocks_g10_80b

Here is how the average cell number at time 80 evolves in five replicates. Clearly, things are still going on at generation 10, not only in the replicate shown above.

mean_fitness_blocks

Here is the same plot for the virus pattern I showed above, i.e. the blocks but with one single added cell, fixed in the starting population. Prior genetic architecture matters. Even if the virus pattern has fewer cells than the blocks pattern at time 80, it is apparently a better starting point to quickly evolve more cells:

mean_fitness_virus

And finally, out of curiosity, what happens if we start with an empty 30 x 30 board?

mean_fitness_blank

Not much. The simple still life block evolves a lot. But in my replicate three, this creature emerged. ”Life, uh, finds a way.”

blank_denovo

Unfortunately, many of the selected patterns extended to the edges of the board, making them play not precisely the game of life, but the game of life with edge effects. I’d like to use a much bigger board and see how far patterns extend. It would also be fun to follow them longer. To do that, I would need to implement a more efficient way to update the board (this is very possible, but I was lazy). It would also be fun to select for something more complex, with multiple fitness components, potentially in conflict, e.g. favouring patterns that grow large at a later time while being as small as possible at an earlier time.

Code is on github, including functions to display and animate boards with the animation package and ImageMagick, and code for the plots. Again, the blocks_selection.R script is slow, so leave it running and go do something else.

Postat i:computer stuff, evolution Tagged: evolution, game of life, R, selection

To leave a comment for the author, please follow the link and comment on their blog: R – On unicorns and genes.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Batch Forecasting in R

By atmathew

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

Given a data frame with multiple columns which contain time series data, let’s say that we are interested in executing an automatic forecasting algorithm on a number of columns. Furthermore, we want to train the model on a particular number of observations and assess how well they forecast future values. Based upon those testing procedures, we will estimate the full model. This is a fairly simple undertaking, but let’s walk through this task. My preference for such procedures is to loop through each column and append the results into a nested list.

First, let’s create some data.

ddat <- data.frame(date = c(seq(as.Date("2010/01/01"), as.Date("2010/03/02"), by=1)),
                      value1 = abs(round(rnorm(61), 2)),
                      value2 = abs(round(rnorm(61), 2)),
                      value3 = abs(round(rnorm(61), 2)))
head(ddat)
tail(ddat)

We want to forecast future values of the three columns. Because we want to save the results of these models into a list, lets begin by creating a list that contains the same number of elements as our data frame.

lst.names <- c(colnames(data))
lst <- vector("list", length(lst.names))
names(lst) <- lst.names
lst

I’ve gone ahead and written a user defined function that handles the batch forecasting process. It takes two arguments, a data frame and default argument which specifies the number of observations that will be used in the training set. The model estimates, forecasts, and diagnostic measures will be saved as a nested list and categorized under the appropriate variable name.

batch <- function(data, n_train=55){
 
  lst.names <- c(colnames(data))
  lst <- vector("list", length(lst.names))
  names(lst) <- lst.names    
 
  for( i in 2:ncol(data) ){  
 
    lst[[1]][["train_dates"]] <- data[1:(n_train),1]
    lst[[1]][["test_dates"]] <- data[(n_train+1):nrow(data),1]
 
    est <- auto.arima(data[1:n_train,i])
    fcas <- forecast(est, h=6)$mean
    acc <- accuracy(fcas, data[(n_train+1):nrow(data),i])
    fcas_upd <- data.frame(date=data[(n_train+1):nrow(data),1], forecast=fcas,                           actual=data[(n_train+1):nrow(data),i])
 
    lst[[i]][["estimates"]] <- est
    lst[[i]][["forecast"]] <- fcas
    lst[[i]][["forecast_f"]] <- fcas_upd
    lst[[i]][["accuracy"]] <- acc
 
    cond1 = diff(range(fcas[1], fcas[length(fcas)])) == 0
    cond2 = acc[,3] >= 0.025
 
    if(cond1|cond2){
 
      mfcas = forecast(ma(data[,i], order=3), h=5)        
      lst[[i]][["moving_average"]] <- mfcas
 
    } else {
 
      est2 <- auto.arima(data[,i])
      fcas2 <- forecast(est, h=5)$mean
 
      lst[[i]][["estimates_full"]] <- est2
      lst[[i]][["forecast_full"]] <- fcas2
 
    }  
  }  
  return(lst)
}
 
batch(ddat)

This isn’t the prettiest code, but it gets the job done. Note that lst was populated within a function and won’t be available in the global environment. Instead, I chose to simply print out the contents of the list after the function is evaluated.

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

covplot supports GRangesList

By R on Guangchuang Yu

(This article was first published on R on Guangchuang Yu, and kindly contributed to R-bloggers)

To answer the issue, I extend the covplot function to support viewing coverage of a list of GRanges objects or bed files.

library(ChIPseeker)
files <- getSampleFiles()
peak=GenomicRanges::GRangesList(CBX6=readPeakFile(files[[4]]),
                                CBX7=readPeakFile(files[[5]]))

p <- covplot(peak)
print(p)

By default, The coverage plot are merged together with different color. Users can separate them to different panels using facet_grid.

library(ggplot2)
col <- c(CBX6='red', CBX7='green')
p + facet_grid(chr ~ .id) + scale_color_manual(values=col) + scale_fill_manual(values=col)

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang Yu.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New CRAN package gunsales

By Thinking inside the box

Total Estimated Gun Sales

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

This is based on joint work with Gregor Aisch and Josh Keller of the New York Times.

A new package gunsales is now on the CRAN network for R. It is based the NYTimes/gunsales repository underlying the excellent New York Times visualizations, first published first in December 2015 and updated with more recent data since.

The analysis takes public government data on gun sales from the National Instant Criminal Background Check System (NICS). The original data is scraped from the pdf, included in the package, and analysed in a cross-section and time-series manner. The standard US Census tool X-13ARIMA-SEATS is used to deseasonalize the timeseries at the national or state level. (Note that Buzzfeed also published data and (Python) code in another GitHub repo.)

As an aside, it was the use of X-13ARIMA-SEATS here — and its somewhat awkward and manual installation also seen in the initial versions of the code in the NYTimes/gunsales repo — which lead to the recent work by Christoph Sax and myself. We now provide a new package x13binary on CRAN so that Christoph’s excellent seasonal package can simply depend upon it and have a working binary provided and installed ready to use; see the recent blog post for more. The net result is that a package like this new gunsales project can simply depend upon seasonal and also be assurred that x13binary “just works”. As Martha would say, “A Good Thing”.

Back to the gunsales project. Following the initial publication of the repository with the data and R code in a simple script, I felt compelled to reorganize it as a package. Packages for R, as we teach our students, colleagues, or anybody else who wants to listen are really the best way to bundle code, data, documentation (i.e. vignettes) and tests. All that exists now in the gunsales package.

The package now has one main function, analysis(), which returns a single dataframe object. This dataframe object can then be fed to two plotting functions. The first, plot_gunsales(), will then recreate all the (base R) plots from the original code base. The second, ggplot_gunsales(), does the same but via ggplot2.

This should give anybody the ability to look at the data, study the transformations done, form and maybe test new hypotheses and visualize in manner comparable to the original publication.

As an amuse gueule, here are the key plots also shown in the main README.md at GitHub:

Total Estimated Gun Sales, Seasonally Adjusted

Total Estimated Gun Sales, Population-Growth Adjusted

Handguns vs Longguns

Six States

DC

We look forward to more remixes and analysis of this data. The plan of the GitHub repository is to keep the data set updated as new data points are published.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

My Baby Boomer Name Might Have Been “Debbie”

By Julia Silge

Shiny App Screenshot

(This article was first published on data science ish, and kindly contributed to R-bloggers)

I have always loved learning and thinking about names, how they are chosen and used, and how people feel about their names and the names around them. We had a traditional baby name book at our house when I was growing up (you know, lists of names with meanings), and I remember poring over it to find unusual or appealing names for my pretend play or the stories I wrote. As an adult, I read Laura Wattenberg’s excellent book on baby names when we were expecting our second baby, and I also discovered the NameVoyager on Wattenberg’s website. I just love that kind of thing.

The data used to make the NameVoyager interactive is from the Social Security Administration based on Social Security card applications, and Hadley Wickham has done the work of taking the same data and making it an R package. Lucky us! Let’s use this package and take a look at how the popularity of my name has changed over time.

library(babynames)
library(ggplot2)
library(dplyr)
data(babynames)
juliejulia <- babynames %>% filter(sex == "F", name %in% c("Julia", "Julie"))
ggplot(juliejulia, aes(x = year, y = prop, color = name)) + 
        geom_line(size = 1.1) + 
        theme(legend.title=element_blank()) + 
        scale_color_manual(values = c("tomato", "midnightblue")) + 
        ggtitle("My Name Is NOT JULIE!") +
        ylab("Proportion of total applicants for year") + xlab("Year")

The babynames package includes data from 1880 through 2014. I was born in 1978; notice that for about 20 years or so before my birth, the name “Julie” was very popular, about 4 times as popular as “Julia”. This means that by the time I was born, there were many more girls and women named Julie walking around than those named Julia. During my childhood, I got called “Julie” all the time by people who misheard or misread my name, and oh, how it rankled! It bothered me so, so much at the time. My actual name started to gain in popularity a little after “Julie” started to decline in popularity, so this doesn’t happen to me much as an adult. In fact, I have known a number of other girls and women named Julia at this point in my life, although they have all been younger than me.

My Parents, the Trendsetters

I’ve always thought it was interesting that my parents somehow picked a name like mine; they picked a name that was on its way to becoming popular again, after its long decline from the 19th century, but wasn’t yet really. (They did the same thing for my one sibling, too; her name also was about to become popular when she was born and named.) Let’s look a bit more deeply at my name’s popularity around my birth year.

pickaname <- babynames %>% filter(sex == "F", name == "Julia")
pickaname[pickaname$year == 1978,]
## Source: local data frame [1 x 5]
## 
##    year   sex  name     n        prop
##   (dbl) (chr) (chr) (int)       (dbl)
## 1  1978     F Julia  2592 0.001576993

So that is the proportion of the total applicants for Social Security cards who had the name “Julia” in 1978, a measure of the popularity of a name. How is the popularity changing? Let’s take 5 years before and after my birth year and fit a linear model to just those years.

subsetfitname <- pickaname %>% filter(year %in% seq(1978-5,1978+5))
myfit <- lm(prop ~ year, subsetfitname)
subsetfitname$prop <- myfit$fitted.values

fitname <- pickaname %>% mutate(fit = "data")
subsetfitname <- subsetfitname %>% mutate(fit = "fit")
fitname <- rbind(fitname, subsetfitname)
fitname$fit <- as.factor(fitname$fit)
goalprop <- as.numeric(fitname[fitname$year == 1978 & fitname$fit == "data",'prop'])
goalslope <- myfit$coefficients[2]

ggplot(fitname, aes(x = year, y = prop, color = fit, size = fit, alpha = fit)) + 
        geom_line() + 
        annotate("point", x = 1978, y = goalprop,
                 color = "tomato", size = 4, alpha = .8) +
        theme(legend.title=element_blank()) + 
        scale_color_manual(values = c("black", "blue")) +
        scale_size_manual(values = c(1.1, 2)) +
        scale_alpha_manual(values = c(1, 0.8)) +
        ggtitle("How Was the Popularity of My Name Changing Around 1978?") +
        ylab("Proportion of total applicants for year") + xlab("Year")

center

Here we can see the positive slope for the proportion of applicants with year; the popularity of the name “Julia” is increasing around 1978.

Finding Similar Names in a Different Year

Now we have a magnitude and a slope to characterize the popularity of my name in my birth year. Let’s find similar names in other years. This type of analysis was done by Time last year, but they ranked names and then matched names by their rank (i.e. what was the 112th most common name today and in the past decades?). Here, we are approaching the question a little differently. What names in other years have similar proportion of the total applicants and change in that proportion to my name in my birth year? My oldest daughter was born in 2006; let’s try that year. First, let’s find all the names with about the same proportion. Then, let’s calculate the slope for each of those names.

goalyear <- 2006
findmatches <- babynames %>% filter(sex == "F", year == goalyear, 
                     prop < goalprop*1.1 & prop > goalprop*0.9) %>%
        mutate(slope = 0.00)

for (i in seq_along(findmatches$name)) {
        matchfitname <- babynames %>% filter(sex == "F", 
                                             name == as.character(findmatches[i,'name']))
        matchfitname <- matchfitname %>% filter(year %in% seq(goalyear-5,goalyear+5))
        matchfit <- lm(prop ~ year, matchfitname)
        findmatches[i,'slope'] <- matchfit$coefficients[2]
}

Now, let’s keep only the names that have about the same slope as the original name. For matching purposes, the slopes here are divided into three categories: positive, negative, and mostly flat (between -0.00005 and 0.00005).

if (goalslope >= 0.00005) {
        matchnames <- findmatches %>% filter(slope >= 0.00005) %>% select(name)
} else if (goalslope <= -0.00005) {
        matchnames <- findmatches %>% filter(slope <= -0.00005) %>% select(name)
} else {
        matchnames <- findmatches %>% 
                filter(slope > -0.00005 & slope < 0.00005) %>% select(name) 
}

matchnames <- babynames %>% filter(sex == "F", name %in% matchnames$name)
plotname <- rbind(pickaname, matchnames)

So what do we have?

ggplot(plotname, aes(x = year, y = prop, color = name)) + 
        geom_line(size = 1.1) + 
        annotate("text", x = 1978, y = goalprop*1.3, label = "1978") +
        annotate("point", x = 1978, y = goalprop,
                 color = "blue", size = 4.5, alpha = .8) +
        annotate("text", x = goalyear, y = goalprop*1.3, label = goalyear) +
        annotate("point", x = goalyear, y = goalprop,
                 color = "blue", size = 4.5, alpha = .8) +
        theme(legend.title=element_blank()) + 
        ggtitle("Which Names For a Girl Born in 2006 Are Similar to Julia Born in 1978?") +
        ylab("Proportion of total applicants for year") + xlab("Year")

center

These names all have about the same proportion of the population in 2006 as “Julia” in 1978, and they are all increasing in popularity.

What If I Was a Baby Boomer?

My mom was born in 1953, so let’s do this one more time.

center

These names definitely sound different than the 2006 matches. They don’t necessarily sound like 1950s names, which makes sense when we look at the patterns in their popularity. These names were all at the beginning of becoming more popular in 1953, some of them extremely popular; they sound more like 1970s names to me. Coincidentally, the proportion for Julia at 1953 is within the bounds to match the proportion for Julia at 1978, but the slope at 1953 is flat and not increasing.

Explore the Names Yourself

I made a Shiny app to explore the names further. Check out the code for the app, and explore the names in the app itself. The app works pretty much just as described in this blog post. Let’s look at some screenshots for a few cases.

Robert in 1910

There is only one name in 1980 as popular as Robert in 1910, and that is Christopher, which was the second most popular male name in 1980. Robert is further down the list in 1910. This illustrates a trend in U.S. baby naming; in general, more parents now are choosing less common names than in the past. You can see this in the NameVoyager. That visualization includes the top 1000 names for each year; notice that this makes up a smaller proportion of total births in recent years than in earlier years.

It’s not that there aren’t ever very popular, common names in recent decades, though.

Jennifer in 1980

Gosh, Jennifer was just so dominant during my childhood. So much so that there was no name for girls as popular in 2010 as Jennifer was in 1980. I think my younger sister regularly had multiple Jennifers in her classes all through school.

What if we switch those dates?

Jennifer in 2010

By 2010, Jennifer is on the decline. Naming one’s daughter Jennifer in 2010 is like someone my age being named Barbara, Sharon, or Cheryl. I do not know many women my age with these names, and I will say they sound significantly older to me. (Probably my children or grandchildren will revive these names as adorably retro.)

What happens if you choose a name that is rare in your chosen year? Like, say, Leonard in 1990?

Leonard in 1990

There are many more rare names than common names (everything’s a power law?). In fact, I ended up adding a ceiling to how many names the app will display because it would just get too overwhelming. The matches to display for the rare names are chosen randomly so you might get different names if you put in the same rare name twice.

Looking at the rare names can be pretty entertaining, though. Notice how a few of the extremely rare names in 1920 (Austin, Steven, Larry) went on to significant popularity. Some, of course, stayed rare and sound rather hilariously antique (Cornelius, Millard, Gerard), and we can see the rise of Hispanic names like Carlos throughout this app. After playing with the app for a while, I’ve come to the conclusion that this matching process is the least meaningful for very rare names with flat slopes; it’s just a slush of super rare names not changing in popularity down there.

The End

There are a couple of other Shiny apps out there for exploring the babynames data set in different ways, if you just can’t get enough. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!

To leave a comment for the author, please follow the link and comment on their blog: data science ish.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Mapping Birds with Choroplethr

By Ari Lamstein

bird

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

The bird in question. But where does it live? Credit: Wikipedia.

After releasing my course Mapmaking in R with Choroplethr last December I received an interesting email from Dr. Herb Wilson, a biologist at Colby College. Herb studies the Red-breasted Nuthatch, which live throughout the United States and Canada. He asked if it was possible to use choroplethr to map the location of these birds. Herb’s data was a list of (region, value) pairs, where the regions are US States and Canadian Provinces.

At that time it was possible to use the function ?admin1_choropleth to map the birds in US States or Canadian Provinces, but not both simultaneously. So I created a new function for him, ?admin1_region_choropleth, which solves this exact problem.

The code is now on CRAN. Below is a tutorial on how to use it.

The Code

To get the latest version of choroplethr, simply install it from CRAN and check its version:

install.packages("choroplethr")
packageVersion("choroplethr")
[1] ‘3.5.0'

If you see a lower version (for example 3.4.0), then you need to wait a day or two until your CRAN mirror updates.

The new function is called admin1_region_choropleth. You can see its built-in help like this:

?admin1_region_choropleth

The Data

The bird count data comes from the annual Christmas Bird Count run by the National Audubon Society. I have put it in a github repository which you can access here. Once you download the file you can read it in like this:

library(readr)
rbn = read_csv("~/Downloads/rbn.csv")

head(rbn)
Source: local data frame [6 x 8]

  State AdminRegion Count_yr SpeciesNumber NumberByPartyHours  Year ReportingCounts ReportingObservers
  (chr)       (chr)    (int)         (int)              (dbl) (int)           (int)              (int)
1    AK      alaska       63             1             0.0128  1962               1                  1
2    AK      alaska       64             2             0.0233  1963               1                  2
3    AK      alaska       70             6             0.0513  1969               2                  8
4    AK      alaska       71             4             0.0313  1970               1                  7
5    AK      alaska       72             2             0.0187  1971               2                 18
6    AK      alaska       73             3             0.0328  1972               2                 13

If we wanted to map the NumberByPartyHours in 2013 we could start like this:

library(dplyr)

rbn2013 = rbn %>% 
          rename(region = AdminRegion, value = NumberByPartyHours) %>%
          filter(Year == 2013 & !region %in% c("northwest territories", "alaska")) 

We rename the columns to region and value because choroplethr requires columns with those names. We filter out Alaska and the Northwest Territores because they visually throw off the map a bit, and might look nicer as insets.

Making the Map

To make the map, simply pass the new data frame to the function admin1_region_choropleth:

library(choroplethr)
library(ggplot2) 

admin1_region_choropleth(rbn2013, 
    title  = "2013 Red-breasted Nuthatch Sightings", 
    legend = "Sightings By Party Hours") + coord_map()

And now the visual pattern of the data is clear. Within North America, the Red-breasted Nuthatch has been seen mostly in the northwest and northeast.

Updating the Course

This is the third update I’ve made to choroplethr since launching my course Mapmaking in R with Choroplethr last December. (You can learn about the other updates here and here.) My plan is to update the course with video lectures that demonstrate this material soon. But that will probably have to wait until I finish production of my next course, which I hope to announce soon.

The post Mapping Birds with Choroplethr appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Network Visualization with Plotly and Shiny

By dgrapov

netmaping

(This article was first published on r-bloggers – Creative Data Solutions, and kindly contributed to R-bloggers)


In addition to their more common uses, networks can be used as powerful multivariate data visualizations and exploration tools. Networks not only provide mathematical representations of data but are also one of the few data visualization methods capable of easily displaying multivariate variable relationships. The process of network mapping involves using the network manifold to display a variety of other information e.g. statistical, machine learning or functional analysis results (see more mapped network examples).


The combination of Plotly and Shiny is awesome for creating your very own network mapping tools. Networkly is an R package which can be used to create 2-D and 3-D interactive networks which are rendered with plotly and can be easily integrated into shiny apps or markdown documents. All you need to get started is an edge list and node attributes which can then be used to generate interactive 2-D and 3-D networks with customizable edge (color, width, hover, etc) and node (color, size, hover, label, etc) properties.


2-Dimensional Network (interactive version)2dnetwork


3-Dimensional Network (interactive version)

3dnetwork

View all code used to generate the networks above.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – Creative Data Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A Tale of Two Charting Paradigms: Vega-Lite vs R+ggplot2

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

This post comes hot off the heels of the nigh-feature-complete release of vegalite (virtually all the components of Vega-Lite are now implemented and just need real-world user testing). I’ve had a few and seen a few questions about “why Vega-Lite”? I think my previous post gave some good answers to “why”. However, Vega-Lite and Vega provide different ways to think about composing statistical graphs than folks seem to be used to (which is part of the “why?”).

Vega-Lite attempts to simplify the way charts are specified (i.e. the way you create a “spec”) in Vega. Vega-proper is rich and complex. You interleave data, operations on data, chart aesthetics and chart element interactions all in one giant JSON file. Vega-Lite 1.0 is definitely more limited than Vega-proper and even when it does add more interactivity (like “brushing”) it will still be more limited, on purpose. The reduction in complexity makes it more accessible to both humans and apps, especially apps that don’t grok the Grammar of Graphics (GoG) well.

Even though ggplot2 lets you mix and match statistical operations on data, I’m going to demonstrate the difference in paradigms/idioms through a single chart. I grabbed the FRED data on historical WTI crude oil prices and will show a chart that displays the minimum monthly price per-decade for a barrel of this cancerous, greed-inducing, global-conflict-generating, atmosphere-destroying black gold.

The data consists of records of daily prices (USD) for this commodity. That means we have to:

  1. compute the decade
  2. compute the month
  3. determine the minimum price by month and decade
  4. plot the values

The goal of each idiom is to provide a way to reproduce and communicate the “research”.

Here’s the idiomatic way of doing this with Vega-Lite:

library(vegalite)
library(quantmod)
library(dplyr)
 
getSymbols("DCOILWTICO", src="FRED")
 
data_frame(date=index(DCOILWTICO),
           value=coredata(DCOILWTICO)[,1]) %>%
  mutate(decade=sprintf("%s0", substring(date, 1, 3))) -> oil
 
# i created a CSV and moved the file to my server for easier embedding but
# could just have easily embedded the data in the spec.
# remember, you can pipe a vegalite object to embed_spec() to
# get javascript embed code.
 
vegalite() %>%
  add_data("http://rud.is/dl/crude.csv") %>%
  encode_x("date", "temporal") %>%
  encode_y("value", "quantitative", aggregate="min") %>%
  encode_color("decade", "nominal") %>%
  timeunit_x("month") %>%
  axis_y(title="", format="$3d") %>%
  axis_x(labelAngle=45, labelAlign="left", 
         title="Min price for Crude Oil (WTI) by month/decade, 1986-present") %>%
  mark_tick(thickness=3) %>%
  legend_color(title="Decade", orient="left")

Here’s the “spec” that creates (wordpress was having issues with it, hence the gist embed):

And, here’s the resulting visualization:

var embedSpec_vl37b06857 = { “mode”: “vega-lite”, “spec”: spec_vl37b06857, “renderer”: spec_vl37b06857.embed.renderer, “actions”: spec_vl37b06857.embed.actions };

vg.embed(“#vl37b06857”, embedSpec_vl37b06857, function(error, result) {});

The grouping and aggregation operations operate in-chart-craft-situ. You have to carefully, visually parse either the spec or the R code that creates the spec to really grasp what’s going on. A different way of looking at this is that you embed everything you need to reproduce the transformations and visual encodings in a single, simple JSON file.

Here’s what I believe to be the modern, idiomatic way to do this in R + ggplot2:

library(ggplot2)
library(quantmod)
library(dplyr)
 
getSymbols("DCOILWTICO", src="FRED")
 
data_frame(date=index(DCOILWTICO),
           value=coredata(DCOILWTICO)[,1]) %>%
  mutate(decade=sprintf("%s0", substring(date, 1, 3)),
         month=factor(format(as.Date(date), "%B"),
                      levels=month.name)) -> oil
 
filter(oil, !is.na(value)) %>%
  group_by(decade, month) %>%
  summarise(value=min(value)) %>%
  ungroup() -> oil_summary
 
ggplot(oil_summary, aes(x=month, y=value, group=decade)) +
  geom_point(aes(color=decade), shape=95, size=8) +
  scale_y_continuous(labels=scales::dollar) +
  scale_color_manual(name="Decade", 
                     values=c("#d42a2f", "#fd7f28", "#339f34", "#d42a2f")) +
  labs(x="Min price for Crude Oil (WTI) by month/decade, 1986-present", y=NULL) +
  theme_bw() +
  theme(axis.text.x=element_text(angle=-45, hjust=0)) +
  theme(legend.position="left") +
  theme(legend.key=element_blank()) +
  theme(plot.margin=grid::unit(rep(1, 4), "cm"))

(To stave off some comments, yes I do know you can be Vega-like and compute with arbitrary functions within ggplot2. This was meant to show what I’ve seen to be the modern, recommended idiom.)

You really don’t even need to know R (for the most part) to grok what’s going on. Data is acquired and transformed and we map that into the plot. Yes, you can do the same thing with Vega[-Lite] (i.e. munge the data ahead of time and just churn out marks) but you’re not encouraged to. The power of the Vega paradigm is that you do blend data and operations together and they stay together.

To make the R+ggplot2 code reproducible the entirety of the script has to be shipped. It’s really the same as shipping the Vega[-Lite] spec, though since you need to reproduce either the JSON or the R code in environments that support the code (R just happens to support both ggplot2 & Vega-Lite now :-).

I like the latter approach but can appreciate both (otherwise I wouldn’t have written the vegalite package). I also think Vega-Lite will catch on more than Vega-proper did (though Vega itself is in use and you use under the covers whenever you use ggvis). If Vega-Lite does nothing more than improve visualization literacy—you must understand core vis terms to use it—and foster the notion for the need for serialization, reproduction and sharing of basic statistical charts, it will have been an amazing success in my book.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Upload plots as PNG file to your wordpress

By jletteboer

(This article was first published on R – Networkx, and kindly contributed to R-bloggers)

Synopsis

Note! This post is a addition to Create blog posts from RStudio to WordPress

About a week ago I tried to add my blog to R-Bloggers. I thought everything was correct to add it. But this week I got a mail from Tal Galili (site administrator of R-Bloggers) with the message that my blog uses base64 images in my feed.

This type of images are created, as default option, with knitr as a standalone HTML document. It would be great if I could solve it with PNG instead of base64 images, even better if I could solve it with posting from RStudio to WordPress.

And so I did. In this post I will explain you how to post your WordPress post with PNG files and upload it to your blog.

How to recognize a base64 image

Well you can check your RSS feed of your blog and search for data:image/png;base64, it should be something like this (thx Tal for the image ):

Setting your options

To upload your PNG files to your blog you need to set some knitr options first.

As I described in my earlier post you had to set your login parameters to login to your WordPress blog. For the upload of files to WordPress you need to set the upload.fun option of knitr. This will take the filename as the input and returns a link to the image in the RMarkdown file.

Let’s set the login parameters and the upload.fun options. I’ve hashed it out because I can not post to the dummy credentials. In the earlier post you can add the upload.fun option after your login credentials and you are good to go.

# Load libraries
library(RWordPress)
library(knitr)

# Login parameters
options(WordPressLogin=c(your_username="your_password"),
        WordPressURL="http://your.blog.com/xmlrpc.php")

# Upload your plots as png files to your blog
opts_knit$set(upload.fun = function(file){library(RWordPress);uploadFile(file)$url;})

After setting the above credentials your are ready to upload your post with PNG image(s) to your blog. For the completeness I will post a png image in this post I will create within R.

Example of posting your PNG image

data(iris)
pairs(iris, col = iris$Species)

plot of chunk iris

After this I could start post to my blog with knit2wpCrayon.

knit2wpCrayon("r2blog_with_png.Rmd", 
        title = "Upload plots as PNG file to your wordpress",
        categories = c("R", "Programming"), 
        publish = FALSE, upload = TRUE)

Again I hashed it because it could not post itself.

This code can also be find on my GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – Networkx.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News