Manipulate Biological Data Using Biostrings Package Exercises(Part 2)

By Stephen James

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Bioinformatics is an amalgamation Biology and Computer science.Biological Data is manipulated using Computers and Computer software’s in Bioinformatics. Biological Data includes DNA; RNA & Proteins. DNA & RNA is made of Nucleotide which are our genetic material in which we are coded.Our Structure and Functions are done by protein, which are build of Amino acids

In this exercise we try correlate the relation between DNA, RNA & Protein.

Conversion of DNA to RNA is known as Transcription. DNA/RNA to protein is known as Translation.

Here we also discuss Sequence Alignment Techniques. Sequence Alignment is comparing the similarity between the sequences to check how much the DNA,RNA or Protein are related to each other.
Three are three types of Sequence Alignment
1. Global Alignment
2. Local Alignment
3. Over lap Alignment

In the exercises below we cover how we can Manipulate Biological Data using Biostrings package in Bioconductor.

Install Packages
Biostrings

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a DNA String and find out the complement of the DNA

Exercise 2

Create a RNA String and find out the complement of the RNA

Exercise 3

Create a DNA string and find the reverse complement of the same.

Exercise 4

Create a RNA string and find the reverse complement of the same.

Exercise 5

Create a DNA string and translate the same into Amino Acids using Standard Genetic codon and print the three letter codon of the amino acids

Exercise 6

Create a DNA string and translate the same into Amino Acids using Standard Genetic codon and print the three letter codon of the amino acids

Exercise 7

Create two DNA Strings and align the sequence using Global Alignment technique and print the score of the alignment

Exercise 8

Create two DNA Strings and align the sequence using Global Alignment technique and print the score of the alignment after specifying your own score for match and mismatch among the sequence

Exercise 9

Create two DNA Strings and align the sequence using Local Alignment technique and print the score of the alignment

Exercise 10

Create two DNA Strings and align the sequence using Overlap Alignment technique and print the score of the alignment

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

L.A. Unconf-idential : a.k.a. an rOpenSci #runconf17 Retrospective

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Last year, I was able to sit back and lazily “RT” Julia Silge’s excellent retrospective on her 2016 @rOpenSci “unconference” experience. Since Julia was not there this year, and the unconference experience is still in primary storage (LMD v2.0 was a success!) I thought this would be the perfect time for a mindful look-back.

And Now, A Word From…

Hosting a conference is an expensive endeavour. These organizations made the event possible:

At most “conferences” you are inundated with advertising from event sponsors. These folks provided resources and said “do good work”. That makes them all pretty amazing but is also an indicator of the awesomeness of this particular unconference.

All For “Un” and “Un” For All

Over the years, I’ve become much less appreciative of “talking heads” events. Don’t get me wrong. There’s great benefit in being part of a larger group experiencing the same message(s) and getting inspired to understand and investigate new ideas, concepts and technologies. Shining examples of what great “conferences” look like include OpenVis Conf and RStudio’s inaugural self-titled event.

The @rOpenSci “unconference” model is incredibly refreshing.

It has the “get’er done” feel of a hackathon but places less importance on the competitive aspect that is usually paramount in hackathons and increases emphasis on forging links, partnerships and creativity across the diverse R community. It’s really like the Three Musketeers saying “all for one and one for all” since we were all there to help each other build great things to enable R users to build even greater things.

What We Going To Do Tonight, BrainKarthik?

I’ll let you peruse the rOpenSci member list and #runconf17 attendee list at your leisure. Those folks came to Los Angeles to work — not just listen — for two days.

In the grand scheme of things, two days is not much time. It takes many organizations two days to just agree on what conference room they’re going to use for an upcoming internal meeting let alone try to get something meaningful accomplished. In two days, the unconference participants cranked out ~20 working projects. No project had every “i” dotted and every “t” crossed but the vast majority were at Minimum Viable Product status by presentation time on Day 2, and none were “trivial”.

You can read all of the projects at the aforementioned link. Any that I fail to mention here is not a reflection on the project but more a factor of needing to keep this post to a reasonable length. To that end, I’m not even elaborating on the project I mainly worked on with Rich, Steph, Oliver & Jeroen (though it is getting a separate blog post soon).

Want to inspire Minecraft enthusiasts to learn R? There’s an app for that. The vast functional programming power that’s enabled the modern statistics and machine learning revolution is now at the fingertips of any player. On the flip side, you now have tools to create 3D models in a world you can literally walk through — as in, literally stand and watch models of migratory patterns of laden swallows that you’ve developed. Or, make a 3D scatterblock™ diagram and inspect — or destroy with an obsidian axe — interesting clusters. Eliminating data set outliers never felt so cathartic! Or, even create mazes algorithmically and see if your AI-controlled avatar can find its way out.

Want to connect up live sensor (or other live stream) data into an R Shiny project? There’s an app for that. Websockets are a more efficient & versatile way to wire up clients and servers. Amazon’s IoT platform even uses it as one way to push data out from your connected hairbrush. R now has a lightweight way to grab this data.

The team even live-demoed how to pick up accelerometer data from a mobile device and collect + plot it live.

Want vastly improved summaries of your data frames so you can find errors, normalize columns and get to visualization and model development faster? There an app for that.

Yes, I — too — SQUEEd at in-console & in-data frame histograms.

There are many more projects for you to investigate and U.S. folks should be thankful for a long weekend so they have time to dive into each of them.

It’s never about the technology. It’s about the people.

(I trust Doctor Who fans will forgive me for usurping Clara’s best line from the Bells of Saint John)

Stefanie, Karthik, Scott & the rest of the rOpenSci team did a phenomenal job organizing and running the unconference. Their efforts ensured it was an open and safe environment for folks (or 🐕) to just be themselves.

I got to “see” idividuals I’ve only ever previously digitally interacted or collaborated with. Their IRL smiles — a very familiar expression on the faces of attendees during the two-day event — are even wider and brighter than those that come through in their tweets and blog posts.

Each and every attendee I met brought fresh perspectives, unique knowledge, incredible talent and unwavering enthusiasm to the event. Teams and individuals traded ideas and code snippets and provided inspiration and encouragement when not hammering out massive quantities of R code.

You can actually get a mini-unconf experience at any time from the comfort of your own glowing rectangle nesting spot. Pick or start a project, connect with the team and dive in.

FIN

It was great meeting new folks, hanging with familiar faces and having two days to just focus on making things for the R community. I hope more conferences or groups explore the “un” model and look forward to seeing the 2017 projects become production-ready and more folks jumping on board rOpenSci.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

5 ways to measure running time of R code

By Alexej's blog

Microbenchmark results plot

A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.

A quick online search revealed at least three R packages for benchmarking R code (rbenchmark, microbenchmark, and tictoc). Additionally, base R provides at least two methods to measure the running time of R code (Sys.time and system.time). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end.

1. Using Sys.time

The run time of a chunk of code can be measured by taking the difference between the time at the start and at the end of the code chunk. Simple yet flexible :sunglasses:.

sleep_for_a_minute  function() { Sys.sleep(60) }

start_time  Sys.time()
sleep_for_a_minute()
end_time  Sys.time()

end_time - start_time
# Time difference of 1.000327 mins

2. Library tictoc

The functions tic and toc are used in the same manner for benchmarking as the just demonstrated Sys.time. However tictoc adds a lot more convenience to the whole.

The most recent development1 version of tictoc can be installed from github:

devtools::install_github("collectivemedia/tictoc")

One can time a single code chunk:

library(tictoc)

tic("sleeping")
print("falling asleep...")
sleep_for_a_minute()
print("...waking up")
toc()
# [1] "falling asleep..."
# [1] "...waking up"
# sleeping: 60.026 sec elapsed

Or nest multiple timers:

tic("total")
tic("data generation")
X  matrix(rnorm(50000*1000), 50000, 1000)
b  sample(1:1000, 1000)
y  runif(1) + X %*% b + rnorm(50000)
toc()
tic("model fitting")
model  lm(y ~ X)
toc()
toc()
# data generation: 3.792 sec elapsed
# model fitting: 39.278 sec elapsed
# total: 43.071 sec elapsed

3. Using system.time

One can time the evaluation of an R expression using system.time. For example, we can use it to measure the execution time of the function sleep_for_a_minute (defined above) as follows.

system.time({ sleep_for_a_minute() })
#   user  system elapsed
#  0.004   0.000  60.051

But what exactly are the reported times user, system, and elapsed? :confused:

Well, clearly elapsed is the wall clock time taken to execute the function sleep_for_a_minute, plus some benchmarking code wrapping it (that’s why it took slightly more than a minute to run I guess).

As for user and system times, William Dunlap has posted a great explanation to the r-help mailing list:

“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.

:grinning:

4. Library rbenchmark

The documentation to the function benchmark from the rbenchmark R package describes it as “a simple wrapper around system.time”. However it adds a lot of convenience compared to bare system.time calls. For example it requires just one benchmark call to time multiple replications of multiple expressions. Additionally the returned results are conveniently organized in a data frame.

I installed the development1 version of the rbenchmark package from github:

devtools::install_github("eddelbuettel/rbenchmark")

For example purposes, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures:

  1. lm,
  2. the Moore-Penrose pseudoinverse,
  3. the Moore-Penrose pseudoinverse but without explicit matrix inverses.
library(rbenchmark)

benchmark("lm" = {
            X  matrix(rnorm(1000), 100, 10)
            y  X %*% sample(1:10, 10) + rnorm(100)
            b  lm(y ~ X + 0)$coef
          },
          "pseudoinverse" = {
            X  matrix(rnorm(1000), 100, 10)
            y  X %*% sample(1:10, 10) + rnorm(100)
            b  solve(t(X) %*% X) %*% t(X) %*% y
          },
          "linear system" = {
            X  matrix(rnorm(1000), 100, 10)
            y  X %*% sample(1:10, 10) + rnorm(100)
            b  solve(t(X) %*% X, t(X) %*% y)
          },
          replications = 1000,
          columns = c("test", "replications", "elapsed",
                      "relative", "user.self", "sys.self"))

#            test replications elapsed relative user.self sys.self
# 3 linear system         1000   0.167    1.000     0.208    0.240
# 1            lm         1000   0.930    5.569     0.952    0.212
# 2 pseudoinverse         1000   0.240    1.437     0.332    0.612

Here, the meaning of elapsed, user.self, and sys.self is the same as described above in the section about system.time, and relative is simply the time ratio with the fastest test. Interestingly lm is by far the slowest here.

5. Library microbenchmark

The most recent development version of microbenchmark can be installed from github:

devtools::install_github("olafmersmann/microbenchmarkCore")
devtools::install_github("olafmersmann/microbenchmark")

Much like benchmark from the package rbenchmark, the function microbenchmark can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality.

I find that one particularly nice feature of microbenchmark is the ability to automatically check the results of the benchmarked expressions with a user-specified function. This is demonstrated below, where we again compare three methods computing the coefficient vector of a linear model.

library(microbenchmark)

set.seed(2017)
n  10000
p  100
X  matrix(rnorm(n*p), n, p)
y  X %*% rnorm(p) + rnorm(100)

check_for_equal_coefs  function(values) {
  tol  1e-12
  max_error  max(c(abs(values[[1]] - values[[2]]),
                     abs(values[[2]] - values[[3]]),
                     abs(values[[1]] - values[[3]])))
  max_error  tol
}

mbm  microbenchmark("lm" = { b  lm(y ~ X + 0)$coef },
               "pseudoinverse" = {
                 b  solve(t(X) %*% X) %*% t(X) %*% y
               },
               "linear system" = {
                 b  solve(t(X) %*% X, t(X) %*% y)
               },
               check = check_for_equal_coefs)

mbm
# Unit: milliseconds
#           expr      min        lq      mean    median        uq      max neval cld
#             lm 96.12717 124.43298 150.72674 135.12729 188.32154 236.4910   100   c
#  pseudoinverse 26.61816  28.81151  53.32246  30.69587  80.61303 145.0489   100  b
#  linear system 16.70331  18.58778  35.14599  19.48467  22.69537 138.6660   100 a

We used the function argument check to check for equality (up to a maximal error of 1e-12) of the results returned by the three methods. If the results weren’t equal, microbenchmark would return an error message.

Another great feature is the integration with ggplot2 for plotting microbenchmark results.

library(ggplot2)
autoplot(mbm)

Conclusion

The given demonstration of the different benchmarking functions is surely not exhaustive. Nevertheless I made some conclusions for my personal benchmarking needs:

  • The Sys.time approach as well as the tictoc package can be used for timing (potentially nested) steps of a complicated algorithm (that’s often my use case). However, tictoc is more convenient, and (most importantly) foolproof.
  • We saw that microbenchmark returns other types of measurements than benchmark, and I think that in most situations the microbenchmark measurements are of a higher practical significance :stuck_out_tongue:.
  • To my knowledge microbenchmark is the only benchmarking package that has visualizations built in :+1:.

For these reasons I will go with microbenchmark and tictoc. :bowtie:


  1. Though the repository does not seem to be very active. So the github version is probably no different from the stable release on CRAN. 2

Source:: R News

An Interesting Study: Exploring Mental Health Conditions in the Tech Workplace

By Bo Lian

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Contributed by Bo Lian. This blog post is the first class project on Exploratory Visualization & Shiny. Please check the Shiny App here.


Background and Motivation

Mental illness is a global health problem. It has critically high prevalence, yielding severe health outcomes. One in four adults suffers from a diagnosable mental illness in any given year (National Alliance on Mental Health, 2013). In the tech industry, 50% of individuals have sought treatment for mental illness, according to Finkler, 2015.

This important health issue warrants further investigation. This study, based on the 2014 Mental Health in Tech Survey from Open Sourcing Mental Illness, performs a complete data visualization and statistical analyses among the mental illness and various factors in order to answer following questions:

  1. What are the strongest groups of predictors of mental health illness in the workplace?
  2. What might be the causes of the high mental illnesses prevalence in the tech industry other than some common thoughts?
  3. How is the representativeness of the survey, is other factors involved: geographic locations, ages, etc.?

The Data

The data used in the study is the 2014 Mental Health in Tech Survey from Open Sourcing Mental Illness. With 1259 responses, it is considered the largest survey on mental health in the tech industry.

The dataset contains 27 factors that could be segmented into 5 categories, each of which can be performed for Chi-squared test against the response variable of the treatment (if one has sought treatment for a mental health condition).

1. Demographics : age, gender, country, state etc.

2. Mental Health Condition : treatment, work interference, family history etc.

3. Employment Background : tech, remote, employee number, size etc.

4. Organizational Policies on Mental Health : benefit, wellness program etc.

5. Openness about Mental Health : self awareness, coworkers, supervisors, openness to discuss etc.

Data Visualization and Analysis

The data is drawn from employees from 46 countries, among which US employees account for 59.2%. Most of those hail from tech heavy states like California, New York, Washington, etc.

The median age of the surveyees is 31. The distribution of ages is right skewed which is expected as tech industry tends to have younger employees.

From the box-plots, there is no statistically significant difference of ages between the treatment and the non treatment groups.

Below are 6 of the data analysis examples by Chi-Square Testing with the scores and P values. As expected, mental illness is strongly dependent on the family history, work interference. If the companies offer mental-illness care options, it also enhances the mental illness treatment rate, so as gender difference. What is interesting is that the study shows that the data shows that what people may assume correlates with mental illness is not a factor at all. For example, either remote work or tech type company is a factor does not correlate with t mental illness so as the supervisor’s attitude.

Results and Conclusions:

Significant Factors of the Chi-Square tests

They include: gender, family history, work interference, benefits, care options, mental health interview etc.

Insignificant Factors of the Chi-Square tests

Identifiers like self_employed, remote work, number of employees, tech or not, supervisors etc. , did not prove significant.

Among the results, the family history, work interference, gender difference are mostly likely to be the main causal factors. If a company has provided better mental health care services, the employees might be more likely to seek treatment or raise their awareness, different genders might tend to seek treatment differently as well. Interestingly, none of the true work conditions has any statistically significant impact on the mental illness of the employees. Human are tougher than they think! One thing to consider is that as the tech industry is primarily located at the US west or east coast or developed countries as shown in the map, the high living cost, competitive atmosphere, and high population density might contribute greatly to the prevalence of mental illness rather than the working conditions themselves.

Future Work

Keep exploring on

  • More data visualization
  • Data complexity and subjectivity, location specificity
  • Integrating ongoing survey data from 2016
  • Interaction among factors
  • Predictions with multiple variates

The post An Interesting Study: Exploring Mental Health Conditions in the Tech Workplace appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Science for Business – Time Series Forecasting Part 1: EDA & Data Preparation

By Shirin's playgRound

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

Data Science is a fairly broad term and encompasses a wide range of techniques from data visualization to statistics and machine learning models. But the techniques are only tools in a – sometimes very messy – toolbox. And while it is important to know and understand these tools, here, I want to go at it from a different angle: What is the task at hand that data science tools can help tackle, and what question do we want to have answered?

A straight-forward business problem is to estimate future sales and future income. Based on past experience, i.e. data from past sales, data science can help improve forecasts and generate models that describe the main factors of influence. This, in turn, can then be used to develop actions based on what we have learned, like where to increase advertisement, how much of which products to keep in stock, etc.

Data preparation

While it isn’t the most exciting aspect of data science, and therefore often neglected in favor of fancy modeling techniques, getting the data into the right format and extracting meaningful features, is arguably THE most essential part of any analysis!

Therefore, I have chosen to dedicate an entire article to this part and will discuss modeling and time series forecasting in separate blog posts.

Many of the formal concepts I am using when dealing with data in a tidy way come from Hadley Wickham & Garrett Grolemund’s “R for Data Science”.

The central package is tidyverse, which contains tidyr, dplyr, ggplot, etc. Other packages I am using are tidyquant for its nice ggplot theme, modelr, gridExtra and grid for additional plotting functionalities.

library(tidyverse)
library(tidyquant)
library(modelr)
library(gridExtra)
library(grid)

The data

I am again using a dataset from UC Irvine’s machine learning repository.

From the dataset description:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Reading in the data

A fast way to read in data in csv format is to use readr‘s read_csv() function. With a small dataset like this, it makes sense to specifically define what format each column should have (e.g. integers, character, etc.). In our case, this is particularly convenient for defining the date/time column to be read in correctly with col_datetime().

The original data contains the following features (description from UC Irvine’s machine learning repository):

  • InvoiceNo: Invoice number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
  • StockCode: Product (item) code uniquely assigned to each distinct product.
  • Description: Product (item) name.
  • Quantity: The quantities of each product (item) per transaction.
  • InvoiceDate: Invoice Date and time, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Product price per unit in sterling.
  • CustomerID: Customer number uniquely assigned to each customer.
  • Country: Country name. The name of the country where each customer resides.

Because read_csv generates a tibble (a specific dataframe class) and is part of the tidyverse, we can directly create a few additional columns:

  • day: Invoice date, the day when each transaction was generated.
  • time: Invoice time, the time when each transaction was generated.
  • month: The month when each transaction was generated.
  • income: The amount of income generated from each transaction (Quantity * UnitPrice), negative in case of returns
  • income_return: Description of whether a transaction generated income or loss (i.e. purchase or return)
retail  read_csv("OnlineRetail.csv",
                   col_types = cols(
                      InvoiceNo = col_character(),
                      StockCode = col_character(),
                      Description = col_character(),
                      Quantity = col_integer(),
                      InvoiceDate = col_datetime("%m/%d/%Y %H:%M"),
                      UnitPrice = col_double(),
                      CustomerID = col_integer(),
                      Country = col_character()
                      )) %>%
  mutate(day = parse_date(format(InvoiceDate, "%Y-%m-%d")),
         day_of_week = wday(day, label = TRUE),
         time = parse_time(format(InvoiceDate, "%H:%M")),
         month = format(InvoiceDate, "%m"),
         income = Quantity * UnitPrice,
         income_return = ifelse(Quantity > 0, "income", "return"))

Exploratory Data Analysis (EDA)

In order to decide, which features to use in the final dataset for modeling, we want to get a feel for our data. And a good way to do this, is by creating different visualizations. It also helps with assessing your models later on, because to closer you are acquainted with the data’s properties, the better you’ll be able to pick up on things that might have gone wrong in your analysis (think of it as a kind of sanity-check for your data).

Transactions by country

The online retailer is UK-based, but its customers come from all over the world. However, the plots below tell us very quickly that the main customer base is from the UK, followed by Germany and France.

p1  retail %>%
  filter(Country == "United Kingdom") %>%
  ggplot(aes(x = Country, fill = income_return)) +
    geom_bar(alpha = 0.8) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    guides(fill = FALSE) +
    labs(x = "")

p2  retail %>%
  filter(Country != "United Kingdom") %>%
  ggplot(aes(x = Country, fill = income_return)) +
    geom_bar(alpha = 0.8) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(legend.position = "right") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "",
         fill = "")

grid.arrange(p1, p2, widths = c(0.2, 0.8))

Transactions over time

To get an idea of the number of transactions over time, we can use a frequency polygon. Here, we can see that the number purchases slightly increased during the last two months of recording, while the number of returns remained relatively stable.

retail %>%
  ggplot(aes(x = day, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_freqpoly(bins = 100, size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Number of purchases/returns over time",
         x = "")

Because the number of returns is much smaller than the number of purchases, it is difficult to visualize and compare them in the same plot. While above, I split them into two facets with free scales, we can also compare the density of values. From this plot, we can more easily see the relationship between purchases and returns over time: except for the last month, the proportion of both remained relatively stable.

retail %>%
  ggplot(aes(x = day, y = ..density.., color = income_return)) +
    geom_freqpoly(size = 1, alpha = 0.8, bins = 100) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(title = "Density of purchases/returns over time",
         x = "",
         color = "")

Income/loss from transactions

Let’s look at the income/loss from transactions over time. Here, we plot the sum of income and losses for each day. The income seems to increase slightly during the last month, while losses remained more stable. The only severe outlier is the last day.

retail %>%
  group_by(day, income_return) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = day, y = sum_income, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_ref_line(h = 0, colour = "grey") +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Income/loss from transactions per day",
         x = "",
         y = "sum of income/losses",
         color = "")

We can also look at the sum of income and losses according to time of day of the transaction. Not surprisingly, transactions happen mostly during business hours.

retail %>%
  group_by(time, income_return) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = time, y = sum_income, color = income_return)) +
    facet_grid(income_return ~ ., scales = "free") +
    geom_ref_line(h = 0, colour = "grey") +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    guides(color = FALSE) +
    labs(title = "Income from purchases/returns per time of day",
         x = "time of day",
         y = "sum of income/losses",
         color = "")

Here, we again see the two extreme outliers. Let’s look at them in the dataset. This purchase of 80995 paper craft birdies might have been a mistake, because we can see that the same customer who bought them at 09:15 cancelled the order only 15 minutes later and didn’t order a smaller number either.

retail %>%
  filter(day == "2011-12-09") %>%
  arrange(-Quantity) %>%
  .[1:3, ]
## # A tibble: 3 x 14
##   InvoiceNo StockCode                         Description Quantity
##       
## 1    581483     23843         PAPER CRAFT , LITTLE BIRDIE    80995
## 2    581476     16008 SMALL FOLDING SCISSOR(POINTED EDGE)      240
## 3    581476     22693  GROW A FLYTRAP OR SUNFLOWER IN TIN      192
## # ... with 10 more variables: InvoiceDate , UnitPrice ,
## #   CustomerID , Country , day , day_of_week ,
## #   time , month , income , income_return 
retail %>%
  filter(day == "2011-12-09") %>%
  arrange(Quantity) %>%
  .[1:3, ]
## # A tibble: 3 x 14
##   InvoiceNo StockCode                     Description Quantity
##       
## 1   C581484     23843     PAPER CRAFT , LITTLE BIRDIE   -80995
## 2   C581490     22178 VICTORIAN GLASS HANGING T-LIGHT      -12
## 3   C581490     23144 ZINC T-LIGHT HOLDER STARS SMALL      -11
## # ... with 10 more variables: InvoiceDate , UnitPrice ,
## #   CustomerID , Country , day , day_of_week ,
## #   time , month , income , income_return 
retail %>%
  filter(CustomerID == 16446)
## # A tibble: 4 x 14
##   InvoiceNo StockCode                 Description Quantity
##       
## 1    553573     22980      PANTRY SCRUBBING BRUSH        1
## 2    553573     22982         PANTRY PASTRY BRUSH        1
## 3    581483     23843 PAPER CRAFT , LITTLE BIRDIE    80995
## 4   C581484     23843 PAPER CRAFT , LITTLE BIRDIE   -80995
## # ... with 10 more variables: InvoiceDate , UnitPrice ,
## #   CustomerID , Country , day , day_of_week ,
## #   time , month , income , income_return 

Transactions by day and time

The last plot told us that in general, transactions were done during business hours. We can look at this in even more detail by comparing the day and time of transactions in a 2D-bin-plot where the tile colors indicate transaction numbers.

retail %>%
  ggplot(aes(x = time, y = day)) +
    stat_bin2d(alpha = 0.8, bins = 25, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(title = "Purchases/returns per day and time")

Net income

The net income we can e.g. plot in a similar way by comparing month and day of the month of transactions with a tile plot:

retail %>%
  mutate(day2 = format(InvoiceDate, "%d")) %>%
  group_by(month, day2) %>%
  summarise(sum_income = sum(income)) %>%
  ggplot(aes(x = month, y = day2, fill = sum_income)) +
    geom_tile(alpha = 0.8, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(title = "Net income per month and day",
         y = "day of the month",
         fill = "net sum of income")

Items

Also of interest are the items that are being purchases or returned. Here, we sum up the net quantities for each item.

retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  arrange(-sum)
## Source: local data frame [5,749 x 3]
## Groups: StockCode [4,070]
## 
## # A tibble: 5,749 x 3
##    StockCode                        Description   sum
##        
##  1     84077  WORLD WAR 2 GLIDERS ASSTD DESIGNS 53847
##  2    85099B            JUMBO BAG RED RETROSPOT 47363
##  3     84879      ASSORTED COLOUR BIRD ORNAMENT 36381
##  4     22197                     POPCORN HOLDER 36334
##  5     21212    PACK OF 72 RETROSPOT CAKE CASES 36039
##  6    85123A WHITE HANGING HEART T-LIGHT HOLDER 35025
##  7     23084                 RABBIT NIGHT LIGHT 30680
##  8     22492             MINI PAINT SET VINTAGE 26437
##  9     22616          PACK OF 12 LONDON TISSUES 26315
## 10     21977 PACK OF 60 PINK PAISLEY CAKE CASES 24753
## # ... with 5,739 more rows

As we can see in the plots below, the majority of items is purchases only occasionally, while a few items are purchased a lot.

p1  retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()

p2  retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  filter(sum > 1) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()

p3  retail %>%
  group_by(StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  filter(sum > 10000) %>%
  ggplot(aes(x = sum)) +
    geom_density(fill = palette_light()[[1]], alpha = 0.8) +
    theme_tq()
    
grid.arrange(p1, p2, p3, ncol = 3)

We can also calculate on how many different days, items have been purchased.

most_sold  retail %>%
  group_by(day, StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  group_by(StockCode, Description) %>%
  summarise(n = n()) %>%
  arrange(-n)

head(most_sold)
## Source: local data frame [6 x 3]
## Groups: StockCode [6]
## 
## # A tibble: 6 x 3
##   StockCode                        Description     n
##       
## 1    85123A WHITE HANGING HEART T-LIGHT HOLDER   304
## 2    85099B            JUMBO BAG RED RETROSPOT   302
## 3     22423           REGENCY CAKESTAND 3 TIER   301
## 4     84879      ASSORTED COLOUR BIRD ORNAMENT   300
## 5     20725            LUNCH BAG RED RETROSPOT   299
## 6     21212    PACK OF 72 RETROSPOT CAKE CASES   299

The item that has been purchased most often in terms of days is the white hanging heart t-light holder. Let’s look at its distribution of sold/returned quantities per day:

retail %>%
  filter(StockCode == "85123A") %>%
  group_by(day, income_return) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = day, y = sum, color = income_return)) +
    facet_wrap(~ income_return, ncol = 1, scales = "free") +
    geom_line(size = 1, alpha = 0.5) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "",
         y = "sum of quantities",
         color = "",
         title = "Transactions of WHITE HANGING HEART T-LIGHT HOLDER")

Preparing data for modeling by day

There are of course, infinitely more ways to visualize data but for now, I think we have enough of a feel for the data that we can start preparing a dataset for modeling. Because we have only limited information in that we only have data for one year, we might not have enough data to accurately forecast or model time-dependent trends. But we can try by creating a table of features per day.

Which customers are repeat customers?

If the customer ID has been recorded on more than one day (i.e., they have made multiple transactions during the year of recording), they are considered repeat customers. As we can see in the plot, the majority of customers are repeat customers.

rep_customer  retail %>%
  group_by(day, CustomerID) %>%
  summarise(sum = sum(Quantity)) %>%
  group_by(CustomerID) %>%
  summarise(n = n()) %>%
  mutate(repeat_customer = ifelse(n > 1, "repeat_cust", "one_time_cust"))

length(which(rep_customer$repeat_customer == "repeat_cust"))
## [1] 2992
rep_customer_day  left_join(retail, rep_customer, by = "CustomerID") %>%
  distinct(day, CustomerID, repeat_customer) %>%
  group_by(day, repeat_customer) %>%
  summarise(n = n()) %>%
  spread(key = repeat_customer, value = n)
rep_customer %>%
  group_by(repeat_customer) %>%
  summarise(n = n()) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = "", y = prop, fill = repeat_customer)) +
    geom_bar(stat = "identity", alpha = 0.8) +
    coord_polar("y", start = 0) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    theme(legend.position = "right") +
    labs(x = "",
         y = "",
         fill = "",
         title = "Proportion of one-time & repeat customers")

Transactions, quantities and items per customer and day

I am calculating the following metrics per day and customer.

  • n: number of different items purchased/returned per customer per day
  • sum_it: net sum of items (quantity) purchased/returned per customer per day
  • sum_in: mean net income per customer per day

From these, I can further calculate mean numbers per day:

  • mean_in_cust: mean net income from all customers per day
  • mean_quant_cust: mean net quantities of items from all customers per day
  • mean_items_cust: mean number of items from all customers per day
customer_purch  retail %>%
  group_by(day, CustomerID) %>%
  summarise(n = n(),
            sum_it = sum(Quantity),
            sum_in = sum(income)) %>%
  group_by(day) %>%
  summarise(mean_in_cust = mean(sum_in),
            mean_quant_cust = mean(sum_it),
            mean_items_cust = mean(n))
customer_purch %>%
  gather(x, y, mean_in_cust:mean_items_cust) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")

Purchases/returns per day

This calculates the quantities of items that are purchased and returned per day:

income_return  retail %>%
  group_by(day, income_return) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = income_return, value = sum)
income_return %>%
  gather(x, y, income:return) %>%
  ggplot(aes(x = day, y = y, color = x)) +
    geom_line(size = 1, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "", 
         y = "quantity of items",
         color = "")

How many items are purchased/returned per country?

Because the majority of transactions came from UK customers, I am dividing customers into UK-based and other nationalities and calculate the net quantities of items purchases and returned by day and country.

country_purch  retail %>%
  mutate(Country2 = ifelse(Country == "United Kingdom", "uk", "other_country")) %>%
  group_by(day, Country2) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = Country2, value = sum) %>%
  mutate(prop_other_country = other_country / sum(other_country + uk),
         prop_uk = uk / sum(other_country + uk))
country_purch %>%
  gather(x, y, prop_other_country:prop_uk) %>%
  ggplot(aes(x = day, y = y)) +
    geom_bar(aes(fill = x), stat = "identity", alpha = 0.6) +
    scale_fill_manual(values = palette_light()) +
    geom_line(data = country_purch, aes(x = day, y = prop_uk), size = 1) +
    theme_tq() +
    labs(x = "", 
         y = "proportion of quantity of items",
         fill = "")

Here, we can now also see that not every day has a corresponding data point: there are regular (i.e. no Saturdays are recorded) and irregular patterns of missing data.

How many different items are purchased/returned per day?

I also want to know how many different items were purchased or returned. Here, we see a similar pattern where the number of different items per day increased during the last two months.

n_items  retail %>%
  group_by(day, StockCode) %>%
  summarise(n = n()) %>%
  group_by(day) %>%
  summarise(n_items = n())
n_items %>%
  ggplot(aes(x = day, y = n_items)) +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "number of different items",
         color = "")

Net income & quantities summaries

Finally, I am calculating the sum and mean of the net income and net quantities sold per day.

income  retail %>%
  group_by(day) %>%
  summarise(sum_income = sum(income),
            mean_income = mean(income),
            sum_quantity = sum(Quantity),
            mean_quantity = mean(Quantity))
income %>%
  gather(x, y, sum_income:mean_quantity) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")

Income from purchases and returns

The same I am also calculating for purchases and returns separately.

purchases  retail %>%
  filter(income > 0) %>%
  group_by(day) %>%
  summarise(sum_income_purch = sum(income),
            mean_income_purch = mean(income),
            sum_quantity_purch = sum(Quantity),
            mean_quantity_purch = mean(Quantity))
purchases %>%
  gather(x, y, sum_income_purch:mean_quantity_purch) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    geom_smooth(color = palette_light()[[2]], method = 'loess') +
    theme_tq() +
    labs(x = "", 
         y = "")

returns  retail %>%
  filter(income  0) %>%
  group_by(day) %>%
  summarise(sum_income_return = sum(income),
            mean_income_return = mean(income),
            sum_quantity_return = sum(Quantity),
            mean_quantity_return = mean(Quantity))
returns %>%
  gather(x, y, sum_income_return:mean_quantity_return) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "")

Mean price per units sold per day

This is a bit more complicated than thought because some items are sold multiple times per time and unit prices of the same item change between transactions and days. Therefore, I am creating a temporary dataframe where I combine day & StockCode. This, I can then combine with a table of the number of transactions per day and item to find the price per item and day. Based on this, I calculate the mean unit price per item and the mean unit price of all items per day.

temp  distinct(select(retail, day, StockCode, UnitPrice)) %>%
  mutate(temp = paste(day, StockCode, sep = "_")) %>%
  select(temp, UnitPrice)

mean_unit_price  retail %>%
  filter(income_return == "income") %>%
  group_by(day, StockCode) %>%
  summarise(n = n()) %>%
  mutate(temp = paste(day, StockCode, sep = "_")) %>%
  left_join(temp, by = "temp") %>%
  group_by(day, StockCode) %>%
  summarise(mean = mean(UnitPrice)) %>%
  group_by(day) %>%
  summarise(mean_unit_price = mean(mean))
mean_unit_price %>%
  ggplot(aes(x = day, y = mean_unit_price)) +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "mean unit price of sold items")

Purchases of most sold items

The 10 items that have been sold on the most days throughout the year of recording are also introduced as separate features.

most_sold_day  retail %>%
  filter(StockCode %in% most_sold$StockCode[1:10]) %>%
  group_by(day, StockCode) %>%
  summarise(sum = sum(Quantity)) %>%
  spread(key = StockCode, value = sum)
retail %>%
  filter(StockCode %in% most_sold$StockCode[1:10]) %>%
  group_by(day, StockCode, Description) %>%
  summarise(sum = sum(Quantity)) %>%
  ggplot(aes(x = day, y = sum)) +
    facet_wrap(~ StockCode, ncol = 1, scales = "free") +
    geom_line(color = palette_light()[[1]], size = 1, alpha = 0.8) +
    theme_tq() +
    labs(x = "", 
         y = "net sum of quantites sold")

Combining data

Now, we can combine all the different per-day features into one dataframe. Now, I am also calculating the difference of the net sum of income to the previous day by using the lag() function. And I am creating a column for seasons.

Additional time-based information will be added in a later part when we actually do forecasting with the timekit package.

retail_p_day  distinct(select(retail, day, day_of_week, month)) %>%
  left_join(income, by = "day") %>%
  left_join(mean_unit_price, by = "day") %>%
  left_join(purchases, by = "day") %>%
  left_join(returns, by = "day") %>%
  left_join(customer_purch, by = "day") %>%
  left_join(rep_customer_day, by = "day") %>%
  left_join(income_return, by = "day") %>%
  left_join(country_purch, by = "day") %>%
  left_join(n_items, by = "day") %>%
  left_join(most_sold_day, by = "day") %>%
  mutate(diff_sum_income = sum_income - lag(sum_income),
         season = ifelse(month %in% c("03", "04", "05"), "spring",
                         ifelse(month %in% c("06", "07", "08"), "summer",
                                ifelse(month %in% c("09", "10", "11"), "fall", "winter"))))

To keep track of the features, I am creating an additional dataframe/tibble containing the descriptions of every feature column.

meta  tibble(colnames_retail_p_day = colnames(retail_p_day)) %>%
  mutate(description = c(
    "day in YYYY-MM-DD",
    "weekday (Sun - Fri, there are no Sat in the dataset)",
    "month (as month number)",
    
    "sum of net income per day (all purchases & losses per day combined)",
    "mean net income per day (all purchases & losses per day combined)",
    "sum of net quantities sold/returned per day (all purchases & returns per day combined)",
    "mean net quantities sold/returned per day (all purchases & returns per day combined)",
    
    "mean price per unit sold (returns excluded)",
    
    "sum of income from purchases per day (losses excluded)",
    "mean income from purchases per day (losses excluded)",
    "sum of quantities from purchases per day (losses excluded)",
    "mean quantities from purchases per day (losses excluded)",
    
    "sum of losses from returns per day (purchases excluded)",
    "mean losses from returns per day (purchases excluded)",
    "sum of quantities from returns per day (purchases excluded)",
    "mean quantities from returns per day (purchases excluded)",
    
    "mean net income from all customers per day",
    "mean number of items from all customers per day",
    "mean number of items from all customers per day",

    "number of one-time customers per day",
    "number of repeat customers per day",
    
    "sum of items (quantities) purchased per day (returns excluded)",
    "sum of items (quantities) returned per day (purchases excluded)",
    
    "net sum of items (quantities) purchased per day from countries other than the UK",
    "net sum of items (quantities) purchased per day from the UK",
    "proportion of net sum of items (quantities) purchased per day from countries other than the UK",
    "proportion of net sum of items (quantities) purchased per day from the UK",
    
    "number of different items purchased/returned per day",
    
    "net sum of quantities sold of item with StockCode 20725",
    "net sum of quantities sold of item with StockCode 21212",
    "net sum of quantities sold of item with StockCode 22423",
    "net sum of quantities sold of item with StockCode 22457",
    "net sum of quantities sold of item with StockCode 22666",
    "net sum of quantities sold of item with StockCode 22960",
    "net sum of quantities sold of item with StockCode 47566",
    "net sum of quantities sold of item with StockCode 84879",
    "net sum of quantities sold of item with StockCode 85099B",
    "net sum of quantities sold of item with StockCode 85123A",
    
    "difference in sum of net income (purchases - returns) to previous day",
    "season"
    ))

Now, that we have all kinds of features per day, we can gather them in a final visualization to identify patterns and potential response variables suitable for modeling. I am also now adding points that are colored by day of the week so that we’ll be able to judge potential weekday biases.

retail_p_day %>%
  # remove last day because it is so extreme
  filter(day != max(retail_p_day$day)) %>%
  gather(x, y, sum_income:diff_sum_income) %>%
  ggplot(aes(x = day, y = y)) +
    facet_wrap(~ x, scales = "free", ncol = 2) +
    geom_line(alpha = 0.8, color = palette_light()[[1]]) +
    geom_point(aes(color = day_of_week)) +
    geom_smooth() +
    scale_color_manual(values = palette_light()) +
    theme_tq() +
    labs(x = "",
         y = "",
         color = "day of the week")

From the plots we can see that there has been a slight increase in sales during the months of November and December. Unfortunately, we don’t have data from additional years to judge whether that has to do with Christmas sales or whether it is marking a general increase in sales that will continue into Januar and February.

And indeed, for many features we see lower sales on Sundays than on Monday through Friday (Saturdays have not been recorded).

Nevertheless, I will attempt a time-series forecast on this data. First, however, I will be developing some simple models to test for weekday biases and other statistics of the data. Stay tuned for my next post… 🙂


sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gridExtra_2.2.1               modelr_0.1.0                 
##  [3] tidyquant_0.5.1               quantmod_0.4-8               
##  [5] TTR_0.23-1                    PerformanceAnalytics_1.4.3541
##  [7] xts_0.9-7                     zoo_1.8-0                    
##  [9] lubridate_1.6.0               dplyr_0.5.0                  
## [11] purrr_0.2.2.2                 readr_1.1.1                  
## [13] tidyr_0.6.3                   tibble_1.3.1                 
## [15] ggplot2_2.2.1                 tidyverse_1.1.1              
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2   haven_1.0.0      lattice_0.20-35  colorspace_1.3-2
##  [5] htmltools_0.3.6  yaml_2.1.14      rlang_0.1.1      foreign_0.8-68  
##  [9] DBI_0.6-1        readxl_1.0.0     plyr_1.8.4       stringr_1.2.0   
## [13] Quandl_2.8.0     munsell_0.4.3    gtable_0.2.0     cellranger_1.1.0
## [17] rvest_0.3.2      psych_1.7.5      evaluate_0.10    labeling_0.3    
## [21] knitr_1.16       forcats_0.2.0    parallel_3.4.0   broom_0.4.2     
## [25] Rcpp_0.12.11     scales_0.4.1     backports_1.1.0  jsonlite_1.4    
## [29] mnormt_1.5-5     hms_0.3          digest_0.6.12    stringi_1.1.5   
## [33] rprojroot_1.2    tools_3.4.0      magrittr_1.5     lazyeval_0.2.0  
## [37] xml2_1.1.1       assertthat_0.2.0 rmarkdown_1.5    httr_1.2.1      
## [41] R6_2.2.1         nlme_3.1-131     compiler_3.4.0
To leave a comment for the author, please follow the link and comment on their blog: Shirin’s playgRound.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

tidyquant: R/Finance 2017 Presentation

By business-science.io – Articles

RFinance Presentation

(This article was first published on business-science.io – Articles, and kindly contributed to R-bloggers)

The R/Finance 2017 conference was just held at the UIC in Chicago, and the event was a huge success. There were a ton of high quality presentations really showcasing innovation in finance. We gave a presentation on tidyquant illustrating the key benefits related to financial analysis in the tidyverse. We’ve uploaded the tidyquant presentation to YouTube. Check out the presentation. Don’t forget to check out our announcements and to follow us on social media to stay up on the latest Business Science news, events and information!

R/Finance 2017 Presentation

If you’re interested in financial analysis, check out our short 6 minute presentation from R/Finance 2017 that discusses the tidyquant package benefits. The discussion touches on the current state of financial analysis, answers the question “Why tidyquant?”, discusses the core tq functions along with tq benefits including financial data science at scale.

Download Presentation on GitHub

The slide deck from the R/Finance 2017 presentation can be downloaded from the Business Science GitHub site.

Download the R/Finance 2017 Presentation Slides!

Upcoming Events

If you’re interested in meeting the members of Business Science, we’ll be speaking at EARL San Francisco conference, June 5-7. We have an expanded presentation that covers:

We can’t wait! The presentation will unveil some really cool capabilities especially for timekit time series machine learning that are advantages over existing forecasting packages.

Follow Business Science on Social Media

To leave a comment for the author, please follow the link and comment on their blog: business-science.io – Articles.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Oakland Real Estate – Full EDA

By greenberg.jon@gmail.com

(This article was first published on R Analysis – Analyst at Large, and kindly contributed to R-bloggers)

Living in the Bay Area has led me to think more and more about real estate (and how amazingly expensive it is here…) I’ve signed up for trackers on Zillow and Redfin, but the data analyst in me always wants to dive deeper, to look back historically, to quantify, to visualize the trends, etc… With that in mind, here is my first view at Oakland real estate prices over the past decade. I’ll only be looking at multi-tenant units (duplexes, triplexes, etc.)

The first plot is simply looking at the number of sales each month:

You can clearly see a strong uptick in the number of units sold from 2003 to 2005 and the following steep decline in sales bottoming out during the financial crisis in 2008. Interestingly, sales pick up again very quickly in 2009 and 2010 (a time when I expected to see low sales figures) before stabilizing at the current rate of ~30 properties sold per month.

The next plot shows the price distribution for multi-tenant buildings sold in Oakland:

Here we can see prices rising pretty steadily right up until 2008 when the financial crisis hit (peaking with median prices around $650K). The prices are rising in 2006 and 2007 even while the number of sales is dropping rapidly. After prices drop to ~$250K we can see that they stay low from 2009 – 2013 (even though sales picked up dramatically in 2009). This suggests that a number of investors recognized the low prices and bought up a lot of the available properties. Starting in 2013 prices begin rising and they have continued to present where we are looking at record high median prices of ~$750K.

Does month of year matter?

I’ve started doing some basic analysis on Oakland real estate prices over the past decade (multi-tenant buildings only). There’s still a lot to unpack here, but I’m only able to investigate this 30 minutes at a time (new dad life), so I’m breaking the analysis into small, manageable pieces. The first one I wanted to explore quickly is: does month of year matter? I’ve often heard that summer is the busy season for buying a house because families with children try not to move during the school year. Does this rule also apply to multi-tenant buildings as well (which tend to be purchased by investors)?

I’ve collected the sales over the past decade and group by month of year. We can see that summer months (May, June, and July) do see more sales than other months. Interestingly, December also tends to see a large number of sales. Maybe people have more time over the holidays to check out investment properties? Not sure what else could be driving this weird spike in sales in December – any ideas?

Given that the most sales occur in summer months, I wanted to see if this has any impact on sale price.

There doesn’t seem to be much of a relationship at all between month of year and sale price. I had expected to see higher prices in the summer months under the assumption that demand is higher during that period, but maybe supply is also higher during that time (and they rise in roughly equal proportion). This is something that I could theoretically test (since I have the list date as well as the sale date for each property), but I think there are more interesting topics to explore… It’s enough for me to know that month of year doesn’t appear to be correlated with sale price!

Are multi-tenant houses going above or below asking price?

For many people one of the most difficult aspects of the house-buying process is deciding what to bid. I decided to take a look at the relationship between list price and sale price.

Since 2000 the average difference between list and sale price has generally been less than $25K. We can see that in 2004-2006 multi-tenant houses in Oakland were generally selling for an average of $15K – $25K above the asking price. In financial crisis in 2008, we can see houses going for an average of $25K less than asking. From 2010-2013, list price and sale price averaged close to $0K difference. Starting in 2013 we start to see houses selling for more than asking with multi-tenant houses in Oakland now averaging $50K more than asking! I know that housing prices have increased over time, so my next step was to look at these as percentage of asking price (attempting to control for inflation in housing values over the past two decades).

The shape of this plot matches the plot showing absolute dollar figures, but it is helpful to see percentages. The most recent data shows that Oakland multi-tenant houses sell at an average of 7.5% premium over asking price.

How are days on market and sales price vs. list price premium related?

I’ve often heard that the longer a house is on the market, the lower the premium vs. asking price. This seems as good a dataset as any to test this theory out!

This first plot just shows the relationship between days on market (x-axis) and difference between sale price and list price (y-axis). This plot isn’t particularly helpful, but it does show me a couple of things. I see a handful of outliers (houses on the market for 1000+ days or selling $1M below asking price). There also seems to be a fairly large cluster of houses selling significantly above asking price between 7-30 days after going on the market. Next step was just to bucket these days on market:

We can see that there does appear to be an inverse relationship between days on market and sales price – list price premium. Roughly 75% of houses sold in less than 30 days were sold above asking price (with median premiums of ~$10K). On the other hand, roughly 75% of houses on the market for 60+ days were sold below asking price (with median discounts of ~$10K). Next steps here would be converting to percent of list price and then reviewing these trends over time, but that will have to wait for another day!

R code here:

To leave a comment for the author, please follow the link and comment on their blog: R Analysis – Analyst at Large.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Testing the Hierarchical Risk Parity algorithm

By Ilya Kipnis

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be a modified backtest of the Adaptive Asset Allocation backtest from AllocateSmartly, using the Hierarchical Risk Parity algorithm from last post, because Adam Butler was eager to see my results. On a whole, as Adam Butler had told me he had seen, HRP does not generate outperformance when applied to a small, carefully-constructed, diversified-by-selection universe of asset classes, as opposed to a universe of hundreds or even several thousand assets, where its theoretically superior properties result in it being a superior algorithm.

First off, I would like to thank one Matthew Barry, for helping me modify my HRP algorithm so as to not use the global environment for recursion. You can find his github here.

Here is the modified HRP code.

covMat  1) {
    w  1) {
    w 

With covMat and corMat being from the last post. In fact, this function can be further modified by encapsulating the clustering order within the getRecBipart function, but in the interest of keeping the code as similar to Marcos Lopez de Prado's code as I could, I'll leave this here.

Anyhow, the backtest will follow. One thing I will mention is that I'm using Quandl's EOD database, as Yahoo has really screwed up their financial database (I.E. some sector SPDRs have broken data, dividends not adjusted, etc.). While this database is a $50/month subscription, I believe free users can access it up to 150 times in 60 days, so that should be enough to run backtests from this blog, so long as you save your downloaded time series for later use by using write.zoo.

This code needs the tseries library for the portfolio.optim function for the minimum variance portfolio (Dr. Kris Boudt has a course on this at datacamp), and the other standard packages.

A helper function for this backtest (and really, any other momentum rotation backtest) is the appendMissingAssets function, which simply adds on assets not selected to the final weighting and re-orders the weights by the original ordering.

require(tseries)
require(PerformanceAnalytics)
require(quantmod)
require(Quandl)

Quandl.api_key("YOUR_AUTHENTICATION_HERE") # not displaying my own api key, sorry 

# function to append missing (I.E. assets not selected) asset names and sort into original order
appendMissingAssets 

Next, we make the call to Quandl to get our data.

symbols 

While Josh Ulrich fixed quantmod to actually get Yahoo data after Yahoo broke the API, the problem is that the Yahoo data is now garbage as well, and I'm not sure how much Josh Ulrich can do about that. I really hope some other provider can step up and provide free, usable EOD data so that I don't have to worry about readers not being able to replicate the backtest, as my policy for this blog is that readers should be able to replicate the backtests so they don't just nod and take my word for it. If you are or know of such a provider, please leave a comment so that I can let the blog readers know all about you.

Next, we initialize the settings for the backtest.

invVolWts 

While the AAA backtest actually uses a 126 day lookback instead of a 6 month lookback, as it trades at the end of every month, that's effectively a 6 month lookback, give or take a few days out of 126, but the code is less complex this way.

Next, we have our actual backtest.

for(i in 1:(length(ep)-nMonths)) {
  
  # get returns subset and compute absolute momentum
  retSubset = 6 # top 5 assets
  posReturnAssets  0 # positive momentum assets
  selectedAssets 

In a few sentences, this is what happens:

The algorithm takes a subset of the returns (the past six months at every month), and computes absolute momentum. It then ranks the ten absolute momentum calculations, and selects the intersection of the top 5, and those with a return greater than zero (so, a dual momentum calculation).

If no assets qualify, the algorithm invests in nothing. If there's only one asset that qualifies, the algorithm invests in that one asset. If there are two or more qualifying assets, the algorithm computes a covariance matrix using 20 day volatility multiplied with a 126 day correlation matrix (that is, sd_20′ %*% sd_20 * (elementwise) cor_126. It then computes normalized inverse volatility weights using the volatility from the past 20 days, a minimum variance portfolio with the portfolio.optim function, and lastly, the hierarchical risk parity weights using the HRP code above from Marcos Lopez de Prado's paper.

Lastly, the program puts together all of the weights, and adds a cash investment for any period without any investments.

invVolWts 

Here are the results:

compare 
                             invVol    minVol       HRP
Annualized Return         0.0872000 0.0724000 0.0792000
Annualized Std Dev        0.1208000 0.1025000 0.1136000
Annualized Sharpe (Rf=0%) 0.7221000 0.7067000 0.6968000
Worst Drawdown            0.1548801 0.1411368 0.1593287
Calmar Ratio              0.5629882 0.5131956 0.4968234

In short, in the context of a small, carefully-selected and allegedly diversified (I’ll let Adam Butler speak for that one) universe dominated by the process of which assets to invest in as opposed to how much, the theoretical upsides of an algorithm which simultaneously exploits a covariance structure without needing to invert a covariance matrix can be lost.

However, this test (albeit from 2007 onwards, thanks to ETF inception dates combined with lookback burn-in) confirms what Adam Butler himself told me, which is that HRP hasn’t impressed him, and from this backtest, I can see why. However, in the context of dual momentum rank selection, I’m not convinced that any weighting scheme will realize much better performance than any other.

Thanks for reading.

NOTE: I am always interested in networking and hearing about full-time opportunities related to my skill set. My linkedIn profile can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Canada Immigration: Where to settle in? Exercises

By Lauro Silva

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Many people around the globe would like to immigrate to Canada as a Skilled Worker. These candidates must prove language proficiency in French and English, at least 2 years of working experience after graduation, and more. But, many immigrants that arrive in canada face unemployment rates sometimes even higher than in their original countries. So, the choice of the province to settle in is very important for those wishing to have economic success. With these exercises we will use R to analyze some immigration open data from Canadian government.

Answers to the exercises are available here.

Exercise 1
Download and read into R a data set from Labour force survey estimates (LFS), by immigrant status, age group, Canada, regions, provinces and Montreal, Toronto, Vancouver census metropolitan areas, 3-month moving average, unadjusted for seasonality.. Then take a look at it using head.

Exercise 2
Load libraries to manipulate data like dplyr. Turn Ref_Date into a Date type variable. (Tip: use as.Date)

Exercise 3
Transform the variable “Value” to a numeric format.

Learn more about Data manipulation in the online course R: Complete Data Analysis Solutions. In this course you will learn how to:

  • Learn indepth how to work with dplyr
  • Get a full introduction to the data.table package
  • And much more

Exercise 4
Create a numeric vector that contains this column indices 1,2,4,5,6, and 9. And create a new data frame to store this data.

Exercise 5
Create a text vector that contains the province names. Create a new data frame to store only lines with valid province names.

Exercise 6
We are interested in comparing unemployment rate between people born in canada and recent immigrants. Exclude lines related to other kinds of status.

Exercise 7
Skilled worker immigrants usually need to have a university degree and at least 2 year of professional experience. So, exclude lines in the “agegroup” variable with “15 years and over”, and remove this column.

Exercise 8
Take a look at the summary information of the unemployment rate.

Exercise 9
Use the summarize this data grouping then by status and province. Please, take the mean of the unemployment rate as the table content.

Exercise 10
Use qplot from ggplot2 to create a plot and find the best province in terms of difference at unemployment rate between local people and recent immigrants.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Who is the caretaker? Evidence-based probability estimation with the bnlearn package

By Guest Blogger

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft

In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview.

The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean.

The confusion over this episode caused a second viral wave calling out that people that thought she was the nanny should feel bad for being stereotypical.

We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.

@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes ‘viral’ and gets weird?

— Robert E Kelly (@Robert_E_Kelly) March 10, 2017

What evidence did viewers of the video have?

  • the person is American Caucasian
  • the person is professional
  • there are two kids
  • the caretaker is Asian

We then look for probability values for these statistics. (Given that Professor Kelly is American, all statistics are based on US data.)

We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.

library(bnlearn)
set.seed(3)
net  model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]")

plot(net)

The last step is to fit the parameters of the Bayesian network conditional on its structure, the bn.fit function runs the EM algorithm to learn CPT for all different nodes in the above graph.

yn  c("yes", "no")
ca  c("caucacian","other")
ao  c("asian","other")
nw  c("nanny","wife")

cptHusbandDemographics  matrix(c(0.85, 0.15), ncol=2, dimnames=list(NULL, ca)) #[1]
cptHusbandIsProfessional  matrix(c(0.81, 0.19), ncol=2, dimnames=list(NULL, yn)) #[2]
cptNannyDemographics  matrix(c(0.06, 0.94), ncol=2, dimnames=list(NULL, ao)) # [3]
cptWifeDemographics  matrix(c(0.01, 0.99, 0.33, 0.67), ncol=2, dimnames=list("WifeDemographics"=ao, "HusbandDemographics"=ca)) #[1]
cptStayAtHomeMom  c(0.3, 0.7, 0.14, 0.86, 0.125, 0.875, 0.125, 0.875) #[4]

dim(cptStayAtHomeMom)  c(2, 2, 2)
dimnames(cptStayAtHomeMom)  list("StayAtHomeMom"=yn, "WifeDemographics"=ao, "HusbandIsProfessional"=yn)

cptHouseholdHasNanny  c(0.01, 0.99, 0.035, 0.965, 0.00134, 0.99866, 0.00134, 0.99866) #[5]
dim(cptHouseholdHasNanny)  c(2, 2, 2)
dimnames(cptHouseholdHasNanny)  list("HouseholdHasNanny"=yn, "StayAtHomeMom"=yn, "HusbandIsProfessional"=yn)

cptCaretaker  c(0.5, 0.5, 0.999, 0.001, 0.01, 0.99, 0.001, 0.999)
dim(cptCaretaker)  c(2, 2, 2)
dimnames(cptCaretaker)  list("Caretaker"=nw, "StayAtHomeMom"=yn, "HouseholdHasNanny"=yn)

cptCaretakerEthnicity  c(0.99, 0.01, 0.99, 0.01, 0.99, 0.01, 0.01, 0.99, 0.01,0.99,0.99,0.01,0.01,0.99,0.01,0.99)
dim(cptCaretakerEthnicity)  c(2, 2, 2,2)
dimnames(cptCaretakerEthnicity)  list("CaretakerEthnicity"=ao,"Caretaker"=nw, "WifeDemographics"=ao ,"NannyDemographics"=ao)

net.disc  custom.fit(net, dist=list(HusbandDemographics=cptHusbandDemographics, HusbandIsProfessional=cptHusbandIsProfessional, WifeDemographics=cptWifeDemographics, StayAtHomeMom=cptStayAtHomeMom, HouseholdHasNanny=cptHouseholdHasNanny, Caretaker=cptCaretaker, NannyDemographics=cptNannyDemographics,CaretakerEthnicity=cptCaretakerEthnicity))

Once we have the model, we can query the network using cpquery to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is 90% vs. 10% that she is the nanny.

probWife  cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000)
probNanny  cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000) 

[1] "The probability that the caretaker is his wife  = 0.898718647764449"
[1] "The probability that the caretaker is the nanny = 0.110892031547457"

In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!

The bnlearn package is available on CRAN. You can find the R code behind this post here on GitHub or here as a Jupyter Notebook.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News