The Proof-Calculation Ping Pong

By Theory meets practice…

Daffodills

Abstract

The proof-calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the statistical proof-calculation ping-pong process.

Case Story

Ada and Bob have to proof-calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. For this post we will ignore the complex survey part of the problem and just pretend that our sample corresponds to the population (complete inventory count). After the questionnaire phase, the following data are available to Ada and Bob.

Id Version Region Sales Volume Staff Costs People
01 1 A 23000 10003 5
02 1 B 12200 7200 1
03 1 NA 19500 7609 2
04 1 A 17500 13000 3
05 1 B 119500 90000 NA
05 2 B 119500 95691 19
06 1 B NA 19990 6
07 1 A 19123 20100 8
08 1 D 25000 100 NA
09 1 D 45860 32555 9
10 1 E 33020 25010 7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

  • the overall number of fitness centres
  • total sales volume
  • total staff cost
  • total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada  fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People))
ada
## # A tibble: 4 x 5
##   Region NoOfUnits `Sales Volume` `Staff Costs` People
##    
## 1      A         3          59623         43103     16
## 2      B         1         119500         95691     19
## 3      D         1          45860         32555      9
## 4      E         1          33020         25010      7

Bob

Bob has a dark past as database developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach:

library(RSQLite)
## Create ad-hoc database
db  dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite"))
##Move fitness data into the ad-hoc DB
dbWriteTable(conn = db, name = "FITNESS", fitness, overwrite=TRUE, row.names=FALSE)
##Query using SQL
bob  dbGetQuery(db, "
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People
    FROM FITNESS
    WHERE [Sales Volume] > 17500 GROUP BY Region
")
bob
##   Region NoOfUnits Sales Volume Staff Costs People
## 1            1        19500        7609      2
## 2      A         2        42123       30103     13
## 3      B         2       239000      185691     19
## 4      D         2        70860       32655      9
## 5      E         1        33020       25010      7

The Ping-Pong phase

After Ada and Bob each have a result, they compare their resulting data.framess using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff)
diff  diff_data(ada, bob)
diff$get_data()
##    @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 +++            1          19500          7609      2
## 2  ->      A      3->2   59623->42123  43103->30103 16->13
## 3  ->      B      1->2 119500->239000 95691->185691     19
## 4  ->      D      1->2   45860->70860  32555->32655      9
## 5          E         1          33020         25010      7

After Ada’s and Bob’s serve, the two realize that their results just agree for one region (‘E’). Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

print.data_diff  function(x) x$get_data() %>% filter(`@@` != "")
print(diff)
##    @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 +++            1          19500          7609      2
## 2  ->      A      3->2   59623->42123  43103->30103 16->13
## 3  ->      B      1->2 119500->239000 95691->185691     19
## 4  ->      D      1->2   45860->70860  32555->32655      9

The two decide to first focus on agreeing on the number of units per region.

##    @@ Region NoOfUnits
## 1 +++            1
## 2  ->      A      3->2
## 3  ->      B      1->2
## 4  ->      D      1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob

bob2  dbGetQuery(db, "
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People
    FROM FITNESS
    WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL)
    GROUP BY Region
")
diff2  diff_data(ada, bob2, ordered=FALSE,count_like_a_spreadsheet=FALSE)
diff2 %>% print()
##   @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 ->      A      3->2   59623->42123  43103->30103 16->13
## 2 ->      B      1->2 119500->239000 95691->185691     19
## 3 ->      D      1->2   45860->70860  32555->32655      9

Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and massages the SQL query as follows:

bob3  dbGetQuery(db, "
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People FROM
    (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM FITNESS GROUP BY Id)
    WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL)
    GROUP BY Region
")
diff3  diff_data(ada, bob3, ordered=FALSE,count_like_a_spreadsheet=FALSE)
diff3 %>% print()
##    @@ Region NoOfUnits Sales Volume  Staff Costs People
## 1 ...    ...       ...          ...          ...    ...
## 2  ->      D      1->2 45860->70860 32555->32655      9

Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2  fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People))
(diff_final  diff_data(ada2,bob3)) %>% print()
##    @@ Region NoOfUnits ... Staff Costs People
## 1 ...    ...       ... ...         ...    ...
## 2  ->      D         2 ...       32655  NA->9

Pong Ada

Oops, forgot to take care of the NA in the summation:

ada3  fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People,na.rm=TRUE))
diff_final  diff_data(ada3,bob3)
length(diff_final$get_data()) == 0
## [1] TRUE

Conclusion

Finally, their results agree and they move on to production and their results are published in a nice report.

Question 1: Do you agree with their results?

ada3
## # A tibble: 4 x 5
##   Region NoOfUnits `Sales Volume` `Staff Costs` People
##    
## 1      A         3          59623         43103     16
## 2      B         1         119500         95691     19
## 3      D         2          70860         32655      9
## 4      E         1          33020         25010      7

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below.

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.