So you are using this pipeline to have data treated by different functions in R. For example, you may be imputing some missing values using the simputation package. Let us first load the only realistic dataset in R
> data(retailers, package="validate")
> head(retailers, 3)
size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1 sc0 0.02 75 NA NA 1130 NA 18915 20045 NA
2 sc3 0.14 9 1607 NA 1607 131 1544 63 NA
3 sc3 0.14 NA 6886 -33 6919 324 6493 426 NA
This data is dirty with missings and full of errors. Let us do some imputations with simputation.
> out %
+ impute_lm(other.rev ~ turnover) %>%
+ impute_median(other.rev ~ size)
size incl.prob staff turnover other.rev total.rev staff.costs total.costs profit vat
1 sc0 0.02 75 NA 6114.775 1130 NA 18915 20045 NA
2 sc3 0.14 9 1607 5427.113 1607 131 1544 63 NA
3 sc3 0.14 NA 6886 -33.000 6919 324 6493 426 NA
Ok, cool, we know all that. But what if you’d like to know what value was imputed with which method? That’s where the lumberjack comes in.
The lumberjack operator is a `pipe' operator that allows you to track changes in data.
> retailers$id out >%
+ start_log(log=cellwise$new(key="id")) %>>%
+ impute_lm(other.rev ~ turnover) %>>%
+ impute_median(other.rev ~ size) %>>%
Dumped a log at cellwise.csv
> read.csv("cellwise.csv") %>>% dplyr::arrange(key) %>>% head(3)
step time expression key variable old new
1 2 2017-06-23 21:11:05 CEST impute_median(other.rev ~ size) 1 other.rev NA 6114.775
2 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 2 other.rev NA 5427.113
3 1 2017-06-23 21:11:05 CEST impute_lm(other.rev ~ turnover) 6 other.rev NA 6341.683
So, to track changes we only need to switch from %>% to %>>% and add the start_log() and dump_log() function calls in the data pipeline. (to be sure: it works with any function, not only with simputation). The package is on CRAN now, and please see the introductory vignette for more examples and ways to customize it.
There are many ways to track changes in data. That is why the lumberjack is completely extensible. The package comes with a few loggers, but users or package authors are invited to write their own. Please see the extending lumberjack vignette for instructions.
If this post got you interested, please install the package using
You can get started with the introductory vignette or even just use the lumberjack operator %>>% as a (close) replacement of the %>% operator.
As always, I am open to suggestions and comments. Either through the packages github page.
Also, I will be talking at useR2017 about the simputation package, but I will sneak in a bit of lumberjack as well :p.
And finally, here’s a picture of a lumberjack smoking a pipe.
 It really should be called a function composition operator, but potetoes/potatoes.
R is incredible software for statistics and data science. But while the bits and bytes of software are an essential component of its usefulness, software needs a community to be successful. And that’s an area where R really shines, as Shannon Ellis explains in this lovely ROpenSci blog post. For software, a thriving community offers developers, expertise, collaborators, writers and documentation, testers, agitators (to keep the community and software on track!), and so much more. Shannon provides links where you can find all of this in the R community:
#rstats hashtag — a responsive, welcoming, and inclusive community of R users to interact with on Twitter
R-Ladies — a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
Local R meetup groups — a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
Rweekly — an incredible weekly recap of all things R
R-bloggers — an awesome resource to find posts from many different bloggers using R
Stack Overflow — chances are your R question has already been answered here (with additional resources for people looking for jobs)
I’ll add a couple of others as well:
R Conferences — The annual useR! conference is the major community event of the year, but there are many smaller community-led events on various topics.
Github — there’s a fantastic community of R developers on Github. There’s no directory, but the list of trending R developers is a good place to start.
The R Consortium — proposing or getting involved with an R Consortium project is a great way to get involved with the community
As I’ve said before, the R community is one of the greatest assets of R, and is an essential component of what makes R useful, easy, and fun to use. And you couldn’t find a nicer and more welcoming group of people to be a part of.
To learn more about the R community, be sure to check out Shannon’s blog post linked below.
Skewed data prevail in real life. Unless you observe trivial or near constant processes data is skewed one way or another due to outliers, long tails, errors or something else. Such effects create problems in visualizations when a few data elements are much larger than the rest.
Suppose we decided to visualize top 30 U.S trading partners usingbubble chart, which simply is a 2Dscatter plotwith the third dimension expressed through point size. Then U.S. trade partners become disks with imports and exports forxycoordinates and trade balance (abs(export – import)) for size:
China, Canada, and Mexico run far larger balances compared to the other 27 countries which causes most data points to collapse into crowded lower left corner. One way to “solve” this problem is to eliminate 3 mentioned outliers from the picture:
While this plot does look better it no longer serves its original purpose of displayingalltop trading partners. And undesirable effect of outliers though reduced still presents itself with new ones: Japan, Germany, and U.K. So let us bring all countries back into the mix by trying logarithmic scale.
Quick refresher from algebra.Log function(in this example log base 10 but the same applies to natural log or log base 2) is commonly used to transform positive real numbers. All because of its property of mapping multiplicative relationships into additive ones. Indeed, given numbersA,B, andCsuch that
Logarithmic scale is simply a log transformation applied to all feature’s values before plotting them. In our example we used it on both trading partners’ features – imports and exports which gives bubble chart new look:
The same data displayed on logarithmic scale appear almost uniform but not to forget the farther away points from 0 the more orders of magnitude they are apart on actual scale (observe this by scrolling back to the original plot). The main advantage of using log scale in this plot is ability of observing relationships between all top 30 countries without loosing the whole picture and avoiding collapsing smaller points together.
In an important 2005 article in the Australian Journal of Political Science, Simon Jackman set out a statistically-based approach to pooling polls in an election campaign. He describes the sensible intuitive approach of modelling a latent, unobserved voting intention (unobserved except on the day of the actual election) and treats each poll as a random observation based on that latent state space. Uncertainty associated with each measurement comes from sample size and bias coming from the average effect of the firm conducting the poll, as well as of course uncertainty about the state of the unobserved voting intention. This approach allows house effects and the latent state space to be estimated simultaneously, quantifies the uncertainty associated with both, and in general gives a much more satisfying method of pooling polls than any kind of weighted average.
Jackman gives a worked example of the approach in his excellent book Bayesian Analysis for the Social Sciences, using voting intention for the Australian Labor Party (ALP) in the 2007 Australian federal election for data. He provides JAGS code for fitting the model, but notes that with over 1,000 parameters to estimate (most of those parameters are the estimated voting intention for each day between the 2004 and 2007 elections) it is painfully slow to fit in general purpose MCMC-based Bayesian tools such as WinBUGS or JAGS – several days of CPU time on a fast computer in 2009. Jackman estimated his model with Gibbs sampling implemented directly in R.
Down the track, I want to implement Jackman’s method of polling aggregation myself, to estimate latent voting intention for New Zealand to provide an alternative method for my election forecasts. I set myself the familiarisation task of reproducing his results for the Australian 2007 election. New Zealand’s elections are a little complex to model because of the multiple parties in the proportional representation system, so I wanted to use a general Bayesian tool for the purpose to simplify my model specification when I came to it. I use Stan because its Hamiltonian Monte Carlo method of exploring the parameter space works well when there are many parameters – as in this case, with well over 1,000 parameters to estimate.
Stan describes itself as “a state-of-the-art platform for statistical modeling and high-performance statistical computation. Thousands of users rely on Stan for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business.” It lets the programmer specify a complex statistical model, and given a set of data will return a range of parameter estimates that were most likely to produce the observed data. Stan isn’t something you use as an end-to-end workbench – it’s assumed that data manipulation and presentation is done with another tool such as R, Matlab or Python. Stan focuses on doing one thing well – using Hamiltonian Monte Carlo to estimate complex statistical models, potentially with many thousands of hierarchical parameters, with arbitrarily set prior distributions.
Caveat! – I’m fairly new to Stan and I’m pretty sure my Stan programs that follow aren’t best practice, even though I am confident they work. Use at your own risk!
Basic approach – estimated voting intention in the absence of polls
I approached the problem in stages, gradually making my model more realistic. First, I set myself the task of modelling latent first-preference support for the ALP in the absence of polling data. If all we had were the 2004 and 2007 election results, where might we have thought ALP support went between those two points? Here’s my results:
For this first analysis, I specified that support for the ALP had to be a random walk that changed by a normally distributed variable with standard deviation of 0.25 percentage points for each daily change. Why 0.25? Just because Jim Savage used it in his rough application of this approach to the US Presidential election in 2016. I’ll be relaxing this assumption later.
Here’s the R code that sets up the session, brings in the data from Jackman’s pscl R package, and defines a graphics function that I’ll be using for each model I create.
Here’s the Stan program that specifies this super simple model of changing ALP support from 2004 to 2007:
And here’s the R code that calls that Stan program and draws the resulting summary graphic. Stan works by compiling a program in C++ that is based on the statistical model specified in the *.stan file. Then the C++ program zooms around the high-dimensional parameter space, moving slower around the combinations of parameters that seem more likely given the data and the specified prior distributions. It can use multiple processors on your machine and works super fast given the complexity of what it’s doing.
Adding in one polling firm
Next I wanted to add a single polling firm. I chose Nielsen’s 42 polls because Jackman found they had a fairly low bias, which removed one complication for me as I built up my familiarity with the approach. Here’s the result:
That model was specified in Stan as set out below. The Stan program is more complex now; I’ve had to specify how many polls I have (y_n), the values for each poll (y_values), and the days since the last election each poll was taken (y_days). This way I only have to specify 42 measurement errors as part of the probability model – other implementations I’ve seen of this approach ask for an estimate of measurement error for each poll on each day, treating the days with no polls as missing values to be estimated. That obviously adds a huge computational load I wanted to avoid.
In this program, I haven’t yet added in the notion of a house effect for Nielsen. Each measurement Nielsen made is assumed to have been an unbiased one. Again, I’ll be relaxing this later. The state model is also the same as before ie standard deviation of the day to day innovations is still hard coded as 0.25 percentage points.
Here’s the R code to prepare the data and pass it to Stan. Interestingly, fitting this model is noticeably faster than the one with no polling data at all. My intuition for this is that now the state space is constrained to being reasonably close to some actually observed measurements, it’s an easier job for Stan to know where is good to explore.
Including all five polling houses
Finally, the complete model replicating Jackman’s work:
As well as adding the other four sets of polls, I’ve introduced five house effects that need to be estimated (ie the bias for each polling firm/mode); and I’ve told Stan to estimate the standard deviation of the day-to-day innovations in the latent support for ALP rather than hard-coding it as 0.25. Jackman specified a uniform prior on [0, 1] for that parameter, but I found this led to lots of estimation problems for Stan. The Stan developers give some great practical advice on this sort of issue and I adapted some of that to specify the prior distribution for the standard deviation of day to day innovation as N(0.5, 0.5), constrained to be positive.
Here’s the Stan program:
Building the fact there are 5 polling firms (or firm-mode combinations, as Morgan is in there twice) directly into the program must be bad practice, but seeing as there are different numbers of polls taken by each firm and on different days I couldn’t work out a better way to do it. Stan doesn’t support ragged arrays, or objects like R’s lists, or (I think) convenient subsetting of tables, which would be the three ways I’d normally try to do that in another language. So I settled for the approach above, even though it has some ugly bits of repetition.
Here’s the R code that sorts the data and passes it to Stan
Estimates of polling house effects
Here’s the house effects estimated by me with Stan, compared to those in Jackman’s 2009 book:
Basically we got the same results – certainly close enough anyway. Jackman writes:
“The largest effect is for the face-to-face polls conducted by Morgan; the point estimate of the house effect is 2.7 percentage points, which is very large relative to the classical sampling error accompanhying these polls.”
Interestingly, Morgan’s phone polls did much better.
Here’s the code that did that comparison:
So there we go – state space modelling of voting intention, with variable house effects, in the Australian 2007 federal election.
What’s that? You’ve heard of R? You use R? You develop in R? You know someone else who’s mentioned R? Oh, you’re breathing? Well, in that case, welcome! Come join the R community!
We recently had a group discussion at rOpenSci‘s #runconf17 in Los Angeles, CA about the R community. I initially opened the issue on GitHub. After this issue was well-received (check out the emoji-love below!), we realized people were keen to talk about this and decided to have an optional and informal discussion in person.
To get the discussion started I posed two general questions and then just let discussion fly. I prompted the group with the following:
The R community is such an asset. How do we make sure that everyone knows about it and feels both welcome and comfortable?
What are other languages/communities doing that we’re not? How could we adopt their good ideas?
The discussion focused primarily on the first point, and I have to say the group’s answers…were awesome. Take a look!
How to find the community
Everyone seemed to be in agreement that (1) the community is one of R’s biggest strengths and (2) a lot within the R community happens on twitter. During discussion, Julia Lowndes mentioned she joined twitter because she heard that people asked and answered questions about R there, and others echoed this sentiment. Simply, the R community is not just for ‘power users’ or developers. It’s a place for users and people interested in learning more about R. So, if you want to get involved in the community and you are not already, consider getting a twitter account and check out the #rstats hashtag. We expect you’ll be surprised by how responsive, welcoming, and inclusive the community is.
In addition to twitter, there are many resources available within the R community where you can learn more about all things R. Below is a brief list of resources mentioned during our discussion that had helped us feel more included in the community. Feel free to suggest more!
R-Ladies – – a world-wide organization focused on promoting gender diversity within the R community, with more than 30 local chapters
Local R meetup groups – a google search may show that there’s one in your area! If not, maybe consider starting one! Face-to-face meet-ups for users of all levels are incredibly valuable
Rweekly – an incredible weekly recap of all things R
R-bloggers – an awesome resource to find posts from many different bloggers using R
Stack Overflow – chances are your R question has already been answered here (with additional resources for people looking for jobs)
No community is perfect, and being willing to consider our shortcomings and think about ways in which we can improve is so important. The group came up with a lot of great suggestions, including many I had not previously thought of personally.
Alice Daish did a great job capturing the conversation and allowing for more discussion online:
Be conscious of your tone. When in doubt, check out tone checker.
If you see someone being belittling in their answers, consider reaching out to the person who is behaving inappropriately. There was some agreement that reaching out privately may be more effective as a first approach than calling them out in public.Strong arguments against that strategy and in favor of a public response from Oliver Keyes can be found here.
Also, it’s often easier to defend on behalf of someone else than it is on one’s own behalf. Keep that in mind if you see negative things happening, and consider defending on someone else’s behalf.
Having a code of conduct is important. rOpenSci has one, and we like it a whole lot.
And, when times get tough, look to your community. Get out there. Be active. Communicate with one another. Tim Phan brilliantly summarized the importance of action and community in this thread:
Thank you to all who participated in this conversation and all who contribute to the community to make R such a fun language in which to work and develop! Thank you to rOpenSci for hosting and giving us all the opportunity to get to know one another and work together. I’m excited to see where this community goes moving forward!
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.
Two hundred and twenty-nine new packages were submitted to CRAN in May. Here are my picks for the “Top 40”, organized into five categories: Data, Data Science and Machine Learning, Education, Miscellaneous, Statistics and Utilities.
bikedata v0.0.1: Download and aggregate data from public bicycle systems from around the world. There is a vignette.
datasauRushttps://CRAN.R-project.org/package=datasauRus v0.1.2: The Datasaurus Dozen is a set of datasets that have the same summary statistics, despite having radically different distributions. As well as being an engaging variant on the Anscombe’s Quartet, the data is generated in a novel way through a simulated annealing process. Look here for details, and in the vignette for examples.
suncalc v0.1: Implements an R interface to the ‘suncalc.js’ library, part of the SunCalc.net’s project for calculating sun position, sunlight phases, moon position and lunar phase for the given location and time.
spacyr v0.9.0: Provides a wrapper for the Python spaCy Natural Language Processing library. Look here for help with installation and use.
learnr v0.9: Provides functions to create interactive tutorials for learning about R and R packages using R Markdown, using a combination of narrative, figures, videos, exercises, and quizzes. Look here to get started.
rODE v0.99.4: Contains functions to show students how an ODE solver is made and how classes can be effective for constructing equations that describe natural phenomena. Have a look at the free book Computer Simulations in Physics. There are several vignettes providing brief examples, including one on the Pendulum and another on Planets.
adaptiveGPCA v0.1: Implements the adaptive gPCA algorithm described in Fukuyama. The vignette shows an example using data stored in a phyloseq object.
BayesNetBP v1.2.1: Implements belief propagation methods for Bayesian Networks based on the paper by Cowell. There is a function to invoke a Shiny App.
RPEXE.RPEXT v0.0.1: Implements the likelihood ration test and backward elimination procedure for the reduced piecewise exponential survival analysis technique described in described in Han et al. 2012 and 2016. The vignette provides examples.
sfdct v0.0.3: Provides functions to construct a constrained ‘Delaunay’ triangulation from simple features objects. There is a vignette.
checkarg v0.1.0: Provides utility functions that allow checking the basic validity of a function argument or any other value, including generating an error and assigning a default in a single line of code.
CodeDepends v0.5-3: Provides tools for analyzing R expressions or blocks of code and determining the dependencies between them. The vignette shows how to use them.
desctable v0.1.0: Provides functions to create descriptive and comparative tables that are ready to be saved as csv, or piped to DT::datatable() or pander::pander() to integrate into reports. There is a vignette to get you started.
processx v2.0.0: Portable tools to run system processes in the background.
printr v0.1: Extends knitr generic function knit_print() to automatically print objects using an appropriate format such as Markdown or LaTeX. The vignette provides an introduction.
RHPCBenchmark v0.1.0: Provides microbenchmarks for determining the run-time performance of aspects of the R programming environment, and packages that are relevant to high-performance computation. There is an Introduction.
rlang v0.1.1: Provides a toolbox of functions for working with base types, core R features like the condition system, and core ‘Tidyverse’ features like tidy evaluation. The vignette explains R’s capabilities for creating Domain Specific Languages.
readtext v0.50: Provides functions for importing and handling text files and formatted text files with additional meta-data, including ‘.csv’, ‘.tab’, ‘.json’, ‘.xml’, ‘.pdf’, ‘.doc’, ‘.docx’, ‘.xls’, ‘.xlsx’ and other file types. There is a vignette
tangram v0.2.6: Provides an extensible formula system to implements a grammar of tables for creating production-quality tables using a three-step process that involves a formula parser, statistical content generation from data, and rendering. There is a vignette introducing the Grammar, a Global Style for Rmd, and duplicating SAS PROC Tabulate.
tatoo v1.0.6: Provides functions to combine data.frames and to add metadata that can be used for printing and xlsx export. The vignette shows some examples.
mbgraphic v1.0.0: Implements a two-step process for describing univariate and bivariate behavior similar to the cognostics measures proposed by Paul and John Tuke. First, measures describing variables are computed and then plots are selected. The vignette describes the details.
polypoly v0.0.2: Provides tools for reshaping, plotting, and manipulating matrices of orthogonal polynomials. The vignette provides an overview.
To leave a comment for the author, please follow the link and comment on their blog: R Views.
It would be very tedious and unnecessary to repeat the union statement repeatedly for any non-trivial amount of sets, for example, the first few unions would be written as:
Thus a more general operation for performing unions is needed. This operation is denoted by the symbol. For example, the set above and the desired unions of the member sets can be generalized to the following using the new notation:
We can then state the following definition: For a set , the union of is defined by:
For example, consider the three sets:
The union of the three sets is written as:
Recalling our union axiom from a previous post, the union axiom states for two sets and , there is a set whose members consist entirely of those belonging to sets or , or both. More formally, the union axiom is stated as:
As we are now dealing with an arbitrary amount of sets, we need an updated version of the union axiom to account for the change.
Restating the union axiom:
For any set , there exists a set whose members are the same elements of the elements of . Stated more formally:
The definition of can be stated as:
For example, we can demonstrate the updated axiom with the union of four sets :
We can implement the set operation for an arbitrary amount of sets by expanding upon the function we wrote previously.
Perform the set union operation of four sets:
Intersections of an Arbitrary Number of Sets
The intersection set operation can also be generalized to any number of sets. Consider the previous set containing an infinite number of sets.
As before, writing out all the intersections would be tedious and not elegant. The intersection can instead be written as:
As before in our previous example of set intersections, there is no need for a separate axiom for intersections, unlike unions. Instead, we can state the following theorem, for a nonempty set , a set exists that such for any element :
Consider the following four sets:
The intersection of the sets is written as:
We can write another function to implement the set intersection operation given any number of sets.
Perform set intersections of the four sets specified earlier.
Enderton, H. (1977). Elements of set theory (1st ed.). New York: Academic Press.
Which policy instruments should we use to cost-effectively reduce greenhouse gas emissions? For a given technological level there are many economic arguments in favour of tradeable emission certificates or a carbon tax: they generate static efficiency by inducing emission reductions in those sectors and for those technologies where it is most cost effective.
Specialized subsidies, like the originally extremely high subsidies on solar energy in Germany and other countries are often much more costly. Yet, we have seen a tremendous cost reduction for photovoltaics, which may have not been achieved on such a scale without those subsidies. And maybe in a world, where the current president of a major polluting country seems not to care much about the risks of climate change, the development of cheap green technology that even absent goverment support can cost-effectively substitute fossil fuels, is the most decisive factor to fight climate change.
Yet, the impact of different policy measures on innovation of green technology is very hard to assess. Are focused subsidies or mandates the best way, or can also emission trading or carbon taxes considerably boost innovation of green technologies? That is a tough quantitative question, but we can try to get at least some evidence.
In their article Environmental Policy and Directed Technological Change: Evidence from the European carbon market, Review of Economic and Statistics (2016), Raphael Calel and Antoine Dechezlepretre study the impact of the EU carbon emission trading system on patent activities of the regulated firms. By matching them with unregulated firms, they estimate that the emission trading has increased the innovation activities for low carbon technologies of the regulated firms by 10%.
As part of his Master Thesis at Ulm University, Arthur Schäfer has generated an RTutor problem set that allows you to replicate the main insights of the paper in an interactive fashion.
Here is screenshoot:
Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by many multiple choice quizzes.
To install the problem set the problem set locally, first install RTutor as explained here:
There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)
Power BI has long had the capability to include custom R charts in dashboards and reports. But in sharp contrast to standard Power BI visuals, these R charts were static. While R charts would update when the report data was refreshed or filtered, it wasn’t possible to interact with an R chart on the screen (to display tool-tips, for example). But in the latest update to Power BI, you can create create R custom visuals that embed interactive R charts, like this:
The above chart was created with the plotly package, but you can also use htmlwidgets or any other R package that creates interactive graphics. The only restriction is that the output must be HTML, which can then be embedded into the Power BI dashboard or report. You can also publish reports including these interactive charts to the online Power BI service to share with others. (In this case though, you’re restricted to those R packages supported in Power BI online.)
You can also create your own custom R visuals. The documentation explains how to create custom R visuals from HTML output, and you can also use the code on Github for the provided visuals linked above as a guide. For more on the new custom visuals, take a look at the blog post linked below.
Last Friday marked my two year anniversary working as a data scientist at Stack Overflow. At the end of my first year I wrote a blog post about my experience, both to share some of what I’d learned and as a form of self-reflection.
After another year, I’d like to revisit the topic. While my first post focused mostly on the transition from my PhD to an industry position, here I’ll be sharing what has changed for me in my job in the last year, and what I hope the next year will bring.
Most of my current statistical education has to be self-driven, and I need to be very cautious about my work: if I use an inappropriate statistical assumption in a report, it’s unlikely anyone else will point it out.
This continued to be a challenge, and fortunately in December we hired our second data scientist, Julia Silge.
I have some very exciting news! I am joining the data team at @StackOverflow. ✨✨✨
We started hiring for the position in September, and there were a lot of terrific candidates I got to meet and review during the application and review process. But I was particularly excited to welcome Julia to the team because we’d been working together during the course of the year, ever since we met and created the tidytext package at the 2016 rOpenSci unconference.
Julia, like me, works on analysis and visualization rather than building and productionizing features, and having a second person in that role has made our team much more productive. This is not just because Julia is an exceptional colleague, but because the two of us can now collaborate on statistical analyses or split them up to give each more focus. I did enjoy being the first data scientist at the company, but I’m glad I’m no longer the only one. Julia’s also a skilled writer and communicator, which was essential in achieving the next goal.
Company blog posts
In last year’s post, I shared some of the work that I’d done to explore the landscape of software developers, and set a goal for the following year (emphasis is new):
I’m also just intrinsically pretty interested in learning about and visualizing this kind of information; it’s one of the things that makes this a fun job. One plan for my second year here is to share more of these analyses publicly. In a previous post looked at which technologies were the most polarizing, and I’m looking forward to sharing more posts like that soon.
I’ve really enjoyed sharing these snapshots of the software developer world, and I’m looking forward to sharing a lot more on the blog this next year.
Teaching R at Stack Overflow
Last year I mentioned that part of my work has been developing data science architecture, and trying to spread the use of R at the company.
This also has involved building R tutorials and writing “onboarding” materials… My hope is that as the data team grows and as more engineers learn R, this ecosystem of packages and guides can grow into a true internal data science platform.
At the time, R was used mostly by three of us on the data team (Jason Punyon, Nick Larsen, and me). I’m excited to say it’s grown since then, and not just because of my evangelism.
“I’ve been thinking of switching to R, do you have any opinions on that?” he asked me at lunch, ill-advisedly
Every Friday since last September, I’ve met with a group of developers to run internal “R sessions”, in which we analyze some of our data to develop insights and models. Together we’ve made discoveries that have led to real projects and features, for both the Data Team and other parts of the engineering department.
There are about half a dozen developers who regularly take part, and they all do great work. But I especially appreciate Ian Allen and Jisoo Shin for coming up with the idea of these sessions back in September, and for following through in the months since. Ian and Jisoo joined the company last summer, and were interested in learning R to complement their development of product features. Their curiosity, and that of others in the team, has helped prove that data analysis can be a part of every engineer’s workflow.
Writing production code
My relationship to production code (the C# that runs the actual Stack Overflow website) has also changed. In my first year I wrote much more R code than C#, but in the second I’ve stopped writing C# entirely. (My last commit to production was more than a year ago, and I often go weeks without touching my Windows partition). This wasn’t really a conscious decision; it came from a gradual shift in my role on the engineering team. I’d usually rather be analyzing data than shipping features, and focusing entirely on R rather than splitting attention across languages has been helpful for my productivity.
Instead, I work with engineers to implement product changes based on analyses and push models into production. One skill I’ve had to work on is writing technical specifications, both for data sources that I need to query or models that I’m proposing for production. One developer I’d like to acknowledge specifically Nick Larsen, who works with me on the Data Team. Many of the blog posts I mention above answer questions like “What tags are visited in New York vs San Francisco”, or “What tags are visited at what hour of the day”, and these wouldn’t have been possible without Nick. Until recently, this kind of traffic data was very hard to extract and analyze, but he developed processes that extract and transform the data into more readily queryable tables. This has many important analyses possible besides the blog posts, and I can’t appreciate this work enough.
One team that I’ve worked with that I hadn’t in the first year is Display Ads. Display Ads are separate from job ads, and are purchased by companies with developer-focused products and services.
For example, I’ve been excited to work closer with Steve Feldman on the Display Ad Operations team. If you’re wondering why I’m not ashamed to work on ads, please read Steve’s blog post on how we sell display ads at Stack Overflow– he explains it better than I could. We’ve worked on several new methods for display ad targeting and evaluation, and I think there’s a lot of potential for data to have a postive impact for the company.
There are some goals I didn’t achieve. I’ve had a longstanding interest in getting R into production (and we’ve idly investigated some approaches like Microsoft R Server), but as of now we’re still productionizing models by rewriting them in C#. And there are many teams at Stack Overflow that I’d like to give better support to- prioritizing the Data Team’s time has been a challenge, though having a second data scientist has helped greatly. But I’m still happy with how my work has gone, and excited about the future.