R for Data Science is our first course of the autumn term. It takes place in October 4-5 in a location close to Milano Lima.
If you want to deepen your data analysis knowledge using the most modern R tools, or you want to figure out if R is the right solution for you, this is the your class. This course is for people that are willing to learn about R and would like to get an overview of its capabilities for data science or those that have very little knowledge of R.
R for Data Science Outline
– A bit of R history and online resources – R and R-Studio installation and configuration – Your first R session – Your first R markdown document – R objects: data and functions – Data import from external sources: excel and database connection – Data manipulation with tidyverse – Tidy data with tidyr – Data visualization using ggplot – An overview of statistical models and data mining with R
The course is for max 6 attendees.
R for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.
Location
The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.
October 11-12: Statistics for Data Science. Develop a wide variety of statistical models with R, from the simplest Linear Regression to the most sophisticated GLM models. Reserve now!
October 25-26: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
November 7-8: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
November 21-22: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
November 29-30: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!
In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.
Every day in New York City, millions of commuters take part in a giant race to determine transportation supremacy. Cars, bikes, subways, buses, ferries, and more all compete against one another, but we never get much explicit feedback as to who “wins.” I’ve previously written about NYC’s public taxi data and Citi Bike share data, and it occurred to me that these datasets can help identify who’s fastest, at least between cars and bikes. In fact, I’ve built an interactive guide that shows when a Citi Bike is faster than a taxi, depending on the route and the time of day.
The methodology and findings will be explained more below, and all code used in this post is available open-source on GitHub.
Interactive guide to when taxis are faster or slower than Citi Bikes
Pick a starting neighborhood and a time. The map shows whether you’d expect get to each neighborhood faster with a taxi (yellow) or a Citi Bike (dark blue).
Alphabet CityBattery ParkBattery Park CityBloomingdaleCentral ParkChinatownClinton EastClinton WestEast ChelseaEast Harlem SouthEast VillageFinancial District NorthFinancial District SouthFlatironGarment DistrictGramercyGreenwich Village NorthGreenwich Village SouthHudson SqKips BayLenox Hill EastLenox Hill WestLincoln Square EastLincoln Square WestLittle Italy/NoLiTaLower East SideManhattan ValleyMeatpacking/West Village WestMidtown CenterMidtown EastMidtown NorthMidtown SouthMurray HillPenn Station/Madison Sq WestSeaportSoHoStuy Town/Peter Cooper VillageSutton Place/Turtle Bay NorthTimes Sq/Theatre DistrictTriBeCa/Civic CenterTwo Bridges/Seward ParkUN/Turtle Bay SouthUnion SqUpper East Side NorthUpper East Side SouthUpper West Side NorthUpper West Side SouthWest Chelsea/Hudson YardsWest VillageWorld Trade CenterYorkville EastYorkville West
BedfordBoerum HillBrooklyn HeightsBrooklyn Navy YardBushwick SouthCarroll GardensClinton HillCobble HillColumbia StreetCrown Heights NorthDowntown Brooklyn/MetroTechDUMBO/Vinegar HillEast WilliamsburgFort GreeneGowanusGreenpointPark SlopeProspect HeightsProspect ParkRed HookSouth WilliamsburgStuyvesant HeightsSunset Park WestWilliamsburg (North Side)Williamsburg (South Side)
Long Island City/Hunters PointLong Island City/Queens PlazaQueensbridge/RavenswoodSunnyside
Or click the map to change neighborhoods
Weekday time window
From Midtown East, weekdays 8:00 AM–11:00 AM
Taxi vs. Citi Bike travel times to other neighborhoods
Turn on javascript (or click through from RSS) to view the interactive taxi vs. Citi Bike map.
Data via NYC TLC and Citi Bike
Based on trips 7/1/2016–6/30/2017
toddwschneider.com
Hover over a neighborhood (tap on mobile) to view travel time stats
40% of weekday taxi trips—over 50% during peak hours—would be faster as Citi Bike rides
I estimate that 40% of weekday taxi trips within the Citi Bike service area would expect to be faster if switched to a Citi Bike, based on data from July 2016 to June 2017. During peak midday hours, more than 50% of taxi trips would expect to be faster as Citi Bike rides.
There are some significant caveats to this estimate. In particular, if many taxi riders simultaneously switched to Citi Bikes, the bike share system would probably hit severe capacity constraints, making it difficult to find available bikes and docks. Increased bike usage might eventually lead to fewer vehicles on the road, which could ease vehicle congestion, and potentially increase bike lane congestion. It’s important to acknowledge that when I say “40% of taxi trips would be faster if they switched to Citi Bikes”, we’re roughly considering the decision of a single able-bodied person, under the assumption that everyone else’s behavior will remain unchanged.
Heading crosstown in Manhattan? Seriously consider taking a bike instead of a car!
Crosstown Manhattan trips are generally regarded as more difficult than their north-south counterparts. There are fewer subways that run crosstown, and if you take a car, the narrower east-west streets often feel more congested than the broad north-south avenues with their synchronized traffic lights. Crosstown buses are so notoriously slow that they’ve been known to lose races against tricycles.
I divided Manhattan into the crosstown zones pictured above, then calculated the taxi vs. Citi Bike win rate for trips that started and ended within each zone. Taxis fare especially badly in the Manhattan central business district. If you take a midday taxi that both starts and ends between 42nd and 59th streets, there’s over a 70% chance that the trip would have been faster as a Citi Bike ride.
Keep in mind that’s for all trips between 42nd and 59th streets. For some of the longest crosstown routes, for example, from the United Nations on the far east side to Hell’s Kitchen on the west, Citi Bikes beat taxis 90% of the time during the day. It’s worth noting that taxis made 8 times as many trips as Citi Bikes between 42nd and 59th streets from July 2016 to June 2017—almost certainly there would be less total time spent in transit if some of those taxi riders took bikes instead.
Hourly graphs for all of the crosstown zones are available here, and here’s a summary table for weekday trips between 8:00 AM and 7:00 PM:
Manhattan crosstown zone
% taxis lose to Citi Bikes
96th–110th
41%
77th–96th
36%
59th–77th
54%
42nd–59th
69%
14th–42nd
64%
Houston–14th
54%
Canal–Houston
60%
Below Canal
55%
A reminder that this analysis restricts to trips that start and end within the same zone, so for example a trip from 23rd St to 57th St would be excluded because it starts and ends in different zones.
Taxis fare better for trips that stay on the east or west sides of Manhattan: 35% of daytime taxi trips that start and end west of 8th Avenue would expect to be faster as Citi Bike trips, along with 38% of taxi trips that start and end east of 3rd Avenue. Taxis also generally beat Citi Bikes on longer trips:
Taxis are losing more to Citi Bikes over time
When the Citi Bike program began in July 2013, less than half of weekday daytime taxi trips would have been faster if switched to Citi Bikes. I ran a month-by-month analysis to see how the taxi vs. Citi Bike calculus has changed over time, and discovered that taxis are getting increasingly slower compared to Citi Bikes:
Note that this month-by-month analysis restricts to the original Citi Bike service area, before the program expanded in August 2015. The initial expansion was largely into Upper Manhattan and the outer boroughs, where taxis generally fare better than bikes, and so to keep things consistent, I restricted the above graph to areas that have had Citi Bikes since 2013.
Taxis are losing more to Citi Bikes over time because taxi travel times have gotten slower, while Citi Bike travel times have remained roughly unchanged. I ran a pair of linear regressions to model travel times as a function of:
trip distance
time of day
precipitation
whether the route crosses between Manhattan and the outer boroughs
month of year
year
The regression code and output are available on GitHub: taxi, Citi Bike
As usual, I make no claim that this is a perfect model, but it does account for the basics, and if we look at the coefficients by year, it shows that, holding the other variables constant, a taxi trip in 2017 took 17% longer than the same trip in 2009. For example, a weekday morning trip from Midtown East to Union Square that took 10 minutes in 2009 would average 11:45 in 2017.
The same regression applied to Citi Bikes shows no such slowdown over time, in fact Citi Bikes got slightly faster. The regressions also show that:
Citi Bike travel times are less sensitive to time of day than taxi travel times. A peak midday taxi trip averages 40% longer than the same trip at off-peak hours, while a peak Citi Bike trip averages 15% longer than during off-peak hours.
Rainy days are associated with 2% faster Citi Bike travel times and 1% slower taxi travel times.
For taxis, fall months have the slowest travel times, but for Citi Bikes, summer has the slowest travel times. For both, January has the fastest travel times.
Taxis are more prone to very bad days
It’s one thing to say that 50% of midday taxi trips would be faster as Citi Bike rides, but how much does that vary from day to day? You could imagine there are some days with severe road closures, where more nimble bikes have an advantage getting around traffic, or other days in the dead of summer, when taxis might take advantage of the less crowded roads.
I ran a more granular analysis to measure win/loss rates for individual dates. Here’s a histogram of the taxi loss rate—the % of taxi trips we’d expect to be faster if switched to Citi Bikes—for weekday afternoon trips from July 2016 to June 2017:
Many days see a taxi loss rate of just over 50%, but there are tails on both ends, indicating that some days tilt in favor of either taxis or Citi Bikes. I was curious if we could learn anything from the outliers on each end, so I looked at individual dates to see if there were any obvious patterns.
The dates when taxis were the fastest compared to Citi Bikes look like dates that probably had less traffic than usual. The afternoon with the highest taxi win rate was Monday, October 3, 2016, which was the Jewish holiday of Rosh Hashanah, when many New Yorkers would have been home from work or school. The next 3 best days for taxis were all Mondays in August, when I’d imagine a lot of people were gone from the city on vacation.
The top 4 dates where Citi Bikes did best against taxis were all rainy days in the fall of 2016. I don’t know why rainy days make bikes faster relative to taxis, maybe rain causes traffic on the roads that disproportionately affects cars, but it’s also possible that there’s a selection bias. I’ve written previously about how the weather predicts Citi Bike ridership, and not surprisingly there are fewer riders when it rains. Maybe the folks inclined to ride bikes when it’s raining are more confident cyclists, who also pedal faster when the weather is nice. It’s also possible that rainy-day cyclists are particularly motivated to pedal faster so they can get out of the rain. I don’t know if these are really the causes, but they at least sound believable, and would explain the observed phenomenon.
When the President is in town, take a bike
June 8, 2016 was a particularly good day for Citi Bikes compared to taxis. President Obama came to town that afternoon, and brought the requisite street closures with him. I poked around a bit looking for the routes that appeared to be the most impacted by the President’s visit, and came to afternoon trips from Union Square to Murray Hill. On a typical weekday afternoon, taxis beat Citi Bikes 57% of the time from Union Square to Murray Hill, but on June 8, Citi Bikes won 90% of the time. An even more dramatic way to see the Obama effect is to look at daily median travel times:
A typical afternoon taxi takes 8 minutes, but on June 8, the median was over 21 minutes. The Citi Bike median travel time is almost always 9 minutes, including during President Obama’s visit.
The same graph shows a similar phenomenon on September 19, 2016, when the annual United Nations General Assembly shut down large swathes of Manhattan’s east side, including Murray Hill. Although the impact was not as severe as during President Obama’s visit, the taxi median time doubled on September 19, while the Citi Bike median time again remained unchanged.
The morning of June 15, 2016 offers another example, this time on the west side, when an overturned tractor trailer shut down the Lincoln Tunnel for nearly seven hours. Taxi trips from the Upper West Side to West Chelsea, which normally take 15 minutes, took over 35 minutes. Citi Bikes typically take 18 minutes along the same route, and June 15 was no exception. Taxis would normally expect to beat Citi Bikes 67% of the time on a weekday morning, but on June 15, Citi Bikes won over 92% of the time.
These are of course three hand-picked outliers, and it wouldn’t be entirely fair to extrapolate from them to say that Citi Bikes are always more resilient than taxis during extreme circumstances. The broader data shows, though, that taxis are more than twice as likely as Citi Bikes to have days when a route’s median time is at least 5 minutes slower than average, and more than 3.5 times as likely to be at least 10 minutes slower, so it really does seem that Citi Bikes are better at minimizing worst-case outcomes.
Why have taxis gotten slower since 2009?
The biggest slowdowns in taxi travel times happened in 2014 and 2015. The data and regression model have nothing to say about why taxis slowed down so much over that period, though it might be interesting to dig deeper into the data to see if there are specific regions where taxis have fared better or worse since 2009.
Uber usage took off in New York starting in 2014, reaching over 10,000 vehicles dispatched per week by the beginning of 2015. There are certainly people who blame Uber—and other ride-hailing apps like Lyft and Juno—for increasing traffic, but the city’s own 2016 traffic report did not blame Uber for increased congestion.
It’s undoubtedly very hard to do an accurate study measuring ride-hailing’s impact on traffic, and I’m especially wary of people on both sides who have strong interests in blaming or exonerating the ride-hailing companies. Nevertheless, if I had to guess the biggest reasons taxis got particularly slower in 2014 and 2015, I would start with the explosive growth of ride-hailing apps, since the timing looks to align, and the publicly available data shows that they account for tens of thousands of vehicles on the roads.
On the other hand, if ride-hailing were the biggest cause of increased congestion in 2014 and 2015, it doesn’t exactly make sense that taxi travel times have stabilized a bit in 2016 and 2017, because ride-hailing has continued to grow, and while taxi usage continues to shrink, the respective rates of growth and shrinkage are not very different in 2016–17 than they were in 2014–15. One explanation could be that starting in 2016 there was a reduction in other types of vehicles—traditional black cars, private vehicles, etc.—to offset ride-hailing growth, but I have not seen any data to support (or refute) that idea.
There are also those who blame bike lanes for worsening vehicle traffic. Again, different people have strong interests arguing both sides, but it seems like there are more data points arguing that bike lanes do not cause traffic (e.g. here, here, and here) than vice versa. I wasn’t able to find anything about the timing of NYC bike lane construction to see how closely it aligns with the 2014–15 taxi slowdown.
Lots of other factors could have contributed to worsening traffic: commuter-adjusted population growth, subway usage, decaying infrastructure, construction, and presidential residences are just a few that feel like they could be relevant. I don’t know the best way to account for all of them, but it does seem like if you want to get somewhere in New York quickly, it’s increasingly less likely that a car is your best option.
How representative are taxis and Citi Bikes of all cars and bikes?
I think it’s not a terrible assumption that taxis are representative of typical car traffic in New York. If anything, maybe taxis are faster than average cars since taxi drivers are likely more experienced—and often aggressive—than average drivers. On the other hand, taxi drivers seem anecdotally less likely to use a traffic-enabled GPS, which maybe hurts their travel times.
Citi Bikes are probably slower than privately-owned bikes. Citi Bikes are designed to be heavy and stable, which maybe makes them safer, but lowers their speeds. Plus, I’d guess that biking enthusiasts, who might be faster riders, are more likely to ride their own higher-performance bikes. Lastly, Citi Bike riders might have to spend extra time at the end of a trip looking for an available dock, whereas privately-owned bikes have more parking options.
Weighing up these factors, I would guess that if we somehow got the relevant data to analyze the broader question of all cars vs. all bikes, the results would tip a bit in favor of bikes compared to the results of the narrower taxi vs. Citi Bike analysis. It’s also worth noting that both taxis and Citi Bikes have additional time costs that aren’t accounted for in trip durations: you have to hail a taxi, and there might not be a Citi Bike station in the near vicinity of your origin or destination.
What are the implications of all this?
One thing to keep in mind is that even though the taxi and Citi Bike datasets are the most conveniently available for analysis, New Yorkers don’t limit their choices to cars and bikes. The subway, despite its poor reputation of late, carries millions of people every day, more than taxis, ride-hailing apps, and Citi Bikes combined, so it’s not like “car vs. bike” is always the most relevant question. There are also legitimate reasons to choose a car over a bike—or vice versa—that don’t depend strictly on expected travel time.
Bike usage in New York has increased dramatically over the past decade, probably in large part because people figured out on their own that biking is often the fastest option. Even with this growth, though, the data shows that a lot of people could still save precious time—and minimize their worse-case outcomes—if they switched from cars to bikes. To the extent the city can incentivize that, it strikes me as a good thing.
When L-mageddon comes, take a bike
For any readers who might be affected by the L train’s planned 2019 closure, if you only remember one thing from this post: Citi Bikes crush taxis when traveling from Williamsburg to just about anywhere in Manhattan during morning rush hour!
GitHub
The code for the taxi vs. Citi Bike analysis is available here as part of the nyc-taxi-data repo. Note that parts of the analysis also depend on loading the data from the nyc-citibike-data repo.
The data
Taxi trip data is available since January 2009, Citi Bike data since July 2013. I filtered each dataset to make the analysis closer to an apples-to-apples comparison—see the GitHub repo for a more complete description of the filtering—but in short:
Restrict both datasets to weekday trips only
Restrict Citi Bike dataset to subscribers only, i.e. no daily pass customers
Restrict taxi dataset to trips that started and ended in areas with Citi Bike stations, i.e. where taking a Citi Bike would have been a viable option
Starting in July 2016, perhaps owing to privacy concerns, the TLC stopped providing latitude and longitude coordinates for every taxi trip. Instead, the TLC now divides the city into 263 taxi zones (map), and provides the pickup and drop off zones for every trip. The analysis then makes the assumption that taxis and Citi Bikes have the same distribution of trips within a single zone, see GitHub for more.
80% of taxi trips start and end within zones that have Citi Bike stations, and the filtered dataset since July 2013 contains a total of 330 million taxi trips and 27 million Citi Bike trips. From July 1, 2016 to June 30, 2017—the most recent 12 month period of available data—the filtered dataset includes 68 million taxi trips and 9 million Citi Bike trips.
Methodology
I wrote a Monte Carlo simulation in R to calculate the probability that a Citi Bike would be faster than a taxi for a given route. Every trip is assigned to a bucket, where the buckets are picked so that trips within a single bucket are fairly comparable. The bucket definitions are flexible, and I ran many simulations with different bucket definitions, but one sensible choice might be to group trips by:
Starting zone
Ending zone
Hour of day
For example, weekday trips from the West Village to Times Square between 9:00 AM and 10:00 AM would constitute one bucket. The simulation iterates over every bucket that contains at least 5 taxi and 5 Citi Bike trips, and for each bucket, it draws 10,000 random samples, with replacement, for each of taxi and Citi Bike trips. The bucket’s estimated probability that a taxi is faster than a Citi Bike, call it the “taxi win rate”, is the fraction of samples where the taxi duration is shorter than the Citi Bike duration. You can think of this as 10,000 individual head-to-head races, with each race pitting a single taxi trip against a single Citi Bike trip.
Different bucketing and filtering schemes allow for different types of analysis. I ran simulations that bucketed by month to see how win rates have evolved over time, simulations that used only days where it rained, and others. There are undoubtedly more schemes to be considered, and the Monte Carlo methodology should be well equipped to handle them.
For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.
For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.
This file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.
dat = fread("rows.csv")
dat[1:3]
str(dat)
Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.
names(dat)
Let's say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.
dat[, .N, by=.(Arrest)]
In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.
Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.
I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.
PS: As of this past week, I’ve decided to re-enter the job market. So if you are looking for a statistical analyst or data scientist with strong experience in programming in R and time series econometrics, please contact me at mathewanalytics@gmail.com or reach me through LinkedIn
To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics.
An interactive infographic can be used to communicate a lot of information in an engaging way. With the right tools, they are also relatively straightforward to create. In this post, I show step-by-step how to create this interactive infographic, using Canva, Displayr and R code. The interactive example is designed so that the user can change the country and have the infographic update automatically.
Tools used to create an interactive infographic: Canva is used to create the base infographic. The calculations, charting, and automatic text-writing are performed using the R language. It is all hooked up with Displayr.
Step 1: Create or download the infographic
I start by going to Canva, and choosing the Neat Interactive Gaming Infographic (tip: use Find Templates on the left-hand panel). You could, of course, design your own infographic, either in Canva or elsewhere. I like Canva, but the key thing is to create an infographic image some way. In Canva, I edited the template by deleting the bits that I wanted to replace with interactive charts and visualizations and then I download the infographic as a PNG file (2,000 pixels high by 800 wide).
Step 2: Import the infographic into Displayr
Create an account in Displayr, and then click the button that says + New Document. Set the page size to the same aspect ratio as the PNG file (Home > Page Layout > Layout > Page Size > Custom). For this example, the page should be 20 inches high and 8 inches wide.
Next, insert the infographic into Displayr (Insert >Image), move and resize it to fit the page (tip: you can use the Properties panel on the right to type in pixels 800 x 2000 to reset the correct aspect ratio of the image).
Step 3: Get the data into Displayr
The data that will make the infographic interactive needs to be hooked up in Displayr. The data used to create the infographic in this example is shown to the right. There are lots of ways to import data into Displayr (e.g., importing a raw data file and creating the tables in Displayr). For this example, the data has been pasted into Displayr from Excel using the steps below.
To paste the data into Displayr:
Insert > Paste Table (Data), click Add data (on the right of the screen).
Paste in the data. Alternatively, you could type it into this screen. I first just pasted in the Age Distribution data and press OK.
Properties > GENERALand type AgeDistribution into the Label field and check the Automatic option (above).
Drag the table so that it is to the left of the infographic.
Hide the table (select it, Appearance > Hide). It will stay visible but will be invisible when you share the infographic.
Repeat this process for AverageAge, Ratio, and Multiplayer. It is important that you give each of these tables these names, as we refer to them later in our R code.
Step 4: Add the country selector
Next, I add a control so that the user can change country:
Insert > Control
Item list: China, US, Europe
Move it to the top of the screen and style as desired (font size, color, border)
Name: Click on the control and select China
Properties > GENERAL > Name: Country
I then insert a text box (“GAMERS”), and placed it to the left of the control (i.e.: font: Impact, size: 48, color: #ffb600).
Step 5: Create the charts and visualizations in R
Finally, create the charts and visualizations in Displayr using the following R code.
The column chart
I created the column chart using my colleague Carmen’s nifty wrapper-function for plotly. Insert an R Output in Displayr (Insert > R Output), and copy and paste the following code, pressing Calculateand resizing moving and resizing the chart.
The average age was also created by inserting an R Output, using the code below. While I could have written the formatting in R, I instead used the various formatting tools built into Displayr (Properties>LAYOUT and Properties> APPEARANCE).
AverageAge[,Country]
The hearts pictograph
This was also done using an R Output in Displayr, with the following code (using R GitHub packages built by my colleagues Kyle and Carmen).
It’s been a big day for the team here at Microsoft, with a flurry of announcements from the Ignite conference in Orlando. We’ll provide more in-depth details in the coming days and weeks, but for now here’s a brief roundup of the news related to data science:
Microsoft ML Server 9.2 is now available. This is the new name for what used to be called Microsoft R Server, and now also includes support for operationalizing the Python language as well as R. This 2-minute video explains how to deploy models with ML Server (and with this release, real-time scoring is now also supported Linux as well). Microsoft R Client 3.4.1, featuring R 3.4.1, was also released today and provides desktop capabilities for R developers with the ability to deploy computations to a production ML Server.
The next generation of Azure Machine Learning is now available. This new service includes the AML Workbench, a cross-platform client for AI-powered data wrangling and experiment management; the AML Experimentation service to help data scientists increase their rate of experimentation with big data and GPUs, and the AML Model Management service to host, version, manage and monitor machine learning models. This TechCrunch article provides an overview.
Microsoft Cognitive Services offers new intelligent APIs for developing AI applications, including general availability of the text analytics API for topic extraction, language detection and sentiment analysis.
Finally, Microsoft announced a new high-level language for quantum computing to be integrated with Visual Studio, and a simulator for quantum computers up to 32 qubits. This 2-minute video provides an overview of Microsoft’s vision for Quantum Computing.
To leave a comment for the author, please follow the link and comment on their blog: Revolutions.
One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA “one-hot encoding”). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.
By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.
So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.
For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.
We’ll implement our partial pooling encoders using the lmer() (multilevel linear regression) and glmer() (multilevel generalized linear regression) functions from the lme4 package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).
The Data: Radon levels in Minnesota
library("vtreat")
library("lme4")
library("dplyr")
library("tidyr")
library("ggplot2")
# example data
srrs =read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE)
# target: log of radon activity (activity)# grouping variable: county
radonMN =filter(srrs, state=="MN") %>%
select("county", "activity") %>%
filter(activity >0) %>%mutate(activity =log(activity),
county = base::trimws(county)) %>%
mutate(critical = activity>1.5)
str(radonMN)
For this example we have three columns of interest:
county: 85 possible values
activity: the log of the radon reading (numerical outcome)
critical: TRUE when activity > 1.5 (categorical outcome)
The goal is to level code county for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is “critical”).
As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.
Implementing Level Coders for Partial Pooling
Let’s implement level coders that use partial pooling to compute the level score.
Regression
For regression problems, the custom coder should be a function that takes as input:
v: a string with the name of the categorical variable
vcol: the actual categorical column (assumed character)
y: the numerical outcome column
weights: a column of row weights
The function should return a column of scores (the level codings). For our example, the function builds a lmer model to predict y as a function of vcol, then returns the predictions on the training data.
# @param v character variable name# @param vcol character, independent or input variable# @param y numeric, dependent or outcome variable to predict# @param weights row/example weights# @return scored training data column
ppCoderN function(v, vcol,
y,
weights) {
# regression case y ~ vcol
d data.frame(x = vcol,
y = y,
stringsAsFactors =FALSE)
m lmer(y ~(1 |x), data=d, weights=weights)
predict(m, newdata=d)
}
Categorization
For categorization problems, the function should assume that y is a logical column, where TRUE is assumed to be the target outcome. This is because vtreat converts the outcome column to a logical while creating the treatment plan.
# @param v character variable name# @param vcol character, independent or input variable# @param y logical, dependent or outcome variable to predict# @param weights row/example weights# @return scored training data column
ppCoderC function(v, vcol,
y,
weights) {
# classification case y ~ vcol
d data.frame(x = vcol,
y = y,
stringsAsFactors =FALSE)
m =glmer(y ~(1 |x), data=d, weights=weights, family=binomial)
predict(m, newdata=d, type='link')
}
You can then pass the functions in as a named list into either designTreatmentsX or mkCrossFrameXExperiment to build the treatment plan. The format of the key is [n|c].levelName[.option]*.
The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat to center the codes with respect to the estimated grand mean. ThecatN and catB level codings are centered in this way.
Our example coders can be passed in as shown below.
Let’s build a treatment plan for the regression problem.
# I only want to create the cleaned numeric variables, the isBAD variables,# and the level codings (not the indicator variables or catP, etc.)
vartypes_I_want =c('clean', 'isBAD', 'catN', 'poolN')
treatplanN =designTreatmentsN(radonMN,
varlist =c('county'),
outcomename ='activity',
codeRestriction = vartypes_I_want,
customCoders = customCoders,
verbose=FALSE)
scoreFrame =treatplanN$scoreFrame
scoreFrame %>%select(varName, sig, origName, code)
## varName sig origName code
## 1 county_poolN 1.343072e-16 county poolN
## 2 county_catN 2.050811e-16 county catN
Note that the treatment plan returned both the catN variable (default level encoding) and the pooled level encoding (poolN). You can restrict to just using one coding or the other using the codeRestriction argument either during treatment plan creation, or in prepare().
Let’s compare the two level encodings.
# create a frame with one row for every county,
measframe =data.frame(county =unique(radonMN$county),
stringsAsFactors=FALSE)
outframe =prepare(treatplanN, measframe)
# If we wanted only the new pooled level coding,# (plus any numeric/isBAD variables), we would# use a codeRestriction:## outframe = prepare(treatplanN, # measframe,# codeRestriction = c('clean', 'isBAD', 'poolN'))gather(outframe, key=scoreType, value=score,
county_poolN, county_catN) %>%
ggplot(aes(x=score)) +geom_density(adjust=0.5) +geom_rug(sides="b") +facet_wrap(~scoreType, ncol=1, scale="free_y") +ggtitle("Distribution of scores")
Notice that the poolN scores are “tucked in” compared to the catN encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.
Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and un-modeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lower-variance estimators when they do not cost much in terms of bias.
Other Considerations
For this example, we used the lme4 package to create custom level codings. Once calculated, vtreat stores the coding as a lookup table in the treatment plan. This means lme4 is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.
When using mkCrossFrameXExperiment, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB and catN variables. This is because the treatment plan is built using all the data, while the cross frame is built using n-fold cross validation on the data. See the cross frame vignette for more details.
Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.
In a follow-up article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat‘s default coding.
References
Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.
To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.
In Part 1 of this series, we got started by looking at the ts object in R and how it represents time series data. In Part 2, I’ll discuss some of the many time series transformation functions that are available in R. This is by no means an exhaustive catalog. If you feel I left out anything important, please let me know. I compile these posts as a guide in RMarkdown which I plan to make available on the web soon.
Often in time series analysis and modeling, we will want to transform data. There are a number of different functions that can be used to transform time series data such as the difference, log, moving average, percent change, lag, or cumulative sum. These type of function are useful for both visualizing time series data and for modeling time series. For example, the moving average function can be used to more easily visualize a high-variance time series and is also a critical part the ARIMA family of models. Functions such as the difference, percent change, and log difference are helpful for making non-stationary data stationary.
Let’s start with a very common operation. To take a lag we use the lag() function from the stats package. It takes a ts object and a value for the lag argument.
Notice that when we do this, tseries_lag1 is equivalent to tseries offset by 1, such that the value for 2000q1 in tseries become the value for 1999q4 in tseries_lag1 and so on. We can do this for any lag of our choosing. Let’s try 3 instead:
tseries_lag3 tseries tseries_lag3
1999 Q2 NA 7.07667
1999 Q3 NA 10.38876
1999 Q4 NA 23.91096
2000 Q1 7.07667 14.49356
2000 Q2 10.38876 15.90501
2000 Q3 23.91096 28.00545
We can also do this in the opposite direction using a negative value for the lag argument, effectively creating a lead series instead:
Notice that tseries_lead1 now leads the original tseries by one time period.
We can take differences of a time series as well. This is equivalent to taking the difference between each value and its previous value:
As you can see, taking a difference is an effective way to remove a trend and make a time series stationary.
The difference can be calculated over more than one lag. To do so, simply alter the value of the lag argument.
Some time series transformation functions are useful for series in which the variance gets larger over time. These range from the basic logarithm function to the Box-Cox group of transformations (of which the natural logarithm is a special case). We can take the log of a time series using the log function in the same way that we would take the log of a vector. Let’s generate a time series that has increasing variance:
trend
Gives this plot:
As we'll discuss in more detail, the prevalence of increasing, or more generally non-constant, variance is called heteroskedasticity and can cause problems in linear regression. Often, it will need to be corrected before modeling. One way to do this is taking a log:
tseries_log
Gives this plot:
There are other more advanced ways of eliminating non-constant variance, one of which is the Box-Cox transformation, which allows us a bit more control over the transformation. The Box-Cox takes the form (Hyndman and Athanasopoulos, 2013):
In order to demonstrate the Box-Cox transformation, I'll introduce Rob Hyndman's `forecast` package:
library(forecast)
It has two functions that are of use here. The primary function is BoxCox(), which will return a transformed time series given a time series and a value for the parameter lambda:
plot.ts(BoxCox(tseries_h, lambda = 0.5))
Gives this plot:
Notice that this value of lambda here does not entirely take care of the heteroskedasticity problem. We can experiment with different values of lambda, or we can use the BoxCox.lambda() function, which will provide us an optimal value for parameter lambda:
lambda [1] 0.05527724
plot.ts(BoxCox(tseries_h, lambda = lambda))
Gives this plot:
The BoxCox.lambda() function has chosen the value 0.055. If we then use this value in our BoxCox() function, it returns a time series that appears to have constant variance.
Another common calculation that we may want to perform on time series is the percent change from one period to another. Because it seems that R does not have a function for performing this calculation, we can do it ourselves.
We can turn this into a function so that we can easily calculate percent change in the future. We’ll add an argument to specify the number of periods over which we want to calculate the change and set it to 1 by default. Additionally, we’ll add some argument verification.
Now if we wished to calculate the percentage change over more than one period we can do so. Let’s say we wanted to calculate the year over year percent change. We can call our pch() function with a lag of 4.
tseries_pchya
Two of the functions that we have discussed so far, the difference and the log, are often combined in time series analysis. The log difference function is useful for making non-stationary data stationary and has some other useful properties. We can calculate the log difference in R by simply combining the log() and diff() functions.
tseries_dlog
Gives this plot:
Notice that after taking the log return, tseries appears to be stationary. We see some higher than normal volatility in the beginning of the series. This is due largely to the fact that the series levels start off so small. This leads into a nice property of the log return function, which is that it is a close approximation to the percent change:
plot.ts(cbind(pch(tseries), tseries_dlog))
Gives this plot:
This similarity is only approximate. The relationship does break down somewhat when the percent change from one period to the next is particularly large. You can read a good discussion of this topic here.
A moving average is another essential function for working with time series. For series with particularly high volatility, a moving average can help us to more clearly visualize its trend. Unfortunately, base R does not (to my knowledge) have a convenient function for calculating the moving average of a time series directly. However, we can use base R’s filter() function, which allows us to perform general linear filtering. We can set up the parameters of this function to be a moving average (Shumway and Stoffer, 2011). Here we apply the filter() function to tseries to create a 5 period moving average. The filter argument lets up specify the filter, which in this case is just a weighted average of 5 observations. The sides argument allows us to specify whether we want to apply the filter over past values (sides = 1), or to both past and future values (sides = 2). Here I set sides to 1 indicating that I want a trailing moving average:
tseries_lf5
Gives this plot:
The fact that we have to define the linear filter each time we use this function makes it a little cumbersome to use. If we don't mind introducing a dependency to our code, we could use the SMA() function from the TTR package or the ma() function from the forecast package. The SMA() function takes a ts object and a value for n – the window over which we want to calculate the moving average.
library(TTR)
tseries_ma5
The ma() function from the forecast package also performs moving average calculations. We supply the time series and a value for the order argument.
tseries_ma5fore
Let's compare the results. The SMA() function returns a trailing moving average where each value is the mean of the n most recent trailing values. This is equivalent to the results we get from using the filter() function. The ma() function from the forecast package returns a centered moving average. In this case tseries_ma5for is equal to the average of the current observation, the previous two observation, and the next two observations. Which one you use would depend on your application.
tseries tseries_lf5 tseries_ma5 tseries_ma5fore
2000 Q1 7.076670 NA NA NA
2000 Q2 10.388758 NA NA NA
2000 Q3 23.910958 NA NA 14.35499
2000 Q4 14.493559 NA NA 18.54075
2001 Q1 15.905014 14.35499 14.35499 20.50828
2001 Q2 28.005455 18.54075 18.54075 17.55500
2001 Q3 20.226413 20.50828 20.50828 17.49470
2001 Q4 9.144571 17.55500 17.55500 17.68977
2002 Q1 14.192030 17.49470 17.49470 18.00239
2002 Q2 16.880366 17.68977 17.68977 18.86085
That’s it for Part 2. I don’t know about you, but I am sick of working with generated data. I want to work with some real world data. In Part 3 we’ll do just that using Quandl. See you then.
References
Hyndman, Rob J. and George Athanasopoulos (2013) Forecasting: Principles and Practice. https://www.otexts.org/fpp
Shumway, Robert H. and David S. Stoffer (2011) Time Series Analysis and Its Applications With R Examples. New York, NY: Springer.
spiderbar, spiderbar
Reads robots rules from afar.
Crawls the web, any size;
Fetches with respect, never lies.
Look Out!
Here comes the spiderbar.
Is it fast?
Listen bud,
It's got C++ under the hood.
Can you scrape, from a site?
Test with can_fetch(), TRUE == alright
Hey, there
There goes the spiderbar.
(Check the end of the post if you don’t recognize the lyrical riff.)
Face front, true believer!
I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.
But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).
I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).
How much faster?
I’m glad you asked!
Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?
library(spiderbar)
library(robotstxt)
library(microbenchmark)
library(tidyverse)
library(hrbrthemes)
rob mb1
update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))
autoplot(mb1) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
As you can see, it’s just a tad bit faster .
Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.
But, I don’t care about Crawl-Delay!
OK, fine. Do you care about fetchability? We can speed that up, too!
rob_txt mb2
autoplot(mb2) +
scale_y_comma(name="nanoseconds", trans="log10") +
labs(title="Microbenchmark results for testing resource 'fetchability'",
subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
theme_ipsum_rc(grid="Xx")
Vectorized or it didn’t happen.
(Gosh, even Spider-Man got more respect!)
OK, this is a tough crowd, but we’ve got vectorization covered as well:
Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.
You can check out the code and leave package questions or comments on GitHub.
(Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)
To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.
(This article was first published on R Views, and kindly contributed to R-bloggers)
With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were discovered before 1700 [2], and in the early eighteenth century, the old masters – de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation – developed the modern foundations of the field [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.
Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers – people either new to R or new to survival analysis or both – must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger
The survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.
ggfortify enables producing handsome, one-line survival plots with ggplot2::autoplot.
ranger might be the surprise in my very short list of survival packages. The ranger() function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But ranger() also works with survival data. Benchmarks indicate that ranger() is suitable for building time-to-event models with the large, high-dimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machine-learning age.
Load the data
This first block of code loads the required packages, along with the veteran dataset from the survival package that contains data from a two-treatment, randomized trial for lung cancer.
The variables in veteran are: * trt: 1=standard 2=test * celltype: 1=squamous, 2=small cell, 3=adeno, 4=large * time: survival time in days * status: censoring status * karno: Karnofsky performance score (100=good) * diagtime: months from diagnosis to randomization * age: in years * prior: prior therapy 0=no, 10=yes
Kaplan Meier Analysis
The first thing to do is to use Surv() to build the standard survival object. The variable time records survival time; status indicates whether the patient’s death was observed (status = 1) or that survival time was censored (status = 0). Note that a “+” after the time in the print out of km indicates censoring.
To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.
#plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready
autoplot(km_fit)
Next, we look at survival curves by treatment.
km_trt_fit
And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable AG that has values LT60 and GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make trt and prior into factor variables. But note, survfit() and npsurv() worked just fine without this refinement.
vet
Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.
Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette [12] that accompanies the survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, time-dependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.
Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the survConcordance() function in the survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”
To demonstrate using the survival package, along with ggplot2 and ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”
The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.
aa_fit
## Call:
## aareg(formula = Surv(time, status) ~ trt + celltype + karno +
## diagtime + age + prior, data = vet)
##
## n= 137
## 75 out of 97 unique event times used
##
## slope coef se(coef) z p
## Intercept 0.083400 3.81e-02 1.09e-02 3.490 4.79e-04
## trttest 0.006730 2.49e-03 2.58e-03 0.967 3.34e-01
## celltypesmallcell 0.015000 7.30e-03 3.38e-03 2.160 3.09e-02
## celltypeadeno 0.018400 1.03e-02 4.20e-03 2.450 1.42e-02
## celltypelarge -0.001090 -6.21e-04 2.71e-03 -0.229 8.19e-01
## karno -0.001180 -4.37e-04 8.77e-05 -4.980 6.28e-07
## diagtime -0.000243 -4.92e-05 1.64e-04 -0.300 7.65e-01
## age -0.000246 -6.27e-05 1.28e-04 -0.491 6.23e-01
## priorYes 0.003300 1.54e-03 2.86e-03 0.539 5.90e-01
##
## Chisq=41.62 on 8 df, p=1.6e-06; test weights=aalen
#summary(aa_fit) # provides a more complete summary of results
autoplot(aa_fit)
Random Forests Model
As a final example of what some might perceive as a data-science-like way to do time-to-event modeling, I’ll use the ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the rpart package, R’s basic tree-modeling package, along with Brian Ripley. See section 8.4 for the rpart vignette [14] that contains a survival analysis example.
ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.
# ranger model
r_fit
The next block of code illustrates how ranger() ranks variable importance.
vi
## importance
## karno 0.0734
## celltype 0.0338
## diagtime 0.0003
## trt -0.0007
## prior -0.0011
## age -0.0027
Notice that ranger() flags karno and celltype as the two most important; the same variables with the smallest p-values in the Cox model. Also note that the importance results just give variable names and not level names. This is because ranger and other tree models do not usually create dummy variables.
But ranger() does compute Harrell’s c-index (See [8] p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.
An ROC value of .68 would normally be pretty good for a first try. But note that the ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper [16], Hamad observes:
However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.
I believe that the major use for tree-based models for survival data will be to deal with very large data sets.
Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for ggplot().
# Set up for ggplot
kmi
For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the ranger() model to do better.
This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The ranger package, which suggests the survival package, and ggfortify, which depends on ggplot2 and also suggests the survival package, illustrate how open-source code allows developers to build on the work of their predecessors. The documentation that accompanies the survival package, the numerous online resources, and the statistics such as concordance and Harrell’s c-index packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.
Some Tutorials and Papers
For a very nice, basic tutorial on survival analysis, have a look at the Survival Analysis in R [5] and the OIsurv package produced by the folks at OpenIntro.
Look here for an exposition of the Cox Proportional Hazard’s Model, and here [11] for an introduction to Aalen’s Additive Regression Model.
For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein [13].
For an exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.
See the 1995 paper [15] by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.
References
For convenience, I have collected the references used throughout the post here.
Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.
To leave a comment for the author, please follow the link and comment on their blog: R Views.
Recent Comments