13 new R Jobs for R users (2017-09-26) – from all over the world

By Tal Galili

r_jobs

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

  1. Full-Time
    Data Scientist
    Edelweiss Business Services – Posted by Aakash Gupta
    Mumbai
    Maharashtra, India
    20 Sep 2017
  2. Full-Time
    Solutions Engineer for RStudio @ U.S.A. (or anywhere)
    RStudio – Posted by nwstephens
    Anywhere
    8 Sep 2017

More New Jobs

  1. Freelance
    R Programmer & Statistician Empiricus – Posted by empiricus
    Anywhere
    25 Sep 2017
  2. Full-Time
    Data Scientist Edelweiss Business Services – Posted by Aakash Gupta
    Mumbai Maharashtra, India
    20 Sep 2017
  3. Full-Time
    Postdoctoral Data Scientist: GIS and mHealth Harvard T.H. Chan School of Public Health – Posted by Christine Choirat
    Boston Massachusetts, United States
    14 Sep 2017
  4. Full-Time
    Data Scientist @ London Hiscox – Posted by alan.south
    London England, United Kingdom
    13 Sep 2017
  5. Freelance
    Crafty developer: R->C++ SherandoResearch – Posted by feersum_monkey
    Anywhere
    8 Sep 2017
  6. Full-Time
    Solutions Engineer for RStudio @ U.S.A. (or anywhere) RStudio – Posted by nwstephens
    Anywhere
    8 Sep 2017
  7. Full-Time
    Senior Data Scientist @ Boston, Massachusetts, U.S. Colaberry Data Analytics – Posted by Colaberry_DataAnalytics
    Boston Massachusetts, United States
    8 Sep 2017
  8. Part-Time
    Cran-R Programmierer Praktikum @ Zürich, Switzerland Fisch Asset Management AG – Posted by HR_Fisch
    Zürich Zürich, Switzerland
    4 Sep 2017
  9. Freelance
    Data analyst at talmix @ London, UK Talmix – Posted by manikauluck
    London England, United Kingdom
    4 Sep 2017
  10. Full-Time
    R Developer and Open Source Community Manager at UCLA @ Los Angeles, California UCLA – Posted by graemeblair
    Los Angeles California, United States
    4 Sep 2017
  11. Full-Time
    Work in Bodrum: Full-stack web developer with R experience @ Muğla, Turkey PranaGEO Ltd. – Posted by acizmeli
    Muğla Turkey
    4 Sep 2017
  12. Full-Time
    Digital Performance & Data Science Section Manager @ Paris, France Nissan Automotive Europe – Posted by Estefania Rendon
    Paris Île-de-France, France
    31 Aug 2017
  13. Freelance
    Quantitative Analyst (Credit) @ Zürich, Switzerland ipub – Posted by gluc
    Zürich Zürich, Switzerland
    29 Aug 2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts ).

Source:: R News

R live class | R for Data Science | Oct 4-5 Milan

By Quantide

Locandina R for Data Science

(This article was first published on R blog | Quantide – R training & consulting, and kindly contributed to R-bloggers)

R for Data Science is our first course of the autumn term. It takes place in October 4-5 in a location close to Milano Lima.

If you want to deepen your data analysis knowledge using the most modern R tools, or you want to figure out if R is the right solution for you, this is the your class. This course is for people that are willing to learn about R and would like to get an overview of its capabilities for data science or those that have very little knowledge of R.

R for Data Science Outline

– A bit of R history and online resources
– R and R-Studio installation and configuration
– Your first R session
– Your first R markdown document
– R objects: data and functions
– Data import from external sources: excel and database connection
– Data manipulation with tidyverse
– Tidy data with tidyr
– Data visualization using ggplot
– An overview of statistical models and data mining with R

The course is for max 6 attendees.

R for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

You can find an overview of all our courses here. Next dates will be:

  • October 11-12: Statistics for Data Science. Develop a wide variety of statistical models with R, from the simplest Linear Regression to the most sophisticated GLM models. Reserve now!
  • October 25-26: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
  • November 7-8: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
  • November 21-22: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
  • November 29-30: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | R for Data Science | Oct 4-5 Milan appeared first on Quantide – R training & consulting.

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide – R training & consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

When are Citi Bikes Faster than Taxis in New York City?

By Todd Schneider

midtown east

(This article was first published on Category: R | Todd W. Schneider, and kindly contributed to R-bloggers)

Every day in New York City, millions of commuters take part in a giant race to determine transportation supremacy. Cars, bikes, subways, buses, ferries, and more all compete against one another, but we never get much explicit feedback as to who “wins.” I’ve previously written about NYC’s public taxi data and Citi Bike share data, and it occurred to me that these datasets can help identify who’s fastest, at least between cars and bikes. In fact, I’ve built an interactive guide that shows when a Citi Bike is faster than a taxi, depending on the route and the time of day.

The methodology and findings will be explained more below, and all code used in this post is available open-source on GitHub.

Interactive guide to when taxis are faster or slower than Citi Bikes

Pick a starting neighborhood and a time. The map shows whether you’d expect get to each neighborhood faster with a taxi (yellow) or a Citi Bike (dark blue).

Alphabet CityBattery ParkBattery Park CityBloomingdaleCentral ParkChinatownClinton EastClinton WestEast ChelseaEast Harlem SouthEast VillageFinancial District NorthFinancial District SouthFlatironGarment DistrictGramercyGreenwich Village NorthGreenwich Village SouthHudson SqKips BayLenox Hill EastLenox Hill WestLincoln Square EastLincoln Square WestLittle Italy/NoLiTaLower East SideManhattan ValleyMeatpacking/West Village WestMidtown CenterMidtown EastMidtown NorthMidtown SouthMurray HillPenn Station/Madison Sq WestSeaportSoHoStuy Town/Peter Cooper VillageSutton Place/Turtle Bay NorthTimes Sq/Theatre DistrictTriBeCa/Civic CenterTwo Bridges/Seward ParkUN/Turtle Bay SouthUnion SqUpper East Side NorthUpper East Side SouthUpper West Side NorthUpper West Side SouthWest Chelsea/Hudson YardsWest VillageWorld Trade CenterYorkville EastYorkville West

BedfordBoerum HillBrooklyn HeightsBrooklyn Navy YardBushwick SouthCarroll GardensClinton HillCobble HillColumbia StreetCrown Heights NorthDowntown Brooklyn/MetroTechDUMBO/Vinegar HillEast WilliamsburgFort GreeneGowanusGreenpointPark SlopeProspect HeightsProspect ParkRed HookSouth WilliamsburgStuyvesant HeightsSunset Park WestWilliamsburg (North Side)Williamsburg (South Side)

Long Island City/Hunters PointLong Island City/Queens PlazaQueensbridge/RavenswoodSunnyside

Or click the map to change neighborhoods
Weekday time window





From Midtown East, weekdays 8:00 AM–11:00 AM

Taxi vs. Citi Bike travel times to other neighborhoods

Turn on javascript (or click through from RSS) to view the interactive taxi vs. Citi Bike map.

Data via NYC TLC and Citi Bike

Based on trips 7/1/2016–6/30/2017

toddwschneider.com

Hover over a neighborhood (tap on mobile) to view travel time stats

40% of weekday taxi trips—over 50% during peak hours—would be faster as Citi Bike rides

I estimate that 40% of weekday taxi trips within the Citi Bike service area would expect to be faster if switched to a Citi Bike, based on data from July 2016 to June 2017. During peak midday hours, more than 50% of taxi trips would expect to be faster as Citi Bike rides.

hourly win rate

There are some significant caveats to this estimate. In particular, if many taxi riders simultaneously switched to Citi Bikes, the bike share system would probably hit severe capacity constraints, making it difficult to find available bikes and docks. Increased bike usage might eventually lead to fewer vehicles on the road, which could ease vehicle congestion, and potentially increase bike lane congestion. It’s important to acknowledge that when I say “40% of taxi trips would be faster if they switched to Citi Bikes”, we’re roughly considering the decision of a single able-bodied person, under the assumption that everyone else’s behavior will remain unchanged.

Heading crosstown in Manhattan? Seriously consider taking a bike instead of a car!

Crosstown Manhattan trips are generally regarded as more difficult than their north-south counterparts. There are fewer subways that run crosstown, and if you take a car, the narrower east-west streets often feel more congested than the broad north-south avenues with their synchronized traffic lights. Crosstown buses are so notoriously slow that they’ve been known to lose races against tricycles.

crosstown and uptown zones

I divided Manhattan into the crosstown zones pictured above, then calculated the taxi vs. Citi Bike win rate for trips that started and ended within each zone. Taxis fare especially badly in the Manhattan central business district. If you take a midday taxi that both starts and ends between 42nd and 59th streets, there’s over a 70% chance that the trip would have been faster as a Citi Bike ride.

42nd-59th

Keep in mind that’s for all trips between 42nd and 59th streets. For some of the longest crosstown routes, for example, from the United Nations on the far east side to Hell’s Kitchen on the west, Citi Bikes beat taxis 90% of the time during the day. It’s worth noting that taxis made 8 times as many trips as Citi Bikes between 42nd and 59th streets from July 2016 to June 2017—almost certainly there would be less total time spent in transit if some of those taxi riders took bikes instead.

Hourly graphs for all of the crosstown zones are available here, and here’s a summary table for weekday trips between 8:00 AM and 7:00 PM:

Manhattan crosstown zone % taxis lose to Citi Bikes
96th–110th 41%
77th–96th 36%
59th–77th 54%
42nd–59th 69%
14th–42nd 64%
Houston–14th 54%
Canal–Houston 60%
Below Canal 55%

A reminder that this analysis restricts to trips that start and end within the same zone, so for example a trip from 23rd St to 57th St would be excluded because it starts and ends in different zones.

Taxis fare better for trips that stay on the east or west sides of Manhattan: 35% of daytime taxi trips that start and end west of 8th Avenue would expect to be faster as Citi Bike trips, along with 38% of taxi trips that start and end east of 3rd Avenue. Taxis also generally beat Citi Bikes on longer trips:

taxi vs. citi bike by trip length

Taxis are losing more to Citi Bikes over time

When the Citi Bike program began in July 2013, less than half of weekday daytime taxi trips would have been faster if switched to Citi Bikes. I ran a month-by-month analysis to see how the taxi vs. Citi Bike calculus has changed over time, and discovered that taxis are getting increasingly slower compared to Citi Bikes:

taxi vs. Citi Bike by month

Note that this month-by-month analysis restricts to the original Citi Bike service area, before the program expanded in August 2015. The initial expansion was largely into Upper Manhattan and the outer boroughs, where taxis generally fare better than bikes, and so to keep things consistent, I restricted the above graph to areas that have had Citi Bikes since 2013.

Taxis are losing more to Citi Bikes over time because taxi travel times have gotten slower, while Citi Bike travel times have remained roughly unchanged. I ran a pair of linear regressions to model travel times as a function of:

  • trip distance
  • time of day
  • precipitation
  • whether the route crosses between Manhattan and the outer boroughs
  • month of year
  • year

The regression code and output are available on GitHub: taxi, Citi Bike

As usual, I make no claim that this is a perfect model, but it does account for the basics, and if we look at the coefficients by year, it shows that, holding the other variables constant, a taxi trip in 2017 took 17% longer than the same trip in 2009. For example, a weekday morning trip from Midtown East to Union Square that took 10 minutes in 2009 would average 11:45 in 2017.

taxi multipliers

The same regression applied to Citi Bikes shows no such slowdown over time, in fact Citi Bikes got slightly faster. The regressions also show that:

  1. Citi Bike travel times are less sensitive to time of day than taxi travel times. A peak midday taxi trip averages 40% longer than the same trip at off-peak hours, while a peak Citi Bike trip averages 15% longer than during off-peak hours.
  2. Rainy days are associated with 2% faster Citi Bike travel times and 1% slower taxi travel times.
  3. For taxis, fall months have the slowest travel times, but for Citi Bikes, summer has the slowest travel times. For both, January has the fastest travel times.

Taxis are more prone to very bad days

It’s one thing to say that 50% of midday taxi trips would be faster as Citi Bike rides, but how much does that vary from day to day? You could imagine there are some days with severe road closures, where more nimble bikes have an advantage getting around traffic, or other days in the dead of summer, when taxis might take advantage of the less crowded roads.

I ran a more granular analysis to measure win/loss rates for individual dates. Here’s a histogram of the taxi loss rate—the % of taxi trips we’d expect to be faster if switched to Citi Bikes—for weekday afternoon trips from July 2016 to June 2017:

histogram

Many days see a taxi loss rate of just over 50%, but there are tails on both ends, indicating that some days tilt in favor of either taxis or Citi Bikes. I was curious if we could learn anything from the outliers on each end, so I looked at individual dates to see if there were any obvious patterns.

The dates when taxis were the fastest compared to Citi Bikes look like dates that probably had less traffic than usual. The afternoon with the highest taxi win rate was Monday, October 3, 2016, which was the Jewish holiday of Rosh Hashanah, when many New Yorkers would have been home from work or school. The next 3 best days for taxis were all Mondays in August, when I’d imagine a lot of people were gone from the city on vacation.

The top 4 dates where Citi Bikes did best against taxis were all rainy days in the fall of 2016. I don’t know why rainy days make bikes faster relative to taxis, maybe rain causes traffic on the roads that disproportionately affects cars, but it’s also possible that there’s a selection bias. I’ve written previously about how the weather predicts Citi Bike ridership, and not surprisingly there are fewer riders when it rains. Maybe the folks inclined to ride bikes when it’s raining are more confident cyclists, who also pedal faster when the weather is nice. It’s also possible that rainy-day cyclists are particularly motivated to pedal faster so they can get out of the rain. I don’t know if these are really the causes, but they at least sound believable, and would explain the observed phenomenon.

When the President is in town, take a bike

June 8, 2016 was a particularly good day for Citi Bikes compared to taxis. President Obama came to town that afternoon, and brought the requisite street closures with him. I poked around a bit looking for the routes that appeared to be the most impacted by the President’s visit, and came to afternoon trips from Union Square to Murray Hill. On a typical weekday afternoon, taxis beat Citi Bikes 57% of the time from Union Square to Murray Hill, but on June 8, Citi Bikes won 90% of the time. An even more dramatic way to see the Obama effect is to look at daily median travel times:

Obama effect

A typical afternoon taxi takes 8 minutes, but on June 8, the median was over 21 minutes. The Citi Bike median travel time is almost always 9 minutes, including during President Obama’s visit.

The same graph shows a similar phenomenon on September 19, 2016, when the annual United Nations General Assembly shut down large swathes of Manhattan’s east side, including Murray Hill. Although the impact was not as severe as during President Obama’s visit, the taxi median time doubled on September 19, while the Citi Bike median time again remained unchanged.

The morning of June 15, 2016 offers another example, this time on the west side, when an overturned tractor trailer shut down the Lincoln Tunnel for nearly seven hours. Taxi trips from the Upper West Side to West Chelsea, which normally take 15 minutes, took over 35 minutes. Citi Bikes typically take 18 minutes along the same route, and June 15 was no exception. Taxis would normally expect to beat Citi Bikes 67% of the time on a weekday morning, but on June 15, Citi Bikes won over 92% of the time.

Lincoln Tunnel

These are of course three hand-picked outliers, and it wouldn’t be entirely fair to extrapolate from them to say that Citi Bikes are always more resilient than taxis during extreme circumstances. The broader data shows, though, that taxis are more than twice as likely as Citi Bikes to have days when a route’s median time is at least 5 minutes slower than average, and more than 3.5 times as likely to be at least 10 minutes slower, so it really does seem that Citi Bikes are better at minimizing worst-case outcomes.

Why have taxis gotten slower since 2009?

The biggest slowdowns in taxi travel times happened in 2014 and 2015. The data and regression model have nothing to say about why taxis slowed down so much over that period, though it might be interesting to dig deeper into the data to see if there are specific regions where taxis have fared better or worse since 2009.

Uber usage took off in New York starting in 2014, reaching over 10,000 vehicles dispatched per week by the beginning of 2015. There are certainly people who blame Uber—and other ride-hailing apps like Lyft and Juno—for increasing traffic, but the city’s own 2016 traffic report did not blame Uber for increased congestion.

It’s undoubtedly very hard to do an accurate study measuring ride-hailing’s impact on traffic, and I’m especially wary of people on both sides who have strong interests in blaming or exonerating the ride-hailing companies. Nevertheless, if I had to guess the biggest reasons taxis got particularly slower in 2014 and 2015, I would start with the explosive growth of ride-hailing apps, since the timing looks to align, and the publicly available data shows that they account for tens of thousands of vehicles on the roads.

On the other hand, if ride-hailing were the biggest cause of increased congestion in 2014 and 2015, it doesn’t exactly make sense that taxi travel times have stabilized a bit in 2016 and 2017, because ride-hailing has continued to grow, and while taxi usage continues to shrink, the respective rates of growth and shrinkage are not very different in 2016–17 than they were in 2014–15. One explanation could be that starting in 2016 there was a reduction in other types of vehicles—traditional black cars, private vehicles, etc.—to offset ride-hailing growth, but I have not seen any data to support (or refute) that idea.

There are also those who blame bike lanes for worsening vehicle traffic. Again, different people have strong interests arguing both sides, but it seems like there are more data points arguing that bike lanes do not cause traffic (e.g. here, here, and here) than vice versa. I wasn’t able to find anything about the timing of NYC bike lane construction to see how closely it aligns with the 2014–15 taxi slowdown.

Lots of other factors could have contributed to worsening traffic: commuter-adjusted population growth, subway usage, decaying infrastructure, construction, and presidential residences are just a few that feel like they could be relevant. I don’t know the best way to account for all of them, but it does seem like if you want to get somewhere in New York quickly, it’s increasingly less likely that a car is your best option.

How representative are taxis and Citi Bikes of all cars and bikes?

I think it’s not a terrible assumption that taxis are representative of typical car traffic in New York. If anything, maybe taxis are faster than average cars since taxi drivers are likely more experienced—and often aggressive—than average drivers. On the other hand, taxi drivers seem anecdotally less likely to use a traffic-enabled GPS, which maybe hurts their travel times.

Citi Bikes are probably slower than privately-owned bikes. Citi Bikes are designed to be heavy and stable, which maybe makes them safer, but lowers their speeds. Plus, I’d guess that biking enthusiasts, who might be faster riders, are more likely to ride their own higher-performance bikes. Lastly, Citi Bike riders might have to spend extra time at the end of a trip looking for an available dock, whereas privately-owned bikes have more parking options.

Weighing up these factors, I would guess that if we somehow got the relevant data to analyze the broader question of all cars vs. all bikes, the results would tip a bit in favor of bikes compared to the results of the narrower taxi vs. Citi Bike analysis. It’s also worth noting that both taxis and Citi Bikes have additional time costs that aren’t accounted for in trip durations: you have to hail a taxi, and there might not be a Citi Bike station in the near vicinity of your origin or destination.

What are the implications of all this?

One thing to keep in mind is that even though the taxi and Citi Bike datasets are the most conveniently available for analysis, New Yorkers don’t limit their choices to cars and bikes. The subway, despite its poor reputation of late, carries millions of people every day, more than taxis, ride-hailing apps, and Citi Bikes combined, so it’s not like “car vs. bike” is always the most relevant question. There are also legitimate reasons to choose a car over a bike—or vice versa—that don’t depend strictly on expected travel time.

Bike usage in New York has increased dramatically over the past decade, probably in large part because people figured out on their own that biking is often the fastest option. Even with this growth, though, the data shows that a lot of people could still save precious time—and minimize their worse-case outcomes—if they switched from cars to bikes. To the extent the city can incentivize that, it strikes me as a good thing.

When L-mageddon comes, take a bike

For any readers who might be affected by the L train’s planned 2019 closure, if you only remember one thing from this post: Citi Bikes crush taxis when traveling from Williamsburg to just about anywhere in Manhattan during morning rush hour!

Williamsburg

GitHub

The code for the taxi vs. Citi Bike analysis is available here as part of the nyc-taxi-data repo. Note that parts of the analysis also depend on loading the data from the nyc-citibike-data repo.

The data

Taxi trip data is available since January 2009, Citi Bike data since July 2013. I filtered each dataset to make the analysis closer to an apples-to-apples comparison—see the GitHub repo for a more complete description of the filtering—but in short:

  • Restrict both datasets to weekday trips only
  • Restrict Citi Bike dataset to subscribers only, i.e. no daily pass customers
  • Restrict taxi dataset to trips that started and ended in areas with Citi Bike stations, i.e. where taking a Citi Bike would have been a viable option

Starting in July 2016, perhaps owing to privacy concerns, the TLC stopped providing latitude and longitude coordinates for every taxi trip. Instead, the TLC now divides the city into 263 taxi zones (map), and provides the pickup and drop off zones for every trip. The analysis then makes the assumption that taxis and Citi Bikes have the same distribution of trips within a single zone, see GitHub for more.

80% of taxi trips start and end within zones that have Citi Bike stations, and the filtered dataset since July 2013 contains a total of 330 million taxi trips and 27 million Citi Bike trips. From July 1, 2016 to June 30, 2017—the most recent 12 month period of available data—the filtered dataset includes 68 million taxi trips and 9 million Citi Bike trips.

Methodology

I wrote a Monte Carlo simulation in R to calculate the probability that a Citi Bike would be faster than a taxi for a given route. Every trip is assigned to a bucket, where the buckets are picked so that trips within a single bucket are fairly comparable. The bucket definitions are flexible, and I ran many simulations with different bucket definitions, but one sensible choice might be to group trips by:

  1. Starting zone
  2. Ending zone
  3. Hour of day

For example, weekday trips from the West Village to Times Square between 9:00 AM and 10:00 AM would constitute one bucket. The simulation iterates over every bucket that contains at least 5 taxi and 5 Citi Bike trips, and for each bucket, it draws 10,000 random samples, with replacement, for each of taxi and Citi Bike trips. The bucket’s estimated probability that a taxi is faster than a Citi Bike, call it the “taxi win rate”, is the fraction of samples where the taxi duration is shorter than the Citi Bike duration. You can think of this as 10,000 individual head-to-head races, with each race pitting a single taxi trip against a single Citi Bike trip.

Different bucketing and filtering schemes allow for different types of analysis. I ran simulations that bucketed by month to see how win rates have evolved over time, simulations that used only days where it rained, and others. There are undoubtedly more schemes to be considered, and the Monte Carlo methodology should be well equipped to handle them.

var selected_location_click_handler = [];

var tooltip_x_handlers = [
{
events: “@taxi_zone_marks:mouseover”,
update: “x() – 100 width – 240 ? width – 240 : x() – 100)”
},
{
events: “@taxi_zone_marks:mouseout”,
update: “-999”
}
]

if (desktop) {
$(“.map-click-instructions”).show();

selected_location_click_handler = [
{
events: “@taxi_zone_marks:click”,
update: “datum.properties.zone”
},
{
events: “@taxi_zone_base_marks:click”,
update: “datum.properties.zone”
}
];

tooltip_x_handlers.push({
events: “@taxi_zone_marks:click”,
update: “-999”
});
}

if (window.screen && window.screen.width 0.5 ? ‘Taxis’ : ‘Citi Bikes’) + ‘ beat ‘ + (hover_area.taxi_win_rate > 0.5 ? ‘Citi Bikes’ : ‘taxis’) + ‘ ‘ + format((hover_area.taxi_win_rate > 0.5 ? hover_area.taxi_win_rate : 1 – hover_area.taxi_win_rate), ‘0.0%’) + ‘ of the time’ : ””
},
{
name: “tooltip_taxi_median”,
value: null,
update: “hover_area ? ‘Taxi median: ‘ + timeFormat(1000 * hover_area.taxi_median, ‘%-M:%S’) : ””
},
{
name: “tooltip_citi_median”,
value: null,
update: “hover_area ? ‘Citi Bike median: ‘ + timeFormat(1000 * hover_area.citi_median, ‘%-M:%S’) : ””
},
{
name: “tooltip_x”,
value: -999,
on: tooltip_x_handlers
},
{
name: “tooltip_y”,
on: [
{
events: “@taxi_zone_marks:mouseover”,
update: “y() – 145

To leave a comment for the author, please follow the link and comment on their blog: Category: R | Todd W. Schneider.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data.Table by Example – Part 1

By atmathew

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

For many years, I actively avoided the data.table package and preferred to utilize the tools available in either base R or dplyr for data aggregation and exploration. However, over the past year, I have come to realize that this was a mistake. Data tables are incredible and provide R users with a syntatically
concise and efficient data structure for working with small, medium, or large datasets. While the package is well documented, I wanted to put together a series of posts that could be useful for those who want to get introduced to the data.table package in a more task oriented format.

For this series of posts, I will be working with data that comes from the Chicago Police Department’s Citizen Law Enforcement Analysis and Reporting system. This dataset contains information on reported incidents of crime that occured in the city of Chicago from 2001 to present. You can use the wget command in the terminal to export it as a csv file.

$ wget –no-check-certificate –progress=dot https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD > rows.csv

This file can be found in your working directory and was saved as “rows.csv”. We will import the data into R with the fread function and look at the first few rows and structure of the data.

dat = fread("rows.csv")

dat[1:3]
str(dat)

Notice that the each of the string variables in the data set was imported as a character and not a factor. With base R functions like read.csv, we have to set the stringsAsFactors argument to TRUE if we want this result.

names(dat) 

Let's say that we want to see the frequency distribution of several of these variables. This can be done by using .N in conjunction with the by operator.

dat[, .N, by=.(Arrest)]

In the code below, you can also see how to chain operations togther. We started by finding the count of each response in the variable, ordered the count in descending order, and then selected only those which occured more than 200,000 times.

dat[, .N, by=.(Primary_Type)][order(-N)][N>=200000]

dat[, .N, by=.(Location_Description)][order(-N)][N>=200000]

Let’s say that we want to get a count of prostitution incidents by month. To get the desired results, we will need to modify the date values, filter instances in which the primary type was “prostitution”, and then get a count by each date.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
   Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)][]

If you want to plot the results as a line graph, just add another chain which executes the visualization.

dat[, date2 := paste(substr(as.Date(dat$Date, format="%m/%d/%Y"),1,7), "01", sep="-")][
   Primary_Type=="PROSTITUTION", .N, by=.(date2)][, date2 := as.Date(date2)][order(date2)] %>%
      ggplot(., aes(date2, N)) + geom_line(group=1)

I’ve obviously skipped over a lot and some of the code presented here is more verbose than needed. Even so, beginners to R will hopefully find this useful and it will pique your interest in the beauty of the data table package. Future posts will cover more of the goodies available in the data.table package such as get(), set(), {}, and so forth.

PS: As of this past week, I’ve decided to re-enter the job market. So if you are looking for a statistical analyst or data scientist with strong experience in programming in R and time series econometrics, please contact me at mathewanalytics@gmail.com or reach me through LinkedIn

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to Create an Interactive Infographic

By Tim Bock

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

An interactive infographic can be used to communicate a lot of information in an engaging way. With the right tools, they are also relatively straightforward to create. In this post, I show step-by-step how to create this interactive infographic, using Canva, Displayr and R code. The interactive example is designed so that the user can change the country and have the infographic update automatically.

Tools used to create an interactive infographic: Canva is used to create the base infographic. The calculations, charting, and automatic text-writing are performed using the R language. It is all hooked up with Displayr.


Step 1: Create or download the infographic

I start by going to Canva, and choosing the Neat Interactive Gaming Infographic (tip: use Find Templates on the left-hand panel). You could, of course, design your own infographic, either in Canva or elsewhere. I like Canva, but the key thing is to create an infographic image some way. In Canva, I edited the template by deleting the bits that I wanted to replace with interactive charts and visualizations and then I download the infographic as a PNG file (2,000 pixels high by 800 wide).

Interactive infographic example Interactive infographic example


Step 2: Import the infographic into Displayr

Create an account in Displayr, and then click the button that says + New Document. Set the page size to the same aspect ratio as the PNG file (Home > Page Layout > Layout > Page Size > Custom). For this example, the page should be 20 inches high and 8 inches wide.

Next, insert the infographic into Displayr (Insert > Image), move and resize it to fit the page (tip: you can use the Properties panel on the right to type in pixels 800 x 2000 to reset the correct aspect ratio of the image).


Step 3: Get the data into Displayr

The data that will make the infographic interactive needs to be hooked up in Displayr. The data used to create the infographic in this example is shown to the right. There are lots of ways to import data into Displayr (e.g., importing a raw data file and creating the tables in Displayr). For this example, the data has been pasted into Displayr from Excel using the steps below.

To paste the data into Displayr:

  • Insert > Paste Table (Data), click Add data (on the right of the screen).
  • Paste in the data. Alternatively, you could type it into this screen. I first just pasted in the Age Distribution data and press OK.
  • Properties > GENERAL and type AgeDistribution into the Label field and check the Automatic option (above).
  • Drag the table so that it is to the left of the infographic.
  • Hide the table (select it, Appearance > Hide). It will stay visible but will be invisible when you share the infographic.

Repeat this process for AverageAge, Ratio, and Multiplayer. It is important that you give each of these tables these names, as we refer to them later in our R code.


Step 4: Add the country selector

Next, I add a control so that the user can change country:

  • Insert > Control
  • Item list: China, US, Europe
  • Move it to the top of the screen and style as desired (font size, color, border)
  • Name: Click on the control and select China
  • Properties > GENERAL > Name: Country

I then insert a text box (“GAMERS”), and placed it to the left of the control (i.e.: font: Impact, size: 48, color: #ffb600).

Image of Let's play on! Gamers: China


Step 5: Create the charts and visualizations in R

Finally, create the charts and visualizations in Displayr using the following R code.

The column chart

I created the column chart using my colleague Carmen’s nifty wrapper-function for plotly. Insert an R Output in Displayr (Insert > R Output), and copy and paste the following code, pressing Calculate and resizing moving and resizing the chart.

 
flipStandardCharts::Chart(AgeDistribution[, Country], 
 type = "Column",
 background.fill.color = "#212121",
 charting.area.fill.color = "#212121",
 colors = "tiel", 
 x.tick.font.color = "white",
 x.tick.font.size = 20,
 x.grid.width = 0,
 y.tick.font.color = "white",
 y.tick.font.size = 20,
 y.title = "%",
 y.title.font.color = "white",
 y.grid.width = 0)

Average age

The average age was also created by inserting an R Output, using the code below. While I could have written the formatting in R, I instead used the various formatting tools built into Displayr (Properties >LAYOUT and Properties > APPEARANCE).

 
AverageAge[,Country]

The hearts pictograph

This was also done using an R Output in Displayr, with the following code (using R GitHub packages built by my colleagues Kyle and Carmen).

 
women = Ratio["Women", Country]
total = sum(Ratio[, Country])
flipPictographs::SinglePicto(women, total,
    layout = "Number of columns",
    number.cols = 5, 
    image = "Heart", 
    hide.base.image = FALSE,
    auto.size = TRUE,
    fill.icon.color = "red",
    base.icon.color = "cyan",
    background.color ="#212121" )

The R code used to create the textbox is below (tip: toggle on Wrap text output at the bottom of the Properties panel on the right)

 
women = Ratio["Women", Country]
total = sum(Ratio[, Country])
paste0(women, " IN ", total, " GAMERS ARE WOMEN")

The pie chart

This is the R code for the pie chart:

 
flipStandardCharts::Chart(Multiplayer[, Country],
    type = "Pie", 
    colors = c(rgb(175/255, 224/255, 170/255), rgb(0/255, 181/255, 180/255)),
    data.label.font.color = "White",
    data.label.font.size = 18,
    data.label.decimals = 0,
    data.label.suffix = "%")

The R code used to create the textbox was:

 
one.player = Multiplayer["1 player", Country]
paste0(one.player, "% OF PEOPLE PLAY VIDEO GAMES ON THEIR OWN")

Create the interactive infographic yourself

Sign in to Displayr and edit the document used to create the interactive infographic here. In Edit mode you can click on each of the charts, pictographs, and text boxes to see the underlying code. The final document was published (i.e., turned into a Dashboard) using Export > Web Page. Or you can view the interactive infographic created using the instructions above.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

News Roundup from Microsoft Ignite

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It’s been a big day for the team here at Microsoft, with a flurry of announcements from the Ignite conference in Orlando. We’ll provide more in-depth details in the coming days and weeks, but for now here’s a brief roundup of the news related to data science:

Microsoft ML Server 9.2 is now available. This is the new name for what used to be called Microsoft R Server, and now also includes support for operationalizing the Python language as well as R. This 2-minute video explains how to deploy models with ML Server (and with this release, real-time scoring is now also supported Linux as well). Microsoft R Client 3.4.1, featuring R 3.4.1, was also released today and provides desktop capabilities for R developers with the ability to deploy computations to a production ML Server.

The next generation of Azure Machine Learning is now available. This new service includes the AML Workbench, a cross-platform client for AI-powered data wrangling and experiment management; the AML Experimentation service to help data scientists increase their rate of experimentation with big data and GPUs, and the AML Model Management service to host, version, manage and monitor machine learning models. This TechCrunch article provides an overview.

SQL Server 2017 is now generally available, bringing support for in-database R and Python to both Windows and Linux.

Azure SQL Database now supports real-time scoring for R and Python, and a preview of general in-database R services is now available as well.

Microsoft Cognitive Services offers new intelligent APIs for developing AI applications, including general availability of the text analytics API for topic extraction, language detection and sentiment analysis.

Visual Studio Code for AI, an extension for the popular open-source code editor, provides new interfaces for developing with Tensorflow, Cognitive Toolkit (CNTK), Azure ML and more.

Finally, Microsoft announced a new high-level language for quantum computing to be integrated with Visual Studio, and a simulator for quantum computers up to 32 qubits. This 2-minute video provides an overview of Microsoft’s vision for Quantum Computing.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Custom Level Coding in vtreat

By Nina Zumel

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

One of the services that the R package vtreat provides is level coding (what we sometimes call impact coding): converting the levels of a categorical variable to a meaningful and concise single numeric variable, rather than coding them as indicator variables (AKA “one-hot encoding”). Level coding can be computationally and statistically preferable to one-hot encoding for variables that have an extremely large number of possible levels.


Level coding is like measurement: it summarizes categories of individuals into useful numbers. Source: USGS

By default, vtreat level codes to the difference between the conditional means and the grand mean (catN variables) when the outcome is numeric, and to the difference between the conditional log-likelihood and global log-likelihood of the target class (catB variables) when the outcome is categorical. These aren’t the only possible level codings. For example, the ranger package can encode categorical variables as ordinals, sorted by the conditional expectations/means. While this is not a completely faithful encoding for all possible models (it is not completely faithful for linear or logistic regression, for example), it is often invertible for tree-based methods, and has the advantage of keeping the original levels distinct, which impact coding may not. That is, two levels with the same conditional expectation would be conflated by vtreat‘s coding. This often isn’t a problem — but sometimes, it may be.

So the data scientist may want to use a level coding different from what vtreat defaults to. In this article, we will demonstrate how to implement custom level encoders in vtreat. We assume you are familiar with the basics of vtreat: the types of derived variables, how to create and apply a treatment plan, etc.

For our example, we will implement level coders based on partial pooling, or hierarchical/multilevel models (Gelman and Hill, 2007). We’ll leave the details of how partial pooling works to a subsequent article; for now, just think of it as a score that shrinks the estimate of the conditional mean to be closer to the unconditioned mean, and hence possibly closer to the unknown true values, when there are too few measurements to make an accurate estimate.

We’ll implement our partial pooling encoders using the lmer() (multilevel linear regression) and glmer() (multilevel generalized linear regression) functions from the lme4 package. For our example data, we’ll use radon levels by county for the state of Minnesota (Gelman and Hill, 2007. You can find the original data here).

The Data: Radon levels in Minnesota

library("vtreat")
library("lme4")
library("dplyr")
library("tidyr")
library("ggplot2")

# example data

srrs = read.table("srrs2.dat", header=TRUE, sep=",", stringsAsFactor=FALSE)

# target: log of radon activity (activity)
# grouping variable: county
radonMN = filter(srrs, state=="MN") %>%
  select("county", "activity") %>%
  filter(activity > 0) %>% 
  mutate(activity = log(activity),
         county = base::trimws(county)) %>%
  mutate(critical = activity>1.5)

str(radonMN)
## 'data.frame':    916 obs. of  3 variables:
##  $ county  : chr  "AITKIN" "AITKIN" "AITKIN" "AITKIN" ...
##  $ activity: num  0.788 0.788 1.065 0 1.131 ...
##  $ critical: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

For this example we have three columns of interest:

  • county: 85 possible values
  • activity: the log of the radon reading (numerical outcome)
  • critical: TRUE when activity > 1.5 (categorical outcome)

The goal is to level code county for either the regression problem (predict the log radon reading) or the categorization problem (predict whether the radon level is “critical”).

Unnamed chunk 1 1

As the graph shows, the conditional mean of log radon activity by county ranges from nearly zero to about 3, and the conditional expectation of a critical reading ranges from zero to one. On the other hand, the number of readings per county is quite low for many counties — only one or two — though some counties have a large number of readings. That means some of the conditional expectations are quite uncertain.

Implementing Level Coders for Partial Pooling

Let’s implement level coders that use partial pooling to compute the level score.

Regression

For regression problems, the custom coder should be a function that takes as input:

  • v: a string with the name of the categorical variable
  • vcol: the actual categorical column (assumed character)
  • y: the numerical outcome column
  • weights: a column of row weights

The function should return a column of scores (the level codings). For our example, the function builds a lmer model to predict y as a function of vcol, then returns the predictions on the training data.

# @param v character variable name
# @param vcol character, independent or input variable
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
ppCoderN  function(v, vcol, 
                     y, 
                     weights) {
  # regression case y ~ vcol
  d  data.frame(x = vcol,
                  y = y,
                  stringsAsFactors = FALSE)
  m  lmer(y ~ (1 | x), data=d, weights=weights)
  predict(m, newdata=d)
}

Categorization

For categorization problems, the function should assume that y is a logical column, where TRUE is assumed to be the target outcome. This is because vtreat converts the outcome column to a logical while creating the treatment plan.

# @param v character variable name
# @param vcol character, independent or input variable
# @param y logical, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
ppCoderC  function(v, vcol, 
                     y, 
                     weights) {
  # classification case y ~ vcol
  d  data.frame(x = vcol,
                  y = y,
                  stringsAsFactors = FALSE)
  m = glmer(y ~ (1 | x), data=d, weights=weights, family=binomial)
  predict(m, newdata=d, type='link')
}

You can then pass the functions in as a named list into either designTreatmentsX or mkCrossFrameXExperiment to build the treatment plan. The format of the key is [n|c].levelName[.option]*.

The prefacing picks the model type: numeric or regression starts with ‘n.’ and the categorical encoder starts with ‘c.’. Currently, the only supported option is ‘center,’ which directs vtreat to center the codes with respect to the estimated grand mean. ThecatN and catB level codings are centered in this way.

Our example coders can be passed in as shown below.

customCoders = list('n.poolN.center' = ppCoderN, 
                    'c.poolC.center' = ppCoderC)

Using the Custom Coders

Let’s build a treatment plan for the regression problem.

# I only want to create the cleaned numeric variables, the isBAD variables,
# and the level codings (not the indicator variables or catP, etc.)
vartypes_I_want = c('clean', 'isBAD', 'catN', 'poolN')

treatplanN = designTreatmentsN(radonMN, 
                               varlist = c('county'),
                               outcomename = 'activity',
                              codeRestriction = vartypes_I_want,
                              customCoders = customCoders, 
                              verbose=FALSE)

scoreFrame = treatplanN$scoreFrame
scoreFrame %>% select(varName, sig, origName, code)
##        varName          sig origName  code
## 1 county_poolN 1.343072e-16   county poolN
## 2  county_catN 2.050811e-16   county  catN

Note that the treatment plan returned both the catN variable (default level encoding) and the pooled level encoding (poolN). You can restrict to just using one coding or the other using the codeRestriction argument either during treatment plan creation, or in prepare().

Let’s compare the two level encodings.

# create a frame with one row for every county,
measframe = data.frame(county = unique(radonMN$county),
                       stringsAsFactors=FALSE)

outframe = prepare(treatplanN, measframe)

# If we wanted only the new pooled level coding,
# (plus any numeric/isBAD variables), we would
# use a codeRestriction:
#
# outframe = prepare(treatplanN, 
#                    measframe,
#                    codeRestriction = c('clean', 'isBAD', 'poolN'))


gather(outframe, key=scoreType, value=score, 
       county_poolN, county_catN) %>%
  ggplot(aes(x=score)) + 
  geom_density(adjust=0.5) + geom_rug(sides="b") + 
  facet_wrap(~scoreType, ncol=1, scale="free_y") + 
  ggtitle("Distribution of scores")

Nex 1

Notice that the poolN scores are “tucked in” compared to the catN encoding. In a later article, we’ll show that the counties with the most tucking in (or shrinkage) tend to be those with fewer measurements.

We can also code for the categorical problem.

# For categorical problems, coding is catB
vartypes_I_want = c('clean', 'isBAD', 'catB', 'poolC')

treatplanC = designTreatmentsC(radonMN, 
                               varlist = c('county'),
                               outcomename = 'critical',
                               outcometarget= TRUE,
                               codeRestriction = vartypes_I_want,
                               customCoders = customCoders, 
                               verbose=FALSE)

outframe = prepare(treatplanC, measframe)

gather(outframe, key=scoreType, value=linkscore, 
       county_poolC, county_catB) %>%
  ggplot(aes(x=linkscore)) + 
  geom_density(adjust=0.5) + geom_rug(sides="b") + 
  facet_wrap(~scoreType, ncol=1, scale="free_y") + 
  ggtitle("Distribution of link scores")

Unnamed chunk 2 1

Notice that the poolC link scores are even more tucked in compared to the catB link scores, and that the catB scores are multimodal. The smaller link scores mean that the pooled model avoids estimates of conditional expectation close to either zero or one, because, again, these estimates come from counties with few readings. Multimodal summaries can be evidence of modeling flaws, including omitted variables and un-modeled mixing of different example classes. Hence, we do not want our inference procedure to suggest such structure until there is a lot of evidence for it. And, as is common in machine learning, there are advantages to lower-variance estimators when they do not cost much in terms of bias.

Other Considerations

For this example, we used the lme4 package to create custom level codings. Once calculated, vtreat stores the coding as a lookup table in the treatment plan. This means lme4 is not needed to prepare new data. In general, using a treatment plan is not dependent on any special packages that might have been used to create it, so it can be shared with other users with no extra dependencies.

When using mkCrossFrameXExperiment, note that the resulting cross frame will have a slightly different distribution of scores than what the treatment plan produces. This is true even for catB and catN variables. This is because the treatment plan is built using all the data, while the cross frame is built using n-fold cross validation on the data. See the cross frame vignette for more details.

Thanks to Geoffrey Simmons, Principal Data Scientist at Echo Global Logistics, for suggesting partial pooling based level coding (and testing it for us!), introducing us to the references, and reviewing our articles.

In a follow-up article, we will go into partial pooling in more detail, and motivate why you might sometimes prefer it to vtreat‘s default coding.

References

Gelman, Andrew and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Time Series Analysis in R Part 2: Time Series Transformations

By Troy Walters

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In Part 1 of this series, we got started by looking at the ts object in R and how it represents time series data. In Part 2, I’ll discuss some of the many time series transformation functions that are available in R. This is by no means an exhaustive catalog. If you feel I left out anything important, please let me know. I compile these posts as a guide in RMarkdown which I plan to make available on the web soon.

Often in time series analysis and modeling, we will want to transform data. There are a number of different functions that can be used to transform time series data such as the difference, log, moving average, percent change, lag, or cumulative sum. These type of function are useful for both visualizing time series data and for modeling time series. For example, the moving average function can be used to more easily visualize a high-variance time series and is also a critical part the ARIMA family of models. Functions such as the difference, percent change, and log difference are helpful for making non-stationary data stationary.

Let’s start with a very common operation. To take a lag we use the lag() function from the stats package. It takes a ts object and a value for the lag argument.

tseries_lag1 tseries tseries_lag1
1999 Q4       NA      7.07667
2000 Q1  7.07667     10.38876
2000 Q2 10.38876     23.91096
2000 Q3 23.91096     14.49356
2000 Q4 14.49356     15.90501
2001 Q1 15.90501     28.00545

Notice that when we do this, tseries_lag1 is equivalent to tseries offset by 1, such that the value for 2000q1 in tseries become the value for 1999q4 in tseries_lag1 and so on. We can do this for any lag of our choosing. Let’s try 3 instead:

tseries_lag3          tseries tseries_lag3
1999 Q2       NA      7.07667
1999 Q3       NA     10.38876
1999 Q4       NA     23.91096
2000 Q1  7.07667     14.49356
2000 Q2 10.38876     15.90501
2000 Q3 23.91096     28.00545

We can also do this in the opposite direction using a negative value for the lag argument, effectively creating a lead series instead:

tseries_lead1          tseries tseries_lead1
2000 Q1  7.07667            NA
2000 Q2 10.38876       7.07667
2000 Q3 23.91096      10.38876
2000 Q4 14.49356      23.91096
2001 Q1 15.90501      14.49356
2001 Q2 28.00545      15.90501

Notice that tseries_lead1 now leads the original tseries by one time period.
We can take differences of a time series as well. This is equivalent to taking the difference between each value and its previous value:

tseries_diff1 tseries tseries_diff1
[1,]  7.07667            NA
[2,] 10.38876      3.312087
[3,] 23.91096     13.522201
[4,] 14.49356     -9.417399
[5,] 15.90501      1.411455
[6,] 28.00545     12.100441
plot.ts(tm)

Give this plot:

As you can see, taking a difference is an effective way to remove a trend and make a time series stationary.
The difference can be calculated over more than one lag. To do so, simply alter the value of the lag argument.

tseries_diff2          tseries tseries_diff2
2000 Q1  7.07667            NA
2000 Q2 10.38876            NA
2000 Q3 23.91096     16.834288
2000 Q4 14.49356      4.104801
2001 Q1 15.90501     -8.005944
2001 Q2 28.00545     13.511896
plot(tm)

Gives this plot:

Some time series transformation functions are useful for series in which the variance gets larger over time. These range from the basic logarithm function to the Box-Cox group of transformations (of which the natural logarithm is a special case). We can take the log of a time series using the log function in the same way that we would take the log of a vector. Let’s generate a time series that has increasing variance:

trend 

Gives this plot:

As we'll discuss in more detail, the prevalence of increasing, or more generally non-constant, variance is called heteroskedasticity and can cause problems in linear regression. Often, it will need to be corrected before modeling. One way to do this is taking a log:

tseries_log 

Gives this plot:

There are other more advanced ways of eliminating non-constant variance, one of which is the Box-Cox transformation, which allows us a bit more control over the transformation. The Box-Cox takes the form (Hyndman and Athanasopoulos, 2013):

$$w_t = begin{cases} log{ y_t }, text{ if } lambda = 0; frac{ ({y_t}^{lambda} – 1) }{ lambda },text{ otherwise } end{cases} $$

In order to demonstrate the Box-Cox transformation, I'll introduce Rob Hyndman's `forecast` package:

library(forecast)

It has two functions that are of use here. The primary function is BoxCox(), which will return a transformed time series given a time series and a value for the parameter lambda:

plot.ts(BoxCox(tseries_h, lambda = 0.5))

Gives this plot:

Notice that this value of lambda here does not entirely take care of the heteroskedasticity problem. We can experiment with different values of lambda, or we can use the BoxCox.lambda() function, which will provide us an optimal value for parameter lambda:

lambda [1] 0.05527724
plot.ts(BoxCox(tseries_h, lambda = lambda))

Gives this plot:

The BoxCox.lambda() function has chosen the value 0.055. If we then use this value in our BoxCox() function, it returns a time series that appears to have constant variance.

Another common calculation that we may want to perform on time series is the percent change from one period to another. Because it seems that R does not have a function for performing this calculation, we can do it ourselves.

tseries_pch          tseries tseries_pch
2000 Q1  7.07667          NA
2000 Q2 10.38876  0.46802901
2000 Q3 23.91096  1.30161865
2000 Q4 14.49356 -0.39385287
2001 Q1 15.90501  0.09738501
2001 Q2 28.00545  0.76079409

We can turn this into a function so that we can easily calculate percent change in the future. We’ll add an argument to specify the number of periods over which we want to calculate the change and set it to 1 by default. Additionally, we’ll add some argument verification.

pch          tseries pch(tseries)
2000 Q1  7.07667           NA
2000 Q2 10.38876   0.46802901
2000 Q3 23.91096   1.30161865
2000 Q4 14.49356  -0.39385287
2001 Q1 15.90501   0.09738501
2001 Q2 28.00545   0.76079409

Now if we wished to calculate the percentage change over more than one period we can do so. Let’s say we wanted to calculate the year over year percent change. We can call our pch() function with a lag of 4.

tseries_pchya 

Two of the functions that we have discussed so far, the difference and the log, are often combined in time series analysis. The log difference function is useful for making non-stationary data stationary and has some other useful properties. We can calculate the log difference in R by simply combining the log() and diff() functions.

tseries_dlog 

Gives this plot:

Notice that after taking the log return, tseries appears to be stationary. We see some higher than normal volatility in the beginning of the series. This is due largely to the fact that the series levels start off so small. This leads into a nice property of the log return function, which is that it is a close approximation to the percent change:

plot.ts(cbind(pch(tseries), tseries_dlog))

Gives this plot:

This similarity is only approximate. The relationship does break down somewhat when the percent change from one period to the next is particularly large. You can read a good discussion of this topic here.

A moving average is another essential function for working with time series. For series with particularly high volatility, a moving average can help us to more clearly visualize its trend. Unfortunately, base R does not (to my knowledge) have a convenient function for calculating the moving average of a time series directly. However, we can use base R’s filter() function, which allows us to perform general linear filtering. We can set up the parameters of this function to be a moving average (Shumway and Stoffer, 2011). Here we apply the filter() function to tseries to create a 5 period moving average. The filter argument lets up specify the filter, which in this case is just a weighted average of 5 observations. The sides argument allows us to specify whether we want to apply the filter over past values (sides = 1), or to both past and future values (sides = 2). Here I set sides to 1 indicating that I want a trailing moving average:

tseries_lf5 

Gives this plot:

The fact that we have to define the linear filter each time we use this function makes it a little cumbersome to use. If we don't mind introducing a dependency to our code, we could use the SMA() function from the TTR package or the ma() function from the forecast package. The SMA() function takes a ts object and a value for n – the window over which we want to calculate the moving average.

library(TTR)
tseries_ma5 

The ma() function from the forecast package also performs moving average calculations. We supply the time series and a value for the order argument.

tseries_ma5fore 

Let's compare the results. The SMA() function returns a trailing moving average where each value is the mean of the n most recent trailing values. This is equivalent to the results we get from using the filter() function. The ma() function from the forecast package returns a centered moving average. In this case tseries_ma5for is equal to the average of the current observation, the previous two observation, and the next two observations. Which one you use would depend on your application.

          tseries tseries_lf5 tseries_ma5 tseries_ma5fore
2000 Q1  7.076670          NA          NA              NA
2000 Q2 10.388758          NA          NA              NA
2000 Q3 23.910958          NA          NA        14.35499
2000 Q4 14.493559          NA          NA        18.54075
2001 Q1 15.905014    14.35499    14.35499        20.50828
2001 Q2 28.005455    18.54075    18.54075        17.55500
2001 Q3 20.226413    20.50828    20.50828        17.49470
2001 Q4  9.144571    17.55500    17.55500        17.68977
2002 Q1 14.192030    17.49470    17.49470        18.00239
2002 Q2 16.880366    17.68977    17.68977        18.86085

That’s it for Part 2. I don’t know about you, but I am sick of working with generated data. I want to work with some real world data. In Part 3 we’ll do just that using Quandl. See you then.

References

    Hyndman, Rob J. and George Athanasopoulos (2013) Forecasting: Principles and Practice. https://www.otexts.org/fpp
    Shumway, Robert H. and David S. Stoffer (2011) Time Series Analysis and Its Applications With R Examples. New York, NY: Springer.

    Related Post

    1. Time Series Analysis in R Part 1: The Time Series Object
    2. Parsing Text for Emotion Terms: Analysis & Visualization Using R
    3. Using MongoDB with R
    4. Finding Optimal Number of Clusters
    5. Analyzing the first Presidential Debate
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Speeding Up Digital Arachnids

    By hrbrmstr

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)
    spiderbar, spiderbar 
    Reads robots rules from afar.
    Crawls the web, any size; 
    Fetches with respect, never lies.
    Look Out! 
    Here comes the spiderbar.
    
    Is it fast? 
    Listen bud, 
    It's got C++ under the hood.
    Can you scrape, from a site?
    Test with can_fetch(), TRUE == alright
    Hey, there 
    There goes the spiderbar. 
    

    (Check the end of the post if you don’t recognize the lyrical riff.)

    Face front, true believer!

    I’ve used and blogged about Peter Meissner’s most excellent robotstxt package before. It’s an essential tool for any ethical web scraper.

    But (there’s always a “but“, right?), it was a definite bottleneck for an unintended package use case earlier this year (yes, I still have not rounded out the corners on my “crawl delay” forthcoming post).

    I needed something faster for my bulk Crawl-Delay analysis which led me to this small, spiffy C++ library for parsing robots.txt files. After a tiny bit of wrangling, that C++ library has turned into a small R package spiderbar which is now hitting a CRAN mirror near you, soon. (CRAN — rightly so — did not like the unoriginal name rep).

    How much faster?

    I’m glad you asked!

    Let’s take a look at one benchmark: parsing robots.txt and extracting Crawl-delay entries. Just how much faster is spiderbar?

    library(spiderbar)
    library(robotstxt)
    library(microbenchmark)
    library(tidyverse)
    library(hrbrthemes)
    
    rob  mb1
    
    update_geom_defaults("violin", list(colour = "#4575b4", fill="#abd9e9"))
    
    autoplot(mb1) +
      scale_y_comma(name="nanoseconds", trans="log10") +
      labs(title="Microbenchmark results for parsing 'robots.txt' and extracting 'Crawl-delay' entries",
           subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
      theme_ipsum_rc(grid="Xx")

    As you can see, it’s just a tad bit faster .

    Now, you won’t notice that temporal gain in an interactive context but you absolutely will if you are cranking through a few million of them across a few thousand WARC files from the Common Crawl.

    But, I don’t care about Crawl-Delay!

    OK, fine. Do you care about fetchability? We can speed that up, too!

    rob_txt  mb2
    
    autoplot(mb2) +
      scale_y_comma(name="nanoseconds", trans="log10") +
      labs(title="Microbenchmark results for testing resource 'fetchability'",
           subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
      theme_ipsum_rc(grid="Xx")

    Vectorized or it didn’t happen.

    (Gosh, even Spider-Man got more respect!)

    OK, this is a tough crowd, but we’ve got vectorization covered as well:

    microbenchmark(
    
      robotstxt = {
        paths_allowed(c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"), "imdb.com")
      },
    
      spiderbar = {
        can_fetch(rob_spi, c("/ShowAll/this/that", "/Soundtracks/this/that", "/Tsearch/this/that"))
      }
    
    ) -> mb3
    
    autoplot(mb3) +
      scale_y_comma(name="nanoseconds", trans="log10") +
      labs(title="Microbenchmark results for testing multiple resource 'fetchability'",
           subtitle="Compares performance between robotstxt & spiderbar packages. Lower values are better.") +
      theme_ipsum_rc(grid="Xx")

    Excelsior!

    Peter’s package does more than this one since it helps find the robots.txt files and provides helpful data frames for more robots exclusion protocol content. And, we’ve got some plans for package interoperability. So, stay tuned, true believer, for more spider-y goodness.

    You can check out the code and leave package questions or comments on GitHub.

    (Hrm…Peter Parker was Spider-Man and Peter Meissner wrote robotstxt which is all about spiders. Coincidence?! I think not!)

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Survival Analysis with R

    By R Views

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    With roots dating back to at least 1662 when John Graunt, a London merchant, published an extensive set of inferences based on mortality records, survival analysis is one of the oldest subfields of Statistics [1]. Basic life-table methods, including techniques for dealing with censored data, were discovered before 1700 [2], and in the early eighteenth century, the old masters – de Moivre working on annuities, and Daniel Bernoulli studying competing risks for the analysis of smallpox inoculation – developed the modern foundations of the field [2]. Today, survival analysis models are important in Engineering, Insurance, Marketing, Medicine, and many more application areas. So, it is not surprising that R should be rich in survival analysis functions. CRAN’s Survival Analysis Task View, a curated list of the best relevant R survival analysis packages and functions, is indeed formidable. We all owe a great deal of gratitude to Arthur Allignol and Aurielien Latouche, the task view maintainers.

    Looking at the Task View on a small screen, however, is a bit like standing too close to a brick wall – left-right, up-down, bricks all around. It is a fantastic edifice that gives some idea of the significant contributions R developers have made both to the theory and practice of Survival Analysis. As well-organized as it is, however, I imagine that even survival analysis experts need some time to find their way around this task view. Newcomers – people either new to R or new to survival analysis or both – must find it overwhelming. So, it is with newcomers in mind that I offer the following narrow trajectory through the task view that relies on just a few packages: survival, ggplot2, ggfortify, and ranger

    The survival package is the cornerstone of the entire R survival analysis edifice. Not only is the package itself rich in features, but the object created by the Surv() function, which contains failure time and censoring information, is the basic survival analysis data structure in R. Dr. Terry Therneau, the package author, began working on the survival package in 1986. The first public release, in late 1989, used the Statlib service hosted by Carnegie Mellon University. Thereafter, the package was incorporated directly into Splus, and subsequently into R.

    ggfortify enables producing handsome, one-line survival plots with ggplot2::autoplot.

    ranger might be the surprise in my very short list of survival packages. The ranger() function is well-known for being a fast implementation of the Random Forests algorithm for building ensembles of classification and regression trees. But ranger() also works with survival data. Benchmarks indicate that ranger() is suitable for building time-to-event models with the large, high-dimensional data sets important to internet marketing applications. Since ranger() uses standard Surv() survival objects, it’s an ideal tool for getting acquainted with survival analysis in this machine-learning age.

    Load the data

    This first block of code loads the required packages, along with the veteran dataset from the survival package that contains data from a two-treatment, randomized trial for lung cancer.

    library(survival)
    library(ranger)
    library(ggplot2)
    library(dplyr)
    library(ggfortify)
    
    #------------
    data(veteran)
    head(veteran)
    ##   trt celltype time status karno diagtime age prior
    ## 1   1 squamous   72      1    60        7  69     0
    ## 2   1 squamous  411      1    70        5  64    10
    ## 3   1 squamous  228      1    60        3  38     0
    ## 4   1 squamous  126      1    60        9  63    10
    ## 5   1 squamous  118      1    70       11  65    10
    ## 6   1 squamous   10      1    20        5  49     0

    The variables in veteran are: * trt: 1=standard 2=test * celltype: 1=squamous, 2=small cell, 3=adeno, 4=large * time: survival time in days * status: censoring status * karno: Karnofsky performance score (100=good) * diagtime: months from diagnosis to randomization * age: in years * prior: prior therapy 0=no, 10=yes

    Kaplan Meier Analysis

    The first thing to do is to use Surv() to build the standard survival object. The variable time records survival time; status indicates whether the patient’s death was observed (status = 1) or that survival time was censored (status = 0). Note that a “+” after the time in the print out of km indicates censoring.

    # Kaplan Meier Survival Curve
    km 
    ##  [1]  72  411  228  126  118   10   82  110  314  100+  42    8  144   25+
    ## [15]  11   30  384    4   54   13  123+  97+ 153   59  117   16  151   22 
    ## [29]  56   21   18  139   20   31   52  287   18   51  122   27   54    7 
    ## [43]  63  392   10    8   92   35  117  132   12  162    3   95  177  162 
    ## [57] 216  553  278   12  260  200  156  182+ 143  105  103  250  100  999 
    ## [71] 112   87+ 231+ 242  991  111    1  587  389   33

    To begin our analysis, we use the formula Surv(futime, status) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of survival over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter. This is the simplest possible model. It only takes three lines of R code to fit it, and produce numerical and graphical summaries.

    km_fit 
    ## Call: survfit(formula = Surv(time, status) ~ 1, data = veteran)
    ## 
    ##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
    ##     1    137       2    0.985  0.0102      0.96552       1.0000
    ##    30     97      39    0.700  0.0392      0.62774       0.7816
    ##    60     73      22    0.538  0.0427      0.46070       0.6288
    ##    90     62      10    0.464  0.0428      0.38731       0.5560
    ##   180     27      30    0.222  0.0369      0.16066       0.3079
    ##   270     16       9    0.144  0.0319      0.09338       0.2223
    ##   360     10       6    0.090  0.0265      0.05061       0.1602
    ##   450      5       5    0.045  0.0194      0.01931       0.1049
    ##   540      4       1    0.036  0.0175      0.01389       0.0934
    ##   630      2       2    0.018  0.0126      0.00459       0.0707
    ##   720      2       0    0.018  0.0126      0.00459       0.0707
    ##   810      2       0    0.018  0.0126      0.00459       0.0707
    ##   900      2       0    0.018  0.0126      0.00459       0.0707
    #plot(km_fit, xlab="Days", main = 'Kaplan Meyer Plot') #base graphics is always ready
    autoplot(km_fit)

    Next, we look at survival curves by treatment.

    km_trt_fit 

    And, to show one more small exploratory plot, I’ll do just a little data munging to look at survival by age. First, I create a new data frame with a categorical variable AG that has values LT60 and GT60, which respectively describe veterans younger and older than sixty. While I am at it, I make trt and prior into factor variables. But note, survfit() and npsurv() worked just fine without this refinement.

    vet 

    Although the two curves appear to overlap in the first fifty days, younger patients clearly have a better chance of surviving more than a year.

    Cox Proportional Hazards Model

    Next, I’ll fit a Cox Proportional Hazards Model that makes use of all of the covariates in the data set.

    # Fit Cox Model
    cox 
    ## Call:
    ## coxph(formula = Surv(time, status) ~ trt + celltype + karno + 
    ##     diagtime + age + prior, data = vet)
    ## 
    ##   n= 137, number of events= 128 
    ## 
    ##                         coef  exp(coef)   se(coef)      z Pr(>|z|)    
    ## trttest            2.946e-01  1.343e+00  2.075e-01  1.419  0.15577    
    ## celltypesmallcell  8.616e-01  2.367e+00  2.753e-01  3.130  0.00175 ** 
    ## celltypeadeno      1.196e+00  3.307e+00  3.009e-01  3.975 7.05e-05 ***
    ## celltypelarge      4.013e-01  1.494e+00  2.827e-01  1.420  0.15574    
    ## karno             -3.282e-02  9.677e-01  5.508e-03 -5.958 2.55e-09 ***
    ## diagtime           8.132e-05  1.000e+00  9.136e-03  0.009  0.99290    
    ## age               -8.706e-03  9.913e-01  9.300e-03 -0.936  0.34920    
    ## priorYes           7.159e-02  1.074e+00  2.323e-01  0.308  0.75794    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##                   exp(coef) exp(-coef) lower .95 upper .95
    ## trttest              1.3426     0.7448    0.8939    2.0166
    ## celltypesmallcell    2.3669     0.4225    1.3799    4.0597
    ## celltypeadeno        3.3071     0.3024    1.8336    5.9647
    ## celltypelarge        1.4938     0.6695    0.8583    2.5996
    ## karno                0.9677     1.0334    0.9573    0.9782
    ## diagtime             1.0001     0.9999    0.9823    1.0182
    ## age                  0.9913     1.0087    0.9734    1.0096
    ## priorYes             1.0742     0.9309    0.6813    1.6937
    ## 
    ## Concordance= 0.736  (se = 0.03 )
    ## Rsquare= 0.364   (max possible= 0.999 )
    ## Likelihood ratio test= 62.1  on 8 df,   p=1.799e-10
    ## Wald test            = 62.37  on 8 df,   p=1.596e-10
    ## Score (logrank) test = 66.74  on 8 df,   p=2.186e-11
    cox_fit 

    Note that the model flags small cell type, adeno cell type and karno as significant. However, some caution needs to be exercised in interpreting these results. While the Cox Proportional Hazard’s model is thought to be “robust”, a careful analysis would check the assumptions underlying the model. For example, the Cox model assumes that the covariates do not vary with time. In a vignette [12] that accompanies the survival package Therneau, Crowson and Atkinson demonstrate that the Karnofsky score (karno) is, in fact, time-dependent so the assumptions for the Cox model are not met. The vignette authors go on to present a strategy for dealing with time dependent covariates.

    Data scientists who are accustomed to computing ROC curves to assess model performance should be interested in the Concordance statistic. The documentation for the survConcordance() function in the survival package defines concordance as “the probability of agreement for any two randomly chosen observations, where in this case agreement means that the observation with the shorter survival time of the two also has the larger risk score. The predictor (or risk score) will often be the result of a Cox model or other regression” and notes that: “For continuous covariates concordance is equivalent to Kendall’s tau, and for logistic regression is is equivalent to the area under the ROC curve.”

    To demonstrate using the survival package, along with ggplot2 and ggfortify, I’ll fit Aalen’s additive regression model for censored data to the veteran data. The documentation states: “The Aalen model assumes that the cumulative hazard H(t) for a subject can be expressed as a(t) + X B(t), where a(t) is a time-dependent intercept term, X is the vector of covariates for the subject (possibly time-dependent), and B(t) is a time-dependent matrix of coefficients.”

    The plots show how the effects of the covariates change over time. Notice the steep slope and then abrupt change in slope of karno.

    aa_fit 
    ## Call:
    ## aareg(formula = Surv(time, status) ~ trt + celltype + karno + 
    ##     diagtime + age + prior, data = vet)
    ## 
    ##   n= 137 
    ##     75 out of 97 unique event times used
    ## 
    ##                       slope      coef se(coef)      z        p
    ## Intercept          0.083400  3.81e-02 1.09e-02  3.490 4.79e-04
    ## trttest            0.006730  2.49e-03 2.58e-03  0.967 3.34e-01
    ## celltypesmallcell  0.015000  7.30e-03 3.38e-03  2.160 3.09e-02
    ## celltypeadeno      0.018400  1.03e-02 4.20e-03  2.450 1.42e-02
    ## celltypelarge     -0.001090 -6.21e-04 2.71e-03 -0.229 8.19e-01
    ## karno             -0.001180 -4.37e-04 8.77e-05 -4.980 6.28e-07
    ## diagtime          -0.000243 -4.92e-05 1.64e-04 -0.300 7.65e-01
    ## age               -0.000246 -6.27e-05 1.28e-04 -0.491 6.23e-01
    ## priorYes           0.003300  1.54e-03 2.86e-03  0.539 5.90e-01
    ## 
    ## Chisq=41.62 on 8 df, p=1.6e-06; test weights=aalen
    #summary(aa_fit)  # provides a more complete summary of results
    autoplot(aa_fit)

    Random Forests Model

    As a final example of what some might perceive as a data-science-like way to do time-to-event modeling, I’ll use the ranger() function to fit a Random Forests Ensemble model to the data. Note however, that there is nothing new about building tree models of survival data. Terry Therneau also wrote the rpart package, R’s basic tree-modeling package, along with Brian Ripley. See section 8.4 for the rpart vignette [14] that contains a survival analysis example.

    ranger() builds a model for each observation in the data set. The next block of code builds the model using the same variables used in the Cox model above, and plots twenty random curves, along with a curve that represents the global average for all of the patients. Note that I am using plain old base R graphics here.

    # ranger model
    r_fit 

    The next block of code illustrates how ranger() ranks variable importance.

    vi 
    ##          importance
    ## karno        0.0734
    ## celltype     0.0338
    ## diagtime     0.0003
    ## trt         -0.0007
    ## prior       -0.0011
    ## age         -0.0027

    Notice that ranger() flags karno and celltype as the two most important; the same variables with the smallest p-values in the Cox model. Also note that the importance results just give variable names and not level names. This is because ranger and other tree models do not usually create dummy variables.

    But ranger() does compute Harrell’s c-index (See [8] p. 370 for the definition), which is similar to the Concordance statistic described above. This is a generalization of the ROC curve, which reduces to the Wilcoxon-Mann-Whitney statistic for binary variables, which in turn, is equivalent to computing the area under the ROC curve.

    cat("Prediction Error = 1 - Harrell's c-index = ", r_fit$prediction.error)
    ## Prediction Error = 1 - Harrell's c-index =  0.308269

    An ROC value of .68 would normally be pretty good for a first try. But note that the ranger model doesn’t do anything to address the time varying coefficients. This apparently is a challenge. In a 2011 paper [16], Hamad observes:

    However, in the context of survival trees, a further difficulty arises when time–varying effects are included. Hence, we feel that the interpretation of covariate effects with tree ensembles in general is still mainly unsolved and should attract future research.

    I believe that the major use for tree-based models for survival data will be to deal with very large data sets.

    Finally, to provide an “eyeball comparison” of the three survival curves, I’ll plot them on the same graph.The following code pulls out the survival data from the three model objects and puts them into a data frame for ggplot().

    # Set up for ggplot
    kmi 

    For this data set, I would put my money on a carefully constructed Cox model that takes into account the time varying coefficients. I suspect that there are neither enough observations nor enough explanatory variables for the ranger() model to do better.

    This four-package excursion only hints at the Survival Analysis tools that are available in R, but it does illustrate some of the richness of the R platform, which has been under continuous development and improvement for nearly twenty years. The ranger package, which suggests the survival package, and ggfortify, which depends on ggplot2 and also suggests the survival package, illustrate how open-source code allows developers to build on the work of their predecessors. The documentation that accompanies the survival package, the numerous online resources, and the statistics such as concordance and Harrell’s c-index packed into the objects produced by fitting the models gives some idea of the statistical depth that underlies almost everything R.

    Some Tutorials and Papers

    For a very nice, basic tutorial on survival analysis, have a look at the Survival Analysis in R [5] and the OIsurv package produced by the folks at OpenIntro.

    Look here for an exposition of the Cox Proportional Hazard’s Model, and here [11] for an introduction to Aalen’s Additive Regression Model.

    For an elementary treatment of evaluating the proportional hazards assumption that uses the veterans data set, see the text by Kleinbaum and Klein [13].

    For an exposition of the sort of predictive survival analysis modeling that can be done with ranger, be sure to have a look at Manuel Amunategui’s post and video.

    See the 1995 paper [15] by Intrator and Kooperberg for an early review of using classification and regression trees to study survival data.

    References

    For convenience, I have collected the references used throughout the post here.

    [1] Hacking, Ian. (2006) The Emergence of Probability: A Philosophical Study of Early Ideas about Probability Induction and Statistical Inference. Cambridge University Press, 2nd ed., p. 11
    [2] Andersen, P.K., Keiding, N. (1998) Survival analysis Encyclopedia of Biostatistics 6. Wiley, pp. 4452-4461 [3] Kaplan, E.L. & Meier, P. (1958). Non-parametric estimation from incomplete observations, J American Stats Assn. 53, pp. 457–481, 562–563. [4] Cox, D.R. (1972). Regression models and life-tables (with discussion), Journal of the Royal Statistical Society (B) 34, pp. 187–220.
    [5] Diez, David. Survival Analysis in R, OpenIntro
    [6] Klein, John P and Moeschberger, Melvin L. Survival Analysis Techniques for Censored and Truncated Data, Springer. (1997)
    [7] Wright, Marvin & Ziegler, Andreas. (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R, JSS Vol 77, Issue 1.
    [8] Harrell, Frank, Lee, Kerry & Mark, Daniel. Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors. Statistics in Medicine, Vol 15 (1996), pp. 361-387 [9] Amunategui, Manuel. Survival Ensembles: Survival Plus Classification for Improved Time-Based Predictions in R
    [10] NUS Course Notes. Chapter 3 The Cox Proportional Hazards Model
    [11] Encyclopedia of Biostatistics, 2nd Edition (2005). Aalen’s Additive Regression Model [12] Therneau et al. Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model
    [13] Kleinbaum, D.G. and Klein, M. Survival Analysis, A Self Learning Text Springer (2005) [14] Therneau, T and Atkinson, E. An Introduction to Recursive Partitioning Using RPART Routines
    [15] Intrator, O. and Kooperberg, C. Trees and splines in survival analysis Statistical Methods in Medical Research (1995)
    [16] Bou-Hamad, I. A review of survival trees Statistics Surveys Vol.5 (2011)

    Authors’s note: this post was originally published on April 26, 2017 but was subsequently withdrawn because of an error spotted by Dr. Terry Therneau. He observed that the Cox Portional Hazards Model fitted in that post did not properly account for the time varying covariates. This revised post makes use of a different data set, and points to resources for addressing time varying covariates. Many thanks to Dr. Therneau. Any errors that remain are mine.

    To leave a comment for the author, please follow the link and comment on their blog: R Views.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News