Le Monde puzzle [#1048]

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

An arithmetic Le Monde mathematical puzzle:

A magical integer m is such that the remainder of the division of any prime number p by m is either a prime number or 1. What is the unique magical integer between 25 and 100? And is there any less than 25?

The question is dead easy to code

primz=c(1,generate_primes(2,1e6))
for (y in 25:10000)
  if (min((primz[primz>y]%%y)%in%primz)==1) print(y)

and return m=30 as the only solution. Bon sang but of course!, since 30=2x3x5… (Actually, the result follows by dividing the quotient of the division of a prime number by 2 by 3 and then the resulting quotient by 5: all possible cases produce a remainder that is a prime number.) For the second question, the same code returns 2,3,4,6,8,12,18,24 as further solutions. There is no solution beyond 30.

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Which world leaders are twitter bots?

By R on The Jumping Rivers Blog

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

Set-up

Given that I do quite like twitter, I thought it would be a good idea to right about R’s interface to the twitter API; rtweet. As usual, we can grab the package in the usual way. We’re also going to need the tidyverse for the analysis, rvest for some initial webscraping of twitter names, lubridate for some date manipulation and stringr for some minor text mining.

install.packages(c("rtweet", "tidyverse", "rvest", "lubridate"))
library("rtweet")
library("tidyverse")
library("rvest")
library("lubridate")

Getting the tweets

So, I could just write the names of twitter’s 10 most followed world leaders, but what would be the fun in that? We’re going to scrape them from twiplomacy using rvest and a chrome extension called selector gadget:

world_leaders = read_html("https://twiplomacy.com/ranking/the-50-most-followed-world-leaders-in-2017/")
lead_r = world_leaders %>% 
  html_nodes(".ranking-entry .ranking-user-name") %>%
  html_text() %>%
  str_replace_all("t|n|@", "")
head(lead_r)
## [1] "realdonaldtrump" "pontifex"        "narendramodi"    "pmoindia"       
## [5] "potus"           "whitehouse"

The string inside html_nodes() is gathered using selector gadget. See this great tutorial on rvest and for more on selector gadget read vignette("selectorgadget"). Tabs (t) and linebreaks (n) are removed with str_replace_all() from the stringr package.

Now we can collect the twitter data using rtweet. We can use the function lookup_users() to grab basic user info such as number of tweets, friends, favourites and followers. Obviously analysing all 50 leaders at once would be a pain. So we’re only going to take the top 10 (WARNING: this could take a while)

lead_r_info = lookup_users(lead_r[1:10])
lead_r_info
## # A tibble: 10 x 20
##    user_id  name   screen_name  location description       url   protected
##        
##  1 25073877 Donal… realDonaldT… Washing… 45th President o… http… FALSE    
##  2 5007043… Pope … Pontifex     Vatican… Welcome to the o… http… FALSE    
##  3 18839785 Naren… narendramodi India    Prime Minister o… http… FALSE    
##  4 4717417… PMO I… PMOIndia     "India " Office of the Pr… http… FALSE    
##  5 8222156… Presi… POTUS        Washing… 45th President o… http… FALSE    
##  6 8222156… The W… WhiteHouse   Washing… Welcome to @Whit… http… FALSE    
##  7 68034431 Recep… RT_Erdogan   Ankara,… Türkiye Cumhurba…   FALSE    
##  8 2196174… Sushm… SushmaSwaraj New Del… Minister of Exte…   FALSE    
##  9 3669871… Joko … jokowi       Jakarta  Akun resmi Joko …   FALSE    
## 10 44335525 HH Sh… HHShkMohd    Dubai, … Official Tweets … http… FALSE    
## # ... with 13 more variables: followers_count , friends_count ,
## #   listed_count , statuses_count , favourites_count ,
## #   account_created_at , verified , profile_url ,
## #   profile_expanded_url , account_lang ,
## #   profile_banner_url , profile_background_url ,
## #   profile_image_url 

We only want the columns of interest (name, followers_count, friends_count, statuses_count and favourites_count) and then we want the data in long format. To do this we’re going to use select() and gather()

(lead_r_info = lead_r_info %>%  
  select(name, followers_count, friends_count, statuses_count, favourites_count, screen_name) %>% 
  gather(type, value, -name, -screen_name))
## # A tibble: 40 x 4
##    name                 screen_name     type               value
##    
##  1 Donald J. Trump      realDonaldTrump followers_count 48426576
##  2 Pope Francis         Pontifex        followers_count 16858642
##  3 Narendra Modi        narendramodi    followers_count 40710139
##  4 PMO India            PMOIndia        followers_count 25156203
##  5 President Trump      POTUS           followers_count 22324998
##  6 The White House      WhiteHouse      followers_count 16713369
##  7 Recep Tayyip Erdoğan RT_Erdogan      followers_count 12513987
##  8 Sushma Swaraj        SushmaSwaraj    followers_count 11461412
##  9 Joko Widodo          jokowi          followers_count  9693534
## 10 HH Sheikh Mohammed   HHShkMohd       followers_count  8764455
## # ... with 30 more rows

Now we can use the fantastic ggplot to plot the respective counts for each world leader

ggplot(data = lead_r_info, 
       aes(x = reorder(name, value), y = value,fill = type, colour = type)) +
  geom_col() +
  facet_wrap(~type, scales = "free") + 
  theme_minimal() + 
  theme(
    strip.background = element_blank(),
    strip.text = element_blank(),
    title = element_blank(),
    axis.text.x = element_blank()
  ) + 
  coord_flip() +
  geom_text(aes(y = value, label = value), colour = "black", hjust = "inward")

Notice Donald trumps everyone in the followers and status area (from what I here he’s quite a prevalent tweeter), however Sushma Swaraj and Narendra Modi trump everyone when it comes to favourites and friends respectively.

Now, we’re going to use the function get_timelines() to retrieve the last 2000 tweets by each leader. Again this may take a while!

lead_r_tl = get_timelines(lead_r, n = 2000)

Unfortunately get_timelines() only gives us their twitter handle and doesn’t return their actual name. So I’m going to use select() and left_join() to add the column of names to make for easier reading on the upcoming graphs

names = select(lead_r_info, name, screen_name)
lead_r_twitt = left_join(lead_r_tl, names, by = "screen_name")

get_timelines() gives us the source of a persons tweet, i.e. iPhone, iPad, Android etc. So, what is the post popular tweet source among world leaders?

lead_r_twitt %>% 
  count(source) %>% 
  ggplot(aes(x = reorder(source, n), y = n)) + 
  geom_col(fill = "cornflowerblue") +
  theme_minimal() + 
  theme(
    strip.background = element_blank(),
    axis.text.x = element_blank()
    ) + 
  labs(
    x = NULL,
    y = NULL,
    title = "Tweet sources for world leaders"
  ) + 
  coord_flip() + 
  geom_text(aes(y = n, label = n), hjust = "inward")

Either world leaders really love iPhones or their social media / security teams do. Probably the latter. I can hear you all begging the question, using which source is more likely to give a world leader more retweets and favourites? To do this we’re going to summarise each source by it’s mean number of retweets and favourites and then gather the data into a long format for plotting

lead_r_twitt %>% 
  group_by(source) %>% 
  summarise(Retweet = mean(retweet_count), Favourite = mean(favorite_count)) %>% 
  gather(type,value,-source) %>%  
  ggplot(aes(x = reorder(source, value), y = value, fill = type)) + 
  geom_col() + 
  facet_wrap(~type, scales = "free") +
  theme_minimal() +  
  labs(
    x = "Source",
    y = NULL,
    title = "Which source is more likely to get more retweets and favourites?", 
    subtitle = "Values are the mean in each group"
  ) + 
  theme(
    legend.position = "none",
    axis.text.x = element_blank()
  ) + 
  geom_text(aes(y = value, label = round(value, 0)), colour = "black", hjust = "inward") + 
  coord_flip()

Naturally this leads me to the question of which leader, over their previous 2000 tweets, has the most overall retweets and favourites, and who has the highest average number of retweets and favourites?

lead_r_twitt %>% 
  group_by(name) %>% 
  summarise(rt_total = sum(retweet_count), fav_total = sum(favorite_count),
            rt_mean = mean(retweet_count), fav_mean = mean(favorite_count)) %>% 
  gather(type, value, -name) %>% 
  ggplot(aes(x = reorder(name, value), y = value, fill = type)) +
  geom_col() + 
  labs(
    x = NULL,
    y = NULL,
    title= "Mean and total retweets/favourites for each world leader"
  ) + 
  coord_flip() + 
  facet_wrap(~type, scales = "free") + 
  theme_minimal()

What about the mean retweets and favourites per month? ts_plot() provides us with a quick way to turn the data into a time series plot. However this wouldn’t work for me so I’m doing it the dplyr way. I’m going to a monthly time series so first we need to aggregate our data into months. The function rollback(), from lubridate, is fantastic for this. It will roll a date back to the first day of that month whilst also getting rid of the time information.

lead_r_twit2 = lead_r_twitt %>% 
  mutate(year_month = rollback(created_at, roll_to_first = TRUE, preserve_hms = FALSE)) %>% 
  group_by(name, year_month) %>% 
  mutate(fav_mean = mean(favorite_count), rt_mean  = mean(retweet_count))

We now have two columns, fav_mean and rt_mean, that have in them the mean number of retweets and favourites for each leader in each month. We can use select() and gather() to select the variables we want then turn this into long data for plotting

lead_r_twit2 = lead_r_twit2 %>% 
  select(name, year_month, fav_mean, rt_mean) %>% 
  gather(type, value, -name, -year_month)

Now we plot

lead_r_twit2 %>% 
  ggplot(aes(x = year_month, y = value, colour = name))  + 
  geom_line() +
  facet_wrap(~type, scales = "free", nrow = 2) + 
  labs(
    x = NULL,
    y = NULL,
    title = "Mean number of favourites/month for world leaders"
  ) +
  theme_minimal() 

Are world leaders actually bots?

botrnot is a fantastic package that uses machine learning to calculate the probability that a twitter user is a bot. So the obvious next question is, are our world leaders a bot or not?

We need to install the development package from GitHub and we also need to install the GitHub version of rtweet

devtools::install_github("mkearney/botrnot")
devtools::install_github("mkearney/rtweet")
library("botrnot")
library("rtweet")

The only function, botornot(), works on either given user names, or the output of the get_timelines() function from rtweet. To keep the inline with the rest of the blog, we’re going to use the output we’ve already created from get_timelines(), stored in lead_r_tl

bot = botornot(lead_r_tl) %>%
  arrange(prob_bot)

For a clearer look at the probabilities I’m going to plot them with their actual names instead of the screen names

bot %>% 
  rename(screen_name =  user) %>% 
  inner_join(distinct(names), by = "screen_name") %>% 
  select(name, prob_bot) %>% 
  arrange(prob_bot) %>% 
  ggplot()  + 
  geom_col(aes(x = reorder(name, -prob_bot), y = prob_bot), fill = "cornflowerblue") + 
  coord_flip() + 
  labs(y = "Probability of being a bot", 
       x = "World leader", 
       title = "Probability of world leaders being a bot") + 
  theme_minimal()

So apparently we are almost certain Donald J. Trump isn’t a bot and very very nearly certain the Pope is a bot!

That’s all for this time, thanks for reading!


To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Run Python from R

By Deepanshu Bhalla

(This article was first published on ListenData, and kindly contributed to R-bloggers)
This article explains how to call or run python from R. Both the tools have its own advantages and disadvantages. It’s always a good idea to use the best packages and functions from both the tools and combine it. In data science world, these tools have a good market share in terms of usage. R is mainly known for data analysis, statistical modeling and visualization. While python is popular for deep learning and natural language processing.

In recent KDnuggets Analytics software survey poll, Python and R were ranked top 2 tools for data science and machine learning. If you really want to boost your career in data science world, these are the languages you need to focus on.

Combine Python and R

RStudio developed a package called reticulate which provides a medium to run Python packages and functions from R.

Install and Load Reticulate Package

Run the command below to get this package installed and imported to your system.

# Install reticulate package
install.packages(“reticulate”)

# Load reticulate package
library(reticulate)

Check whether Python is available on your system

py_available()

It returns TRUE/FALSE. If it is TRUE, it means python is installed on your system.

Import a python module within R

You can use the function import( ) to import a particular package or module.

os os$getcwd()

The above program returns working directory.
[1] "C:UsersDELLDocuments"

You can use listdir( ) function from os package to see all the files in working directory

os$listdir()

 [1] ".conda"                       ".gitignore"                   ".httr-oauth"                 
[4] ".matplotlib" ".RData" ".RDataTmp"
[7] ".Rhistory" "1.pdf" "12.pdf"
[10] "122.pdf" "124.pdf" "13.pdf"
[13] "1403.2805.pdf" "2.pdf" "3.pdf"
[16] "AIR.xlsx" "app.r" "Apps"
[19] "articles.csv" "Attrition_Telecom.xlsx" "AUC.R"

Install Python Package

Step 1 : Create a new environment

conda_create(“r-reticulate”)

Step 2 : Install a package within a conda environment

conda_install(“r-reticulate”, “numpy”)

Since numpy is already installed, you don’t need to install it again. The above example is just for demonstration.

Step 3 : Load the package

numpy

Working with numpy array

Let’s create a sample numpy array

y x numpy$array(y)

     [,1] [,2]
[1,] 1 3
[2,] 2 4

Transpose the above array

numpy$transpose(x)

    [,1] [,2]
[1,] 1 2
[2,] 3 4

Eigenvalues and eigen vectors

numpy$linalg$eig(x)

[[1]]
[1] -0.3722813 5.3722813

[[2]]
[,1] [,2]
[1,] -0.9093767 -0.5657675
[2,] 0.4159736 -0.8245648

Mathematical Functions

numpy$sqrt(x)
numpy$exp(x)

Working with Python interactively

You can create an interactive Python console within R session. Objects you create within Python are available to your R session (and vice-versa).
By using repl_python() function, you can make it interactive. Download the dataset used in the program below.
repl_python()
# Load Pandas package
import pandas as pd
# Importing Dataset
travel = pd.read_excel(“AIR.xlsx”)
# Number of rows and columns
travel.shape

# Select random no. of rows
travel.sample(n = 10)

# Group By
travel.groupby(“Year”).AIR.mean()

# Filter
t = travel.loc[(travel.Month >= 6) & (travel.Year >= 1955),:]

# Return to R
exit

Note : You need to enter exit to return to the R environment.

Run Python from R

How to access objects created in python from R

You can use the py object to access objects created within python.

summary(py$t)

In this case, I am using R’s summary( ) function and accessing dataframe t which was created in python. Similarly, you can create line plot using ggplot2 package.

# Line chart using ggplot2
library(ggplot2)
ggplot(py$t, aes(AIR, Year)) + geom_line()

How to access objects created in R from Python

You can use the r object to accomplish this task.
1. Let’s create a object in R

mydata = head(cars, n=15)

2. Use the R created object within Python REPL

repl_python()
import pandas as pd
r.mydata.describe()
pd.isnull(r.mydata.speed)
exit

Building Logistic Regression Model using sklearn package
The sklearn package is one of the most popular package for machine learning in python. It supports various statistical and machine learning algorithms.
repl_python()
# Load libraries
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
# load the iris datasets
iris = datasets.load_iris()
# Developing logit model
model = LogisticRegression()
model.fit(iris.data, iris.target)
# Scoring
actual = iris.target
predicted = model.predict(iris.data)
# Performance Metrics
print(metrics.classification_report(actual, predicted))
print(metrics.confusion_matrix(actual, predicted))
Other Useful Functions

To see configuration of python

Run the py_config( ) command to find the version of R installed on your system.It also shows details about anaconda and numpy.

py_config()

python:         C:UsersDELLANACON~1python.exe
libpython: C:/Users/DELL/ANACON~1/python36.dll
pythonhome: C:UsersDELLANACON~1
version: 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
Architecture: 64bit
numpy: C:UsersDELLANACON~1libsite-packagesnumpy
numpy_version: 1.14.2

To check whether a particular package is installed

In the following program, we are checking whether pandas package is installed or not.

py_module_available(“pandas”)

About Author:

Deepanshu founded ListenData with a simple objective – Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.

Let’s Get Connected: LinkedIn

To leave a comment for the author, please follow the link and comment on their blog: ListenData.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

7 Jobs for R users from around the world (2018-03-27)

By Tal Galili

r_jobs

To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

  1. Full-Time
    Director, Data Labs Pew Research Center – Posted by oneclickorders@jobtarget.com
    Washington District of Columbia, United States
    12 Mar 2018
  2. Full-Time
    Customer Success Rep RStudio Inc. – Posted by jclemens1
    Anywhere
    26 Feb 2018

More New R Jobs

  1. Full-Time
    Post-Doctoral Fellow – Data Scientist @ CIMMYT (eastern Africa) International Maize and Wheat Improvement Center – Posted by cimmyt-jobs
    Marsabit County Kenya
    27 Mar 2018
  2. Freelance
    Statistician / R Developer – for Academic Statistical Research Academic Research – Posted by empiricus
    Anywhere
    27 Mar 2018
  3. Full-Time
    Graduate Data Scientist JamieAi – Posted by JamieAi
    Anywhere
    27 Mar 2018
  4. Freelance
    Backtesting stock-trading strategy termod
    Anywhere
    14 Mar 2018
  5. Full-Time
    Director, Data Labs Pew Research Center – Posted by oneclickorders@jobtarget.com
    Washington District of Columbia, United States
    12 Mar 2018
  6. Freelance
    R Report Builder DDI – Posted by survey.support@ddiworld.com
    Anywhere
    10 Mar 2018
  7. Freelance
    Statistician/Econometrician – R Programmer for Academic Statistical Research (for 2-3 weeks) Academic Research – Posted by empiricus
    Anywhere
    2 Mar 2018

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts ).

Source:: R News

Steps to Perform Survival Analysis in R

By Perceptive Analytics

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Another way of analysis?

When there are so many tools and techniques of prediction modelling, why do we have another field known as survival analysis? As one of the most popular branch of statistics, Survival analysis is a way of prediction at various points in time. This is to say, while other prediction models make predictions of whether an event will occur, survival analysis predicts whether the event will occur at a specified time. Thus, it requires a time component for prediction and correspondingly, predicts the time when an event will happen. This helps one in understanding the expected duration of time when events occur and provide much more useful information. One can think of natural areas of application of survival analysis which include biological sciences where one can predict the time for bacteria or other cellular organisms to multiple to a particular size or expected time of decay of atoms. Some interesting applications include prediction of the expected time when a machine will break down and maintenance will be required

How hard does it get..

It is not easy to apply the concepts of survival analysis right off the bat. One needs to understand the ways it can be used first. This includes Kaplan-Meier Curves, creating the survival function through tools such as survival trees or survival forests and log-rank test.
Let’s go through each of them one by one in R. We will use the survival package in R as a starting example. The survival package has the surv() function that is the center of survival analysis.

# install.packages("survival")
# Loading the package
library("survival")


The package contains a sample dataset for demonstration purposes. The dataset is pbc which contains a 10 year study of 424 patients having Primary Biliary Cirrhosis (pbc) when treated in Mayo clinic. A point to note here from the dataset description is that out of 424 patients, 312 participated in the trial of drug D-penicillamine and the rest 112 consented to have their basic measurements recorded and followed for survival but did not participate in the trial. 6 of these 112 cases were lost.
We are particularly interested in ‘time’ and ‘status’ features in the dataset. Time represents the number of days after registration and final status (which can be censored, liver transplant or dead). Since it is survival, we will consider the status as dead or not-dead (transplant or censored). Further details about the dataset can be read from the command:

#Dataset description
?pbc


We start with a direct application of the Surv() function and pass it to the survfit() function. The Surv() function will take the time and status parameters and create a survival object out of it. The survfit() function takes a survival object (the one which Surv() produces) and creates the survival curves.

#Fitting the survival model
survival_func=survfit(Surv(pbc$time,pbc$status == 2)~1)
survival_func

Call: survfit(formula = Surv(pbc$time, pbc$status == 2) ~ 1)

        n   events      median  0.95LCL     0.95UCL 
        418         161         3395        3090        3853


The function gives us the number of values, the number of positives in status, the median time and 95% confidence interval values. The model can also be plotted.

#Plot the survival model
plot(survival_func)

As expected, the plot shows us the decreasing probabilities for survival as time passes. The dashed lines are the upper and lower confidence intervals. In the survfit() function here, we passed the formula as ~ 1 which indicates that we are asking the function to fit the model solely on the basis of survival object and thus have an intercept. The output along with the confidence intervals are actually Kaplan-Meier estimates. This estimate is prominent in medical research survival analysis. The Kaplan – Meier estimates are based on the number of patients (each patient as a row of data) from the total number who survive for a certain time after treatment. (which is the event). We can represent the Kaplan – Meier function by the formula:

Ŝ(t)=∏(1-di/ni) for all i where ti≤t
Here, di the number of events and ni is the total number of people at risk at time ti

What to make of the graph?

Unlike other machine learning techniques where one uses test samples and makes predictions over them, the survival analysis curve is a self – explanatory curve. From the curve, we see that the possibility of surviving about 1000 days after treatment is roughly 0.8 or 80%. We can similarly define probability of survival for different number of days after treatment. At the same time, we also have the confidence interval ranges which show the margin of expected error. For example, in case of surviving 1000 days example, the upper confidence interval reaches about 0.85 or 85% and goes down to about 0.75 or 75%. Post the data range, which is 10 years or about 3500 days, the probability calculations are very erratic and vague and should not be taken up. For example, if one wants to know the probability of surviving 4500 days after treatment, then though the Kaplan – Meier graph above shows a range between 0.25 to 0.55 which is itself a large value to accommodate the lack of data, the data is still not sufficient enough and a better data should be used to make such an estimate.

Alternative models: Cox Proportional Hazard model

The survival package also contains a cox proportional hazard function coxph() and use other features in the data to make a better survival model. Though the data has untreated missing values, I am skipping the data processing and fitting the model directly. In practice, however, one needs to study the data and look at ways to process the data appropriately so that the best possible models are fitted. As the intention of this article is to get the readers acquainted with the function rather than processing, applying the function is the shortcut step which I am taking.

# Fit Cox Model
Cox_model = coxph(Surv(pbc$time,pbc$status==2) ~.,data=pbc)
summary(Cox_model)

Call:
coxph(formula = Surv(pbc$time, pbc$status == 2) ~ ., data = pbc)

  n= 276, number of events= 111 
   (142 observations deleted due to missingness)

                coef    exp(coef)       se(coef)        z   Pr(>|z|)   
id              -2.729e-03      9.973e-01   1.462e-03   -1.866  0.06203 . 
trt             -1.116e-01      8.944e-01   2.156e-01   -0.518  0.60476   
age         3.191e-02   1.032e+00   1.200e-02   2.659   0.00784 **
sexf            -3.822e-01      6.824e-01   3.074e-01   -1.243  0.21378   
ascites     6.321e-02   1.065e+00   3.874e-01   0.163   0.87038   
hepato      6.257e-02   1.065e+00   2.521e-01   0.248   0.80397   
spiders     7.594e-02   1.079e+00   2.448e-01   0.310   0.75635   
edema       8.860e-01   2.425e+00   4.078e-01   2.173   0.02980 * 
bili            8.038e-02   1.084e+00   2.539e-02   3.166   0.00155 **
chol        5.151e-04   1.001e+00   4.409e-04   1.168   0.24272   
albumin     -8.511e-01      4.270e-01   3.114e-01   -2.733  0.00627 **
copper      2.612e-03   1.003e+00   1.148e-03   2.275   0.02290 * 
alk.phos    -2.623e-05      1.000e+00   4.206e-05   -0.624  0.53288   
ast         4.239e-03   1.004e+00   1.941e-03   2.184   0.02894 * 
trig            -1.228e-03      9.988e-01   1.334e-03   -0.920  0.35741   
platelet    7.272e-04   1.001e+00   1.177e-03   0.618   0.53660   
protime     1.895e-01   1.209e+00   1.128e-01   1.680   0.09289 . 
stage       4.468e-01   1.563e+00   1.784e-01   2.504   0.01226 * 
---
Signif. codes:  0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1

                exp(coef)   exp(-coef)  lower .95   upper .95
id              0.9973      1.0027      0.9944      1.000
trt             0.8944      1.1181      0.5862      1.365
age             1.0324      0.9686      1.0084      1.057
sexf            0.6824      1.4655      0.3736      1.246
ascites         1.0653      0.9387      0.4985      2.276
hepato          1.0646      0.9393      0.6495      1.745
spiders         1.0789      0.9269      0.6678      1.743
edema           2.4253      0.4123      1.0907      5.393
bili            1.0837      0.9228      1.0311      1.139
chol            1.0005      0.9995      0.9997      1.001
albumin     0.4270      2.3422      0.2319      0.786
copper          1.0026      0.9974      1.0004      1.005
alk.phos        1.0000      1.0000      0.9999      1.000
ast             1.0042      0.9958      1.0004      1.008
trig            0.9988      1.0012      0.9962      1.001
platelet        1.0007      0.9993      0.9984      1.003
protime         1.2086      0.8274      0.9690      1.508
stage           1.5634      0.6397      1.1020      2.218

Concordance= 0.849  (se = 0.031 )
Rsquare= 0.462   (max possible= 0.981 )
Likelihood ratio test= 171.3  on 18 df,   p=0
Wald test            = 172.5  on 18 df,   p=0
Score (logrank) test = 286.1  on 18 df,   p=0


The Cox model output is similar to how a linear regression output comes up. The R2 is only 46% which is not high and we don’t have any feature which is highly significant. The top important features appear to be age, bilirubin (bili) and albumin. Let’s see how the plot looks like.

#Create a survival curve from the cox model
Cox_curve 

With more data, we get a different plot and this one is more volatile. Compared to the Kaplan – Meier curve, the cox-plot curve is higher for the initial values and lower for the higher values. The major reason for this difference is the inclusion of variables in cox-model. The plots are made by similar functions and can be interpreted the same way as the Kaplan – Meier curve.

Going traditional : Using survival forests

Random forests can also be used for survival analysis and the ranger package in R provides the functionality. However, the ranger function cannot handle the missing values so I will use a smaller data with all rows having NA values dropped. This will reduce my data to only 276 observations.

#Using the Ranger package for survival analysis
Install.packages("ranger")
library(ranger)

#Drop rows with NA values
pbc_nadrop=pbc[complete.cases(pbc), ]
#Fitting the random forest
ranger_model 

Let's look at the variable importance plot which the random forest model calculates.

#Get the variable importance
data.frame(sort(ranger_model$variable.importance,decreasing = TRUE))
sort.ranger_model.variable.importance..decreasing...TRUE.

bili                                                    0.0762338981
copper                                                  0.0202733989
albumin                                                 0.0165070226
age                                                     0.0130134413
edema                                                   0.0122113704
ascites                                                 0.0115315711
chol                                                    0.0092889960
protime                                                 0.0060215073
id                                                      0.0055867915
ast                                                     0.0049932803
stage                                                   0.0030225398
hepato                                                  0.0029290675
trig                                                    0.0028869184
platelet                                                0.0012958105
sex                                                     0.0010639806
spiders                                                 0.0005210531
alk.phos                                                0.0003291581
trt                                                     -0.0002020952

These numbers may be different for different runs. In my example, we see that bilirubin is the most important feature.

Lessons learned: Conclusion

Though the input data for Survival package’s Kaplan – Meier estimate, Cox Model and ranger model are all different, we will compare the methodologies by plotting them on the same graph using ggplot.

#Comparing models
library(ggplot2)

#Kaplan-Meier curve dataframe
#Add a row of model name
km 

We see here that the Cox model is the most volatile with the most data and features. It is higher for lower values and drops down sharply when the time increases. The survival forest is of the lowest range and resembles Kaplan-Meier curve. The difference might be because of Survival forest having less rows. The essence of the plots is that there can be different approaches to the same concept of survival analysis and one may choose the technique based on one's comfort and situation. A better data with processed data points and treated missing values might fetch us a better R2 and more stable curves. At the same time, they will help better in finding time to event cases such as knowing the time when a promotion's effect dies down, knowing when tumors will develop and become significant and lots of other applications with a significant chunk of them being from medical science. Survival, as the name suggests, relates to surviving objects and is thus related to event occurrence in a completely different way than machine learning. It is important to know this technique to know more and more ways data can help us in solving problems, with time involved in this particular case. Hope this article serves the purpose of giving a glimpse of survival analysis and the feature rich packages available in R.

Here is the complete code for the article:

# install.packages("survival")
# Loading the package
library("survival")

#Dataset description
?pbc

#Fitting the survival model
survival_func=survfit(Surv(pbc$time,pbc$status == 2)~1)
survival_func

#Plot the survival model
plot(survival_func)

# Fit Cox Model
Cox_model = coxph(Surv(pbc$time,pbc$status==2) ~.,data=pbc)
summary(Cox_model)

#Create a survival curve from the cox model
Cox_curve 

Author Bio:

This article was contributed by Perceptive Analytics. Madhur Modi, Chaitanya Sagar, Vishnu Reddy and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

DoOR – The Database of Odorant Responses

By rOpenSci – open tools for open science

↩

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)

Olfactory Coding

Detecting volatile chemicals and encoding these into neuronal activity is a vital task for all animals that is performed by their olfactory sensory systems. While these olfactory systems vary vastly between species regarding their numerical complexity, they are amazingly similar in their general structure. The periphery of olfactory systems consists of different classes of olfactory sensory neurons (OSN). In mammals, OSNs are located in the nose, in insects, OSNs are located on the antenna. OSN classes are tuned to defined but overlapping sets of odors. Thus a single odor usually elicits differential responses across an ensemble of OSNs. This ensemble code is able to encode thousands of odors, even in comparably simple olfactory systems.

To better understand the underlying logic of this ensemble code, one would ideally want to know the specific ensemble codes each and every chemical elicits across an olfactory system, the so-called “olfactome”. Obviously this is not feasible to tackle experimentally, yet many labs work with e.g. the olfactory system of Drosophila and publish studies containing responses of different odor-OSN combinations.

DoOR takes these heterogeneous data measured in different labs, with different techniques in different metrics and maps/merges them into a common response space, creating a consensus data-set of all odor responses. By this, DoOR provides access to the most-complete version of the Drosophila olfactome available.

This consensus data-set is of interest in various aspects: Scientists working in olfaction use DoOR to plan experiments, i.e. to find odors that activate specific OSNs. Others performing physiological recordings use DoOR to identify OSNs based on measured odor responses and modelers use DoOR to explore the combinatorial code and to predict unmeasured odor responses.

The DoOR Development Process

DoOR started as the bachelor thesis of Shouwen Ma who was a bachelor’s student in Giovanni Galizia’s lab at the University of Konstanz. Shouwen took the tedious task to extract all odor responses from the existing Drosophila olfaction literature. When he was lucky he found digital versions of these data in the supplemental material of the studies but in many cases he had to manually copy numbers from printed tables or even extract data from plots.

With this data being available we then formed a team consisting of Giovanni Galizia, Shouwen Ma, Marting Strauch, Anja Nissler and me who started to discuss ways to integrate this heterogeneous data in order to make it usable (see merging heterogeneous data for our solution). We would then build the R code of what we discussed, with Shouwen performing almost all of the programming of this initial version of DoOR which was published in 2010 [1].

In 2014 we put the DoOR code on GitHub and started to work on a major update which included rewriting major parts of the code, addition of several new features and the integration of new data and eventually resulted in the publication of DoOR 2.0 [2].

Part of the DoOR 2.0 update was also to ask authors to contribute their original data, especially in the cases where our initial data had to be extracted from plots. It is worth mentioning that while the majority of colleagues was very responsive and helpful, there were cases where we failed to get any response of authors or where data could not be provided as it was completely or partially lost. This underlines how important it is to make use of reliable data-repositories and to push data availability policies e.g. by journals.

Two Packages

The DoOR project consists of two packages, DoOR.data and DoOR.functions. While the first mainly provides the raw data-sets, the latter performs the processing i.e. the actual merging process, data import and export, data analysis and visualization. We chose this structure to be able to update both, data and functions independently. This also gives the possibility to easily create new “DoORs” by providing DoOR.functions with another heterogeneous data-set like e.g. olfactory responses from another species or maybe something completely different.

Installation & Loading Data

To install the latest version of both packages from the master branch on GitHub run:

devtools::install_github("ropensci/DoOR.functions")
devtools::install_github("ropensci/DoOR.data")

and then load the packages

library(DoOR.functions)
library(DoOR.data)

and call the load_door_data() function to load all data to the current workspace.

DoOR.data

The raw data in DoOR is stored in data.frames that are named according to the functional unit (i.e. OSN) they correspond to. There are a few header columns containing different chemical descriptors followed by data columns named according to the study these data were measured in. The example below contains all data that were measured from OSNs that express the Or22a olfactory receptor.

> head(Or22a)
  Class               Name                    InChIKey   CID       CAS Hallem.2006.EN
1                  sfr                         SFR   SFR       SFR              4
2 other              water XLYOFNOQVPJJNP-UHFFFAOYSA-N   962 7732-18-5             NA
3 amine ammonium hydroxide VHUUQVKOLVNVRT-UHFFFAOYSA-N 14923 1336-21-6             17
4 amine         putrescine KIDHWZJUCRJVML-UHFFFAOYSA-N  1045  110-60-1             16
5 amine         cadaverine VHRGRCVQAFMJIZ-UHFFFAOYSA-N   273  462-94-2             17
6 amine            ammonia QGZKDVFQNNGYKY-UHFFFAOYSA-N   222 7664-41-7             NA
  Dobritsa.2003.EN Stensmyr.2003.WT Schmuker.2007.TR Pelz.2006.ALEC50 Pelz.2006.AntEC50
1               NA               NA                4               NA                NA
2               NA               NA               NA               NA                NA
3               NA               NA               NA               NA                NA
4               NA               NA                3               NA                NA
5               NA               NA               NA               NA                NA
6               NA               NA                3               NA                NA
  Pelz.2005.ALnmr Pelz.2005.Antnmr Gabler.2013.AL Bruyne.2001.WT Bruyne.2010.WT
1               0                0              0              4              0
2              NA               NA             NA             NA             NA
3              NA               NA             NA             NA             NA
4              NA               NA             NA             NA             NA
5              NA               NA             NA             NA             NA
6              NA               NA             NA             NA             NA
  Marshall.2010.WT Hallem.2004.EN Hallem.2004.WT
1            0.000          6.186          6.915
2           60.571             NA             NA
3           44.286             NA             NA
4           28.000             NA             NA
5               NA             NA             NA
6               NA             NA             NA

DoOR.data also contains the pre-computed consensus data-sets (merged raw data) door_response_matrix and door_response_matrix_non_normalized and some additional meta data.

DoOR.functions

DoOR.functions provides the algorithm to compute the consensus matrix by merging the raw response data, as well as some data analysis & visualization (dplot_*()) functions and some smaller helper functions. I’ll give some basic examples below, for a more detailed documentation please see the three vignettes that come with the package (DoOR.functions, DoOR tools & DoOR visualizations).

Merging Heterogeneous Data

This algorithm is the actual core of DoOR. The computation of the consensus response matrix is based on the assumption that, no matter the recording technique (spikes, calcium imaging, …) or the differences in the recording setups, the same monotonic relationship between odor-responses applies: if odor A is a weaker ligand for an OSN than ligand B, this should be true no matter the recording technique. We then take two data-sets of a given OSN that have at least a minimal overlap of odors tested, rescale each of these [0,1] and plot the overlapping odor-responses against each other. We then fit a function onto the overlapping responses (picking the best fitting one of eight different functions). Next we project the responses that were measured only in one of the studies onto this function and measure the distance of all points along the function. This data is the again rescaled [0,1] which gives us a new merged data-set. Then we perform the same step between the merged data-set and the next raw data-set and iterate this process until all data is included. As the sequence of merges has an influence on the final data, wherever possible we compute all possible sequences of fits and compared how good the final consensus data fitted all raw data.

Working With Response Data

You don’t have to perform the merging process every time you use DoOR, as DoOR.data contains a pre-computed version of it (door_response_matrix). But you might have your own data that you want to add to DoOR or you might want to exclude certain data from the merging process. In that case you can easily modify the DoOR data.frames (i.e. Or22a) and then initiate a complete merging process by calling create_door_database() or just update the receptor you modified via update_door_database('Or22a', permutation = FALSE).

If you are interested in the raw response values of specific odors you can use get_responses() to list these across all studies:

# here we use trans_id() to translate odor names into InChIKeys for the lookup
odors  

Please refer to the following vignettes for more tools and for a more detailed description of the above functions: DoOR.functions, DoOR tools

Visualizing Data

We created various plotting functions to visualize the data in DoOR. All of the visualizations depend on ggplot2. One very important task of DoOR is to visualize the ensemble responses that odors elicit across the sensory neurons of Drosophila. plot_al_map() does this job by mapping the responses onto a color-coded map of the antennal-lobe. The antennal-lobe is the first odor-processing center in the insect brain. All OSNs of the same class converge onto spherical sub-structures (glomeruli) of the antennal lobe, thus the differential activity across glomeruli corresponds to the ensemble response across OSNs.

Here we plot the ensemble response of 2,3-butanedione, the odor that gives margarine its buttery smell:

dplot_al_map("QSJXEFYPDANLFS-UHFFFAOYSA-N", base_size = 8)

Ensemble response of 2,3, butanedione

Of course there are options to theme the output, either by passing parameters to the function or by adding ggplot2 functions to the call:

dplot_al_map("QSJXEFYPDANLFS-UHFFFAOYSA-N", tag = "", main = "",
  legend = FALSE, scalebar = FALSE)

does the same as

dplot_al_map("QSJXEFYPDANLFS-UHFFFAOYSA-N", tag = "", legend = FALSE) +
  ggplot2::theme(legend.position  = "none") +
  ggplot2::ggtitle('')

Ensemble response of 2,3, butanedione

DoOR comes with more built-in visualizations, please see the visualization vignette for a complete list.

The DoOR Website

The DoOR web-site is an export of the odor & OSN profiles with some background information and links to other databases. It provides a quick overview over the latest “olfactome” and can be found at http://neuro.uni-konstanz.de/DoOR.

Contribute!

  • You published a paper with odor responses or know a study we missed?
  • You want to build DoOR for other organisms?
    • Just fork DoOR.functions and start adapting it to your needs. We are happy to assist!

Acknowledgements

A lot of people contributed to DoOR over the last year. First I would like to thank Remko Duursma and Scott Chamberlain for their reviews. Scott already gave important feedback before the initial submission and helped a lot during the whole review process.
Giovanni Galizia and the whole team Shouwen Ma, Martin Strauch, Anja Nissler that was involved in developing DoOR in the first place. Especially Shouwen for digging through all the data and writing the initial code.
I am extremely grateful to all authors that contributed their original data to the project, even if that meant to dig out almost forgotten hard-disks (DoOR.data::unique(door_dataset_info$study)). I’d like to thank Veith Grabe for contributing the antennal lobe model for the dplot_al_map() function, Eduard Szöcs for pointing me to his webchem package that saved me days and weeks of work when translating chemical identifiers. I want to thank Greg Jefferis for suggesting to put the DoOR code onto GitHub in the first place. And I am thankful for all the people that use DoOR in their research and provide us with their feedback.


  1. Galizia, C.G., Münch, D., Strauch, M., Nissler, A., Ma, S., 2010. Integrating heterogeneous odor response data into a common response model: A DoOR to the complete olfactome. Chem. Senses 35, 551–563.
  2. Münch, D., Galizia, C.G., 2016. DoOR 2.0 – Comprehensive Mapping of Drosophila melanogaster Odorant Responses. Sci. Rep. 6, 21841. https://doi.org/10.1038/srep21841
    ↩
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R 3.4.4 released

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed “Someone to Lean On” — likely a Peanuts reference, though I couldn’t find which one with a quick search) is a minor bugfix release, and shouldn’t cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.

This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.

R-announce mailing list: R 3.4.4 is released

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Visualizing MonteCarlo Simulation Results: Mean vs Median

By cleschinski

unnamed-chunk-7-1

(This article was first published on R – first differences, and kindly contributed to R-bloggers)

Simulation studies are used in a wide range of areas from risk management, to epidemiology, and of course in statistics. The MonteCarlo package provides tools to automatize the design of these kind of simulation studies in R. The user only has to specify the random experiment he or she wants to conduct and to specify the number of replications. The rest is handled by the package.

So far, the main tool to analyze the results was to look at Latex tables generated using the MakeTable() function. Now, the new package version 1.0.4 contains the function MakeFrame() that allows to represent the simulation results in form of a dataframe. This makes it very easy to visualize the results using standard tools such as dplyr and ggplot2.

Here, I will demonstrate some of these concepts for a simple example that could be part of an introductory statistics course: the comparison of the mean and the median as estimators for the expected value. For an introduction to the MonteCarlo package click here or confer the package vignette.

For a symmetric distribution both the mean and the median are consistent estimators of the expected value, but since the mean is the maximum likelihood estimator, it is more efficient. On the other hand, it is usually stressed that in contrast to the mean, the median is not sensitive to outliers.

To demonstrate this, and to explore the relative magnitude of these effects depending on the sample size, we formulate a suitable random experiment.

library(MonteCarlo)
set.seed(123)

mean_vs_median

The mean_vs_median() function generates a sample of n standard normal distributed random variables and contaminates the first observation with a deterministic outlier of size out. We then calculate both the mean and the median and return them as elements of a list. Note that the function has two arguments – the sample size n and the size of the outlier out.

To analyze the effect of these two parameters on the relative performance of the two estimators, we use the MonteCarlo package. All we have to do is to pass the mean_vs_median function to the MonteCarlo() function and to specify the parameter values we are interested in. In the example below, we look at outliers of size 0, 3, or 5 in a sample of size 5, 25, or 250 and the experiment is repeated 1000 times.

erg_mean_median
summary(erg_mean_median)
## Simulation of function: 
## 
## function(n, out){
## 
##   # generate sample
##   sample

This is all the programming that is required to run the simulation study. To be able to generate plots from the results we call the MakeFrame() function on the object returned by the simulation.

df
##     n out        mean      median
## 1  25   0  0.85512352  1.28055488
## 2 250   0 -0.19633118 -0.09100042
## 3   5   0 -0.07376411 -0.06749168
## 4  25   3  0.59734761  0.59949205
## 5 250   3  0.13122706  0.28618942
## 6   5   3  0.07257960  0.12357383

As one can see, the result is a dataframe. Each row contains information on a single repetition of the random experiment. The first two columns contain the values of the parameters and the other two columns contain the estimates of the mean and the median.

To manipulate the dataframe and to make plots, we load the tidyverse package and convert the dataframe to a tibble.

library(tidyverse)
tbl 

To compare the efficiency of the estimates in absence of an outlier, we focus on the cases where out=0 and then compare estimates of the distribution of the estimators for different sample sizes. For each sample size we use a different colour and the mean and the median can be distinguished by the linetypes.

ggplot(filter(tbl, out==0)) + 
  geom_density(aes(x=mean, col=factor(n), linetype="mean")) +
  geom_density(aes(x=median, col=factor(n), linetype="median"))

It is clear to see that both the distribution of the mean and the median are centered around the true expected value of zero. This implies that both estimators are unbiased. However, the distribution of the mean tends to be more concentrated than that of the median. The mean has a smaller variance and is therefore more efficient.

This can also be seen if we calculate the corresponding summary statistics. Using the tibble and the dplyr package, this can be done with a single line of code.

tbl %>% filter(out==0) %>% group_by(n) %>% summarise_each("sd") %>% round(digits=2)
## # A tibble: 3 x 4
##       n   out  mean median
##       
## 1     5     0  0.06   0.08
## 2    25     0  0.44   0.52
## 3   250     0  0.19   0.24

We now consider the effect of outliers on the two estimators. To do so, we generate a similar plot, but now we keep the sample size constant a n=25 and different colors represent outliers of different magnitudes.

ggplot(filter(tbl, n==25)) + 
  geom_density(aes(x=mean, col=factor(out), linetype="mean")) +
  geom_density(aes(x=median, col=factor(out), linetype="median"))

unnamed-chunk-9-1

It is clear to see that the dashed lines representing the distribution of the medians are centered around the true mean of zero, irrespective of the size of the outlier. The distribution of the means, on the other hand, shifts further to the right the larger the magnitude of the outlier. This shows that the median is robust to outliers, whereas the mean is not.

Finally, we want to explore the interaction of the effect of the size of the outlier with the sample size. We therefore focus on the mean. Different colors now represent different sizes of the outlier and different linetypes represent different sample sizes.

ggplot(filter(tbl)) + 
  geom_density(aes(x=mean, col=factor(out), linetype=factor(n)))

unnamed-chunk-10-1

For an outlier of a given size, we can observe that its impact decreases as the sample size increases. While the effect of the outlier can be quite dramatic for n=5, the effect basically vanishes for n=250. The sensitivity of the mean to outliers is therefore a finite sample property that is less important in larger samples.

As can be seen from this example, conducting simulation studies requires minimal effort, if a package such as MonteCarlo or one of its competitors such as simsalapar or SimDesign is used. The programming required to produce this analysis should be simple enough so that simulations are not restricted to be a tool for research, but can even be used for teaching at an undergraduate level.

To leave a comment for the author, please follow the link and comment on their blog: R – first differences.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

USS proposals: Tail wagging the dog?

By David Firth

lost-comparison

(This article was first published on R – Let’s Look at the Figures, and kindly contributed to R-bloggers)

In response to my previous post, “Latest USS proposal: Who would lose most?“, someone asked me about doing the same calculation for the USS JNC-supported proposals from January. For a summary of those January proposals and my comments about their fairness, please see my earlier post “USS pension scheme and fairness“.

Anyway, the calculation is quite simple, and it led to the following graph. The black curve is as in my previous post, and the red one is from the same calculation done for the January USS proposal.

The red curve shows just over 5% effective loss of salary for those below the current £55.55k USS threshold, and then a fairly sharp decline to less than 2% lost at the salaries of the very highest-paid professors, managers and administrators. Under the January proposals, higher-paid staff would contribute proportionately less to the “rescue package” for USS — less, even, than under the March proposals. (And if the salary axis were to be extended indefinitely, the red curve would actually cross the zero-line: that’s because in the January proposals the defined-contribution rate from employers would actually have increased from (max) 13% to 13.25%.)

In terms of unequal sharing of the “pain”, then, the January proposal was even worse than the March one.

At the bottom here I’ll give the R code and a few words of explanation for the calculation of the red curve above.

But the main topic of this post arises from a remarkable feature of the above graph! At the current USS threshold salary of £55.55k, the amount lost is the same — it’s 5.08% under both proposals. Which led me to wonder: is that a coincidence, or was it actually a (pretty weird!) constraint used in the recent UUK-UCU negotiations? And then to wonder: might the best solution (i.e., for the same cost) be to do something that gives a better graph than either of the two proposals seen so far?

Tail wagging the dog?

The fact that the loss under the March proposal tops out at 5.08%, exactly (to 2 decimals, anyway) the same as in the January proposal, seems unlikely to be a coincidence?

If it’s not a coincidence, then a plausible route to the March proposal, at the UUK-UCU negotiating table, could have been along the lines of:

How can we re-work the January proposal to

  • retain defined benefit, up to some (presumably reduced) threshold and with some (presumably reduced) accrual rate,

while at the same time

  • nobody loses more than the maximum 5.08% that’s in the January proposal
  • the employer contribution rate to the DC pots of high earners is not reduced below the current standard (i.e., without the “match”) level of 12%

?

Those constraints, coupled with total cost to employers, would lead naturally to a family of solutions indexed by just two adjustable constants, namely

  • the threshold salary up to which DB pension applies (previously £55.55k)
  • the DB accrual rate (previously 1/75)

— and it seems plausible that the suggested (12 March 2018) new threshold of £42k and accrual rate of 1/85 were simply selected as the preferred candidate (among many such potential solutions) to offer to UUK and UCU members.

But the curve ought to be flat, or even increasing!

The two constraints listed as second and third bullets in the above essentially fix the position of the part of the black curve that applies to salaries over £55.55k. That’s what I mean by “tail wagging the dog”. Those constraints inevitably result in a solution that implies substantial losses for those with low or moderate incomes.

Once this is recognised, it becomes natural to ask: what should the shape of that “percentage loss” curve be?

The answer is surely a matter of opinion.

Those wishing to preserve substantial pension contributions at high salary levels, at the expense of those at lower salary levels, would want a curve that decreases to the right — as seen in the above curves for the January and March proposals.

For myself, I would argue the opposite: The “percent lost” curve should either be roughly constant, or might reasonably even increase as salary increases. (The obvious parallel being progressive rates of income tax: those who can afford to pay more, pay more.)

I had made a specific suggestion along these lines, in this earlier post:

The details of any solution that satisfies the “percent loss roughly constant, or even increasing” requirement clearly would need to depend on data that’s not so widely available (mainly, the distribution of all salaries for USS members).

But first the principle of fairness needs to be recognised. And once that is accepted, the constraints underlying future UUK-UCU negotiations would need to change radically — i.e., definitely away from those last two bullets in the above display.

Calculation of the red curve

In the previous post I gave R code for the black curve. Here is the corresponding calculation behind the red curve:

sacrifice.Jan  old_threshold) * (s - old_threshold) * 
                (13 - 13.25)/100

    return(r2 + r3)
}

## A vector of salary values up to £150k
salaries 

In essence:

  • salary under £55.55k would lose the defined benefit (that’s the 19/75 part) and the 1% “match”, and in its place would get 21.25% as defined contribution. The sum of these parts is the computed loss r2.
  • salary over £55.55k would gain the difference between potential 13% employer contribution and the proposed new rate of 13.25% (that’s the negative value r3 in the code).

© David Firth, March 2018

To cite this entry:
Firth, D (2018). USS proposals: Tail wagging the dog?. Weblog entry at URL https://statgeek.net/2018/03/15/uss-proposals-tail-wagging-the-dog/

To leave a comment for the author, please follow the link and comment on their blog: R – Let’s Look at the Figures.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Phaedra 1.0.2

By Open Analytics

Scripted Python Chart

Phaedra is an open source platform for data capture and analysis of high-content screening data. With the release of Phaedra 1.0.2, we are taking another step towards our goal of unprecedented flexibility in supported setups, ranging from a single small Mac desktop to a cloud-based infrastructure with multiple servers and an array of mixed Windows/Mac/Linux clients.

The initial release of Phaedra supported only the Windows platform. Update 1.0.1 introduced Phaedra on the Mac and Linux desktops, and allowed you to deploy a DataCapture server on Linux servers as well. With update 1.0.2, stability was improved across all platforms, and cloud-based infrastructures are now also supported in the form of pluggable filestore backends. Amazon S3 is supported out-of-the box, as is any S3-compatible store such as Minio. We intend to add more backends, to offer a true choice that can mesh with any corporate infrastructure.

But that’s not all! Take a look at the examples below that showcase other new or improved features in Phaedra 1.0.2.

Creating charts with Python

The Python addon offers a lot of interesting options. It can be used to create advanced calculated features, or for batch operations such as modifying large sets of plates. But with the new Scripted Chart view, you can also use Python scripts to generate charts and display them in Phaedra.

Many views in Phaedra are reportable, which means you can include them into a PDF report. The Scripted Chart view is no exception: you can save it and insert it into a report with a few clicks.

Scripted Python Chart in PDF report

R-based workflows

Phaedra contains the Knime workflow engine to empower users with an advanced data mining and processing toolbox.
While Knime already has a big scientific community that has created hundreds of free addons, there are still scenarios where you’d want to include your own R-based calculations in a workflow.

For that purpose, there is the R Snippet node: a workflow node that can execute arbitrary R code on a set of input data tables.

Example workflow with R snippet

This R Snippet node uses Phaedra’s embedded R engine, which was just upgraded to version R-3.4.3.

Image rendering with OpenJPEG

At the core of Phaedra’s features are its image rendering capabilities. The ability to view data alongside the image that was the source of the data, is a concept we believe is fundamental to the Phaedra experience.

Various views with image rendering

As these screenshots illustrate, Phaedra places high requirements on the image rendering component. Rendering a huge image in real-time (with the possibility of panning and zooming the image), or rendering dozens of small image crops at once, is no easy task for any image rendering codec.

Phaedra ships with the OpenJPEG codec, which has recently been updated with significant performance improvements.

Documentation has been updated on the project homepage and as always community support on this new release is available on our support site.

Source:: R News