Replicating Plots – Boxplot Exercises

By karolis koncevicius

matlab_boxplot

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

R’s boxplot function has a lot of useful parameters allowing us to change the behaviour and appearance of the boxplot graphs. In this exercise we will try to use those parameters in order to replicate the visual style of Matlab’s boxplot. Before trying out this exercise please make sure that you are familiar with the following functions: bxp, boxplot, axis, mtext

Here is the plot we will be replicating:

We will be using the same iris dataset which is available in R by default in the variable of the same name – iris. The exercises will require you to make incremental changes to the default boxplot style.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Make a default boxplot of Sepal.Width stratified by Species.

Exercise 2
Change the range of the y-axis so it starts at 2 and ends at 4.5.

Exercise 3
Modify the boxplot function so it doesn’t draw ticks nor labels of the x and y axes.

Exercise 4
Add notches (triangular dents around the median representing confidence intervals) to the boxes in the plot.

Exercise 5
Increase the distance between boxes in the plot.

Exercise 6
Change the color of the box borders to blue.

Exercise 7
a. Change the color of the median lines to red.
b. Change the line width of the median line to 1.

Exercise 8
a. Change the color of the outlier points to red.
b. Change the symbol of the outlier points to “+”.
c. Change the size of the outlier points to 0.8.

Exercise 9
a. Add the title to the boxplot (try to replicate the style of matlab’s boxplot).
b. Add the y-axis label to the boxplot (try to replicate the style of matlab’s boxplot).

Exercise 10
a. Add x-axis (try to make it resemble the x-axis in the matlab’s boxplot)
b. Add y-axis (try to make it resemble the y-axis in the matlab’s boxplot)
c. Add the y-axis ticks on the other side.

NOTE: You can use format(as.character(c(2, 4.5)), drop0trailing=TRUE, justify="right") to obtain the text for y-axis labels.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

When Trump visits… tweets from his trip to Mexico

By En El Margen – R-English

Trump tweets by mexican officials, percent

(This article was first published on En El Margen – R-English, and kindly contributed to R-bloggers)

I’m sure many of my fellow Mexicans will remember the historically ill-advised (to say the least) decision by our President to invite Donald Trump for a meeting.

Talking to some fellow colleagues, we couldn’t help but notice that maybe in another era this decision would have been good policy. The problem, some concluded, was the influence of social media today. In fact, the Trump debacle did cause outcry among leading politica voices online.

I wanted to investigate this further, and thankfully for me, I’ve been using R to collect tweets from a catalog of leading political personalities in Mexico for a personal business project.

Here is a short descriptive look at what the 65 twitter accounts I’m following tweeted between August 27th and September 5th (the Donald announced his visit on August the 30th). I’m sorry I can’t share the dataset, but you get the idea with the code…

library(dplyr)
library(stringr)

# 42 of the 65 accounts tweeted between those dates.
d %>% 
  summarise("n" = n_distinct(NOMBRE))
#   n
#  42

We can see how mentions of trump spike just about the time it was announced…

byhour <- d %>% 
  mutate("MONTH" = as.numeric(month(T_CREATED)), 
         "DAY" = as.numeric(day(T_CREATED)), 
         "HOUR" = as.numeric(hour(T_CREATED)), 
         "TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>% 
  group_by(MONTH, DAY, HOUR) %>% 
  summarise("N" = n(), 
            "TRUMP_MENTIONS" = sum(TRUMP_MENTION)) %>%
  mutate("PCT_MENTIONS" = TRUMP_MENTIONS/N*100) %>%
  arrange(desc(MONTH), desc(DAY), HOUR) %>%
  mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))

library(ggplot2)  
library(eem)
ggplot(byhour, 
       aes(x = CHART_DATE, 
           y = PCT_MENTIONS)) + 
        geom_line(colour=eem_colors[1]) + 
        theme_eem()+
        labs(x = "Time", 
             y = "Trump mentions n (% of Tweets)")

The peak of mentions (as a percentage of tweets) was September 1st at 6 am (75%). But it terms of amount of tweets, it is much more obvious the outcry was following the anouncement and later visit of the candidate:

ggplot(byhour, 
       aes(x = CHART_DATE, 
           y = TRUMP_MENTIONS)) + 
        geom_line(colour=eem_colors[1]) + 
        theme_eem()+
        labs(x = "Time", 
             y = "Trump mentions n (# of Tweets)")

Trump tweets by mexican officials, total

We can also (sort-of) identify the effect of these influencers tweeting. I’m going to add the followers, which are potential viewers, of each tweet mentioning Trump, by hour.

byaudience <- d %>% 
  mutate("MONTH" = as.numeric(month(T_CREATED)), 
         "DAY" = as.numeric(day(T_CREATED)), 
         "HOUR" = as.numeric(hour(T_CREATED)), 
         "TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>% 
  filter(TRUMP_MENTION > 0) %>%
  group_by(MONTH, DAY, HOUR) %>% 
  summarise("TWEETS" = n(), 
            "AUDIENCE" = sum(U_FOLLOWERS)) %>%
  arrange(desc(MONTH), desc(DAY), HOUR) %>%
  mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))


ggplot(byaudience, 
       aes(x = CHART_DATE, 
           y = AUDIENCE)) + 
        geom_line(colour=eem_colors[1]) + 
        theme_eem()+
        labs(x = "Time", 
             y = "Potential audience n (# of followers)")

Total audience of trump tweets

So clearly, I’m stating the obvious. People were talking. But how was the conversation being developed? Let’s first see the type of tweets (RT’s vs drafted individually):

bytype <- d %>% 
  mutate("TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
  # only the tweets that mention trump
  filter(TRUMP_MENTION>0) %>%
  group_by(T_ISRT) %>% 
  summarise("count" = n())
kable(bytype)
T_ISRT count
FALSE 313
TRUE 164

About 1 in 3 was a RT. Comparing to the overall tweets, (1389 out of 3833) this seems not too much of a difference, so it wasn’t necesarrily an influencer pushing the discourse. In terms of the most mentioned by tweet it was our President on the spotlight:

bymentionchain <- d %>% 
  mutate("TRUMP_MENTION" = str_count(TXT, pattern = "Trump|TRUMP|trump")) %>%
  # only the tweets that mention trump
  group_by(TRUMP_MENTION, MENTION_CHAIN) %>% 
  summarise("count" = n()) %>% 
  ungroup() %>% 
  mutate("GROUPED_CHAIN" = ifelse(grepl(pattern = "EPN", 
                                        x = MENTION_CHAIN), 
                                  "EPN", MENTION_CHAIN)) %>% 
  mutate("GROUPED_CHAIN" = ifelse(grepl(pattern = "realDonaldTrump", 
                                        x = MENTION_CHAIN), 
                                  "realDonaldTrump", GROUPED_CHAIN))
                                  
ggplot(order_axis(bymentionchain %>% 
                    filter(count>10 & GROUPED_CHAIN!="ND"), 
                  axis = GROUPED_CHAIN, 
                  column = count), 
       aes(x = GROUPED_CHAIN_o, 
           y = count)) + 
  geom_bar(stat = "identity") + 
  theme_eem() + 
  labs(x = "Mention chain n (separated by _|.|_ )", y = "Tweets")

Mentions

How about the actual persons who tweeted? It seemed like news anchor Joaquin Lopez-Doriga and security analyst Alejandro Hope were the most vocal about the visit (out of the influencers i’m following).

bytweetstar <- d %>% 
  mutate("TRUMP_MENTION" = ifelse(str_count(TXT, pattern = "Trump|TRUMP|trump")<1,0,1)) %>%
  group_by(TRUMP_MENTION, NOMBRE) %>% 
  summarise("count" = n_distinct(TXT))
## plot with ggplot2

Mentions

I also grouped each person by his political affiliation and I found it confirms the notion that the conversation on the eve of the visit, at least among this very small subset of twitter accounts, was driven by those with no party afiliation or in the “PAN” (opposition party).

byafiliation <- d %>% 
  mutate("MONTH" = as.numeric(month(T_CREATED)), 
         "DAY" = as.numeric(day(T_CREATED)), 
         "HOUR" = as.numeric(hour(T_CREATED)), 
         "TRUMP_MENTION" = ifelse(str_count(TXT, pattern = "Trump|TRUMP|trump")>0,1,0)) %>% 
  group_by(MONTH, DAY, HOUR, TRUMP_MENTION, AFILIACION) %>% 
  summarise("TWEETS" = n()) %>%
  arrange(desc(MONTH), desc(DAY), HOUR) %>%
  mutate("CHART_DATE" = as.POSIXct(paste0("2016-",MONTH,"-",DAY," ", HOUR, ":00")))
  
 ggplot(byafiliation, 
       aes(x = CHART_DATE, 
           y = TWEETS, 
           group = AFILIACION, 
           fill = AFILIACION)) + 
  geom_bar(stat = "identity") + 
  theme_eem() + 
  scale_fill_eem(20) + 
  facet_grid(TRUMP_MENTION ~.) +
  labs(x = "Time", y = "Tweets n (By mention of Trump)")

Mentions

However, It’s interesting to note how there is a small spike of the accounts afiliated with the PRI (party in power) on the day after his visit (Sept. 1st). Maybe they were trying to drive the conversation to another place?

To leave a comment for the author, please follow the link and comment on their blog: En El Margen – R-English.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Better Model Selection for Evolving Models

By quintuitive

Model Selection Algorithm

(This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

For quite some time now I have been using R’s caret package to choose the model for forecasting time series data. The approach is satisfactory as long as the model is not an evolving model (i.e. is not re-trained), or if it evolves rarely. If the model is re-trained often – the approach has significant computational overhead. Interestingly enough, an alternative, more efficient approach allows also for more flexibility in the area of model selection.

Let’s first outline how caret chooses a single model. The high level algorithm is outlined here:

So let’s say we are training a random forest. For this model, a single parameter, mtry is optimized:

require(caret)
getModelInfo('rf')$wsrf$parameters
#   parameter   class                         label
#   1    mtry numeric #Randomly Selected Predictors

Let’s assume we are using some form of cross validation. According to the algorithm outline, caret will create a few subsets. On each subset, it will train all models (as many models as different values for mtry there are) and finally it will choose the model behaving best over all cross validation folds. So far so good.

When dealing with time series, using regular cross validation has a future-snooping problem and from my experience general cross validation doesn’t work well in practice for time series data. The results are good on the training set, but the performance on the test set, the hold out, is bad. To address this issue, caret provides the timeslice cross validation method:

require(caret)
history = 1000
initial.window = 800
train.control = trainControl(
                    method="timeslice",
                    initialWindow=initial.window,
                    horizon=history-initial.window,
                    fixedWindow=T)

When the above train.control is used in training (via the train call), we will end up using 200 models for each set of parameters (each value of mtry in the random forest case). In other words, for a single value of mtry, we will compute:

Window Training Points Test Point
1 1..800 801
2 2..801 802
3 3..803 803
200 200..999 1000

The training set for each model is the previous 800 points. The test set for a single model is the single point forecast. Now, for each value of mtry we end up with 200 forecasted points, using the accuracy (or any other metric) we select the best performing model over these 200 points. No future-snooping here, because all history points are prior the points being forecasted.

Granted, this approach (of doing things on daily basis) may sound extreme, but it’s useful to illustrate the overhead which is imposed when the model evolves over time, so bear with me.

So far we have dealt with a single model selection. Once the best model is selected, we can forecast the next data point. Then what? What I usually do is to walk the time series forward and repeat these steps at certain intervals. This is equivalent to saying something like: “Let’s choose the best model each Friday, use the selected model to predict each day for the next week. Then re-fit it on Friday.”. This forward-walking approach has been found useful in trading, but surprisingly, hasn’t been discussed pretty much elsewhere. Abundant time series data is generated everywhere, hence, I feel this evolving model approach deserves at least as much attention as the “fit once, live happily thereafter” approach.

Back to our discussion. To illustrate the inefficiency, consider an even more extreme case – we are selecting the best model every day, using the the above parameters, i.e. the best model for each day is selected tuning the parameters over the previous 200 days. On day n for a given value of the parameter (mtry), we will train this model over a sequence of 200 sliding windows each of which is of size 800. Next we will move to day n+1 and we will compute, yet again, this model over a sequence of 200 sliding windows each of which is of size 800. Most of these operations are repeated (the last 800 window on day n is the second last 800 window on day n+1). So just for a single parameter value, we are repeating most of the computation on each step.

At this point, I hope you get the idea. So what is my solution? Simple. For each set of model parameters (each value of mtry), walk the series separately, do the training (no cross validation – we have a single parameter value), do the forecasting and store everything important into, let’s say, SQLite database. Next, pull out all predictions and walk the combined series. On each step, look at the history, and based on it, decide which model prediction to use for the next step. Assuming we are selecting the model over 5 different values for mtry, here is how the combined data may look like for a three-class (0, -1 and 1) classification:

Obviously the described approach is going to be orders of magnitude faster, but will deliver very similar (there are differences based on the window sizes) results. It also has an added bonus – once the forecasts are generated, one can experiment with different metrics for model selection on each step and all that without re-running the machine learning portion. For instance, instead of model accuracy (the default caret metric for classification), one can compare accumulative returns over the last n days.

Still cryptic or curious about the details? My plan is keep posting details and code as I progress with my Python implementation. Thus, look for the next installments of these series.

The post Better Model Selection for Evolving Models appeared first on Quintuitive.

To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The biggest liars in US politics

By tlfvincent

(This article was first published on Stat Of Mind, and kindly contributed to R-bloggers)

Anyone that follows US politics will be aware of the tremendous changes and volatility that has struck the US political landscape in the past year. In this post, I leverage third-party data to surface who are the most frequent liars, and show how to build a containerized Shiny app to visualize direct comparisons between individuals.

http://tlfvincent.github.io/2016/06/11/biggest-political-liars/

To leave a comment for the author, please follow the link and comment on their blog: Stat Of Mind.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

FileTable and storing graphs from Microsoft R Server

By tomaztsql

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

FileTable has been around now for quite some time and and it is useful for storing files, documents, pictures and and binary files in a designated SQL Server table – FileTable. The best part of FileTable is the fact one can access it from windows or other application as if it were stored on file system (because they are) and not making any other changes on the client.

And this feature is absolutely handy for using and storing outputs from Microsoft R Server. In this blog post I will focus mainly on persistently storing charts from statistical analysis.

First we need to secure that FileStream is enabled. Open SQL Server Configuration Manager and navigate to your running SQL Server. Right click and select FILESTREAM and enable Filestream for T-SQL access and I/O access. In addition, allow remote clients access to Filestream data as well.

Next step is to enable the configurations in Management Studio.

EXEC sp_configure 'filestream_access_level' , 2;
GO
RECONFIGURE;
GO

For this purpose I have decided to have a dedicated database for storing charts created in R. And this database will have FileTable enabled.

USE master;
GO

CREATE DATABASE FileTableRChart 
ON PRIMARY  (NAME = N'FileTableRChart', FILENAME = N'C:Program FilesMicrosoft SQL ServerMSSQL13.MSSQLSERVERMSSQLDATAFileTableRChart.mdf' , SIZE = 8192KB , FILEGROWTH = 65536KB ),
FILEGROUP FileStreamGroup1 CONTAINS FILESTREAM( NAME = ChartsFG, FILENAME = 'C:Program FilesMicrosoft SQL ServerMSSQL13.MSSQLSERVERMSSQLDATARCharts')
LOG ON (NAME = N'FileTableRChart_log', FILENAME = N'C:Program FilesMicrosoft SQL ServerMSSQL13.MSSQLSERVERMSSQLDATAFileTableRChart_log.ldf' , SIZE = 8192KB , FILEGROWTH = 65536KB )
GO

ALTER DATABASE FileTableRChart
    SET FILESTREAM ( NON_TRANSACTED_ACCESS = FULL, DIRECTORY_NAME = N'RCharts' )

So I will have folder RCharts available as a BLOB storage to my FileTableRChart SQL server database. Adding a table to get all the needed information on my charts.

USE FileTableRChart;
GO

CREATE TABLE ChartsR AS FILETABLE
WITH (
 FileTable_Directory = 'DocumentTable'
,FileTable_Collate_Filename = database_default  
);
GO

Setting the BLOB, we can focus now on R code within T-SQL. Following R Code will be used to generated histograms with normal curve for quick data overview (note, this is just a sample):

x <- data.frame(val = c(1,2,3,6,3,2,3,4,5,6,7,7,6,6,6,5,5,4,8))
y <- data.frame(val = c(1,2,5,8,5,4,2,4,5,6,3,2,3,5,5,6,7,7,8))
x$class <- 'XX'
y$class <- 'YY'
d <- rbind(x,y)

#normal function with counts
gghist <- ggplot(d, aes(x=val)) + geom_histogram(binwidth=2, 
                  aes(y=..density.., fill=..count..))
gghist <- gghist + stat_function(fun=dnorm, args=list(mean=mean(d$val), 
                   sd=sd(d$val)), colour="red")
gghist <- gghist + ggtitle("Histogram of val with normal curve")  + 
                   xlab("Variable Val") + ylab("Density of Val")

Returning diagram that will be further parametrized when inserted into T-SQL code.

hist_norm_curv

Besides parametrization, I will add a function to loop through all the input variables and generated diagrams for each of the given variable/column in SQL Server query passed through sp_execute_external_script stored procedure.

Final code:

DECLARE @SQLStat NVARCHAR(4000)
SET @SQLStat = 'SELECT
                     fs.[Sale Key] AS SalesID
                    ,c.[City] AS City
                    ,c.[State Province] AS StateProvince
                    ,c.[Sales Territory] AS SalesTerritory
                    ,fs.[Customer Key] AS CustomerKey
                    ,fs.[Stock Item Key] AS StockItem
                    ,fs.[Quantity] AS Quantity
                    ,fs.[Total Including Tax] AS Total
                    ,fs.[Profit] AS Profit

                    FROM [Fact].[Sale] AS  fs
                    JOIN dimension.city AS c
                    ON c.[City Key] = fs.[City Key]
                    WHERE
                        fs.[customer key] <> 0'

DECLARE @RStat NVARCHAR(4000)
SET @RStat = 'library(ggplot2)
              library(stringr)
              #library(jpeg)
              cust_data <- Sales
              n <- ncol(cust_data)
              for (i in 1:n) 
                        {
                          path <- 
''\SICN-KASTRUNmssqlserverRChartsDocumentTablePlot_''
                          colid   <- data.frame(val=(cust_data)[i])
                          colname <- names(cust_data)[i]
                          #print(colname)
                          #print(colid)
                          gghist <- ggplot(colid, aes(x=val)) + 
geom_histogram(binwidth=2, aes(y=..density.., fill=..count..))
                          gghist <- gghist + stat_function(fun=dnorm, 
args=list(mean=mean(colid$val), sd=sd(colid$val)), colour="red")
                          gghist <- gghist + ggtitle("Histogram of val with 
normal curve")  + xlab("Variable Val") + ylab("Density of Val")
                          path <- paste(path,colname,''.jpg'')
                          path <- str_replace_all(path," ","")
                          #jpeg(file=path)
                          ggsave(path, width = 4, height = 4)
                          plot(gghist)
                          dev.off()
                        }';

EXECUTE sp_execute_external_script
     @language = N'R'
    ,@script = @RStat
    ,@input_data_1 = @SQLStat
    ,@input_data_1_name = N'Sales'

I am using ggsave function, but jpeg function from package jpeg is also an option. Matter of a flavour. And variable path should be pointing to your local FileTable directory.

Now I can have graphs and charts stored persistently in filetable and retrieving information on files with simple query:

SELECT FT.Name
,IIF(FT.is_directory=1,'Directory','Files') [File Category]
,FT.file_type [File Type]
,(FT.cached_file_size)/1024.0 [File Size (KB)]
,FT.creation_time [Created Time]
,FT.file_stream.GetFileNamespacePath(1,0) [File Path]
,ISNULL(PT.file_stream.GetFileNamespacePath(1,0),'Root Directory') [Parent Path]
FROM [dbo].[ChartsR] FT
LEFT JOIN [dbo].[ChartsR] PT
ON FT.path_locator.GetAncestor(1) = PT.path_locator

2016-09-25-13_33_40-using_filetable_to_store_graphs_generated_with_revoscale-sql-sicn-kastrun-file

Going through the charts could be now much easier for multiple purposes.

2016-09-25-13_35_08-edit-post-tomaztsql-wordpress-com

There might be some security issues: I have used mklink to create a logical drive pointing to FileTable directory.

2016-09-25-12_05_54-administrator_-command-prompt

You might also want to use Local group policy editor for MSSQLLaunchpad to have access granted (write permissions) to FileTable directory.

2016-09-25-13_37_41-local-group-policy-editor

Code is available at GitHub.

Happy R-SQLing!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Re-introducing Radiant: A shiny interface for R

By R(adiant) news

(This article was first published on R(adiant) news, and kindly contributed to R-bloggers)

Radiant is a platform-independent browser-based interface for business analytics in R. I first introduced Radiant through R-bloggers on 5/2/2015 and, according to Dean Attali, the post was reasonably popular. So I decided to write a post about the changes to the tool since then.

Radiant is back on CRAN and the code and documentation have been moved to a GitHub organization radiant-rstats. Note that the app is only available for R 3.3.0 or later.

There have been numerous changes to the functionality and structure of Radiant. The app is now made up of 5 different menus, each in a separate package. The Data menu (radiant.data) includes interfaces for loading, saving, viewing, visualizing, summarizing, transforming, and combining data. It also contains functionality to generate reproducible reports of the analyses conducted in the application. The Design menu (radiant.design) includes interfaces for design of experiments, sampling, and sample size calculation. The Basics menu (radiant.basics) includes interfaces for probability calculation, central limit theorem simulation, comparing means and proportions, goodness-of-fit testing, cross-tabs, and correlation. The Model menu (radiant.model) includes interfaces for linear and logistic regression, Neural Networks, model evaluation, decision analysis, and simulation. The Multivariate menu (radiant.multivariate) includes interfaces for perceptual mapping, factor analysis, cluster analysis, and conjoint analysis. Finally, the radiant package combines the functionality from each of these 5 packages.

More functionality is in the works. For example, naive Bayes, boosted decision trees, random forests, and choice models will be added to the Model menu (radiant.model). I’m also planning to add a Text menu (radiant.text) to provide functionality to view, process, and analyze text.

If you are interested in contributing to, or extending, Radiant, take a look at the code for the radiant.design package on GitHub. This the simplest menu and should give you a good idea of how you can build on the functionality in the radiant.data package that is the basis for all other packages and menus.

Want to know more about Radiant? Although you could take look at the original Introducing Radiant blog post, quite a few of the links and references have changed. So to make things a bit easier, I’m including an updated version of the original post below.

If you have questions or comments please email me at radiant@rady.ucsd.edu

Key features

  • Explore: Quickly and easily summarize, visualize, and analyze your data
  • Cross-platform: It runs in a browser on Windows, Mac, and Linux
  • Reproducible: Recreate results at any time and share work with others as a state file or an Rmarkdown report
  • Programming: Integrate Radiant’s analysis functions into your own R-code
  • Context: Data and examples focus on business applications

Explore

Radiant is interactive. Results update immediately when inputs are changed (i.e., no separate dialog boxes). This greatly facilitates exploration and understanding of the data.

Cross-platform

Radiant works on Windows, Mac, or Linux. It can run without an Internet connection and no data will leave your computer. You can also run the app as a web application on a server.

Reproducible

Simply saving output is not enough. You need the ability to recreate results for the same data and/or when new data becomes available. Moreover, others may want to review your analyses and results. Save and load the state of the application to continue your work at a later time or on another computer. Share state files with others and create reproducible reports using Rmarkdown.

If you are using Radiant on a server you can even share the url (include the SSUID) with others so they can see what you are working on. Thanks for this feature go to Joe Cheng.

Programming

Although Radiant’s web-interface can handle quite a few data and analysis tasks, you may prefer to write your own code. Radiant provides a bridge to programming in R(studio) by exporting the functions used for analysis. For more information about programming with Radiant see the programming page on the documentation site.

Context

Radiant focuses on business data and decisions. It offers context-relevant tools, examples, and documentation to reduce the business analytics learning curve.

How to install Radiant

  • Required: R version 3.3.0 or later
  • Required: A modern browser (e.g., Chrome or Safari). Internet Explorer (version 11 or higher) or Edge should work as well
  • Recommended: Rstudio

Radiant is available on CRAN. However, to install the latest version of the different packages with complete documentation for offline access open R(studio) and copy-and-paste the command below into the console:

install.packages("radiant", repos = "http://radiant-rstats.github.io/minicran/")

Once all packages and dependencies are installed use the following command to launch the app in your default browser:

radiant::radiant()

If you have a recent version of Rstudio installed you can also start the app from the Addins dropdown. That dropdown will also provide an option to upgrade Radiant to the latest version available on the github minicran repo.

If you currently only have R on your computer and want to make sure you have all supporting software installed as well (e.g., Rstudio, MikTex, etc.) open R, copy-and-paste the command below, and follow along as different dialogs are opened:

source("https://raw.githubusercontent.com/radiant-rstats/minicran/gh-pages/install.R")

More detailed instructions are available on the install radiant page.

Documentation

Documentation and tutorials are available at http://radiant-rstats.github.io/docs/ and in the Radiant web interface (the ? icons and the Help menu).

Want some help getting started? Watch the tutorials on the documentation site

Radiant on a server

If you have access to a server you can use shiny-server to run radiant. First, start R on the server with sudo R and install radiant using install.packages("radiant"). Then clone the radiant repo and point shiny-server to the inst/app/ directory.

If you have Rstudio server running and the Radiant package is installed, you can start Radiant from the addins menu as well. To deploy Radiant using Docker take a look the example and documentation at:

https://github.com/radiant-rstats/docker-radiant

Not ready to install Radiant, either locally or on a server? Try it out on shinyapps.io at the link below:

vnijs.shinyapps.io/radiant

Send questions and comments to: radiant@rady.ucsd.edu.


aggregated on R-bloggers – the complete collection of blogs about R

To leave a comment for the author, please follow the link and comment on their blog: R(adiant) news.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

tint 0.0.1: Tint Is Not Tufte

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new experimental package is now on the ghrr drat. It is named tint which stands for Tint Is Not Tufte. It provides an alternative for Tufte-style html presentation. I wrote a bit more on the package page and the README in the repo — so go read this.

Here is just a little teaser of what it looks like:

and the full underlying document is available too.

For questions or comments use the issue tracker off the GitHub repo. The package may be short-lived as its functionality may end up inside the tufte package.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Surveillance Out of the Box – The #Zombie Experiment

By Theory meets practice…

Creative Commons License

(This article was first published on Theory meets practice…, and kindly contributed to R-bloggers)

Abstract

We perform a social experiment to investigate, if zombie related twitter posts can used as a reliable indicator for an early warning system. We show how such a system can be set up almost out-of-the-box using R – a free software environment for statistical computing and graphics. Warning: This blog entry contains toxic doses of Danish irony and sarcasm as well as disturbing graphs.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

Proposing statistical methods is only mediocre fun if nobody applies them. As an act of desperation the prudent statistician has been forced to provide R packages supplemented with a CRAN, github, useR! or word-of-mouth advertising strategy. To underpin efforts, a reproducibility-crisis has been announced in order to scare decent comma-separated scientist from using Excel. Social media marketing strategies of your R package include hashtag #rstats twitter announcements, possibly enhanced by a picture or animation showing your package at its best:

Introducing gganimate: #rstats package for adding animation to any ggplot2 figure https://t.co/UBWKHmIc0e pic.twitter.com/oQhQaYBqOj

— David Robinson (@drob) February 1, 2016

Unfortunately, little experience with the interactive aspect of this statistical software marketing strategy appears to be available. In order to fill this scientific advertising gap, this blog post constitutes an advertisement for the out-of-the-box-functionality of the surveillance package hidden as social experiment. It shows shows what you can do with R when combining a couple of packages, wrangle the data, cleverly visualize the results and then team up with the fantastic R community.

The Setup: Detecting a Zombie Attack

As previously explained in an useR! 2015 lightning talk, Max Brooks’ Zombie Survival Guide is very concerned about the early warning of Zombie outbreaks.

However, despite of extensive research and recommendations, no reliable service appears available for the early detection of such upcoming events. Twitter, on the other hand, has become the media darling to stay informed about news as they unfold. Hence, continuous monitoring of hashtags like #zombie or #zombieattack appears an essential component of your zombie survival strategy.

Tight Clothes, Short Hair and R

Extending the recommendations of the Zombie Survival guide we provide an out-of-the-box (OOTB) monitoring system by using the rtweet R package to obtain all individual tweets containing the hashtags #zombie or #zombieattack.

the_query <- "#zombieattack OR #zombie"
geocode <- ""  #To limit the seach to berlin & surroundings: geocode <- "52.520583,13.402765,25km"
#Converted query string which works for storing as file
safe_query <- stringr::str_replace_all(the_query, "[^[:alnum:]]", "X")

In particular, the README of the rtweet package provides helpful information on how to create a twitter app to automatically search tweets using the twitter API. One annoyance of the twitter REST API is that only the tweets of the past 7 days are kept in the index. Hence, your time series are going to be short unless you accumulate data over several queries spread over a time period. Instead of using a fancy database setup for this data collection, we provide a simple R solution based on dplyr and saveRDS – see the underlying R code of this post by clicking on the github logo in the license statement of this post. Basically,

  • all tweets fulfilling the above hashtag search queries are extracted
  • each tweet is extended with a time stamp of the query-time
  • the entire result of each query us stored into a separate RDS-files using saveRDS

In a next step, all stored queries are loaded from the RDS files and put together. Subsequently, only the newest time stamped entry about each tweet is kept – this ensures that the re-tweeted counts are up-to-date and no post is counted twice. All these data wrangling operations are easily conducted using dplyr. Of course a full database solution would have been more elegant, but R does the job just as well as long it’s not millions of queries. No matter the data backend, at the end of this pipeline we have a database of tweets.

#Read the tweet database
tw <- readRDS(file=paste0(filePath,"Tweets-Database-",safe_query,"-","2016-09-25",".RDS"))
options(width=300,tibble.width = Inf)
tw %>% select(created_at, retweet_count,screen_name,text,hashtags,query_at)
## # A tibble: 10,974 × 6
##             created_at retweet_count    screen_name                                                                                                                                          text  hashtags            query_at
##                 <dttm>         <int>          <chr>                                                                                                                                         <chr>    <list>              <dttm>
## 1  2016-09-25 10:26:28             0       Lovebian                                               The latest #Zombie Nation! https://t.co/8ZkOFSZH2v Thanks to @NJTVNews @MaxfireXSA @Xtopgun901X <chr [1]> 2016-09-25 10:30:44
## 2  2016-09-25 10:25:49             2  MilesssAwaaay RT @Shaaooun: I'm gonna turn to a zombie soon! xdxdxdxd #AlmostSurvived #204Days #ITried #Zombie #StuckInMyRoom #Hahann#MediaDoomsDay #Kame <chr [7]> 2016-09-25 10:30:44
## 3  2016-09-25 10:21:10             6 catZzinthecity          RT @ZombieEventsUK: 7 reasons #TheGirlWithAllTheGifts is the best #zombie movie in years https://t.co/MB82ssxss2 via @MetroUK #Metro <chr [3]> 2016-09-25 10:30:44
## 4  2016-09-25 10:19:41             0  CoolStuff2Get                             Think Geek Zombie Plush Slippers https://t.co/0em920WCMh #Zombie #Slippers #MyFeetAreCold https://t.co/iCEkPBykCa <chr [3]> 2016-09-25 10:30:44
## 5  2016-09-25 10:19:41             4  TwitchersNews    RT @zOOkerx: Nur der frhe Vogel fngt den #zombie also schaut gemtlich rein bei @booty_pax! Now live #dayz on #twitch nnhttps://t.co/OIk6 <chr [3]> 2016-09-25 10:30:44
## 6  2016-09-25 10:17:45             0 ZombieExaminer     Washington mall shooting suspect Arcan Cetin was '#Zombie-like' during arrest - USA TODAY https://t.co/itoDXG3L8T https://t.co/q2mURi24DB <chr [1]> 2016-09-25 10:30:44
## 7  2016-09-25 10:17:44             4       SpawnRTs    RT @zOOkerx: Nur der frhe Vogel fngt den #zombie also schaut gemtlich rein bei @booty_pax! Now live #dayz on #twitch nnhttps://t.co/OIk6 <chr [3]> 2016-09-25 10:30:44
## 8  2016-09-25 10:17:23             0   BennyPrabowo                   bad miku - bad oni-chan... no mercyn.n.n.n.n#left4dead #games #hatsunemiku #fps #zombie #witch https://t.co/YP0nRDFFj7 <chr [6]> 2016-09-25 10:30:44
## 9  2016-09-25 10:12:53            62   Nblackthorne  RT @PennilessScribe: He would end her pain, but he could no longer live in a world that demanded such sacrifice. #zombie #apocalypsenhttps: <chr [2]> 2016-09-25 10:30:44
## 10 2016-09-25 10:06:46             0   mthvillaalva                                                             Pak ganern!!! Kakatapos ko lang kumain ng dugo! n#Zombie https://t.co/Zyd0btVJH4 <chr [1]> 2016-09-25 10:30:44
## # ... with 10,964 more rows

OOTB Zombie Surveillance

We are now ready to prospectively detect changes using the surveillance R package (Salmon, Schumacher, and Höhle 2016).

library("surveillance")

We shall initially focus on the #zombie series as it contains more counts. The first step is to convert the data.frame of individual tweets into a time series of daily counts.

#' Function to convert data.frame to queries. For convenience we store the time series
#' and the data.frame jointly as a list. This allows for easy manipulations later on
#' as we see data.frame and time series to be a joint package.
#'
#' @param tw data.frame containing the linelist of tweets.
#' @param the_query_subset String containing a regexp to restrict the hashtags
#' @return List containing sts object as well as the original data frame.
#'
df_2_timeseries <- function(tw, the_query_subset) {
  tw_subset <- tw %>% filter(grepl(gsub("#","",the_query_subset),hashtags))

  #Aggregate data per day and convert times series to sts object
  ts <- surveillance::linelist2sts(as.data.frame(tw_subset), dateCol="created_at_Date", aggregate.by="1 day")
  #Drop first day with observations, due to the moving window of the twitter index, this count is incomplete
  ts <- ts[-1,]

  return(list(tw=tw_subset,ts=ts, the_query_subset=the_query_subset))
}

zombie <- df_2_timeseries(tw, the_query_subset = "#zombie")

It’s easy to visualize the resulting time series using the plotting functionality of the surveillance package.


We see that the counts on the last day are incomplete. This is because the query was performed at 10:30 CEST and not at midnight. We therefore adjust counts on the last day based on simple inverse probability weighting. This just means that we scale up the counts by the inverse of the fraction the query-hour (10:30 CEST) makes up of 24h (see github code for details). This relies on the assumption that queries are evenly distributed over the day.

We are now ready to apply a surveillance algorithm to the pre-processed time series. We shall pick the so called C1 version of the EARS algorithm documented in Hutwagner et al. (2003) or Fricker, Hegler, and Dunfee (2008). For a monitored time point (s) (here: a particular day, say 2016-09-23), this simple algorithm takes the previous seven observations before (s) in order to compute the mean and standard deviation, i.e. [
begin{align*}
bar{y}_s &= frac{1}{7} sum_{t=s-8}^{s-1} y_t,
operatorname{sd}_s &= frac{1}{7-1} sum_{t=s-8}^{s-1} (y_t – bar{y}_s)^2
end{align*}
]
The algorithm then computes the z-statistic (operatorname{C1}_s = (y_s – bar{y}_s)/operatorname{sd}_s) for each time point to monitor. Once the value of this statistic is above 3 an alarm is flagged. This means that we assume that the previous 7 observations are what is to be expected when no unusual activity is going on. One can interpret the statistic as a transformation to (standard) normality: once the current observation is too extreme under this model an alarm is sounded. Such normal-approximations are justified given the large number of daily counts in the zombie series we consider, but does not take secular trends or day of the week effects into account. Note that the calculations can also be reversed in order to determine how large the number of observations need to be in order to generate an alarm.

We now apply the EARS C1 monitoring procedure to the zombie time series starting at the 8th day of the time series. It is important to realize that the result of monitoring a time point in the graphic is obtained by only looking into the past. Hence, the relevant time point to consider today is if an alarm would have occurred 2016-09-25. We also show the other time points to see, if we could have detected potential alarms earlier.

zombie[["sts"]] <- earsC(zombie$ts, control=list(range = 8:nrow(zombie$ts),
                         method = "C1", alpha = 1-pnorm(3)))

What a relief! No suspicious zombie activity appears to be ongoing. Actually, it would have taken 511 tweets before we would have raised an alarm on 2016-09-25. This is quite a number.

As an additional sensitivity analysis we redo the analyses for the #zombieattack hashtag. Here the use of the normal approximation in the computation of the alerts is more questionable. Still, we can get a time series of counts together with the alarm limits.


Also no indication of zombie activity. The number of additional tweets needed before alarm in this case is: 21. Altogether, it looks safe out there…

Summary

R provides ideal functionality to quickly extract and monitor twitter time series. Combining with statistical process control methods allows you to prospectively monitor the use of hashtags. Twitter has released a dedicated package for this purpose, however, in case of low count time series it is better to use count-time series monitoring devices as implemented in the surveillance package. Salmon, Schumacher, and Höhle (2016) contains further details on how to proceed in this case.

The important question although remains: Does this really work in practice? Can you sleep tight, while your R zombie monitor scans twitter? Here is where the social experiment starts: Please help answer this question by retweeting the post below to create a drill alarm situation. More than 511 (!) and 21 tweets, respectively, are needed before an alarm will sound.

(placeholder tweet, this will change in a couple of minutes!!)

Video recording, slides & R code of our (???) MV Time Series webinar now available at https://t.co/XVtLrjbJKZ #biosurveillance #rstats

— Michael Höhle (@m_hoehle) 21. September 2016

I will continuously update the graphs in this post to see how our efforts are reflected in the time series of tweets containing the #zombieattack and #zombie hashtags. Thanks for your help!

References

Fricker, R. D., B. L. Hegler, and D. A. Dunfee. 2008. “Comparing syndromic surveillance detection methods: EARS’ versus a CUSUM-based methodology.” Stat Med 27 (17): 3407–29.

Hutwagner, L., W. Thompson, G. M. Seeman, and T. Treadwell. 2003. “The bioterrorism preparedness and response Early Aberration Reporting System (EARS).” J Urban Health 80 (2 Suppl 1): 89–96.

Salmon, M., D. Schumacher, and M. Höhle. 2016. “Monitoring Count Time Series in R: Aberration Detection in Public Health Surveillance.” Journal of Statistical Software 70 (10). doi:10.18637/jss.v070.i10.

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Windows 10 anniversary updates includes a whole Linux layer – this is good news for data scientists

By John Johnson

(This article was first published on Realizations in Biostatistics, and kindly contributed to R-bloggers)

If you are on Windows 10, no doubt you have heard that Microsoft included the bash shell in its 2016 Windows 10 anniversary update. What you may not know is that this is much, much more than just the bash shell. This is a whole Linux layer that enables you to use Linux tools, and does away with a further layer like Cygwin (which requires a special dll). However, you will only get the bash shell out of the box. To enable the whole Linux layer, follow instructions here. Basically, this involves enabling developer mode then enabling the Linux layer feature. In the process, you will download some further software from the Windows store.

Why is this big news? To me, this installs a lot of the Linux tools that have proven useful over the years, such as wc, sed, awk, grep, and so forth. In some cases, these tools work much better than software packages such as R or SAS, and their power comes in combining these tools through pipes. You also get apt-get, which enables you to install and manage packages such as SQLite, octave, and gnuplot. You can even install R through this method, though I don’t know if RStudio works with R installed in this way.

If you’re a Linux buff who uses Windows, you can probably think of many more things you can do with this. The only drawback is that I haven’t tried using any sort of X Windows or other graphical interfaces.

To leave a comment for the author, please follow the link and comment on their blog: Realizations in Biostatistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Radial Stacked Area Chart in R using Plotly

By Riddhiman

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

In this post we’ll quickly show how to create radial stacked ara charts in plotly. We’ll use the AirPassengers dataset.

Inspired by Mike Bostocks post: http://bl.ocks.org/mbostock/3048740

#devtools::install_github("ropensci/plotly")

library(plotly)
library(zoo)
library(data.table)

# Load Airpassengers data set
data("AirPassengers")

# Create data frame with year and month
AirPassengers <- zoo(coredata(AirPassengers), order.by = as.yearmon(index(AirPassengers)))
df <- data.frame(month = format(index(AirPassengers), "%b"),
                 year =  format(index(AirPassengers), "%Y"),
                 value = coredata(AirPassengers))

# Get coordinates for plotting
#Angles for each month
nMonths <- length(unique(df$month))
theta <- seq(0, 2*pi, by = (2*pi)/nMonths)[-(nMonths+1)]

# Append these angles to the data frame
df$theta <- rep(theta, nMonths)

# Cumulatively sum number of passgengers
dt <- as.data.table(df)
dt[,cumvalue := cumsum(value), by = month]
df <- as.data.frame(dt)

# Cartesian coordinates (x, y) space will be value*cos(theta) and value*sin(theta)
df$x <- df$cumvalue * cos(df$theta)
df$y <- df$cumvalue * sin(df$theta)

# Create hovertext
df$hovertext <- paste("Year:", df$year, "<br>",
                      "Month:", df$month, "<br>",
                      "Passegers:", df$value)

# Repeat January values
ddf <- data.frame()
for(i in unique(df$year)){
  temp <- subset(df, year == i)
  temp <- rbind(temp, temp[1,])
  ddf <- rbind(ddf, temp)
}

df <- ddf

# Plot
colorramp <- colorRampPalette(c("#bfbfbf", "#f2f2f2"))
cols <- colorramp(12)

cols <- rep(c("#e6e6e6", "#f2f2f2"), 6)

linecolor <- "#737373"

p <- plot_ly(subset(df, year == 1949), x = ~x, y = ~y, hoverinfo = "text", text = ~hovertext,
             type = "scatter", mode = "lines",
             line = list(shape = "spline", color = linecolor))

k <- 2
for(i in unique(df$year)[-1]){
  p <- add_trace(p, data = subset(df, year == i), 
                 x = ~x, y = ~y, hoverinfo = "text", text = ~hovertext,
                 type = "scatter", mode = "lines",
                 line = list(shape = "spline", color = linecolor),
                 fillcolor = cols[k], fill = "tonexty")
  
  k <- k + 1
}

start <- 100
end <- 4350
axisdf <- data.frame(x = start*cos(theta), y = start*sin(theta),
                     xend = end*cos(theta), yend = end*sin(theta))

p <- add_segments(p = p, data = axisdf, x = ~x, y = ~y, xend = ~xend, yend = ~yend, inherit = F,
                  line = list(dash = "8px", color = "#737373", width = 4),
                  opacity = 0.7)

p <- add_text(p, x = (end + 200)*cos(theta), y = (end + 200)*sin(theta), text = unique(df$month), inherit = F,
              textfont = list(color = "black", size = 18))

p <- layout(p, showlegend = F,
       title = "Radial Stacked Area Chart",
       xaxis = list(showgrid = F, zeroline = F, showticklabels = F, domain = c(0.25, 0.80)),
       yaxis = list(showgrid = F, zeroline = F, showticklabels = F),
       length = 1024,
       height = 600)

p

Radial Stacked Area Chart

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News