thinking with data with “Modern Data Science with R”

By Nick Horton

One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly

While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data. The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods. (Figure 8.2 above was taken from a case study in the supervised learning chapter.)

Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics. Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation. Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data. A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.

We believe that several features of the book are distinctive:

  1. minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
  2. ethical considerations are raised early, to motivate later examples
  3. recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured

Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.

This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We’ve taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.
We’ve made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we’re working to create a series of lab activities (e.g., text as data). (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.)
Modern Data Science with R


An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Source:: SAS & R

thinking with data with “Modern Data Science with R”

By Nick Horton

One of the biggest challenges educators face is how to teach statistical thinking integrated with data and computing skills to allow our students to fluidly think with data. Contemporary data science requires a tight integration of knowledge from statistics, computer science, mathematics, and a domain of application. For example, how can one model high earnings as a function of other features that might be available for a customer? How do the results of a decision tree compare to a logistic regression model? How does one assess whether the underlying assumptions of a chosen model are appropriate? How are the results interpreted and communicated?


While there are a lot of other useful textbooks and references out there (e.g., R for Data Science, Practical Data Science with R, Intro to Data Science with Python) we saw a need for a book that incorporates statistical and computational thinking to solve real-world problems with data. The result was Modern Data Science with R, a comprehensive data science textbook for undergraduates that features meaty, real-world case studies integrated with modern data science methods. (Figure 8.2 above was taken from a case study in the supervised learning chapter.)

Part I (introduction to data science) motivates the book and provides an introduction to data visualization, data wrangling, and ethics. Part II (statistics and modeling) begins with fundamental concepts in statistics, supervised learning, unsupervised learning, and simulation. Part III (topics in data science) reviews dynamic visualization, SQL, spatial data, text as data, network statistics, and moving towards big data. A series of appendices cover the mdsr package, an introduction to R, algorithmic thinking, reproducible analysis, multiple regression, and database creation.

We believe that several features of the book are distinctive:

  1. minimal prerequisites: while some background in statistics and computing is ideal, appendices provide an introduction to R, how to write a function, and key statistical topics such as multiple regression
  2. ethical considerations are raised early, to motivate later examples
  3. recent developments in the R ecosystem (e.g., RStudio and the tidyverse) are featured

Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in R/RStudio can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling statistical questions.

This book is intended to help readers with some background in statistics and modest prior experience with coding develop and practice the appropriate skills to tackle complex data science projects. We’ve taught a variety of courses using it, ranging from an introduction to data science, a sophomore level data science course, and as part of the components for a senior capstone class.
We’ve made three chapters freely available for download: data wrangling I, data ethics, and an introduction to multiple regression. An instructors solution manual is available, and we’re working to create a series of lab activities (e.g., text as data). (The code to generate the above figure can be found in the supervised learning materials at http://mdsr-book.github.io/instructor.html.)
Modern Data Science with R


An unrelated note about aggregators:
We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

Source:: R News

Passing two SQL queries to sp_execute_external_script

By tomaztsql

(This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

Recently, I got a question on one of my previous blog posts, if there is possibility to pass two queries in same run-time as an argument to external procedure sp_execute_external_script.

Some of the arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is not true. External procedure can hold only one T-SQL dataset, inserted through this parameter.

There are many reasons for that, one would be the cost of sending several datasets to external process and back, so inadvertently, this forces user to rethink and pre-prepare the dataset (meaning, do all the data munging beforehand), prior to sending it into external procedure.

But there are workarounds on how to pass additional query/queries to sp_execute_external_script. I am not advocating this, and I strongly disagree with such usage, but here it is.

First I will create two small datasets, using T-SQL

USE SQLR;
GO

DROP TABLE IF EXISTS dataset;
GO

CREATE TABLE dataset
(ID INT IDENTITY(1,1) NOT NULL
,v1 INT
,v2 INT
CONSTRAINT pk_dataset PRIMARY KEY (id)
)

SET NOCOUNT ON;
GO

INSERT INTO dataset(v1,v2)
SELECT TOP 1
 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1
,(SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOD') ORDER BY NEWID()) AS v2
FROM master..spt_values
GO 50

This dataset will be used directly into @input_data_1 argument. The next one will be used through R code:

CREATE TABLE external_dataset
(ID INT IDENTITY(1,1) NOT NULL
,v1 INT
CONSTRAINT pk_external_dataset PRIMARY KEY (id)
)

SET NOCOUNT ON;
GO

INSERT INTO external_dataset(v1)
SELECT TOP 1
 (SELECT TOP 1 number FROM master..spt_values WHERE type IN ('EOB') ORDER BY NEWID()) AS V1
FROM master..spt_values
GO 50

Normally, one would use a single dataset like:

EXEC sp_execute_external_script
     @language = N'R'
    ,@script = N'OutputDataSet 

But by “injecting” the ODBC into R code, we can allow external procedure, to get back to your SQL Server and get additional dataset.

This can be done by following:

EXECUTE AS USER = 'RR'; 
GO

DECLARE @Rscript NVARCHAR(MAX)
SET @Rscript = '
   library(RODBC)
   myconn 

And the result will be merged two datasets, in total three columns:

which correspond to two datasets:

-- Check the results!
SELECT * FROM dataset
SELECT * FROM external_dataset

There are, as already mentioned, several opposing factors to this approach, and I would not recommend this. Some are:

  • validating and keeping R code in one place
  • performance issues
  • additional costs of data transferring
  • using ODBC connectors
  • installing additional R packages (in my case RODBC package)
  • keeping different datasets in one place
  • security issues
  • additional login/user settings
  • firewall inbound/outbound rules setting

This, of course, can also be achieved with *.XDF file formats, if they are stored locally or on server as a files.

As always, code is available at Github.

Happy R-SQLing! 🙂

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Effect Size Statistics for Anova Tables #rstats

By Daniel

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

My sjstats-package has been updated on CRAN. The past updates introduced new functions for various purposes, e.g. predictive accuracy of regression models or improved support for the marvelous glmmTMB-package. The current update, however, added some ANOVA tools to the package.

In this post, I want to give a short overview of these new functions, which report different effect size measures. These are useful beyond significance tests (p-values), because they estimate the magnitude of effects, independent from sample size. sjstats provides following functions:

  • eta_sq()
  • omega_sq()
  • cohens_f()
  • anova_stats()

First, we need a sample model:

library(sjstats)
# load sample data
data(efc)

# fit linear model
fit 

All functions accept objects of class aov or anova, so you can also use model fits from the car-package, which allows fitting Anova’s with different types of sum of squares. Other objects, like lm, will be coerced to anova internally.

The following functions return the effect size statistic as named numeric vector, using the model’s term names.

Eta Squared

The eta squared is the proportion of the total variability in the dependent variable that is accounted for by the variation in the independent variable. It is the ratio of the sum of squares for each group level to the total sum of squares. It can be interpreted as percentage of variance accounted for by a variable.

For variables with 1 degree of freedeom (in the numerator), the square root of eta squared is equal to the correlation coefficient r. For variables with more than 1 degree of freedom, eta squared equals R2. This makes eta squared easily interpretable. Furthermore, these effect sizes can easily be converted into effect size measures that can be, for instance, further processed in meta-analyses.

Eta squared can be computed simply with:

eta_sq(fit)
#>   as.factor(e42dep) as.factor(c172code)             c160age 
#>         0.266114185         0.005399167         0.048441046

Partial Eta Squared

The partial eta squared value is the ratio of the sum of squares for each group level to the sum of squares for each group level plus the residual sum of squares. It is more difficult to interpret, because its value strongly depends on the variability of the residuals. Partial eta squared values should be reported with caution, and Levine and Hullett (2002) recommend reporting eta or omega squared rather than partial eta squared.

Use the partial-argument to compute partial eta squared values:

eta_sq(fit, partial = TRUE)
#>   as.factor(e42dep) as.factor(c172code)             c160age 
#>         0.281257128         0.007876882         0.066495448

Omega Squared

While eta squared estimates tend to be biased in certain situations, e.g. when the sample size is small or the independent variables have many group levels, omega squared estimates are corrected for this bias.

Omega squared can be simply computed with:

omega_sq(fit)
#>   as.factor(e42dep) as.factor(c172code)             c160age 
#>         0.263453157         0.003765292         0.047586841

Cohen’s F

Finally, cohens_f() computes Cohen’s F effect size for all independent variables in the model:

cohens_f(fit)
#>   as.factor(e42dep) as.factor(c172code)             c160age 
#>          0.62555427          0.08910342          0.26689334

Complete Statistical Table Output

The anova_stats() function takes a model input and computes a comprehensive summary, including the above effect size measures, returned as tidy data frame (as tibble, to be exact):

anova_stats(fit)
#> # A tibble: 4 x 11
#>                  term    df      sumsq     meansq statistic p.value etasq partial.etasq omegasq cohens.f power
#>                                                        
#> 1   as.factor(e42dep)     3  577756.33 192585.444   108.786   0.000 0.266         0.281   0.263    0.626  1.00
#> 2 as.factor(c172code)     2   11722.05   5861.024     3.311   0.037 0.005         0.008   0.004    0.089  0.63
#> 3             c160age     1  105169.60 105169.595    59.408   0.000 0.048         0.066   0.048    0.267  1.00
#> 4           Residuals   834 1476436.34   1770.307        NA      NA    NA            NA      NA       NA    NA

Like the other functions, the input may also be an object of class anova, so you can also use model fits from the car package, which allows fitting Anova’s with different types of sum of squares:

anova_stats(car::Anova(fit, type = 3))
#> # A tibble: 5 x 11
#>                  term       sumsq     meansq    df statistic p.value etasq partial.etasq omegasq cohens.f power
#>                                                         
#> 1         (Intercept)   26851.070  26851.070     1    15.167   0.000 0.013         0.018   0.012    0.135 0.973
#> 2   as.factor(e42dep)  426461.571 142153.857     3    80.299   0.000 0.209         0.224   0.206    0.537 1.000
#> 3 as.factor(c172code)    7352.049   3676.025     2     2.076   0.126 0.004         0.005   0.002    0.071 0.429
#> 4             c160age  105169.595 105169.595     1    59.408   0.000 0.051         0.066   0.051    0.267 1.000
#> 5           Residuals 1476436.343   1770.307   834        NA      NA    NA            NA      NA       NA    NA

References

Levine TR, Hullet CR. Eta Squared, Partial Eta Squared, and Misreporting of Effect Size in Communication Research. Human Communication Research 28(4); 2002: 612-625

Tagged: anova, R, rstats

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R⁶ — General (Attys) Distributions

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice):

Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). A daily viz: https://t.co/aJ4KDsC5kC pic.twitter.com/ZoiEV3MhGp

— Matt Stiles (@stiles) July 25, 2017

I thought it would be neat (since Matt did the data scraping part already) to look at AG tenure distribution by party, while also pointing out where Sessions falls.

Now, while Matt did scrape the data, it’s tucked away into a javascript variable in an iframe on the page that contains his vis.

It’s still easier to get it from there vs re-scrape Wikipedia (like Matt did) thanks to the V8 package by @opencpu.

The following code:

  • grabs the vis iframe
  • extracts and evaluates the target javascript to get a nice data frame
  • performs some factor re-coding (for better grouping and to make it easier to identify Sessions)
  • plots the distributions using the beeswarm quasirandom alogrithm
library(V8)
library(rvest)
library(ggbeeswarm)
library(hrbrthemes)
library(tidyverse)

pg % html_text())

ctx$get("DATA") %>% 
  as_tibble() %>% 
  readr::type_convert() %>% 
  mutate(party = ifelse(is.na(party), "Other", party)) %>% 
  mutate(party = fct_lump(party)) %>% 
  mutate(color1 = case_when(
    party == "Democratic" ~ "#313695",
    party == "Republican" ~ "#a50026",
    party == "Other" ~ "#4d4d4d")
  ) %>% 
  mutate(color2 = ifelse(grepl("Sessions", label), "#2b2b2b", "#00000000")) -> ags

ggplot() + 
  geom_quasirandom(data = ags, aes(party, amt, color = color1)) +
  geom_quasirandom(data = ags, aes(party, amt, color = color2), 
                   fill = "#ffffff00", size = 4, stroke = 0.25, shape = 21) +
  geom_text(data = data_frame(), aes(x = "Republican", y = 100, label = "Jeff Sessions"), 
            nudge_x = -0.15, family = font_rc, size = 3, hjust = 1) +
  scale_color_identity() +
  scale_y_comma(limits = c(0, 4200)) +
  labs(x = "Party", y = "Tenure (days)", 
       title = "U.S. Attorneys General",
       subtitle = "Distribution of tenure in office, by days & party: 1789-2017",
       caption = "Source data/idea: Matt Stiles ") +
  theme_ipsum_rc(grid = "XY")

I turned the data into a CSV and stuck it in this gist if folks want to play w/o doing the js scraping.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

SQL Server 2017 release candidate now available

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

SQL Server 2017, the next major release of the SQL Server database, has been available as a community preview for around 8 months, but now the first full-featured release candidate is available for public preview. For those looking to do data science with data in SQL Server, there are a number of new features compared to SQL Server 2017 which might be of interest:

SQL Server 2017 Release Candidate 1 is available for download now. For more details on these and other new features in this release, check out the link below.

SQL Server Blog: SQL Server 2017 CTP 2.1 now available

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)

By Guillaume Touzin

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

Until now, in this series of exercise sets, we have used only continuous probability distributions, which are functions defined on all the real numbers on a certain interval. As a consequence, random variable who have those distributions can assume an infinity of values. However, a lot of random situations only have a finite amount of possible outcome and using a continuous probability distributions to analyze them is not really useful. In today set, we’ll introduce the concept of discrete probability functions, which can be used in those situations and some examples of problems in which they can be used.

Answers to the exercises are available here.

Exercise 1
Just as continuous probability distributions are characterized by a probability density function discrete probability functions are characterized by a probability mass function which gives the probability that a random variable is equal to one value.

The first probability mass function we will use today is the binomial distribution, which is used to simulate n iterations of a random process who can either result in a success, with a probability of p, or a failure, with a probability of (1-p). Basically, if you want to simulate something like a coins flip, the binomial distribution is the tool you need.

Suppose you roll a 20 sided dice 200 times and you want to know the probability to get a 20 exactly five times on your rolls. Use the dbinom(n, size, prob) function to compute this probability.

Exercise 2
For the binomial distribution, the individual events are independents, meaning that the probability of realization of two events can be calculated by adding the probability of realization of both event. This principle can be generalize to any number of events. For example, the probability of getting three tails or less when you flip a coins 10 time is equal to the probability of getting 1 tails plus the probability of getting 2 tails plus the probability of getting 3 tails.

Knowing this, use the dbinom() function to compute the probability of getting six correct responses at a test made of 10 questions which have true or false for answer if you answer randomly. Then, use the pbinom() function to compute the cumulative probability function of the binomial distribution in that situation.

Exercise 3
Another consequence of the independence of events is that if we know the probability of realization of a set of events we can compute the probability of realization of one of his subset by subtracting the probability of the unwanted event. For example, the probability of getting two or three tails when you flip a coins 10 time is equal to the probability of getting at least 3 tails minus the probability of getting 1 tails.

Knowing this, compute the probability of getting 6 or more correct answer on the test described in the previous exercise.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Work with about different binomial and logistic regression techniques
  • Know how to compare regression models and choose the right fit
  • And much more

Exercise 4
Let’s say that in an experiment a success is defined as getting a 1 if you roll a 20 sided die. Use the barplot() function to represent the probability of getting from 0 to 10 success if you roll the die 10 times. What happened to the barplot if you roll a 10 sided die instead? If you roll a 3 sided die?

Exercise 5
Another discrete probability distribution close to the binomial distribution is the Poisson distribution, which give the probability of a number of events to occur during a fixed amount of time if we know the average rate of his occurrence. For example, we could use this distribution to estimate the amount of visitor who goes on a website if we know the average number of visitor per second. In this case, we must assume two things: first that the website has visitor from around the world since the rate of visitor must be constant around the day and two that when a visitor is coming on the site he is not influenced by the last visitor since a process can be expressed by the Poisson distribution if the events are independent from each other.

Use the dpois() function to estimate the probability of having 85 visitors on a website in the next hour if in average 80 individual connect on the site per hour. What is the probability of getting 2000 unique visitors on the website in a day?

Exercise 6
Poisson distribution can be also used to compute the probability of an event occurring in an amount of space, as long as the unit of the average rate is compatible with the unit of measure of the space you use. Suppose that a fishing boat catch 1/2 ton of fish when his net goes through 5 squares kilometers of sea. If the boat combed 20 square kilometer, what is the probability that they catch 5 tons of fish?

Exercise 7
Until now, we used the Poisson distribution to compute the probability of observing precisely n occurrences of an event. In practice, we are often interested in knowing the probability that an event occur n times or less. To do so we can use the ppois() function to compute the cumulative Poisson distribution. If we are interested in knowing what is the probability of observing strictly more than n occurrences, we can use this function and set the parameter lower to FALSE.

In the situation of exercise 5, what is the probability that the boat caught 5 tons of fish or less? What is the probability that the caught more than 5 tons of fish?

Note that, just as in a binomial experiment, the events in a Poisson process are independant, so you can add or subtract probability of event to compute the probability of a particular set of events.
Exercise 8
Draw the Poisson distribution for average rate of 1,3,5 and 10.

Exercise 9
The last discrete probability distribution we will use today is the negative binomial distribution which give the probability of observing a certain number of success before observing a fixed number of failures. For example, imagine that a professional football player will retire at the end of the season. This player has scored 495 goals in his career and would really want to meet the 500 goal mark before retiring. If he is set to play 8 games until the end of the season and score one goal every three games in average, we can use the negative binomial distribution to compute the probability that he will meet his goal on his last game, supposing that he won’t score more than one goal per game.

Use the dnbinom() function to compute this probability. In this case, the number of success is 5, the probability of success is 1/3 and the number of failures is 3.

Exercise 10
Like for the Poisson distribution, R give us the option to compute the cumulative negative binomial distribution with the function pnbinom(). Again, the lower.tail parameter than give you the option to compute the probability of realizing more than n success if he is set to TRUE.

In the situation of the last exercise, what is the probability that the football player will score at most 5 goals in before the end of his career.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Experiment designs for Agriculture

By Fabio Veronesi

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)
This post is more for personal use than anything else. It is just a collection of code and functions to produce some of the most used experimental designs in agriculture and animal science.
I will not go into details about these designs. If you want to know more about what to use in which situation you can find material at the following links:
Design of Experiments (Penn State): https://onlinecourses.science.psu.edu/stat503/node/5

Statistical Methods for Bioscience (Wisconsin-Madison): http://www.stat.wisc.edu/courses/st572-larget/Spring2007/

R Packages to create several designs are presented here: https://cran.r-project.org/web/views/ExperimentalDesign.html
A very good tutorial about the use of the package Agricolae can be found here:
https://cran.r-project.org/web/packages/agricolae/vignettes/tutorial.pdf

Complete Randomized Design

This is probably the most common design, and it is generally used when conditions are uniform, so we do not need to account for variations due for example to soil conditions.
In R we can create a simple CRD with the function expand.grid and then with some randomization:
 TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))  
Data.CRD = TR.Structure[sample(1:nrow(TR.Structure),nrow(TR.Structure)),]

Data.CRD = cbind(PlotN=1:nrow(Data.CRD), Data.CRD[,-1])

write.csv(Data.CRD, "CompleteRandomDesign.csv", row.names=F)
The first line create a basic treatment structure, with rep that identifies the number of replicate, that looks like this:

 > TR.Structure  
rep Treatment1 Treatment2
1 1 A A
2 2 A A
3 3 A A
4 1 B A
5 2 B A
6 3 B A
7 1 A B
8 2 A B
9 3 A B
10 1 B B
11 2 B B
12 3 B B
13 1 A C
14 2 A C
15 3 A C
16 1 B C
17 2 B C
18 3 B C

The second line randomizes the whole data.frame to obtain a CRD, then we add with cbind a column at the beginning with an ID for the plot, while also eliminating the columns with rep.

Add Control

To add a Control we need to write two separate lines, one for the treatment structure and the other for the control:
 TR.Structure = expand.grid(rep=1:3, Treatment1=c("A","B"), Treatment2=c("A","B","C"))  
CR.Structure = expand.grid(rep=1:3, Treatment1=c("Control"), Treatment2=c("Control"))

Data.CCRD = rbind(TR.Structure, CR.Structure)
This will generate the following table:

 > Data.CCRD  
rep Treatment1 Treatment2
1 1 A A
2 2 A A
3 3 A A
4 1 B A
5 2 B A
6 3 B A
7 1 A B
8 2 A B
9 3 A B
10 1 B B
11 2 B B
12 3 B B
13 1 A C
14 2 A C
15 3 A C
16 1 B C
17 2 B C
18 3 B C
19 1 Control Control
20 2 Control Control
21 3 Control Control

As you can see the control is totally excluded from the rest. Now we just need to randomize, again using the function sample:

 Data.CCRD = Data.CCRD[sample(1:nrow(Data.CCRD),nrow(Data.CCRD)),]  

Data.CCRD = cbind(PlotN=1:nrow(Data.CCRD), Data.CCRD[,-1])

write.csv(Data.CCRD, "CompleteRandomDesign_Control.csv", row.names=F)

Block Design with Control

The starting is the same as before. The difference starts when we need to randomize, because in CRD we randomize over the entire table, but with blocks, we need to do it by block.
 TR.Structure = expand.grid(Treatment1=c("A","B"), Treatment2=c("A","B","C"))  
CR.Structure = expand.grid(Treatment1=c("Control"), Treatment2=c("Control"))

Data.CBD = rbind(TR.Structure, CR.Structure)

Block1 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]
Block2 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]
Block3 = Data.CBD[sample(1:nrow(Data.CBD),nrow(Data.CBD)),]

Data.CBD = rbind(Block1, Block2, Block3)

BlockID = rep(1:nrow(Block1),3)

Data.CBD = cbind(Block = BlockID, Data.CBD)

write.csv(Data.CBD, "BlockDesign_Control.csv", row.names=F)
As you can see from the code above, we’ve created three objects, one for each block, where we used the function sample to randomize.

Other Designs with Agricolae

The package agricolae includes many designs, which I am sure will cover all your needs in terms of setting up field and lab experiments.
We will look at some of them, so first let’s install the package:
 install.packages("agricolae")  
library(agricolae)

The main syntax for design in agricolae is the following:

 Trt1 = c("A","B","C")  
design.crd(trt=Trt1, r=3)

The result is the output below:

 > design.crd(trt=Trt1, r=3)  
$parameters
$parameters$design
[1] "crd"

$parameters$trt
[1] "A" "B" "C"

$parameters$r
[1] 3 3 3

$parameters$serie
[1] 2

$parameters$seed
[1] 1572684797

$parameters$kinds
[1] "Super-Duper"

$parameters[[7]]
[1] TRUE


$book
plots r Trt1
1 101 1 A
2 102 1 B
3 103 2 B
4 104 2 A
5 105 1 C
6 106 3 A
7 107 2 C
8 108 3 C
9 109 3 B

As you can see the function takes only one argument for treatments and another for replicates. Therefore, if we need to include a more complex treatment structure we first need to work on them:

 Trt1 = c("A","B","C")  
Trt2 = c("1","2")
Trt3 = c("+","-")

TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))
TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))
TRT.Control = c(TRT, rep("Control", 3))

As you can see we have now three treatments, which are merged into unique strings within the function sapply:

 > TRT  
[1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"

Then we need to include the control, and then we can use the object TRT.Control with the function design.crd, from which we can directly obtain the data.frame with $book:

 > design.crd(trt=TRT.Control, r=3)$book  
plots r TRT.Control
1 101 1 A2+
2 102 1 B1+
3 103 1 Control
4 104 1 B2+
5 105 1 A1+
6 106 1 C2+
7 107 2 A2+
8 108 1 C2-
9 109 2 Control
10 110 1 B2-
11 111 3 Control
12 112 1 Control
13 113 2 C2-
14 114 2 Control
15 115 1 C1+
16 116 2 C1+
17 117 2 B2-
18 118 1 C1-
19 119 2 C2+
20 120 3 C2-
21 121 1 A2-
22 122 2 C1-
23 123 2 A1+
24 124 3 C1+
25 125 1 B1-
26 126 3 Control
27 127 3 A1+
28 128 2 B1+
29 129 2 B2+
30 130 3 B2+
31 131 1 A1-
32 132 2 B1-
33 133 2 A2-
34 134 1 Control
35 135 3 C2+
36 136 2 Control
37 137 2 A1-
38 138 3 B1+
39 139 3 Control
40 140 3 A2-
41 141 3 A1-
42 142 3 A2+
43 143 3 B2-
44 144 3 C1-
45 145 3 B1-

Other possible designs are:

 #Random Block Design  
design.rcbd(trt=TRT.Control, r=3)$book

#Incomplete Block Design
design.bib(trt=TRT.Control, r=7, k=3)

#Split-Plot Design
design.split(Trt1, Trt2, r=3, design=c("crd"))

#Latin Square
design.lsd(trt=TRT.tmp)$sketch

Others not included above are: Alpha designs, Cyclic designs, Augmented block designs, Graeco – latin square designs, Lattice designs, Strip Plot Designs, Incomplete Latin Square Design

Final Note

For repeated measures and crossover designs I think we can create designs simply using again the function expand.grid and including time and subjects, as I did in my previous post about Power Analysis. However, there is also the package Crossover that deals specifically with crossover design and on this page you can find more packages that deal with clinical designs: https://cran.r-project.org/web/views/ClinicalTrials.html

To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Twitter Coverage of the Bioinformatics Open Source Conference 2017

By nsaunders

July 21-22 saw the 18th incarnation of the Bioinformatics Open Source Conference, which generally precedes the ISMB meeting. I had the great pleasure of attending BOSC way back in 2003 and delivering a short presentation on Bioperl. I knew almost nothing in those days, but everyone was very kind and appreciative.

My trusty R code for Twitter conference hashtags pulled out 3268 tweets and without further ado here is:

The ISMB/ECCB meeting wraps today and analysis of Twitter coverage for that meeting will appear here in due course.

Filed under:

Source:: R News

The new MODIStsp website (based on pkgdown) is online !

By Lorenzo Busetto

(This article was first published on Spatial Processing in R, and kindly contributed to R-bloggers)

The MODIStsp website, which lay abandoned since several months on github pages, recently underwent a major overhaul thanks to pkgdown. The new site is now available at http://lbusett.github.io/MODIStsp/

We hope that the revised website will allow to navigate MODIStsp-related material much more easily than either github or the standard CRAN documentation, and will therefore help users in better and more easily exploiting MODIStsp functionality.

The restyling was possible thanks to the very nice “pkgdown” R package (http://hadley.github.io/pkgdown), which allows to easily create a static documentation website.

Though pkgdown does already quite a good job in creating a bare-bones website exploiting just the material available in a standard devtools-based R package file structure, I was surprised at how simply the standard website could be tweaked to obtain a very nice (IMO) final result without needing any particular background on html, css, etc. !

(On that, I plan to soon post a short guide containing a few tips and tricks I learnt in the last days for setting up and configuring a pkgdown website, so stay tuned !)

To leave a comment for the author, please follow the link and comment on their blog: Spatial Processing in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News