(This article was first published on

**eKonometrics**, and kindly contributed to R-bloggers)When it comes to undertaking complex data science projects, R is the preferred choice for many. Why? Because handling complex tasks is simpler in R than other comparable platforms.

Regrettably, the same is not true for performing simpler tasks, which I would argue is rather complex in base R. Hence, the title — R: simple for complex tasks, complex for simple tasks.

Consider a simple, yet mandatory, task of generating summary statistics for a research project involving tabular data. In most social science disciplines, summary statistics for continuous variables require generating mean, standard deviation, number of observations, and perhaps minimum and maximum. One would have hoped to see a function in base R to generate such table. But there isn’t one.

Of course, several user-written packages, such as

**psyche**, can generate descriptive statistics in a tabular format. However, this requires one to have advanced knowledge of R and the capabilities hidden in specialized packages whose number now exceed 12,000 (as of March 2018). Keeping abreast of the functionality embedded in user-written packages is time-consuming.Some would argue that the

**summary**command in base R is an option. I humbly disagree.First, the output from

**summary**is not in a tabular format that one could just copy and paste into a document. It would require significant processing before a simple table with summary statistics for more than one continuous variable could be generated. Second,**summary**command does not report standard deviation.I teach business analytics to undergraduate and MBA students. While business students need to know statistics, they are not studying to become statisticians. Their goal in life is to be informed and proficient consumers of statistical analysis.

So, imagine an undergraduate class with 150 students learning to generate a simple table that reports summary statistics for more than one continuous variable. The simple task requires knowledge of several R commands. By the time one teaches these commands to students, most have made up their mind to do the analysis instead in Microsoft Excel instead.

Had there been a simple command to generate descriptive statistics in base R, this would not be a challenge for instructors trying to bring thousands more into R’s fold.

In the following paragraphs, I will illustrate the challenge with an example and identify an R package that generates a simple table of descriptive statistics.

I use

**mtcars**dataset, which is available with R. The following commands load the dataset and display the first few observations with all the variables.**data(mtcars)**

**head(mtcars)**

As stated earlier, one can use

**summary**command to produce descriptive statistics.**summary(mtcars)**

Let’s say one would like to generate descriptive statistics including mean, standard deviation, and the number of observations for the following continuous variables:

**mpg, disp, and hp**. One can use the**sapply**command and generate the three statistics separately and combined them later using the**cbind**command.The following command will create a vector of means.

**mean.cars = with(mtcars, sapply(mtcars[c(“mpg”, “disp”, “hp”)], mean))**

Note that the above syntax requires someone learning R to know the following:

1. Either to attach the dataset or to use

**with**command so that**sapply**could recognize variables.2. Knowledge of subsetting variables in R

3. Familiarity with

**c**to combine variables4. Being aware of enclosing variable names in quotes

We can use similar syntax to determine standard deviation and the number of observations.

**sd.cars = with(mtcars, sapply(mtcars[c(“mpg”, “disp”, “hp”)], sd)); sd.cars**

**n.cars = with(mtcars, sapply(mtcars[c(“mpg”, “disp”, “hp”)], length)); n.cars**

Note that the user needs to know that the command for number of observations is

**length**and for standard deviation is**sd**.Once we have the three vectors, we can combine them using

**cbind**that generates the following table.**cbind(n.cars, mean.cars, sd.cars)**

n.cars mean.cars sd.cars

mpg 32 20.09062 6.026948

disp 32 230.72188 123.938694

hp 32 146.68750 68.562868

Again, one needs to know the

**round**command to restrict the output to a specific number of decimals. See below the output with two decimal points.**round(cbind(n.cars, mean.cars, sd.cars),2)**

n.cars mean.cars sd.cars

mpg 32 20.09 6.03

disp 32 230.72 123.94

hp 32 146.69 68.56

One can indeed use a custom function to generate the same with one command. See below.

**round(with(mtcars, t(sapply(mtcars[c(“mpg”, “disp”, “hp”)],**

**function(x) c(n=length(x), avg=mean(x),**

**stdev=sd(x))**

**))), 2)**

n avg stdev

mpg 32 20.09 6.03

disp 32 230.72 123.94

hp 32 146.69 68.56

But the question I have for my fellow instructors is the following. How likely is an undergraduate student taking an introductory course in statistical analysis to be enthused about R if the simplest of the tasks need multiple lines of codes? A simple function in base R could keep students focussed on interpreting data rather than worrying about missing a comma or a parenthesis.

**stargazer*** is an R package that simplifies this task. Here is the output from

**stargazer**.

**library(stargazer)**

**stargazer(mtcars[c("mpg", "disp", "hp")], type="text")**

============================================

Statistic N Mean St. Dev. Min Max

--------------------------------------------

mpg 32 20.091 6.027 10.400 33.900

disp 32 230.722 123.939 71.100 472.000

hp 32 146.688 68.563 52 335

--------------------------------------------

A simple task, I argue, should be accomplished simply. My plea will be to include in base R a simple command that may generate the above table with a command as simple as the one below:

**descriptives(mpg, disp, hp)**

** Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.*

*R package version 5.2. http://CRAN.R-project.org/package=stargazer*

To

**leave a comment**for the author, please follow the link and comment on their blog:**eKonometrics**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News