## New courses: Introduction to Statistics

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

This week we are launching four new courses as part of our Introduction to Statistics curriculum. We are taking a modern approach to teaching statistics with the use of simulations and randomization rather than a more traditional theoretical one. We have amazing instructors teaching the following courses:

Introduction to Data by Mine Çetin-Rundel, Director of Undergraduate Studies and an Associate Professor of the Practice in the Department of Statistical Science at Duke University.

• Statistics is the study of how best to collect, analyze and draw conclusions from data. In this course, you will focus on identifying a question or problem and collecting relevant data on the topic. Get started today.

Exploratory Data Analysis by Andrew Bray, Assistant Professor of Statistics at Reed College.

• In this course, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data. By the end of this course, you’ll be able to answer which variables suggest interesting relationships and which observations are unusual. All the while generating graphics that are both insightful and beautiful. Get started here.

Correlation and Regression by Ben Baumer, Assistant Professor in the Program in Statistical & Data Sciences Program at Smith College and previously a Statistical Analyst for the NY Mets.

• Ultimately, data analysis is about understanding relationships among variables. In this course, you will learn how to describe relationships between two numerical quantities and characterize these relationships graphically in the form of summary statistics and through linear regression models. Start here.

Foundation of Inference by Jo Hardin, professor of Mathematics & Statistics at Pomona College.

• Inference, the process of drawing conclusions about a larger population from a sample of data, is a foundational aspect of statistical analysis. In this course, you will learn the standard practice of disproving a research claim that is not of interest, the degree of disagreement between data and the hypothesis (p-value) and lastly confidence intervals. Start today.

Some of our instructors are part of OpenIntro, an organization that aims to make educational products that are free, transparent, and that lower barriers to entry to education. Here’s what they had to say about this new method of teaching statistics:

“In these courses, we introduce foundational statistical topics such as exploratory data analysis, statistical inference, and modeling with a focus on both the why and the how. We use real data examples to introduce the ideas of statistical inference within a randomization and simulation framework. We also walk students through the implementation of each method in R using tools from the tidyverse so that students completing the courses are equipped with both a conceptual understanding of the statistical methods presented and also concrete tools for applying them to data.”

– Mine Çetinkaya-Rundel

“The time commitment (~4 hours) for each of the DataCamp courses is just long enough to really sink your teeth into a topic without having to commit to an entire semester. After taking a course, you will be in a position to move forward either to apply the topic to your own work or to take more courses in order to deepen your knowledge.”

– Jo Hardin

“If you want to build your technical skills for data science, there are many resources online. What makes DataCamp special is the interactive coding environment that offers immediate feedback. This introductory statistics sequence goes even further by coordinating a sequence of courses around a single theme.”

-Ben Baumer

The following courses will be the first of our upcoming introduction to statistic track. Expect four more courses in the near future so stay tuned. DataCamp is also officially launching DataCamp for the classroom – teaching staff will now be able to use DataCamp for free with their students. Professors are able to create assignments, manage due dates and have access to all of DataCamp’s premium courses. We believe this introduction to statistic track will make an excellent addition to a classroom.

See you in the course!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Function to Simulate Simple Parabolic Shooting.

(This article was first published on Data R Value, and kindly contributed to R-bloggers)

This function simulates a parabolic shot from the origin and does not take into account the friction of the air.

# Script to calculate the most important info about the parabolic
# shot without friction and in International System of Units

# if you want to use another system of units only change the value
# of g and the units in the prints and legends

# Inputs of the function are:
# a) initial velocity vo
# b) shot angle alfa

# Parameter
# c) gravity acceleration g = 9.81 m/s^2

# The outputs of the function are:
# T_1: ascending time
# T_2: descending time
# H: maximum height
# Tt: total time
# L: maximum horizontal range

parabolic <- function(vo, alfa){

g <- 9.81
an <- round((2*pi*alfa)/360,2)
T_1 <- round(vo*sin(an)/g,2)
H <- round((vo^2)*(sin(an)^2)/(2*g),2)
T_2 <- round(vo*sin(an)/g,2)
T_t <- round(2*vo*sin(an)/g,2)
L <- round(vo^2*sin(2*an)/g,2)

print(“Ascending time”);print(paste(T_1,”s”))
print(“Maximum height”);print(paste(H,”m”))
print(“Descending time”);print(paste(T_2, “s”))
print(“Total time”);print(paste(T_t,”s”))
print(“Max. horizontal range”);print(paste(L,”m”))

y <- vector()
p <- seq(0.0, round(L), round(L)/100)

for(i in 0:length(p)){
y[i] <- (tan(an)*p[i])-((g/((2*vo^2)*cos(an)^2))*p[i]^2)
}

plot(p, y, xlab=”X”, ylab=”Y”, type = “o”, col = “red”, axes=F)
axis(1, at = seq(0,L,L/10),labels=seq(0,L,L/10),
cex.axis=0.7)
axis(2, at = seq(0,H,H/100),labels=seq(0,H,H/100),
cex.axis=0.7)
legend(L/4, H/3, legend = c(paste(“vo =”, vo, “m/s”),
paste(“alfa =”, alfa,”degrees”),
paste(“Ascending time”, paste(T_1,”s”)),
paste(“Maximum height”, paste(H,”m”)),
paste(“Descending time”, paste(T_2, “s”)),
paste(“Total time”, paste(T_t,”s”)),
paste(“Max. horizontal range”, paste(L,”m”))),
cex=0.7, bg = par(“bg”))
title(main = “Parabolic Shot”, sub = “From Origin & Frictionless”)

}

We test the function with a shot of 100 m / s of initial velocity and 45 degrees:

parabolic(100,45)

In a following post we will do the simulation but with the effect of the air friction.

If you want to use the function in another system of units, you only have to change the value of the acceleration of gravity g.

Get the script in:

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Simulating genes and counts for DESeq2 analysis

(This article was first published on Let’s talk about science with R, and kindly contributed to R-bloggers)

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

```# Get all human gene symbols from biomaRt
library("biomaRt")
mart <- useMart(biomart="ensembl", dataset = "hsapiens_gene_ensembl")
my_results <- getBM(attributes = c("hgnc_symbol"), mart=mart)

# Simulate 100 gene names to be used for our cnts matrix
set.seed(32268)
my_genes <- with(my_results, sample(hgnc_symbol, size=100, replace=FALSE))

# Simulate a cnts matrix
cnts = matrix(runif(600, min=0, max=1200), ncol=6)
cnts = apply(cnts, c(1,2), as.integer)
dim(cnts)
```

Now, say we run DESeq2 to look for differentially expressed genes between our two simulated groups.

```# Running DESEQ2 based on https://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf
library("DESeq2")
grp.idx <- rep(c("KO", "WT"), each=3)
coldat=DataFrame(grp=factor(grp.idx, levels=c("WT", "KO")))

# Add the column names and gene names
colnames(cnts) <- paste(grp.idx, 1:6, sep="_")
rownames(cnts) <- my_genes

# Run DESeq2 analysis on the simulated counts
dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
dds <- DESeq(dds)
deseq2.res <- results(dds)
deseq2.fc=deseq2.res\$log2FoldChange
names(deseq2.fc)=rownames(deseq2.res)
exp.fc=deseq2.fc

#  SDAD1 SVOPL SRGAP2C MTND1P2 CNN2P8 IL13
# -0.48840808 0.32122109 -0.55584857 0.00184246 -0.15371042 0.11555792
```

Now let’s see how many simulated genes had a log2 fold change greater than 1 by chance.

```
# Load the fold changes from DESeq2 analysis and order in decreasing order
geneList = sort(exp.fc, decreasing = TRUE) # log FC is shown

gene <- geneList[abs(geneList) >= 1]

# C1orf216
#-1.129836

```

Now it’s your turn! What other probability distributions could we simulate data from to perform a mock RNA seq experiment to determine how many genes could be different by chance? You can even use a bootstrap approach to calculate the p-value after running 1000 permutations of the code. Of course, to circumvent these problems we use adjusted p values but it is always nice to go back to basics and stress the importance of applying statistical methods when looking at differentially expressed genes. I encourage you all to leave your answers in the comment section below to inspire others.

Happy R programming!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Data Science Virtual Machine updated, now includes RStudio, JuliaPro

The Windows edition of the Data Science Virtual Machine (DSVM) was recently updated on the Azure Marketplace. This update upgrades some existing components and adds some new ones as well.

You now have your choice of integrated development environment to use with R. RStudio Desktop is now included in the Data Science Virtual Machine image — no need to install it manually. And R Tools for Visual Studio has been upgraded to Version 0.5.

Microsoft R Server has also been upgraded to version 9.0. This includes the latest R 3.2.2 language engine, the RevoScaleR package for big-data support in R, and also the new MicrosoftML package with several new, high-performance machine learning techniques. If you choose a GPU-enabled NC classs instance for your DSVM, the MicrosoftML package can make use of the GPUs for even more performance.

The DSVM also supports the Julia language with the inclusion of the JuliaPro distribution. This includes the Julia compiler and popular packages, a debugger, and the Juno IDE. If you’re new to Julia, the blog post Julia – A Fresh Approach to Numerical Computing provides an introduction.

There are also improved Deep Learning capabilities, with the latest version of the Microsoft Cognitive Toolkit (formerly called CNTK). And if you use a GPU-enabled instance, the Deep Learning Toolkit extension provides GPU-enabled builds of the Cognitive Toolkit, mxNet, and TensorFlow.

If you want to try out the Data Science Virtual Machine, the blog post linked below provides links to the documentation and several tutorials to get you started, along with information about the Linux edition of the DSVM.

Cortana Intelligence and Machine Learning Blog: New Year & New Updates to the Windows Data Science Virtual Machine

Source:: R News

## Importing Data into R, part II

When you go to import data using R Studio, you get a menu like this.

Once that has completed, you’ll see the new import data window (shown below).

Okay, so first let’s make a simple comma delimited data file so we can test out the new import dataset process. I have made a simple file called “x-y-data.txt” as shown below. If you make this same file (no spaces, just a comma to separate the x column from the y column) then we can do this exercise together.

Now, let’s use the RStudio import to bring in the file “x-y-data.txt”. Here’s a screen grab of the import screen with my x-y dataset.

We can see that RStudio has used the first row as names, has recognized that it is a comma delimited file, and has read both x and y values as integers. Everything looks good, so I click “import”.

It was after this import process, that I had tried running some of my standard functions, such as making an empirical CDF (cumulative density function) and then I ran into problems. So let’s check the type of data we have imported.

``````# get the data structure
typeof(x_y_data)
#[1] "list"
class(x_y_data)
#[1] "tbl_df"     "tbl"        "data.frame"
``````

While the old RStudio would have imported this as a matrix by default, this latest version of RStudio imports data as a data frame by default. Apparently RStudio has created their own version of a data frame called a “tbl_df” or tibble data frame. When you use the ‘readr’ package, your data is imported automatically as a “tbl_df”.

Now this isn’t necessarily a bad thing, in fact it seems like there is some nice functionality gained by using the “tbl_df” format. This change just broke some of my previously written code and it’s good to know what RStudio is doing by default.

If we wanted to get back to the matrix format, we can do this will a simple as.matrix function. From there we can verify it was converted using the typeof and class functions.

``````# convert to a matrix
data<-as.matrix(x_y_data)
#     x  y
#[1,] 1  2
#[2,] 2  4
#[3,] 3  6
#[4,] 4  8
#[5,] 5 10

typeof(data)
#[1] "integer"
class(data)
#[1] "matrix"
``````

You can read more about the new Tibble structure at these websites:

https://blog.rstudio.org/2016/03/24/tibble-1-0-0/

http://www.sthda.com/english/wiki/tibble-data-format-in-r-best-and-modern-way-to-work-with-your-data

Enjoy!

Source:: R News

## Data Hacking with RDSTK (part 1)

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Install and load the RDSTK package.

Exercise 2

Convert the ip adress to co-ordinates. address=”165.124.145.197″. Store the results under the variable stat

Exercise 3

Derive the elevation of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat.

Exercise 4

Derive the population_density of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat called pop_den.

Exercise 5

Great. You are getting the hang of it. Let us try getting the mean temperature of that location. You will notice that it returns a list of 12 numbers, each for a month.

Run this code and see yourself
``` coordinates2statistics(stat[3],stat[6],"mean_temperature")[1]```

Exercise 6

We have to transform the mean_temperature so we can store this as one of the features in our stat dataset. One way to do this is to convert it from long to wide format but that would be too reduntant. Let’s just find the mean temperature from January-December. You might find the sapply function useful to convert each element in the list to integers.

Exercise 7

We decided we do not really need January-December mean value. We actually need the mean temperature from June-December. Make that adjustment to your last code and store the results back in stat under the name mean_temp

Exercise 8

Okay great. Now lets work with more IP-address data. Here is a list of a few ip-addresses scraped from a few commenters of my exercises.

list=c(“165.124.145.197″,”31.24.74.155″,”79.129.19.173”)
df=data.frame(list)
df[,1]=as.character(df[,1])

Exercise 9

Use a iterator like apply that will go through the list and derive its statistics with the ip2coordinates() function. This is the first part. You may get a list within list sort of result. Store this in a variable called data

Exercise 10

Use a method to convert that list within list into a dataframe with 3 rows and all columns derived from the ip2coordinates() function. You are open to use any method for this.

No related exercise sets.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Open Data R Meetup: exploring the Distribution of Traffic Accidents in Belgrade, 2015 in R

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

The R code that accompanies this post is found on GitHub: you will find R, Rmd, and HTML files there that were used during the first Open Data R Meetup held in Belgrade, 31 January 2017, organized by Data Science Serbia in Startit Center, Savska 5, Belgrade Serbia. The Open Data initiative in Serbia is still young, our Open Data Portal is still under development, and guess what – we from Data Science Serbia will join the Working Group for Open Data of the Directorate for eGovernment to help open, standardize, structure, publish, and analyse the many forthcoming open data sets from our country – in R, of course

The data set under exploration here encompasses data on traffic accidents in Belgrade for 2015 (December 2015 data are missing). The notebook focuses on an exploratory analysis of this test open data set that was provided at the Open Data Portal of the Republic of Serbia (the portal is currently under development). The data set was kindly provided to the Open Data Portal by the Republic of Serbia Ministry of Interior. Many more open data sets will be indexed and uploaded in the forthcoming weeks and months.

The Distribution of Traffic Accidents 2015, Belgrade. Part of the city core is shown on the map produced by ggmap, ggplot2 w. geom_density2d() and stat_density2d().

Besides focusing on the exploration and visualization of this test data set, we have demonstrated the basic usage of {weatherData} to fetch historical weather data to R, {wbstats} to access the rich World Data Bank time series, and {ISOcodes} packages in R.

Some exploratory modeling (Negative Binomial Regression with `glm.nb()` and Ordinal Logistic Regression with `clm()`from {ordinal}) is exercised merely to assess the prima facie effects of the most influential factors.

Predicted vs. Observed number of traffic accidents frequency per day, Belgrade 2015. Negative binomial regression for overdispersed frequency data with glm.nb().

Hopefully, this is just a begining of our exploratory analyses of open data in R; in the following months, Data Science Serbia will work hard to enable cross-country open data comparisons by elaborating on the forthcoming Serbian open data sets, and promote R as the lingua franca of the discipline.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Exact p-values for pairwise comparison of Friedman rank sums

(This article was first published on Rense Nieuwenhuis » R-Project, and kindly contributed to R-bloggers)

BMC Bioinformatics has published a paper by colleagues of mine, about calculating exact p-values for pairwise comparison of Friedman rank sums. The paper provides fast and easy-to-use R code, making it an interesting read for anyone conducting the Friedman test. Fancy full text is available at http://rdcu.be/oOf9

Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers, Eisinga, Heskes, Pelzer & Te Grotenhuis, BMC Bioinformatics, 2017, 18:68. http://dx.doi.org/10.1186/s12859-017-1486-2

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Enough trying to use LTO

By Avraham

After months and months of working trying to get R on 64 bit Windows to work with link-time-optimization (LTO), I’ve come to the conclusion that GCC 4.9.3’s implementation of LTO just isn’t polished enough to work across the entire R infrastructure. I have been able to get base R, nloptr, Rblas, and Rlapack to compile with LTO, but they are not really any faster than without it, so it isn’t worth the headache. I’ve removed the steps relating to LTO from my instructions on how to build R+OpenBLAS.

Source:: R News

## Mapping unemployment data, 2016

The last few posts at Sharp Sight about data analysis have been long and fairly intense.

Let’s do something a little more fun. Let’s make a quick map.

# How to make a compelling map,in a few dozen lines of code

The code to create this map is surprisingly brief.

```#==============
#==============

library(ggplot2)
library(viridis)

#==========
#==========

url.unemploy_map <- url("http://sharpsightlabs.com/wp-content/datasets/unemployment_map_data_2016_nov.RData")

#===========
# CREATE MAP
#===========

ggplot() +
geom_polygon(data = map.county_unemp, aes(x = long, y = lat, group = group, fill = unemployed_rate)) +
geom_polygon(data = map.states, aes(x = long, y = lat, group = group), color = "#EEEEEE", fill = NA, size = .3) +
coord_map("albers", lat0 = 30, lat1 = 40) +
labs(title = "United States unemployment rate, by county" , subtitle = "November, 2016") +
labs(fill = "% unemployed") +
scale_fill_viridis() +
theme(text = element_text(family = "Gill Sans", color = "#444444")
,plot.title = element_text(size = 30)
,plot.subtitle = element_text(size = 20)
,axis.text = element_blank()
,axis.title = element_blank()
,axis.ticks = element_blank()
,panel.grid = element_blank()
,legend.position = c(.9,.4)
,legend.title = element_text(size = 16)
,legend.background = element_blank()
,panel.background = element_blank()
)

```

Now, I will admit that there was a fair amount of data manipulation that made this map possible. I’ve provided the dataset to you (you can access it and use it immediately), so you don’t have to do the hard work of gathering, wrangling, and shaping this data in order to create this map.

Nevertheless, once you have the dataset ready, the final code to create the map is only about 20 lines. Once again (and for the record) this is why I love ggplot2. ggplot2 gives you the tools to create compelling data visualizations with relative ease and simplicity.

A few other things to note:

## This is a cousin of the heatmap

This type of map is called a choropleth map. The chloropleth is sort of a cousin of the heatmap. The primary difference is that to create the heatmap in ggplot2, we use geom_tile(), whereas to create this choropleth map, we use geom_polygon(). In either case though, we’re plotting shapes, and shading those shapes in proportion to some metric. Syntactically, in ggplot2, we shade those regions by mapping a variable to the fill = aesthetic.

Having said that, there are clear differences as well. Getting the polygon data for the counties and manipulating the data into shape is much harder for a choropleth than for a typical heatmap. Getting data for heatmaps is typically much easier.

Having said that, I recommend that if you want to learn to make a map like this, start with the heatmap first. Once again, I’m recommending that you learn the basics first, because basic techniques serve as a foundation for more complicated ones.

If you start with the heatmap, you’ll get some practice with a few critical skills. Namely, you’ll be able to practice working with different color palettes. To create compelling heatmaps (and choropleths), you’ll need to know how to use colors to create the right visual effect.

Learning the heatmap will also give you ggplot syntax practice, and show you how to map variables to the fill = aesthetic.

## We built this plot in layers

In many of my recent blog posts, I’ve emphasized “layering” as an essential principle of data visualization.

In this visualization, it’s subtle, but the layering principle is still at work.

There are two primary layers here. Go through the code and see if you can identify them.

Next, as an exercise, try removing the “state” layer and see what happens. Why is it useful to have? Also, notice that the states are only outlines (i.e., they aren’t filled in …). Why did I do that, and how did I accomplish it? Leave your answers to these in the comments below …

## Maps aren’t good for making precise comparisons

I’ll admit that I quite like maps. As a data scientist, they’re fun to make. They’re compelling and beautiful to look at (if you execute them well).

But a map like this has limitations.

The big limitation is that when data is encoded as color (as we’ve done in this map) humans aren’t good at making precise distinctions between values. For example, if you live in the US, try identifying your home county. Can you tell the exact unemployment rate? Probably not. Moreover, try to compare one county vs another. You’ll quickly realize that you can make general statements like “county X has higher unemployment than county Y,” but you won’t be able to make precise statements on the basis of this map alone.

Another exercise: What visualization technique could you use if you wanted to make precise distinctions? (Leave your answer in the comments below.)

## 80% of this is just formatting

If you’re a beginner with ggplot2, the code might look a little complex.

It’s actually much simpler than it seems at first glance.

About 80% of the code for the finalized chart is just formatting code. 80% just deals with things like the font formatting, the legend position, the text colors, etc.

We can actually strip away a lot of that formatting code and still create a functional map.

Here’s a stripped down version:

```ggplot() +
geom_polygon(data = map.county_unemp, aes(x = long, y = lat, group = group, fill = unemployed_rate))

```

We can strip the map down to two lines of code. Two lines of code are the core. Two lines of code do the “heavy lifting” to create the map … the rest is just formatting.

So, the good news is that if you want to begin making maps like these, you can get started quickly by learning only a couple lines of code.