gunsales 0.1.2

By Thinking inside the box

An update to the gunsales package is now on the CRAN network. As in the last update, some changes are mostly internal. We removed the need to import two extra packages with were used in one line each — easy enough to be replaced with base R. We also update the included data sets, and update a few things for current R packaging standards.

Courtesy of CRANberries, there is also a diffstat report for the most recent release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Source:: R News

List of R conferences and user groups

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

For 8 years now, we’ve maintained a list of local R user groups here at the Revolutions blog. This is a list that began with a single group (the Bay Area RUG, the first and still one of the largest groups), and now includes 360 user groups worldwide (including 27 specifically for women).

As the list has grown in size, it’s become harder to manage. Thankfully, Colin Gillespie of Jumping Rivers Consulting has risen to the task, by creating a new website based on a GitHub repository that anyone can contribute to. I’ve updaed the Local R User Group Directory to point to these new pages, specifically the lists of:

If you have a group of your own, contributing to the list is easy. All you need is a GitHub account, and you can click the Edit button to edit one of the R Markdown pages directly. If you’re not familiar with R Markdown, you can also suggest an edit via the Issues page.

(Incidentally, It would be great to automate the process of generating a count and a map of local R user groups. If anyone wants to take up the challenge of writing an R script to process the Rmd pages, please do!)

As R grows in popularity, it’s awesome to see local communities get together and form these groups. If you’d like to start one yourself, here are some tips on starting up a local user group.

GitHub: A list of R conferences and meetings

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using R and satellite data to identify marine bioregions

By Christian Marchese


(This article was first published on MilanoR, and kindly contributed to R-bloggers)

R is a powerful statistical and programming language. Despite its reputation of being hard to learn, it is more and more used in different areas of research and has become an essential tool in oceanography and marine ecology. For instance, R is specifically used to read, process and represent in situ oceanographic data through the use of specific packages (e.g. oce) or more generally, to manage satellite data in order to produce high temporal and spatial resolution maps useful to synoptically explore and monitoring vast areas of the world oceans.

In this post we briefly describe a practical use of R in conjunction with satellite data to identify marine bioregions of the Labrador Sea (an arm of the North Atlantic Ocean between the Labrador Peninsula and Greenland) with different patters in the phytoplankton seasonal cycle ( Phytoplankton, are microscopic plants that occupy the lowest level of the marine food chain. Their presence in the surface water is revealed because of their chlorophyll-a and other photosynthetic pigments, which changes the color of ocean waters. Nowadays, satellite ocean color sensors are routinely used to estimate the concentrations of chlorophyll-a and other parameters in the surface water of the oceans. All this data are freely available for research and educational purposes.

The approach used for the identification of the bioregions is therefore based on the use of the chlorophyll-a concentration, an index for phytoplankton biomass. The Globcolour project (, which combines data from several satellites to reduce spatial and temporal gaps, provides a set of different satellite parameters including estimates of chlorophyll-a. The data are provided at several temporal (daily images, 8-day composite images and monthly averages) and spatial (1 km, 25 km, 100 km) resolutions and stored into NetCDF ( files, a format that include metadata information in addition to the data sets. In our case, among other information, each file contains latitude and longitude values to identify each pixel on the grid. For our purpose we downloaded 8-day composite images (about one image every week, from the year 1998 to 2015) with a spatial resolution of 25 km. To work with NetCDF files we used the R package ncdf4, which replaces the former ncdf package.

Once the time series have been downloaded and unzipped (.nc files), to reach our objective several steps were needed:

  1. By using the functions nc_open and ncvar_get contained in the R package ncdf4, the .nc files were opened and the chlorophyll-a values (pixels) extracted together with the spatial coordinates and date.

  1. Subsequently, by assigning to each pixel the corresponding values of latitude and longitude, id-pixel (i.e. each pixel was numbered) and id-date (i.e. year, month and day of the year) a large data frame was created. Basically, within the data frame each pixel was identified uniquely.

  1. A 8-day climatological time series of chlorophyll-a concentrations was created by averaging over the period 1998-2015 each pixel within the area of interest (i.e. averaging all the first weeks, all the second weeks, etc.).

  1. The resulting time series was normalized ( in order to scale values between 0 and 1.

  1. On the normalized climatology previously obtained (see point 3 and 4), a cluster analysis was carried out to identify marine regions of similarity (clusters).

To perform the cluster analysis we used the function k-means (package stats). The Calinski-Harabasz index was used to evaluate the optimal number of clusters. However, more detailed information about the procedure previously described can be found in D’Ortenzio and Ribera d’Alcalà 2009 and Lacour et al. 2015.

The final outcome of this analysis is shown in the figure below.

As we can see two main areas were identified: the bioregion 1 (the yellow area) located north of about 60°N and the bioregion 2 (green area) located south of 60°N. The two bioregions present a different climatological phytoplankton biomass cycle (bloom). In the northern part (bioregion 1) of the Labrador Sea the bloom starts earlier (around day 102 – dashed line in the figure) and it is more intense (more than 1.75 mg/m3). Conversely, in the southern part (bioregion 2) the bloom starts later (day 128) and it is less intense (less than 1.75 mg/m3). Note that, for simplicity, the bloom onset (represented by the dashed line and usually used as a warning bell for possible changes in trophic interactions and biogeochemical processes) was identified as the time when the chlorophyll-a concentration increases to the threshold of 1.0 mg/m3. Finally, the figure was created by using three R packages: rasterVis, ggplot2 and gridExtra.

Overall, the simple example used here has shown how the concomitant use of statistical methods implemented through the use of R and satellite data can help to characterize vast oceanic areas and thus to better illustrate ecosystems functioning and possibly their response to environmental changes.


D’Ortenzio F., Ribera d’Alcalà M. (2009) On the trophic regimes of the Mediterranean Sea: a satellite analysis. Biogeosciences, 6, 139-148

Lacour, L., Claustre H., Prieur L., D’Ortenzio F. (2015), Phytoplankton biomass cycles in the North Atlantic subpolar gyre: A similar mechanism for two different blooms in the Labrador Sea, Geophysical Research Letters, 42

The post Using R and satellite data to identify marine bioregions appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Building and maintaining exams with dynamic content

By R and Finance

An introduction to package exams –

Part of my job as a researcher and teacher is to periodically apply and
grade exams in my classroom. Being constantly in the shoes of an
examiner, you soon quickly realize that students are clever in finding
ways to do well in an exam without effort. These days, photos and pdf
versions of past exams and exercises are shared online in facebook,
whatsapp groups, instagram and what not. As weird as it may sound, the
distribution of information in the digital era creates a problem for
examiners. If you use the same exam from past year, it is likely that
students will simply memorize the answers from a digital record.
Moreover, some students will also cheat by looking for answers during
the test. Either way, keeping the same exam over time and across
students, is not advisable.

This issue really bothered me. For large classes, there isn’t a way to
evaluate the work of students as cost effective as online or printed
exams. I’m strongly in favor of meritocracy in academia and I think that
a grade in an exam should, on average, be good indicator of the
knowledge that the students retained during coursework. Otherwise,
what’s the point of doing all of it?

In the past, I manually created different versions of questions and
wrote new ones in order to avoid cheating and memorization of questions.
But, year after year, it became clear to me that this was a time
consuming task that took more energy than what I would like to invest.
Besides teaching, I also do research and work on administrative issues
within my department. Sometimes, specially around deadlines, you simply
don’t have the time and mental energy to come up with different versions
of an existing exam.

Back in 2016 I decided to invest some to time to automatize this process
and try to come up with an elegant solution. Since I had all my exams in
a latex template called examdesign, I wrote package
RndTexExams that took
as input a .tex file and created n versions of exams by randomly
defining the order of questions, the answer list and textual content
based on a simple markup language. If you know latex, it is basically a
problem of finding regex patterns and restructuring a character object
that is later saved in a new and compilable latex file.

The package I wrote worked pretty well for me but, as with any first
version of a software, it had missing features. The output was only a
pdf file based on a template, it did not work with standard academic
platforms such as Blackboard and Moodle and, the most problematic in my
opinion, it was not designed to run embedded R code that could be parsed
by knitr, like in a Rmarkdown file.

This is when I tried out the package
exams. While my solution
with RndTexExams was alright for a latex user, package exams is much
better at solving the problem of dynamic content in exams. Using the
knitr and sweave engines, the level of randomization and creation of
dynamic content is really amazing. By combining R code (and all the
capabilities of CRAN packages), you can do do anything your want in an
exam. You can get information on the web, use completely different
datasets for each exam and so on. The limit is set by your imagination.

An example of exam with dynamic content

As a quick example, I am going to show one question from the exercise
chapter of my book. When it is ready, I will be serving the exercises
with a web based shiny app, meaning that the reader will download a pdf
file with unique questions that is processed in a shiny server.

In this example questions, I’m asking the reader to use R to solve the
following problem:

How many packages you can find today (2017-01-30) in CRAN?
Use repository for the solution.

The solution is pretty simple, all you need to do is to ask for the
number of rows for the object output from a call to
available.packages(). The reader can find the solution with the

Now, lets build the content of this simple question in a separate file.
You can either use .Rnw or .Rmd files with exam. I will choose the later
just to keep it simple. Here are the contents of a file called
Question.Rmd, available here.

cat(paste0(readLines('Question.Rmd'), collapse = 'n'))

## ```{r data generation, echo = FALSE, results = "hide"}
## #possible.repo <- getCRANmirrors()$URL  # doenst work well for all repos
## possible.repo <- c('',
##                   '',
##                   '',
##                   '',
##                   '',
##                   '',
##                   '')
## my.repo <- sample(possible.repo,1)
## n.pkgs <- nrow(available.packages(repos = my.repo))
## sol.q <- n.pkgs
## rnd.vec <- c(0, sample(-5000:-1,4))
## my.answers <- paste0(sol.q+rnd.vec, ' packages')
## ```
## Question
## ========
## How many packages you can find today (`r Sys.Date()`) in CRAN? 
## Use repository `r my.repo` for the solution.
## ```{r questionlist, echo = FALSE, results = "asis"}
## exams::answerlist(my.answers, markup = "markdown")
## ```
## Meta-information
## ================
## extype: schoice
## exsolution: 10000
## exname: numbero of cran pkgs
## exshuffle: TRUE

For the last piece of code, notice that I’ve set the solution of the
question in object sol.q. Later, in object my.answers, I use it
together with a random vector of integers to create five alternative
answers to the questions, where the first one is the correct. This
operation results in the following objects:

my.repo <- ''
n.pkgs <- nrow(available.packages(repos = my.repo))
sol.q <- n.pkgs
rnd.vec <- c(0, sample(-5000:-1,4))
my.answers <- paste0(sol.q+rnd.vec, ' packages')

## [1] "10001 packages" "8687 packages"  "9883 packages"  "7157 packages" 
## [5] "8513 packages"

To conclude the question, I simply use Sys.Date() to get the system’s
date and later set the correct answers using function answerlist. Some
metadata is also inserted at the last section of Question.Rmd. The
line exshuffle: TRUE sets a random order of possible answers in each
exam for this questions. Do notice that the solution is registered in
line exsolution: 10000, where the 1 in 10000 means correct answer in
the first element of my.answers and the 0s represent incorrect

Now that the file with content of the question is finished, let’s set
some options and build the exam with exams. For simplicity, we will
repeate the same question five times.


my.f <-'Question.Rmd'
n.ver <- 1
name.exam <- 'exam_sample'
my.dir <- 'exam-out/'

my.exam <- exams2pdf(file = rep(my.f,5),
                     n = n.ver, 
                     dir = my.dir,
                     name = name.exam, 
                     verbose = TRUE)

## Exams generation initialized.
## Output directory: /home/msperlin/Dropbox/My Blog/exam-out
## Exercise directory: /home/msperlin/Dropbox/My Blog
## Supplement directory: /tmp/RtmpaDj6Ju/file27145b4e0310
## Temporary directory: /tmp/RtmpaDj6Ju/file2714cb4b300
## Exercises: Question, Question, Question, Question, Question
## Generation of individual exams.
## Exam 1: Question (srt) Question_1 (srt) Question_2 (srt) Question_3 (srt) Question_4 (srt) ... w ... done.

f.out <- paste0(my.dir,name.exam,'1','.pdf')

## [1] TRUE

The result of the previous code is a pdf file with the following

One interesting information from this post is that you can find a small
difference in the number of packages in between the CRAN mirrors. My
best guess is that they synchronize with the master server in different
times of the day/week.

Looking at the contents of the pdf file, clearly some things are missing
from the exam, such as the title page and the instructions. You can add
all the bells and whistles with the inputs of function exams2pdf or
change it directly in the different file templates. One quick tip for
new users is that the answer sheet can be found by looping over the
values of the output from exams2pdf:

df.answer.key <- data.frame()
n.q <- 5 # number of questions
for (i.ver in seq(n.ver)){ <- my.exam[[i.ver]] 
  for (i.q in seq(n.q)){ <- letters[which([[i.q]]$metainfo$solution)]
    temp <- data.frame(i.ver = i.ver, i.q = i.q, solution =
    df.answer.key <- rbind(df.answer.key, temp)  

df.answer.key.wide <- tidyr::spread(df.answer.key, key = i.q, value = solution)

##   i.ver 1 2 3 4 5
## 1     1 d e a c a

By using package exams2pdf, I can code different questions in the
exams format and not worry whether someone is going to copy it over
and distribute it in the internet. Students may know the content of each
question, but they will have to learn how to get to the correct answer
in order to solve it for their exam. Cheating is also impossible since
each student will have different versions and different answer sheets.
If I have a class of 100 students, I will build 100 different exams,
each one with unique answers.

As for maintainability, the time value of my exam questions increases
significantly. I can use them over and over, now that I can effortlessly
create as many versions of it as I need. Since it is all based in R
code, I can use the code from the class material in my exams. Going
further, I can also automatically grade the exams using the internet
(see the vignette of

for information on how to do that with Google spreadsheets.)

In this post I only scratched the surface of exams. Adding to the
description of its capabilities, you can export exams to standard
academic systems such as Moodle, Blackboard and others. You can also
print the exam in pdf, nops (a pdf that allows easy scanning), or html.
If you know a bit of latex or html, it is easy to customize the
templates to the needs of your particular exam.

As with all technical things, not everything is perfect. In my oppinio,
the main issue with the exams template is that requires some knowledge
of R and Knitr. While this is Ok for most people reading this blog, it
is not the case for the average professor. It may sound surprising to
the quantitative inclined people, but the great majority of professors
still use .docx and .xlsx files to write academic work such as articles
and exams. Why they don’t use or learn better tools? Well, this is a
long answer, best suited for another post.

Package exams had a big and positive impact on how I do my work.
Based on a large database of questions that I’ve built, I can create a
new exam in 5 minutes and grade it for a large class in less than 1
minute. I am very thankful to its authors and this is one of the reasons
why I love posting packages in CRAN. It is my way of giving it back to
the community.

Concluding, package exams is great and I believe that every examiner
and professor should be using it. Thinking about the future, the
template of questions in exams has the potential of setting the
language of exams, a structure that could allow the user to output
questions in any format he wants, just as you can use Markdown to output
latex or word.

Sharing questions in a collaborative platform, such as Quora, should be
something for the developers (or R community) to think of. Questions
could be ranked according to popular vote. Users could contribute by
posting question files for other to use. Users would get a feedback on
their work and, at the same time, be able to use other people questions.
Students could also have access to it and independently study to a
particular topic, by building custom made exams with randomized content.

Summing up, if you are a teacher or examiner, I hope that this post
convinces you to try out package exams.

Source:: R News

Using the Bizarro Pipe to Debug magrittr Pipelines in R

By John Mount

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I have just finished and released a new R video lecture demonstrating how to use the “Bizarro pipe” to debug magrittr pipelines. I think R dplyr users will really enjoy it.

Please read on for the link to the video lecture.

In this video lecture I use the “Bizarro pipe” to debug the example pipeline from RStudio’s purrr announcement.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New features in World Gender Statistics app

By Shirin's playgRound

(This article was first published on Shirin’s playgRound, and kindly contributed to R-bloggers)

In my last post, I built a shiny app to explore World Gender Statistics.

To make it a bit nicer and more convenient, I added a few more features:

  • The drop-down menu for Years is now reactive, i.e. it only shows options with data (all NA years are removed)
  • You can click on any country on the map to get information about which country it is, its population size, income group and region
  • Below the world map, you can look at timelines for male vs female values for each country
  • The map is now in Mercator projection
To leave a comment for the author, please follow the link and comment on their blog: Shirin’s playgRound. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Empirical Software Engineering using R: first draft available for download

By Derek Jones

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

A draft of my book Empirical Software Engineering using R is now available for download.

The book essentially comes in two parts:

  • statistical techniques that are useful for analyzing software engineering data. This draft release contains most of the techniques I plan to cover. I am interested in hearing about any techniques you think ought to be covered, but I only cover techniques when real data is available to use in an example,
  • six chapters covering what I consider to be the primary aspects of software engineering. This draft release includes the Human Cognitive Characteristics chapter and I am hoping to release one each of the remaining chapters every few months (Economics is next).

There is a page for making suggestions and problem reports.

All the code+data is available and I am claiming to have a copy of all the important, publicly available, software engineering data. If you know of any I don’t have, please let me know.

I am looking for a publisher. The only publisher I have had serious discussions with decided not to go ahead because of my insistence of releasing a free copy of the pdf. Self-publishing is a last resort.

To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

SCADA spikes in Water Treatment Data

By Peter Prevos

Turbid water

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

SCADA spikes are events in the data stream of water treatment plants or similar installations. These SCADA spikes can indicate problems with the process and could result in an increased risk to public health.

The WSAA Health Based Targets Manual specifies a series of decision rules to assess the performance of filtration processes. For example, this rule assesses the performance of conventional filtration:

“Individual filter turbidity ≤ 0.2 NTU for 95% of month and not > 0.5 NTU for ≥ 15 consecutive minutes.”

Turbidity is a measure for the cloudiness of a fluid because of large numbers of individual particles otherwise invisible to the naked eye. Turbidity is an important parameter in water treatment because a high level of cloudiness strongly correlates with the presence of microbes. This article shows how to implement this specific decision rule using the R language.


To create a minimum working example, I first create a simulated SCADA feed for turbidity. The turbidity data frame contains 24 hours of data. The seq.POSIXt function creates 24 hours of timestamps at a one-minute spacing. In addition, the rnorm function creates 1440 turbidity readings with an average of 0.1 NTU and a standard deviation of 0.01 NTU. The image below visualises the simulated data. The next step is to assess this data in accordance with the decision rule.

# Simulate data
turbidity <- data.frame(DateTime = seq.POSIXt(as.POSIXct("2017-01-01 00:00:00"), by = "min", length.out=24*60),
                        Turbidity = rnorm(n = 24*60, mean = 0.1, sd = 0.01)

The second section simulates five spikes in the data. The first line picks a random start time for the spike. The second line in the for-loop picks a duration between 10 and 30 minutes. In addition, the third line simulates the value of the spike. The mean value of the spike is determined by the rbinom function to create either a low or a high spike. The remainder of the spike simulation inserts the new data into the turbidity data frame.

# Simulate spikes
for (i in 1:5) {
   time <- sample(turbidity$DateTime, 1)
   duration <- sample(10:30, 1)
   value <- rnorm(1, 0.5 * rbinom(1, 1, 0.5) + 0.3, 0.05)
   start <- which(turbidity$DateTime == time)
   turbidity$Turbidity[start:(start+duration - 1)] <- rnorm(duration, value, value/10)

The image below visualises the simulated data using the mighty ggplot. Only four spikes are visible because two of them overlap. The next step is to assess this data in accordance with the decision rule.

ggplot(turbidity, aes(x = DateTime, y = Turbidity)) + geom_line(size = 0.2) + 
   geom_hline(yintercept = 0.5, col = "red") + ylim(0,max(turbidity$Turbidity)) + 
   ggtitle("Simulated SCADA data")

<img data-attachment-id="262" data-permalink="" data-orig-file="" data-orig-size="1718,962" data-comments-opened="1" data-image-meta='{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}' data-image-title="scada spikes" data-image-description="

Simulated SCADA data with spikes

” data-medium-file=”″ data-large-file=”″ title=”Simulated SCADA data with spikes” src=”″ alt=”Simulated SCADA data with spikes” srcset_temp=” 1718w, 300w, 768w, 1024w, 500w, 1168w” sizes=”(max-width: 584px) 100vw, 584px” data-recalc-dims=”1″>

SCADA Spikes Detection

The following code searches for all spikes over 0.50 NTU using the run length function. This function transforms a vector into a vector of values and lengths. For example, the run length of the vector c(1, 1, 2, 2, 2, 3, 3, 3, 3, 5, 5, 6) is:

  • lengths: int [1:5] 2 3 4 2 1
  • values : num [1:5] 1 2 3 5 6

The value 1 has a length of 1, the value 2 has a length of 3 and so on. The spike detection code creates the run length for turbidity levels greater than 0.5, which results in a boolean vector. The cumsum function calculates the starting point of each spike which allows us to calculate their duration.

The code results in a data frame with all spikes higher than 0.50 NTU and longer than 15 minutes. The spike that occurred at 11:29 was higher than 0.50 NTU and lasted for 24 minutes. The other three spikes are either lower than 0.50 NTU. The first high spike lasted less than 15 minutes.

# Spike Detection
spike.detect <- function(DateTime, Value, Height, Duration) {
 runlength <- rle(Value > Height)
 spikes <- data.frame(Spike = runlength$values,
 times <- cumsum(runlength$lengths))
 spikes$Times <- DateTime[spikes$times]
 spikes$Event <- c(0,spikes$Times[-1] - spikes$Times[-nrow(spikes)])
 spikes <- subset(spikes, Spike == TRUE & Event > Duration)
spike.detect(turbidity$DateTime, turbidity$Turbidity, 0.5, 15)

This approach was used to prototype a software package to assess water treatment plant data in accordance with the Health-Based Targets Manual. The finished product has been written in SQL and is available under an Open Source sharing license.

The post SCADA spikes in Water Treatment Data appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Version Control, File Sharing, and Collaboration Using GitHub and RStudio

By geraldbelton

This is Part 3 of our “Getting Started with R Programming” series. For previous articles in the series, click here: Part 1, Part 2.

This week, we are going to talk about using git and GitHub with RStudio to manage your projects.

Git is a version control system, originally designed to help software developers work together on big projects. Git works with a set of files, which it calls a “repository,” to manage changes in a controlled manner. Git also works with websites like GitHub, GitLab, and BitBucket, to provide a home for your git-based projects on the internet.

If you are a hobbyist, and aren’t working on projects with other programmers, why would you want to bother with any of this? Incorporating version control into your workflow might be more trouble than its worth, if you never have to collaborate with others, or share your files with others. But most of us will, eventually, need to do this. It’s a lot easier to do if it’s built into your workflow from the start.

More importantly, there are tremendous advantages to using the web-based sites like GitHub. At the very minimum, GitHub serves as an off-site backup for your precious program files.

Full disclosure: This is an affiliate link. If you click this link and buy this shirt, Amazon pays me.

In addition, GitHub makes it easy to share your files with others. GitHub users can fork or clone your repository. People who don’t have GitHub accounts can still browse your shared files online, and even download the entire repository as a zip file.

And finally, once you learn Markdown (which we will be doing here, very soon) you can easily create a webpage for your project, hosted on GitHub, at no cost. This is most commonly used for documentation, but it’s a simple and easy way to get on the web. Just last week, I met a young programmer who showed me his portfolio, hosted on GitHub.

OK, let’s get started!

Register a GitHub Account

First, register a free GitHub account: For now, just use the free service. You can upgrade to a paid account, create private repositories, join organizations, and other things, later. But one thing you should think about at the very beginning is your username. I would suggest using some variant of your real name. You’ll want something that you feel comfortable revealing to a future potential employer. Also consider that things change; don’t include your current employer, school, or organization as part of your user name.

If you’ve been following along in this series, you’ve already installed R and R Studio. Otherwise, you should do that now. Instructions are in Part 1 of this series.

Installing and Configuring Git

Next, you’ll need to install git. If you are a Windows user, install Git for Windows. Just click on the link and follow the instructions. Accept any default settings that are offered during installation. This will install git in a standard location, which makes it easy for RStudio to find it. And it installs a BASH shell, which is a way to use git from a command line. This may come in handy if you want to use git outside of R/RStudio.

LINUX users can install git through their distro’s package manager. Mac users can install git from

Now let’s tell git who you are. Go to a command prompt (or, in R Studio, go to Tools > Shell) and type:

git config --global 'Your Name'

For Your Name, substitute your own name, of course. You could use your GitHub user name, or your actual first and last name. It should be something recognizable to your collaborators, as your commits will be tagged with this name.

git config --global ''

The email address you put here must be the same one you used when you signed up for GitHub.

To make sure this worked, type:

git config --global --list

and you should see your name and email address in the output.

Connect Git, GitHub, and RStudio

Let’s run through an exercise to make sure you can pull from, and push to, GitHub from your computer.

Go to and make sure you are logged in. Then click the green “New Repository” button. Give your repository a name. You can call it whatever you want, we are going to delete this shortly. For demonstration purposes, I’m calling mine “demo.” You have the option of adding a description. You should click the checkbox that says “Initialize this repository with a README.” Then click the green “Create Repository” button. You’ve created your first repository!

Click the green “Clone or download” button, and copy the URL to your clipboard. Go to the shell again, and take note of what directory you are in. I’m going to create my repository in a directory called “tmp,” so at the command prompt I typed “mkdir ~/tmp” followed by “cd ~/tmp”.

To clone the repository on your local computer, type “git clone” followed by the url you copied from GitHub. The results should look something like this:

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp
$ git clone
Cloning into 'demo'...
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.

Make this your working directory, list its files, look at the README file, and check how it is connected to GitHub. It should look something like this:

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp
$ cd demo

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ ls

geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ head
# demo
geral@DESKTOP-0HM18A3 MINGW64 ~/tmp/demo (master)
$ git remote show origin
* remote origin
  Fetch URL:
  Push URL:
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (up to date)

Let’s make a change to a file on your local computer, and push that change to GitHub.

echo "This is a new line I wrote on my computer" >>

git status

And you should see something like this:

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
 (use "git add <file>..." to update what will be committed)
 (use "git checkout -- <file>..." to discard changes in working directory)


no changes added to commit (use "git add" and/or "git commit -a")

Now commit the changes, and push them to GitHub:

git add -A
git commit -m "A commit from my local computer"
git push

Git will ask you for your GitHub username and password if you are a new user. Provide them when asked.

The -m flag on the commit is important. If you don’t include it, git will prompt you for it. You should include a message that will tell others (or yourself, months from now) what you are changing with this commit.

Now go back to your browser, and refresh. You should see the line you added to your README file. If you click on commits, you should see the one with the message “My first commit from my local computer.”

Now let’s clean up. You can delete the repository on your local computer just by deleting the directory, as you would any other directory on your computer. On GitHub, (assuming you are still on your repository page) click on “settings.” Scroll down until you see the red “Danger Zone” flag, and click on “Delete This Repository.” Then follow the prompts.

Connecting GitHub to RStudio

We are going to repeat what we did above, but this time we are going to do it using RStudio.

Once again, go to GitHub, click “New Repository,” give it a name, check the box to create a README, and create the repository. Click the “clone or download” button and copy the URL to your clipboard.

In RStudio, start a new project: File > New Project > Version Control > Git

In the “Repository URL” box, paste in the URL that you copied from GitHub. Put something (maybe “demo”) in the box for the Directory Name. Check the box marked “Open in New Session.” Then click the “Create Project” button.

And, just that easy, you’ve cloned your repository!

In the file pane of RStudio, click, and it should open in the editor pane. Add a line, perhaps one that says “This line was added in R Studio.” Click the disk icon to save the file.

Now we will commit the changes and push them to GitHub. In the upper right pane, click the “Git” tab. Click the “staged” box next to Click “Commit” and a new box will pop up. It shows you the staged file, and at the bottom of the box you can see exactly what changes you have made. Type a commit message in the box at the top right, something like “Changes from R Studio.” Click the commit button. ANOTHER box pops up, showing the progress of the commit. Close it after it finishes. Then click “Push.” ANOTHER box pops up, showing you the progress of your push. It may ask you for a user name and password. When it’s finished, close it. Now go back to GitHub in your web browser, refresh, and you should see your changed README file.

Congratulations, you are now set up to use git and GitHub in R Studio!

Source:: R News

Data Science for Doctors – Part 1 : Data Display

By Vasileios Tsakalos

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day, so it is obvious that those people must have a substantial knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the first part of the series, it is going to be about data display.

Before proceeding, it might be helpful to look over the help pages for the table, pie, geom_bar , coord_polar, barplot, stripchart, geom_jitter, density, geom_density, hist, geom_histogram, boxplot, geom_boxplot, qqnorm, qqline, geom_point, plot, qqline, geom_point .


Please run the code below in order to load the data set and make it into a proper data frame format:

url <- ""
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a frequency table of the class variable.

Exercise 2

class.fac <- factor(data[['class']],levels=c(0,1), labels= c("Negative","Positive"))
Create a pie chart of the class.fac variable.

Exercise 3

Create a bar plot for the age variable.

Exercise 4

Create a strip chart for the mass against class.fac.

Exercise 5

Create a density plot for the preg variable.

Exercise 6

Create a histogram for the preg variable.

Exercise 7

Create a boxplot for the age against class.fac.

Exercise 8

Create a normal QQ plot and a line which passes through the first and third quartiles.

Exercise 9

Create a scatter plot for the variables age against the mass variable .

Exercise 10

Create scatter plots for every variable of the data set against every variable of the data set on a single window.
hint: it is quite simple, don’t overthink about it.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News