Brownian Motion GIF with R and ImageMagick

By Javi Fernández-López

(This article was first published on long time ago…, and kindly contributed to R-bloggers)

Hi there!
Last Monday we celebrated a “Scientific Marathon” at Royal Botanic Garden in Madrid, a kind of mini-conference to talk about our research. I was talking about the relation between fungal spore size and environmental variables such as temperature and precipitation. To make my presentation more friendly, I created a GIF to explain the Brownian Motion model. In evolutionary biology, we can use this model to simulate the random variation of a continuous trait through time. Under this model, we can notice how closer species tend to maintain closer trait values due to shared evolutionary history. You have a lot of information about Brownian Motion models in evolutionary biology everywhere!
Here I will show you how I built a GIF to explain Brownian Motion in my talk using R and ImageMagick.
 # First, we simulate continuous trait evolution by adding in each iteration  
# a random number from a normal distribution with mean equal to 0 and standard
# deviation equal to 1. We simulate a total of 4 processes, to obtain at first
# two species and a specieation event at the middle of the simulation, obtaining
# a total of 3 species at the end.
df1 names(df1) y for (g in 1:750){
df1[g,2] df1[g,1] y }
#plot(df1$X,df1$Y, ylim=c(-100,100), xlim=c(0,1500), cex=0)
#lines(df1$X,df1$Y, col="red")
df2 names(df2) y for (g in 1:1500){
df2[g,2] df2[g,1] y }
#lines(df2$X,df2$Y, col="blue")
df3 names(df3) y for (g in 750:1500){
df3[g-749,2] df3[g-749,1] y }
#lines(df3$X,df3$Y, col="green")
df4 names(df4) y for (g in 750:1500){
df4[g-749,2] df4[g-749,1] y }
#lines(df4$X,df4$Y, col="orange")

 # Now, we have to plot each simmulation lapse and store them in our computer.  
# I added some code to make lighter the gif (plotting just odd generations) and
# to add a label at the speciation time. Note that, since Brownan Model is a
# stocasthic process, my simulation will be different from yours.
# You should adjust labels or repeat the simulation process if you don't
# like the shape of your plot.
parp parp for (q in 1:750){
if ( q %% 2 == 1) {
id png(paste("bm",id,".png", sep=""), width=900, height=570, units="px",
pointsize=18)
par(omd = c(.05, 1, .05, 1))
plot(df1$X,df1$Y, ylim=c(-70,70), xlim=c(0,1500), cex=0,
main=paste("Brownian motion model n generation=", 749 + q) ,
xlab="generations", ylab="trait value", font.lab=2, cex.lab=1.5 )
lines(df1$X,df1$Y, col="red", lwd=4)
lines(df2$X[1:(q+749)],df2$Y[1:(q+749)], col="blue", lwd=4)
lines(df3$X[1:q],df3$Y[1:q], col="green", lwd=4)
lines(df4$X[1:q],df4$Y[1:q], col="orange", lwd=4)
if (parp[q]==0)
text(750, 65,labels="speciation event", cex= 1.5, col="black", font=2)
if (parp[q]==0)
arrows(750, 60, 750, 35, length = 0.20, angle = 30, lwd = 3)
dev.off()
}
}

Now, you just have to use ImageMagick to put all the PNG files together in a GIF using a command like this in a terminal:

 convert -delay 10 *.png bm.gif  

Et voilà!

To leave a comment for the author, please follow the link and comment on their blog: long time ago….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data science trivia from the Basel Data Science (BDS) meetup group

By Blog

(This article was first published on blog, and kindly contributed to R-bloggers)

A few weeks ago, we had our first meetup of the Basel Data Scientists (BDS) here in Basel, Switzerland. As it was our first meeting, and I wanted people to get to know each other and have some fun, I decided to have the members play a data science trivia game. I split the group into two teams of 5, and had each group answer 20 data science trivia questions I gathered from a mixture of classic statistical brain teasers, from both statistics and psychology, some statistical history (thank you Wikipedia!), and a few basic probability calculations. I had no idea if people would be into the game or not, but I was happy to see that after a few questions (and beers), people were engaged in some (at times heated!) debates over questions like the definition of a p-value, and how to best protect an airplane from enemy fire.

As I thought other people might have fun with the game. I am posting them here for other people to enjoy. As you’ll see, the 20 questions are broken down into four categories “Fun”, “Statistics”, “History”, and “Terminology”. Once you’ve given the questions a shot, you can find (my) answers to the questions at http://ndphillips.github.io/DataScienceTrivia_Answers.html. If you find errors, or have suggestions for better questions, don’t hesitate to write me at Nathaniel.D.Phillips.is@gmail.com. Have fun!

Data Science Trivia

Fun

  1. Abraham is tasked with reviewing damaged planes coming back from sorties over Germany in the Second World War. He has to review the damage of the planes to see which areas must be protected even more. Abraham finds that the fuel system of returned planes are much more likely to be damaged by bullets than the engines. Which part of the plan should he recommend to receive additional protection, the fuel systems or the engines?

  2. Paul the __ was an animal that became famous in 2010 for accurately predicting the outcomes of the 2010 world cup. What species was Paul?

  3. Amy and Bob have two children, one of whom is female. What is the probability that their other child is female?

  4. Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Assuming that you do not want a goat, should you stick with door No 1. or should you switch to door No 2.?

  5. Imagine the following coin flipping game. Before the game starts, the pot starts at $2. I then continually flip a coin, and each time a Head appears, the pot doubles. The first time tails appears, the game ends and you win whatever is in the pot. Thus a Tails comes on the first flip, the game is over and you get 2$. If the first Tails comes on the second flip, you get $4. Formally, you win (2^k) dollars, where k is the number of flips. If you played this game infinitely times, how much money would you expect to earn on average? How much would you pay me for the opportunity to play this game?

Statistics

  1. How many people do you need in a room for the probability to be greater than .50 that at least two people in the room have the same birthday?

  2. If you flip a fair coin 4 times, what is the probability that it will have at least one head?

  1. Imagine you are a physician presented with the following problem. A 50-year old woman Betty, with no symptoms, participants in routine mammogram screening. She tests positive and wants to know how likely it is that she actually has breast cancer given her positive test result. You know that about 1% of 50-year old women have breast cancer. If a woman does have breast cancer, the probability that she tests positive is 90%. If she does not have breast cancer, the probability that she nevertheless tests positive is 9%. Based on this information, how likely is it that Betty actually has breast cancer given her positive test result?
  1. What is the definition of a p-value?
  1. Imagine that I flipped a fair coin 5 times: which of the following two sequences is more likely to occur? A) “H, H, T, H, T”, B) “T, T, T, T, T”

History

  1. The ___ ___ theorem, one of the most famous in all of statistics, states that, given enough data, the probability distribution of the sample mean will always be Normal, regardless of the probability distribution of the raw data.
  1. The mathematician ___ developed the method of least squares in 1809.

  2. In 1907, Francis Galton submitted a paper to Nature where he found that when 787 people guessed the weight of an ox at a county fair, the median estimate of the group was only off by 10 pounds. This is one of the most famous examples of the ___ __ ___.

  3. The .05 significance threshold was introduced by ___ in 1925.

  4. Python is a programming language created by Guido van Rossum and was first released in 1991. Where did the name for Python come from?

Terminology

  1. A machine learning model that is so complex that no one, even at times its programmers, don’t know exactly why it works the way it does, is called a ___ ___ model.

  2. When an algorithm has very high accuracy in fitting a training dataset, but poor accuracy in predicting a new dataset, then the model has ___ the training data.

  3. In order to computationally estimate probability distributions, especially in Bayesian statistics, MCMC methods are often used, which stand for ___ ___ ___ ___ methods.

  4. What does SPSS stand for?

  5. Regression, decision trees, and random forests are known as ___ learning algorithms, while algorithms such as nearest neighbor and principle component analysis are known as ___ learning algorithms

To leave a comment for the author, please follow the link and comment on their blog: blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R 3.4.3 is released (a bug-fix release)

By Tal Galili

logo

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

R 3.4.3 (codename “Kite-Eating Tree”) was released last week. You can get the latest binaries version from here. (or the .tar.gz source code from here).

As mentioned by David Smith, R 3.4.3 is primarily a bug-fix release:

It fixes an issue with incorrect time zones on MacOS High Sierra, and some issues with handling Unicode characters. (Incidentally, representing international and special characters is something that R takes great care in handling properly. It’s not an easy task: a 2003 essay by Joel Spolsky describes the minefield that is character representation, and not much has changed since then.)

The full list of bug fixes and new features is provided below.

Upgrading to R 3.4.3 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install 
setInternet2(TRUE) # only for R versions older than 3.3.0
installr::updateR() # updating R.
# If you wish it to go faster, run: installr::updateR(T)

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.4.3

INSTALLATION on a UNIX-ALIKE

  • A workaround has been added for the changes in location of time-zone files in macOS 10.13 ‘High Sierra’ and again in 10.13.1, so the default time zone is deduced correctly from the system setting when R is configured with –with-internal-tzcode (the default on macOS).
  • R CMD javareconf has been updated to recognize the use of a Java 9 SDK on macOS.

BUG FIXES

  • raw(0) & raw(0) and raw(0) | raw(0) again return raw(0) (rather than logical(0)).
  • intToUtf8() converts integers corresponding to surrogate code points to NA rather than invalid UTF-8, as well as values larger than the current Unicode maximum of 0x10FFFF. (This aligns with the current RFC3629.)
  • Fix calling of methods on S4 generics that dispatch on ... when the call contains ....
  • Following Unicode ‘Corrigendum 9′, the UTF-8 representations of U+FFFE and U+FFFF are now regarded as valid by utf8ToInt().
  • range(c(TRUE, NA), finite = TRUE) and similar no longer return NA. (Reported by Lukas Stadler.)
  • The self starting function attr(SSlogis, "initial") now also works when the y values have exact minimum zero and is slightly changed in general, behaving symmetrically in the y range.
  • The printing of named raw vectors is now formatted nicely as for other such atomic vectors, thanks to Lukas Stadler.

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New DataCamp Course: Working with Web Data in R

By gabriel

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Hi there! We just launched Working with Web Data in R by Oliver Keyes and Charlotte Wickham, our latest R course!

Most of the useful data in the world, from economic data to news content to geographic information, lives somewhere on the internet – and this course will teach you how to access it. You’ll explore how to work with APIs (computer-readable interfaces to websites), access data from Wikipedia and other sources, and build your own simple API client. For those occasions where APIs are not available, you’ll find out how to use R to scrape information out of web pages. In the process, you’ll learn how to get data out of even the most stubborn website, and how to turn it into a format ready for further analysis. The packages you’ll use and learn your way around are rvest, httr, xml2 and jsonlite, along with particular API client packages like WikipediR and pageviews.

Take me to chapter 1!

Working with Web Data in R features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in getting information from the Internet!

What you’ll learn

1. Downloading Files and Using API Clients

Sometimes getting data off the internet is very, very simple – it’s stored in a format that R can handle and just lives on a server somewhere, or it’s in a more complex format and perhaps part of an API but there’s an R package designed to make using it a piece of cake. This chapter will explore how to download and read in static files, and how to use APIs when pre-existing clients are available.

2. Using httr to interact with APIs directly

If an API client doesn’t exist, it’s up to you to communicate directly with the API. But don’t worry, the package httr makes this really straightforward. In this chapter, you’ll learn how to make web requests from R, how to examine the responses you get back and some best practices for doing this in a responsible way.

3. Handling JSON and XML

Sometimes data is a TSV or nice plaintext output. Sometimes it’s XML and/or JSON. This chapter walks you through what JSON and XML are, how to convert them into R-like objects, and how to extract data from them. You’ll practice by examining the revision history for a Wikipedia article retrieved from the Wikipedia API using httr, xml2 and jsonlite.

4. Web scraping with XPATHs

Now that we’ve covered the low-hanging fruit (“it has an API, and a client”, “it has an API”) it’s time to talk about what to do when a website doesn’t have any access mechanisms at all – when you have to rely on web scraping. This chapter will introduce you to the rvest web-scraping package, and build on your previous knowledge of XML manipulation and XPATHs.

5. ECSS Web Scraping and Final Case Study

CSS path-based web scraping is a far-more-pleasant alternative to using XPATHs. You’ll start this chapter by learning about CSS, and how to leverage it for web scraping. Then, you’ll work through a final case study that combines everything you’ve learnt so far to write a function that queries an API, parses the response and returns data in a nice form.

Master web data in R with our course Working with Web Data in R!

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Live Earthquakes App

By Marco Pasin

(This article was first published on analytics for fun, and kindly contributed to R-bloggers)

It’s awesome when you are asked to build a product demo and you end up building something you actually use yourself.

That is what happened to me with the Live Earthquake Shiny App. A few…

>

To leave a comment for the author, please follow the link and comment on their blog: analytics for fun.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Some quirks with R and SQL Server by @ellis2013nz

By Peter's stats stuff – R

(This article was first published on Peter’s stats stuff – R, and kindly contributed to R-bloggers)

I’ve been writing on this blog less frequently in the past few months. Mostly this is because I’ve been working on a very intensive and engaging professional project that is much more hands-on (ie coding) than I’ve been at work for a while. So a lot of energy has been going into that. One interesting side effect for me has been diving deep into Structured Query Language (SQL), and the Microsoft-flavoured Transact SQL in particular. I’ve used SQL for a long time, but usually with the attitude of “get it out of the database as quickly as possible”. In a situation where this didn’t make sense, I’ve been pleasantly surprised at how powerful and flexible SQL is for the right sort of jobs.

The bulk of the work project is in SQL, but there is an R flavour with a Shiny front end and a bunch of testing and semi-automation of the build process going on in R (long story). Here are a couple of quirky and useful things relating to using R and SQL Server in combination.

Using R to schedule SQL scripts

Imagine a project that combines SQL and R. For example, SQL is used to do a bunch of heavy lifting data management and complex queries in the database; and R is used for statistical modelling and producing polished outputs. This is actually a very common scenario. Relational databases are very powerful tools with many decades of optimisation embedded in them. They aren’t going anywhere soon.

“Whether your first data analysis language is R, Python, Julia or SAS, your second language should be SQL”

Quote by me – something I just thought of.

It’s good practice to keep any non-trivial SQL queries in their own files with a .sql suffix, and develop them in a database-aware environment. With SQL Server, this will often mean SQL Server Management Studio. But when you come to doing the final stage of reproducible analysis, you don’t want to be flicking between two applications; certainly not in terms of actually running anything. Although SQL Server since 2016 can include R code in a stored procedure, if it’s basically a statistical project it’s still going to have the workflow of “database first, R second”, with the statistical and presentation stage probably developed in RStudio or similar. So it’s very useful to be able to run bunch of .sql scripts from R. This is commonly done by reading in the script with readLines() and executing it on the database via RODBC or other database connection software.

I developed an R function sql_execute() to make this process efficient. The version below is available in my pssmisc R package (only on GitHub at this point) which is a grab-bag of multi-use functionality associated with this blog. The original version had a few project-specific features as well as a few cut corners. It also had an accompanying simple function that uses sql_execute() to run all the SQL scripts in a given folder in order.

The sql_execute() function provides the following benefits:

  • combines the multiple steps of reading in SQL scripts and executing them in a single command
  • handles a common problem where files developed in Management Studio often aren’t saved in an encoding automatically recognised by R
  • allows the use of the GO batch separator, a Microsoft-specific addition to SQL that will cause problems if included in a ODBC query
  • lets you specify a search-and-replace for a string – very useful sometimes if you’re running lots of SQL scripts to be able to say something like “oh by the way, can you change all references to database X to database Y while you’re at it”
  • lets you specify if an error in one batch should be fatal, or whether to proceed with the rest of the batches from that script
  • logs execution and results in a table in the databse.

Dealing with GO was particularly important, because it’s common (and often essential) in T-SQL development. GO divides a single script into batches. The sql_execute() function below embraces this, by splitting the original file into separate queries based on the location of GO, and sending the individual batches one at a time to the server.

# Helper function to convert the output of Sys.time() into a character string
# without the time zone on it
# 
# @details Not exported.
# @keywords internal
# @param dt an object of class code{POSIXCT}
# @examples
# datetime_ch(Sys.time())
datetime_ch  function(dt){
  dt  gsub(" [A-Z]*$", "", as.character(dt))
  dt  paste0("CAST ('", dt, "' AS DATETIME)")
  return(dt)
}



#' Execute SQL
#'
#' Execute T-SQL in a script, split into batches
#' 
#' @export
#' @importFrom RODBC sqlQuery
#' @importFrom stringr str_split str_length 
#' @details Reads a script of SQL, splits it into separate queries on the basis of any occurrences of code{GO}
#' in the script, and passes it to the database server for execution.  While the initial use case was for SQL Server, there's no
#' reason why it wouldn't work with other ODBC connections.
#' 
#' The case of code{GO} is ignored but it has to be the first non-space word on its line of code.
#' 
#' If any batch at any point returns rows of data (eg via a code{SELECT} statement that does not code{INSERT} the
#' results into another table or variable on the database), the rest of that batch is not executed.  
#' If that batch was the last batch of SQL
#' in the original file, the results are returned as a data.frame, otherwise it is discarded.
#' 
#' Example SQL code for creating a log suitable for this function:
#' preformatted{
#' CREATE TABLE some_database.dbo.sql_executed_by_r_log
#' (
#'   log_event_code INT NOT NULL IDENTITY PRIMARY KEY, 
#'   start_time     DATETIME, 
#'   end_time       DATETIME,
#'   sub_text       NVARCHAR(200),
#'   script_name    NVARCHAR(1000),
#'   batch_number   INT,
#'   result         NCHAR(30),
#'   err_mess       VARCHAR(8000),
#'   duration       NUMERIC(18, 2)
#' );
#' }
#' 
#' @param channel connection handle as returned by RODBC::odbcConnect() of class RODBC
#' @param filename file name of an SQL script
#' @param sub_in character string that you want to be replaced with code{sub_out}.  Useful if you want to do a bulk search
#' and replace.  This is useful if you have a bunch of scripts that you maybe want
#' to run on one schema sometimes, and on another schema other times - just automate the search and replace.  Use with caution.
#' @param sub_out character string that you want to replace code{sub_in} with.
#' @param fixed logical.  If TRUE, code{sub_in} is a string to be matched as is.  Otherwise it is treated as a regular expression 
#' (eg if fixed = FALSE, then . is a wild card)
#' @param error_action should you stop with an error if a batch gets an error message back from the database?  Any alternative
#' to "stop" means we just keep ploughing on, which may or may not be a bad idea.  Use "stop" unless you know that failure
#' in one part of a script isn't fatal.
#' @param log_table table in the database to record a log of what happened.  Set to NULL if no log table available.  The log_table
#' needs to have (at least) the following columns: event_time, sub_out, script_name, batch_number, result, err_mess and duration. 
#' See Details for example SQL to create such a log table.
#' @param verbose Logical, gives some control over messages
#' @param ... other arguments to be passed to code{sqlQuery()}, such as code{stringsAsFactors = FALSE}.
#' @examples
#' dontrun{
#' ch sql_execute  function(channel, filename, sub_in = NULL, sub_out = NULL, fixed = TRUE, 
                        error_action = "stop", log_table = NULL, 
                        verbose = TRUE, ...){
  
  # we can't tell in advance what encoding the .sql files are in, so we read it in
  # in two ways (one of which is certain to return gibberish) and choose the version that is recognised as a proper string:
  
  # encoding method 1 (weird Windows encoding):
  file_con  file(filename, encoding = "UCS-2LE")
  sql1  paste(readLines(file_con, warn = FALSE), collapse = "n")
  close(file_con)
  
  # encoding method 2 (let R work it out - works in most cases):
  file_con  file(filename)
  sql2  paste(readLines(file_con, warn = FALSE), collapse = "n")
  close(file_con)
  
  # choose between the two encodings, based on which one has a legitimate string length:
  suppressWarnings({
    if(is.na(stringr::str_length(sql2))){
      sql  sql1
    } else {
      sql  sql2
    }
  })
  
  # do the find and replace that are needed
  if(!is.null(sub_in)){
    sql  gsub(sub_in, sub_out, sql, fixed = fixed)
  }
  
  # split the SQL into separate commands wherever there is a "GO" at the beginning of a line
  # ("GO" is not ANSI SQL, only works for SQL Server - it indicates the lines above are a batch)
  sql_split  stringr::str_split(sql, "n *[Gg][Oo]", simplify = TRUE)
  
  base_log_entry  data.frame(
    sub_out         = ifelse(is.null(sub_out), "none", sub_out),
    script_name      = filename,
    stringsAsFactors = FALSE
  )
  
  n_batches  length(sql_split)
  
  # execute the various separate commands
  for(i in 1:n_batches){
    log_entry               base_log_entry
    log_entry$batch_number  i
    log_entry$result        "no error"
    log_entry$err_mess      ""
    log_entry$start_time    datetime_ch(Sys.time())
    
    if(verbose){message(paste("Executing batch", i, "of", n_batches))}
    
    duration  system.time({res  sqlQuery(channel, sql_split[[i]], ...)})
    log_entry$duration  duration[3]
    
    if(class(res) == "data.frame"){
      txt  paste("Downloaded a data.frame with", nrow(res), "rows and",
                   ncol(res), "columns in batch", i, ". Any commands left in batch", i, "were not run.")
      if(verbose){message(txt)}
      log_entry$result  "data.frame"
      
    } 
    if(class(res) == "character" & length(res) > 0){
      message("nnI got this error message:")
      cat(res)
      log_entry$result  "error"
      log_entry$err_mess  paste(gsub("'", "", res), collapse = "n")
      message(paste0("nnSomething went wrong with the SQL execution of batch ", i, 
                     " in ", filename, ". nnError message from the database is shown abovenn"))
    }
    
    log_entry$end_time  datetime_ch(Sys.time())
    
    # Update the log in the database, if we have been given one:
    if(!is.null(log_table)){
      # couldn't get sqlSave to append to a table even when append = TRUE... 
      # see https://stackoverflow.com/questions/36913664/rodbc-error-sqlsave-unable-to-append-to-table
      # so am writing the SQL to update the log by hand:
      sql  with(log_entry, paste0("INSERT INTO ", 
                                    log_table, 
                                    "(start_time, end_time, sub_out, script_name, batch_number, 
                                    result, err_mess, duration)",
                                    " VALUES (", start_time, ", ", end_time, ", '", 
                                    sub_out, "', '", script_name, "', ", batch_number, ", '", result, "', '",
                                    err_mess, "', ", duration, ");"))
      
      log_res  sqlQuery(channel, sql)
      
      
    }
    if(error_action == "stop" && log_entry$result == "error"){
      stop(paste("Stopping due to an error in", filename))
    }
    if(class(res) == "data.frame"){
      if(i == n_batches){
        return(res)  
      } else {
        warning("Downloaded a data frame from batch ", i, " of SQL, which wasn't the nlast batch in the file.  This data frame is not kept.")
      }
      
    }
  }
}

One of the things to watch out for in this situation is how running a script via ODBC can get different results from hitting F5 in Management Studio. One key thing to trip up on is what happens if the SQL includes a SELECT statement that doesn’t INSERT the results into another table or variable, but returns them as a table. In this case, ODBC considers its work done and will not continue to execute anything else in the batch beyond that SELECT statement.

To clarify how this works, here is a potentially problematic SQL file:

/*
eg-sql.sql

for testing the sql_execute R function

*/

-- this will return five rows but sql_execute discards them
SELECT TOP 5 * FROM some_table
GO

-- this will return an error
some non-legitimate SQL here that causes an error
go

-- next batch will only get as far as the first seven rows
SELECT TOP 7 * FROM some_table
SELECT TOP 10 * FROM some_table
GO

If I run that file via sql_execute(ch, "examples/eg-sql.sql"), it does the following:

  • executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
  • tries to execute the some non-legitimate SQL, gets an error and stops.

Alternatively, if I run it via sql_execute(ch, "examples/eg-sql.sql", error_action = "continue") it does the following

  • executes the SELECT TOP 5 statement and returns the results as a data frame, which is discarded as it is not the result of the last batch of the script
  • tries to execute the some non-legitimate SQL, gets an error and prints it to the screen.
  • executes the SELECT TOP 7 statement, returns the results as a data frame, and stops. The SELECT TOP 10 statement isn’t returned.

An odd quirk with SQL loops cutting short with ODBC

A second quirk that had me puzzled for a while (and indeed I am still puzzled and can’t get a fully reproducible example) seems to relate to the use of SQL WHILE loops in a script executed on the database from R via ODBC. I found many such SQL programs would silently stop after about 20 iterations of the loop under ODBC, even if they worked perfectly in Management Studio. The examples all look like this:

DECLARE @i INT = 1
WHILE @i  50
BEGIN
	-- do something that needs looping
	
	
	SET @i = @i + 1
END
GO

BTW, SQL is one of those languages where you avoid loops if you can, and think instead in terms of joining, aggregating and filtering tables. But there are times when it is necessary (for example, performing an action on each table in a database, such as creating a carefully chosen random sample of it in another database – one of the things we had to do in the work project mentioned above).

The solution to this mysterious refusal to go beyond about 20 (it varied) iterations in some loops was to wrap the whole action in a user-defined stored procedure, then execute the procedure. This seems satisfyingly secure in all sorts of ways. The procedure can be kept permanently or blown away depending on what makes sense:

CREATE PROCEDURE do_stuff
AS 
BEGIN
	DECLARE @i INT = 1
	WHILE @i  50
	BEGIN
		-- do something that needs looping
	
	
		SET @i = @i + 1
	END
END
GO

EXECUTE do_stuff
DROP PROCEDURE do_stuff

Worth noting – T-SQL distinguishes between its functions and stored procedures, whereas R lumps the two types of functionality together. Functions in SQL are true computer-science-defined functions, that take inputs and return outputs, with strictly no side effects. Stored procedures can have side effects (like creating or modifying tables). In R, functions can have side effects (and frequently do eg drawing plots), not just return outputs based on inputs.

No graphic today…

To leave a comment for the author, please follow the link and comment on their blog: Peter’s stats stuff – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Bayesian Regression Modelling in R: Choosing informative priors in rstanarm #rstats

By Daniel

(This article was first published on R – Strenge Jacke!, and kindly contributed to R-bloggers)

Yesterday, at the last meeting of the Hamburg R User Group in this year, I had the pleasure to give a talk about Bayesian modelling and choosing (informative) priors in the rstanarm-package.

You can download the slides of my talk here.

Thanks to the Stan team and Tristan for proof reading my slides prior (

Tagged:

To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

.rprofile: Jenny Bryan

By rOpenSci – open tools for open science

Jenny Bryan

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)

Jenny Bryan @JennyBryan is a Software Engineer at RStudio and is on leave from being an Associate Professor at the University of British Columbia. Jenny serves in leadership positions with rOpenSci and Forwards and as an Ordinary member of The R Foundation.


KO: What is your name, your title, and how many years have you worked in R?

JB: I’m Jenny Bryan, I am a software engineer at RStudio (still getting used to that title)., And I am on leave from being an Associate Professor at the University of British Columbia. I’ve been working with R or it’s predecessors since 1996. I switched to R from S in the early 2000s.

KO: Why did you make the switch to R from S?

JB: It just seemed like the community was switching over to R and I didn’t have a specific reason to do otherwise, I was just following the communal path of least resistance.

KO: You have a huge following from all the stuff you post about your course. Did you always want to be a teacher? How did you get into teaching?

JB: No, I wouldn’t say that I always wanted to be a teacher, but I think I’ve enjoyed that above average compared to other professors. But it was more that I realized several years ago that I could have a bigger impact on what people did by improving data analysis workflows, thinking, and tooling instead of trying to make incremental progress on statistical methodology. It is a reflection of where I have a comparative advantage with respect to interest and aptitude, so it’s not really a knock on statistical methodology. But I feel we could use more people working on this side of the field – working on knowledge translation.

I was also reacting to what I saw in my collaborative work. I would work with people in genomics and if I’m completely honest with myself, often my biggest contribution to the paper would be getting all the datasets and analyses organized. I didn’t necessarily do some highly sophisticated statistical analysis. It would often boil down to just doing millions of t-tests or something. But the reason I had an impact on the project would be that I got everything organized so that we could re-run it and have more confidence in our results. And I was like, I have a PhD in stats, why is this my main contribution? Why do the postdocs, grad students, and bioinformaticians not know how to do these things? So then I started to make that more and more the focus of my course, instead of squeezing in more statistical methods. Then the teaching sort of changed who I was and what I allowed myself to think about and spend time on. I used to not let myself spend time on those things. Or if I did, I would feel guilty about it because I thought, I can’t get any professional credit for this! It’s not statistically profound, but it seems to be what the world needs me to do, and needs other people to be doing.

You don’t always have to be proving a theorem, you don’t always have to be writing a package, there’s still a lot of space for worthwhile activity in between all of those things.

KO: Do you feel proud of what you’ve accomplished?

JB: I finally in some sense gave myself permission to start teaching what I thought people actually needed to know. And then after spending lots of time on it in the classroom, you realize what gaps there are, you become increasingly familiar with the tooling that you’re teaching and you’re like, hey I could actually improve that. Or no one really talks about how you get the output of this step to flow nicely as the input into the following step, i.e. how to create workflows. It really helped open my mind to different forms of work that are still valuable. You don’t always have to be proving a theorem, you don’t always have to be writing a package, there’s still a lot of space for worthwhile activity in between all of those things. However because we don’t have names for all of it, it can be difficult from a career point of view. But so many people see it, use it, and are grateful for it.

KO: Can you talk about your transition into working for RStudio and what that will look like on a day-to-day basis?

JB: In many ways it looks a lot like my life already did because I had, especially in the last two to three years, decided if I want to work on R packages or on exposition, I’m going to do that. That’s what I think tenure is for! So I had decided to stop worrying about how to sell myself in a framework set up to reward traditional work in statistical methodology. That freed up a lot of mental energy, to pursue these other activities, unapologetically. Which lead to other opportunities, such as RStudio. I was already working mostly from home. The Statistics department is by no means a negative environment for me, but the internet helped me find virtual colleagues around the globe who really share my interests. The physical comfort of home is very appealing. RStudio is also very light on meetings, which is a beautiful thing.

KO: What is your team like at RStudio? How many projects are you juggling at any given time? Do you have an idea of what you want to accomplish while you’re there?

JB: The person I interact with most is Hadley Wickham and he now has a team of five. There’s a fair amount of back and forth with other team members. I might seek their advice on, e.g., development practices, or just put questions out there for everyone. This team is pretty new and the formalization of the tidyverse is pretty new, so everyone has different packages that they’re working on, either from scratch or shifting some of the maintenance burden off of Hadley. There’s a concerted effort to figure out “what does it mean to be an ecosystem of packages that work together?“.

KO: Do you have a well defined road map at this point on the team?

JB: I’ve been on that team since January and before that we had queued up readxl as a good project for me. It was also overdue for maintenance! I was already a “Spreadsheet Lady”, very familiar with the underlying objects, and with the problem space. It was a good opportunity for me to write compiled code which I hadn’t done in a really long time. I had never written C++ so it was a way to kill at least three birds with one stone. So that was an easy selection for the first thing to work on. And even before that was done, it was clear that going back and doing another project in the Google arena made sense. We knew we would do some work with interns. Wrapping the Google Drive API was going to be useful (in general and for a future update of googlesheets) and I knew our intern Lucy McGowan would be a great person to work with on it.

So no, there’s not some detailed 18-month roadmap stretching ahead of me. I think it will cycle between doing something that’s mine or new and doing maintenance on something that already exists. I also continue to do a lot of exposition, training, and speaking.

It actually pisses me off when people criticize “when” people work – like that’s a signifier of a poor work-life balance … their heart is in the right place to encourage balance, but I have a certain amount of work I want to get done.

KO: Day-to-day, do you have regular standups? How do you like your day to be structured?

JB: Oh there’s how I wish my day was structured and how it’s actually structured. I wish I could get up and just work because that’s when I feel by far my most productive. Unfortunately, this coincides with the morning chaos of a household with three kids, who, despite the fact that we’re trying to get them more independent with lunches and getting to school, you cannot completely tune out through this part of the day. So I do not really get up and just work, I sort of work until everyone gets out the door. Then I usually go exercise at that point, get that taken care of. I get more work done in the afternoon until the children all arrive home. I do a lot of work between 9 or 10 at night and 1 in the morning. Not because I love working at that time, but that’s just what I have.

Given that I have this platform, it actually pisses me off when people criticize “when” people work – like that’s a signifier of a poor work-life balance, though it is possible that I have a poor work-life balance, but I feel like it’s usually coming from people who don’t have the same constraints in their life. “You shouldn’t work on the weekends, You shouldn’t work in the evenings”. I’m like, when the heck else do you think I would work? I feel like sometimes people are – their heart is in the right place to encourage balance, but I have a certain amount of work I want to get done. And I have a family and it means that I work when my children are asleep.

They’re happy years but the tension between all the things you want to do is unbelievable because you will not do all of them. You cannot do it all.

KO: This topic is very interesting and personal to me. As I get older I’ve been thinking (nebulously) about starting a family, and I don’t know what that looks like. It’s scary to me, to not want to give up this lifestyle and this career that I’ve started for myself.

JB: My pivoting of thinking about myself as an applied statistician to more of a data scientist, coincided with me reemerging from having little kids. I had all of them pre-tenure and at some point we had “three under three”. I was trying to get tenure, just barely getting it all done and I was kind of in my own little world, just surviving. Then the tenure process completed successfully, the kids got older, they were all in school, and eventually they didn’t need any out of school care. So me being able to string multiple abstract thoughts together and carve out hours at a time to do thought work coincided with me also freeing myself to work on stuff that I found more interesting.

I don’t know how this all would have worked out if the conventional academic statistical work had suited me better. The time where I was most conflicted between doing a decent job parenting and doing decent work was also when I was doing work I wasn’t passionate about. I can’t tell if having more enthusiasm about the work would have made that period harder or easier! I really thought about ditching it all more than a few times.

The reinvigoration that coincided with switching emphasis also coincided with the reinvigoration that comes from the kids becoming more independent. It does eventually happen! There are some very tough years – they’re not dark years, they’re happy years but the tension between all the things you want to do is unbelievable because you will not do all of them. You cannot do it all.

KO: What are your favorite tools for managing your workflow?

JB: In terms of working with R I’ve completely standardized on working with RStudio. Before that I was an Emacs-ESS zealot and I still have more accumulated years in that sphere. But once RStudio really existed and was viable, I started teaching with it. I hate doing R one way when I’m in front of students and another when I’m alone. It got very confusing and mixing up the keyboard shortcuts would create chaos. So now I’ve fully embraced RStudio and have never looked back.

I’m also a git evangelist. Everything I do is in git, everything is on Github and at this point, almost everything is public because I’ve gotten unselfconscious enough to put it up there. Plus there’s enough volume now that no one could be looking at any particular one thing. It’s so much easier for me to find it again later. I just put everything in a public place rather than trying to have this granular access control; it simplifies things greatly. Working in the open has simplified a lot of decisions, that’s nice.

Otherwise I feel like my workflow is very primitive. I have thousands of email in my inbox. I’ve completely given up on managing email and I’m mostly okay with that. It’s out of my control and I can’t commit to a system where I’m forced to get to inbox zero. I’ve just given up on it. And twitter and slack are important ways to feel connected when I’m sitting at home on my sofa.

KO: Do you have any online blogs, personalities or podcasts that you particularly enjoy? It doesn’t have to be R related.

JB: I do follow people on twitter and the rstats hashtag, so that often results in serendipitous one-off links that I enjoy. I don’t follow certain blogs regularly, but there are certain places that I end up at regularly. I like the Not So Standard Deviations podcast. In the end I always listen to every episode, but it’s what I do on an airplane or car drive.

KO: You build up a backlog?

JB: Exactly. Then the next time I need to drive to Seattle in traffic, I’ll power through four episodes.

KO: What are some of your favorite R packages – do you have some that you think are funny, or love?

JB: I live entirely in the tidyverse. I’m not doing primary data analysis on projects anymore. It’s weird that the more involved you become in honing the tools, the less time you spend wielding them. So I’m increasingly focused on the data prep, data wrangling, data input part of the cycle and not on modeling. I did a lot more of that when I was a statistician and now it’s not where my comparative interest and advantage seems to lie. There’s plenty to do on the other end. And also not that many people who like it. I actually do enjoy it. I don’t have to force myself to enjoy it – this is really important, and it pleases me. Given how important I think the work is, it’s a relatively uncrowded field. Whereas machine learning, it seems like everyone wants to make a contribution there. I’m like, you go for it – I’m going to be over here getting data out of Excel spreadsheets.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Writing Excel formatted csv using readr::write_excel_csv2

By Appsilon Data Science Blog

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

Why this post?

Currently, my team and I are building a Shiny app that serves as an interface for a forecasting model. The app allows business users to interact with predictions. However, we keep getting feature requests, such as, “Can we please have this exported to Excel.”

Our client chose to see results exported to a csv file and wants to open them in Excel. App is already running on the Linux server and the csv that can be downloaded via app are utf-8 encoded.

If you are a Linux user you may not be aware that Windows Excel is not able to recognize utf-8 encoding automatically. It turns out that a few people faced this problem in the past.

Obviously, we cannot have a solution where our users are changing options in Excel or opening the file in any other way than double clicking.

We find having a Shiny App that allows for Excel export to be a good compromise between R/Shiny and Excel. It gives the user the power of interactivity and online access, while still preserving the possibility to work with the results in the environment they are most used to. This a great way to gradually accustom users with working in Shiny.

Current available solution in R

What we want is the following, write a csv file with utf-8 encoding and BOMThe byte order mark (BOM) is a Unicode character which tells about the encoding of the document. . This has been addressed in R by RStudio in readr package.

library(readr)

write_excel_csv(mtcars, "assets/data/readr/my_file.csv")

This is great and solves the problem with opening the file in Excel, but… supports only one type of locale.

Show me your locale

Depending on where you live you might have different locale. Locale is a set of parameters that defines the user’s language, region and any special variant preferences that the user wants to see in their user interface.

This means that number formatting can differ between different regions, for example in the USA . is used as a decimal separator, but on the other hand almost whole Europe uses ,. This article shows how countries around the world define their number formats.

This proves that there is a large need to extend the readr functionality and allow users to save Excel with European locale easily and quickly. This is not currently possible since write_excel_csv only allows one to write in the US locale.

New addition to readr

We proposed to add write_excel_csv2() to readr package that would allow the user to write a csv with , as a decimal separator and ; as column separator. To be consistent with naming convention in R for functions reading in (e.g. read.csv() and read.csv2()) or writing (e.g. write.csv() and write.csv2()) csv files with different delimiter we decided to simply add 2 to write_excel_csv().

tmp  tempfile()
on.exit(unlink(tmp))
readr::write_excel_csv2(mtcars, tmp)

To prove that it works, let’s read the first two lines and inspect the output.

readr::read_lines(tmp, n_max = 2)
## [1] "mpg;cyl;disp;hp;drat;wt;qsec;vs;am;gear;carb"
## [2] "21,0;6;160,0;110;3,90;2,620;16,46;0;1;4;4"

write_excel_csv2() is already available for download from readr repository and should be available on CRAN with the next release.

devtools::install_github("tidyverse/readr")

We hope you and your business team will find this addition useful.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

In case you missed it: November 2017 roundup

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from November of particular interest to R users.

R 3.4.3 “Kite Eating Tree” has been released.

Several approaches for generating a “Secret Santa” list with R.

The “RevoScaleR” package from Microsoft R Server has now been ported to Python.

The call for papers for the R/Finance 2018 conference in Chicago is now open.

Give thanks to the volunteers behind R.

Advice for R user groups from the organizer of R-Ladies Chicago.

Use containers to build R clusters for parallel workloads in Azure with the doAzureParallel package.

A collection of R scripts for interesting visualizations that fit into a 280-character Tweet.

R is featured in a StackOverflow case study at the Microsoft Connect conference.

The City of Chicago uses R to forecast water quality and issue beach safety alerts.

A collection of best practices for sharing data in spreadsheets, from a paper by Karl Broman and Kara Woo.

The MRAN website has been updated with faster package search and other improvements.

The curl package has been updated to use the built-in winSSL library on Windows.

Beginner, intermediate and advanced on-line learning plans for developing AI applications on Azure.

A recap of the EARL conference (Effective Applications of the R Language) in Boston.

Giora Simchoni uses R to calculate the expected payout from a slot machine.

An introductory R tutorial by Jesse Sadler focuses on the analysis of historical documents.

A new RStudio cheat sheet: “Working with Strings“.

An overview of generating distributions in R via simulated gaming dice.

An analysis of StackOverflow survey data ranks R and Python among the most-liked and least-disliked languages.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don’t forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I’m @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News