Change in temperature in Netherlands over the last century

By Wingfeet

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I read a post ‘race for the warmest year’ at
r4 %
mutate(.,cmtemp = cummean(TG/10))

g1 <- ggplot(r4,aes(x=daynon,y=cmtemp,
col=yearf))
g1 + geom_line(alpha=.4,show_guide=FALSE) +
scale_x_continuous(‘Day’,
breaks=mylegend$daynon,
labels=mylegend$month,
expand=c(0,0)) +
scale_y_continuous(‘Temperature (C)’) +
geom_line(data=r4[r4$yearf==’2014′,],
aes(x=daynon,y=cmtemp),
col=’black’,
size=2)

2014 with average of 30 years

To get a better idea how 2014 compares to previous years, the average of 30 years has been added. We had warm year, except for August, which suggested an early spring. In hindsight, second half of August had colder days than beginning April or end October.

r3$Period <- cut(r3$yearn,c(seq(1900,2013,30),2013,2014),
labels=c(‘1901-1930′,’1931-1960’,
‘1961-1990′,’1991-2013′,’2014’))
g1 <- ggplot(r3[r3$yearn<2014,],aes(x=daynon,y=TG/10,col=Period))
g1 + geom_smooth(span=.15,method=’loess’,size=1.5) +
scale_x_continuous(‘Day’,
breaks=mylegend$daynon,
labels=mylegend$month,
expand=c(0,0)) +
geom_line(#aes(x=daynon,y=TG/10),
data=r3[r3$yearn==2014,]) +
scale_y_continuous(‘Temperature (C)’)

Change by year

Finally, a plot showing how temperature changed within the years. To obtain this plot, I needed a day corrected base temperature. The baseline temperature is smoothed over days for years 1901 to 1924. The baseline was used to get a corrected baseline, which was subsequently smoothed over years and days.
Smoothers have edge effects, to remove these from the visual part, January and December have been added as extra to the data. Hence within the year there are only minimal edge effects.
The plot shows that middle last century, some parts of the year actually had a drop in temperature. In contrast, November has gradually been getting warmer since middle last century. The new century has seen quite an increase.

myyears <- r3[r3$yearn<1925,]
m13 <- filter(myyears,daynon%
mutate(.,daynon=daynon+365)
m0 335) %>%
mutate(.,daynon=daynon-365)
myyears <- rbind_list(m0,myyears,m13)

nn <- .2
mymod <- locfit(TG ~ lp(daynon,nn=nn),
data=myyears)
topred <- data.frame(daynon=1:365)
topred$pp <- predict(mymod,topred)
#plot(pp~ daynon,data=topred)

r5 %
mutate(.,tdiff=(TG-pp)/10) %>%
select(.,tdiff,daynon,yearn)
m13 <- filter(r5,daynon%
mutate(.,daynon=daynon+365,
yearn=yearn-1)
m0 335) %>%
mutate(.,daynon=daynon-365,
yearn=yearn+1)
r6 <- rbind_list(m0,r5,m13)

topred <- expand.grid(
daynon=seq(1:365),
yearn=1901:2014)
topred$pp2 <- locfit(
tdiff ~ lp(yearn,daynon,nn=nn),
data=r6) %>%
predict(.,topred)
#topred <- arrange(topred,daynon,yearn)

myz <- matrix(topred$pp2,ncol=365)
zmin <- floor(min(topred$pp2)*10)/10
zmax <- ceiling(max(topred$pp2)*10)/10
myseq <- seq(zmin,zmax,.1)
par(mar=c(5,4,4,6))
image(myz,useRaster=TRUE,
axes=FALSE,frame.plot=TRUE,
col=colorRampPalette(c(‘blue’,’red’))(length(myseq)-1),
breaks=myseq)
axis((seq(10,114,by=10)-1)/113,labels=seq(1910,2010,by=10),side=1)
axis((mylegend$daynon-1)/365,labels=mylegend$month,side=2)
color.legend(1.1,0,1.2,1,legend=c(zmin,zmax),gradient=’y’,
rect.col=colorRampPalette(c(‘blue’,’red’))(length(myseq)-1))

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Storing Forecasts in a Database

By ivannp

(This article was first published on Quintuitive » R, and kindly contributed to R-bloggers)

In my last post I mentioned that I started using RSQLite to store computed results. No rocket science here, but my feeling is that this might be useful to others, hence, this post. This can be done using any database, but I will use (R)SQLite as an illustration.

Let’s assume we are running a long ARMA/GARCH simulation (see here for an example) on daily data on the S&P 500. For each day, we fit numerous models using our package of choice. We need to store all useful information in a database and use it later for various analysis. A reasonable approach is to create two tables. One to store the models, and one to store the forecasts. Here is the relevant RSQLite code:

require(RSQLite)

driver = dbDriver("SQLite")
connection = dbConnect(driver, dbname=db.path)

query = paste(
           " create table if not exists models ( ",
           "    history integer not null, ",
           "    p integer not null, ",
           "    q integer not null, ",
           "    r integer not null, ",
           "    s integer not null, ",
           "    dist varchar(32) not null)",
           sep="")
dbGetQuery(connection, query)

query = paste(
           " create unique index if not exists models_unique ",
           "    on models(history,p,q,r,s,dist) ",
           sep="")
dbGetQuery(connection, query)
      
query = paste(
           " create table if not exists forecasts ( ",
           "    model integer not null, ",
           "    date datetime not null, ",
           "    mu real, ",
           "    sigma real, ",
           "    ic real) ",
           sep="")
dbGetQuery(connection, query)
      
query = paste(
           " create unique index if not exists forecasts_unique ",
           "    on forecasts(model,date) ",
           sep="")
dbGetQuery(connection, query)

dbDisconnect(connection)

The models table has one row for each model, which is a unique combination of (history,p,q,r,s,distribution). Each forecast refers to its model (this way we avoid repetitions – a standard database normalization) and specifies its date. The other columns of the forecasts can be pretty much anything that’s needed. In the example, we save the forecasts for the mean and variance and the information criteria (AIC for instance). The update code is a bit more complicated (or at least in my implementation):

# The assumption is that we are in a context (function) containing the
# parameters specifying the model, i.e. they are available via the
# history,p,q,r,s and dist variables. We also assume that the forecast(s)
# are contained in fore.df.

driver = DBI::dbDriver("SQLite")
connection = DBI::dbConnect(driver, dbname=db.path)

# First add the model if it doesn't exist already
model.df = data.frame(history, p, q, r, s, dist)
colnames(model.df) = c("history", "p", "q", "r", "s", "dist")
query = paste(
           " insert or ignore into models (history, p, q, r, s, dist) ",
           "    values (@history, @p, @q, @r, @s, @dist) ",
           sep="")
RSQLite::dbGetPreparedQuery(connection, query, bind.data=model.df)

# Get the model id from the rowid column (added automatically by SQLite)
query = paste("select rowid from models where history=? and p=? and q=? and r=? and s=? and dist=?", sep="")
rowid = as.numeric(RSQLite::dbGetPreparedQuery(connection, query, bind.data=model.df))

# Set the model id in the forecast data frame
fore.df[,1] = rowid
colnames(fore.df) = c("model", "date", "mu", "sigma", "ic")

DBI::dbBegin(connection)
# Insert the forecasts
query = paste(
           " insert or replace into forecasts ( ",
           "    model, date, mu, sigma, ic) ",
           " values (@model, @date, @mu, @sigma, @ic) ",
           sep="")
RSQLite::dbGetPreparedQuery(connection, query, bind.data=fore.df)
DBI::dbCommit(connection)

DBI::dbDisconnect(connection)

In summary, insert the model if not already available, get the unique id for the model (using SQLite’s rowid unique column added automatically to each table), add the model id to the forecasts data frame and finally insert the forecasts data frame into the forecasts table.

I have been using this approach for a while and has been quite happy with its usefulness.

Last but not least, if you are performing the simulation in parallel, the update must be wrapped in a critical section (for more details, check my previous post and the flock package).

To leave a comment for the author, please follow the link and comment on his blog: Quintuitive » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RcppArmadillo 0.4.550.1.0

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A week ago, Conrad provided another minor release 4.550.0 of Armadillo which has since received one minor correction in 4.550.1.0. As before, I had created a GitHub-only pre-release of his pre-release which was tested against the almost one hundred CRAN dependents of our RcppArmadillo package. This passed fine as usual, and results are as always in the rcpp-logs repository.

Processing and acceptance at the CRAN took a little longer as around the same time a fresh failure in unit tests had become apparent on an as-of-yet unannounced new architecture (!!) also tested at CRAN. The R-devel release has since gotten a new capabilities() test for long double, and we now only run this test (for our rmultinom()) if the test asserts that the given R build has this capability. Phew, so with all that the new version in now on CRAN; Windows binaries have been built and I also uploaded new Debian binaries.

Changes are summarized below; our end also includes added support for conversion of Field types takes to short pull request by Romain.

Changes in RcppArmadillo version 0.4.550.1.0 (2014-11-26)

  • Upgraded to Armadillo release Version 4.550.1 (“Singapore Sling Deluxe”)

    • added matrix exponential function: expmat()

    • faster .log_p() and .avg_log_p() functions in the gmm_diag class when compiling with OpenMP enabled

    • faster handling of in-place addition/subtraction of expressions with an outer product

    • applied correction to gmm_diag relative to the 4.550 release

  • The Armadillo Field type is now converted in as conversions

Courtesy of CRANberries, there is also a diffstat report for the most recent release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

CRAN Task Views for Finance and HPC now (also) on GitHub

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The CRAN Task View system is a fine project which Achim Zeileis initiated almost a decade ago. It is described in a short R Journal article in Volume 5, Number 1. I have been editor / maintainer of the Finance task view essentially since the very beginning of these CRAN Task Views, and added the High-Performance Computing one in the fall of 2008. Many, many people have helped by sending suggestions or even patches; email continues to be the main venue for the changes.

The maintainers of the Web Technologies task view were, at least as far as I know, the first to make the jump to maintaining the task view on GitHub. Karthik and I briefly talked about this when he was in town a few weeks ago for our joint Software Carpentry workshop at Northwestern.

So the topic had been on my mind, but it was only today that I realized that the near-limitless amount of awesome that is pandoc can probably help with maintenance. The task view code by Achim neatly converts the very regular, very XML, very boring original format into somewhat-CRAN-website-specific html. Pandoc, being as versatile as it is, can then make (GitHub-flavoured) markdown out of this, and with a minimal amount of sed magic, we get what we need.

And hence we now have these two new repos:

Contributions are now most welcome by pull request. You can run the included converter scripts, it differs between both repos only by one constant for the task view / file name. As an illustration, the one for Finance is below.

#!/usr/bin/r
## if you do not have /usr/bin/r from littler, just use Rscript

ctv <- "Finance"

ctvfile  <- paste0(ctv, ".ctv")
htmlfile <- paste0(ctv, ".html")
mdfile   <- "README.md"

## load packages
suppressMessages(library(XML))          # called by ctv
suppressMessages(library(ctv))

r <- getOption("repos")                 # set CRAN mirror
r["CRAN"] <- "http://cran.rstudio.com"
options(repos=r)

check_ctv_packages(ctvfile)             # run the check

## create html file from ctv file
ctv2html(read.ctv(ctvfile), htmlfile)

### these look atrocious, but are pretty straight forward. read them one by one
###  - start from the htmlfile
cmd <- paste0("cat ", htmlfile,
###  - in lines of the form  ^<a href="Word">Word.html</a>
###  - capture the 'Word' and insert it into a larger URL containing an absolute reference to task view 'Word'
  " | sed -e 's|^<a href="([a-zA-Z]*).html|<a href="http://cran.rstudio.com/web/views/1.html"|' | ",
###  - call pandoc, specifying html as input and github-flavoured markdown as output
              "pandoc -s -r html -w markdown_github | ",
###  - deal with the header by removing extra ||, replacing |** with ** and **| with **:              
              "sed -e's/||//g' -e's/|**/**/g' -e's/**|/** /g' -e's/|$/  /g' ",
###  - make the implicit URL to packages explicit
              "-e's|../packages/|http://cran.rstudio.com/web/packages/|g' ",
###  - write out mdfile
              "> ", mdfile)

system(cmd)                             # run the conversion

unlink(htmlfile)                        # remove temporary html file

cat("Done.n")

I am quite pleased with this setup—so a quick thanks towards the maintainers of the Web Technologies task view; of course to Achim for creating CRAN Task Views in the first place, and maintaining them all those years; as always to John MacFarlance for the magic that is pandoc; and last but not least of course to anybody who has contributed to the CRAN Task Views.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Some R Highlights from H20 World

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

H2O.ai held its first H2O World conference over two days at the Computer History Museum in Mountain View, CA. Although the main purpose of the conference was to promote the company’s rich set of Java based machne learning algorithms and announce their new products Flow and Play there were quite a few sessions devoted to R and statistics in general.

Before I describe some of these, a few words about the conference itself. H20 World was exceptionally well run, especially for a first try with over 500 people attending (my estimate). The venue is an interesting, accommodating space with plenty of parking, that played well with what, I think, must have been an underlying theme of the conference: acknowledging contributions of past generations of computer scientists and statisticians. There were two stages offering simultaneous talks for at least part of the conference: The Paul Erdős stage and the John Tukey stage. Tukey I got, why put such an eccentric mathematician front and center? I was puzzled until Sri Ambati, H2O.ai’s CEO and co-founder remarked that he admired Erdős because of his great generosity with collaboration. To a greater extent than most similar events, H2O World itself felt like a collaboration. There was plenty of opportunity to interact with other attendees, speakers and H20 technical staff (The whole company must have been there). Data scientists, developers and Marketing staff were accessible and gracious with their time. Well done!

R was center stage for a good bit the hands on training that that occupied the first day of the conference. There were several sessions (Exploratory Data Analysis, Regression, Deep Learning, Clustering and Dimensionality Reduction) on accessing various H2O algorithms through the h2o R package and the H2O API. All of these moved quickly from R to running the custom H2O alogorithms on the JVM. However, the message that came through is that R is the right environment for sophisticated machine learning.

Two great pleasures from the second day of the conference were Trevor Hastie’s tutorial on the Gradient Boosting Machine and John Chamber’s personal remembrances of John Tukey. It is unusual for a speaker to announce that he has been asked to condense a two hour talk into something just under an hour and then go on to speak slowly with great clarity, each sentence beguiling you into imagining that you are really following the details. (It would be very nice if the video of this talk would be made available.)

Two notable points from Trevor’s lecure where understanding gradient boosting as minimizing the exponential loss function and the openness of the gbm algorithm to “tinkering”. For the former point see Chapter 10 of the Elements of Statistical Learning or the more extended discussion in Schapire and Freund’s Boosting: Foundations and Algorithms.

John Tukey spent 40 years at Bell Labs (1945 – 1985) and John Chamber’s tenure there overlapped the last 20 years of Tukey’s stay. Chambers who had the opportunity to observe Tukey over this extended period of time painted a moving and lifelike portrait of the man. According to Chambers, Tukey could be patient and gracious with customers and staff, provocative with his statistician colleagues and “intellectually intimidating”. John remembered Richard Hamming saying: “John (Tukey) was a genius. I was not.” Tukey apparently delighted in making up new terms when talking with fellow statisticians. For example, he called the top and bottom lines that identify the interquartile range on a box plot “hinges” not quartiles. I found it particularly interesting that Tukey would describe a statistic in terms of the process used to compute it, and not in terms of any underlying theory. Very unusual, I would think, for someone who earned a PhD in topology under Solomon Lefschetz. For more memories of John Tukey including more from John Chambers look here.

Other R related highlights were talks by Matt Dowle and Erin Ledell. Matt reprised the update on new features in data.table that he recently gave to the Bay Area useR Group and also presented interesting applications using data.table from UK insurance company Landmark, and KatRisk (Look here for KatRisk part of Matt’s presentation).

Erin, author of the h20Ensemble package available on GitHub, delivered an exciting and informative talk on using ensembles of learners (combining gbm models and logistic regression models, for example) to create “superlearners”.

Finally, I gave a short talk Revolution Analytics’ recent work towards achieving reproducibility in R. The presentation motivates the need for reproducibility by examining the use of R in industry and science and describing how the checkpoint package and Revolution R Open, an open source distribution of R that points to a static repository can be helpful.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Object-oriented Programming – Book Review

By Fabio Veronesi

(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)


I have been asked to review the book “R Object-oriented Programming” by Kelly Black, edited by Packt publishing (£14.45 for the E-Book, £27.99 for Print+E-Book).
The scope of the book is “to provide a resource for programming using the R language” and therefore it can be seen as a good and practical introduction to all the most commonly used part of R. The first 2 chapters deal with data type and data organization in R. They basically quickly review how to handle each type of data (such as integers, doubles) and how to organize them into R objects. The third chapter deals with reading data from files and save them. This chapter gives a pretty good introduction into reading and writing every sort of data, even binaries, and from a variety of sources, including from the web. Chapter 4 provides an introduction to R commands to generate random numbers, in particular it gives a thorough overview of the sample command. Chapters 5 and 6 give a good background into the use of R to manipulate string and time variables. Of particular interest throughout the book is the handling of data gathered from public source on the web. For these particular data skills in string manipulation become crucial both for handling web addresses and also to extract the actual data from the information returned by the server. For this reason I think this book does a good job in introducing these important aspect of the R language.
Chapter 7 introduces some basic programming concepts, such as if statements and loops. Chapters 8 and 9 provide a complete overview of the S3 and S4 classes and finally chapters 10 and 11 are two hands on examples on how to put together all the concepts learned in the book to solve very practical problems. In these example the reader will be guided towards the creation of powerful R programs to grade student and perform a Monte Carlo simulation.
The book is written in a very practical form, meaning that not much time is wasted explaining each function in details, readers can browse the help pages of each function for more details. This means that probably this book is not for newbies to programming languages. Most of the learning is done by exploring the lines of code provided and for this reason I think the best readers would be people familiar with a programming language, even though I do not think that readers necessarily needs some familiarity with R. However as stated on the website, the target for this book are beginners who wants to become more “fluent” with the language.
Overall, I think this book does a good of providing the reader with a strong and neat introduction to all the bits of coding required to become more comfortable writing advance scripts. For example, at the end of chapter 2 the author discuss the use of the applyset of commands. These are crucial milestone to be learned for every individual who wants to switch from a mundane use of R to a more advanced and rigorous use of the language. In my personal experience when I began using R I would often create very long script using lots of loops and if statements, which tends to greatly decrease the execution speed. As soon as I learned to master the apply set of commands I was able to reduce my code and crucially I was also able to substantially increase its executing speed. Personally I would have loved to have access to such a book back then! The use of web sources for data manipulation is also a very nice addition that as far as I know is not common in other introductory texts. Nowadays gathering data from the web has become the norm and therefore I think it is important to provide beginners with tools to handle these type of data.
The strength of this book however is in chapters 8 and 9, which provide an extensive introduction to the use of the classes S3 and S4. I think these two chapters alone would justify the price for buying it. As far as I know these concepts are generally not treated with the right attention in books for beginners. They may explain you that when you load a package then the functions you normally use, such as plot, may change their function and options. However, I never found an introductory book that provides such as exhaustive explanation of how to fully control these classes to create advance programs. Of particular interest are also the two examples provided in Chapters 10 and 11. These are practical exercises that put together all the concepts learned in the previous chapters with the purpose of creating R programs that can be easily implemented and share. Chapter 10 for example describe a neat and powerful way to create a new R program to grade students. In this chapter the reader will use all the basic programming concept learned during the course of the book and he/she will put them together for creating an R program to import grades from csv files, manipulate them and create summary statistics and plot.
In conclusion, I see a variety of uses for this book. Clearly it is targeted to post beginners who need a short way to unlock the full power of R for their daily statistical routines. However, this book does not loose its purpose after we learned to properly use the language. It is written in such a way that even for experienced R users it is a useful way to quickly look-up functions and methods that maybe they do not use very often. I sometimes forget how to use certain functions and having such a book on my office bookshelf will certainly help me in these frustrating situations. So I think it will become part of the set of references that future R user will use on a regular basis.

To leave a comment for the author, please follow the link and comment on his blog: R Video tutorial for Spatial Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Package or Library

By Nicola Sturaro Sommacal

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

In these days, I was talking about an R package I developed with a colleague. He used several times the word library to refer to the R package. So, I realized that many R users do not know that package and library are not synonymous when referring to R.

The “Writing R Extensions” manual is clear: “A package is not a library“, although the same manual admits “this is a persistent mis-usage”.

What is a package

An R package is a directory of files which extend R. Some authors say that R packages are a god way to distribute R code as well as papers are a good way to disseminate scientific researches. Rossi provides some good reasons to write an R package.

  • We refer to the directory containing files as a source package, the master files of a package. These directory can be compressed in a tarball containing the files of a source package, the .tar.gz version of the source package.
  • An installed package is the result of running R CMD INSTALL or install.packages() at the R console on a source package.
  • On some platforms (notably OS X and Windows) there are also binary packages, a zip file or tarball containing the files of an installed package which can be unpacked rather than installing from sources.

Summarizing: we can refer to the source package as the human readable version of the package and to the installed package or to the binary package as the computer readable version.

What is a library

In R, a library can refer to:

  • A directory into which packages are installed, e.g. /usr/lib/R/library.
  • A shared, dynamic or static library or (especially on Windows) a DLL, where the second L stands for ‘library’. Installed packages may contain compiled code in what is known on Unix-alikes as a shared object and on Windows as a DLL.

Origin of the mis-used

“Writing R extension” manual suggest that the mis-use seems to stem from S, whose analogues of R’s packages were officially known as library sections and later as chapters, but almost always referred to as libraries.

I add to this that the R function to load packages, i.e. library(), doesn’t help to understand. By the way, before loading a package we have to install it with install.packages(). Than we will load the package from the directory into which package is installed that is a library.

Summary

At the end of this post, what should a new user remember of all these?

Ramarro, a web book about advanced R programming written by Andrea Spanò, contains a useful summary.

Terms about R packages are often confused. This may help to clarify:

  • Package: a collection of R functions, data, and compiled code in a well-defined format.
  • Library: the directory where packages are installed.
  • Repository: A website providing packages for installation.
  • Source: The original version of a package with human-readable text and code.
  • Binary: A compiled version of a package with computer-readable text and code, may work only on a specific platform.
To leave a comment for the author, please follow the link and comment on his blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R, an Integrated Statistical Programming Environment and GIS

By Robin Lovelace – R

clustering

(This article was first published on Robin Lovelace – R, and kindly contributed to R-bloggers)

This article was originally published in Geoinformatics magazine.

R is well known as a powerful, extensible and relatively fast statistical programming language and open software project with a command line interface (CLI). What is less well known is that R also has cutting edge spatial packages that allow it to behave as a fully featured Geographical Information System in the full sense of the word. In fact, some of cutting edge algorithms for image processing and spatial statistics are implemented in R before any other widely available software product (Bivand et al. 2013). Sophisticated techniques such as geographically weighted regression and spatial interaction models can be custom built around your spatial data in R. But R also works as a general purpose GIS, with mature functions for performing all established techniques of spatial analysis such as spatial selections, buffers and clipping. What is unique about R is that all these capabilities are found in a single programme: R provides a truly integrated modelling environment.

The advantages and drawbacks of R as a GIS

Despite being able to perform the same operations as dedicated GIS software such as ArcGIS and QGIS, R is fundamentally different in the way that the user interacts with it. Not only are most operations complete by typing (e.g. you type “plot(map1)” to plot the data contained in the map1 object), the visualisation stage is different. There is no dynamic canvas which can be used to pan and zoom: instead R only produces visual or other types of output when commanded to do so, using functions such as plot.

Table 1: A summary of the relative merits of R compared with more traditional GIS software.

Attribute Advantages of R Drawbacks of R
User interface Command line interface allows rapid description of workflow and reproducibility Steep learning curve (eased by RStudio)
Visualising data Sophisticated and customisable graphics No dynamic zoomable canvas
Selecting data Concise and consistent method using square brackets (e.g. “map1[x > y,]”) Difficult to dynamically select objects from map
Manipulating data Very wide range of functions through additional packages Only single core processing
Analysing/modelling data Integrated processing, analysis, and modelling framework Sometimes more than one solution available

How to get started with spatial data in R

Here is not the place to go into the details of spatial data analysis in R. Instead, we provide a number of resources that will enable rapidly getting up to speed with both R and its spatial packages. Note that when we say “packages” for R, we are referring to specific add-ons, analogous to extentions in ArcGIS and QGIS. The range of add ons is truly vast, as can be seen on the R website: http://cran.r-project.org/web/views/Spatial.html. This diversity can be daunting and it is certainly frustrating when you know that a problem can be solved in several different ways: R teaches you to think about your data and analysis carefully. The most commonly used packages for spatial analysis are probably sp (the basis of spatial functionality in R), rgdal (for loading spatial file formats such as shapefiles) and rgeos (for spatial analysis). Each is installed and loaded in the same way: rgeos, for example, is installed and loaded by typing:


    install.packages(“rgeos”)
    library(rgeos)

An introductory tutorial on R as a GIS is the working paper “Introduction to visualising spatial data in R” (Lovelace and Cheshire, 2014). This document, along with sample code and data, is available free online. It contains links to many other spatial R resources, so is a good starting point for further explorations of R as a GIS.

R in action as a GIS

To demonstrate where R’s flexibility comes into its own, imagine you have a large number of points that you would like to analyse. These are heavily clustered in a few area, and you would like to a) know where these clusters are; b) remove the points in these clusters and create single points in their place, which summarise the information contained in the clustered points; c) visualise the output.

We will not actually run through all the steps needed to do this in R. Suffice to know that it is possible in R and very difficult to do in other GIS packages such as QGIS, ArcGIS or even PostGIS (another command line GIS that is based on the database language Postgres, a type of SQL): I was asked to tackle this problem by the Spanish GIS service SIGTE after other solutions had been tried.

All of the steps needed to solve the problem, including provision of example data, are provided online. Here I provide an overview of the processes involved and some of the key functions to provide insight into the R way of thinking about spatial data.

First the data must be loaded and converted into yet another spatial data class. This is done using the readOGR function from the rgdal package mentioned above and then using the command as(SpatialPoints(stations), "ppp") to convert the spatial object stations into the ppp class from the spatstat package.

Next the data is converted into a density raster. The value of each pixel corresponds to the interpolated density of points in that area. This is visualised using the plot and contour functions to ensure that the conversion has worked properly.

The raster image is converted into smooth lines using the contourLines. The lines, one for each cluster zone, must then be converted into polygons using the command gPolygonize(SLDF[5, ]). gPolygonize is a very useful function from the rgeos package which automates the conversion of lines into polygons.

The results are plotted using the rather lengthy set of commands shown below. This results in the figure displayed above (notice the red argument creates the red fill of the zones):


    plot(Dens, main = "")
    plot(lnd, border = "grey", lwd = 2, add = T)
    plot(SLDF, col = terrain.colors(8), add = T)
    plot(cAg, col = "red", border = "white", add = T)
    graphics::text(coordinates(cAg) + 1000, labels = cAg$CODE)

Finally the points inside the cluster polygons are extracted using R’s very concise spatial subsetting syntax:


    sIn <- stations[cAg, ]  # select the stations inside the clusters

The code above, translated into English, means “take all the station points within the cluster polygon called cAg and save them as a new object call sIn”.

Conclusion

The purpose of the article has been to introduce the idea that GIS software can take many forms, including the rather unusual command line interface of the statistical programming language R. Hopefully any preconceptions about poor performance have been dispelled by the examples of converting text into spatial objects and of clustering points. Both examples would be difficult to undertake in more traditional GIS software packages. R has a steep learning curve and should probably be seen more as a powerful tool to use in harmony with other GIS packages for dealing with particularly tricky/unconventional tasks rather than a standalone GIS package in its own right form most users. R interfaces to QGIS and ArcGIS should make this easier, although this solution is not yet mature. In addition to new functionalities, R should also provide a new way of thinking about spatial data for many GIS users (Lovelace and Cheshire, 2014). Yes it has a steep learning curve, but it’s a fun curve to be on whether you are on the beginning of the ascent or in the near vertical phase of the exponential function!

References

Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2013). Applied spatial data analysis with R (Vol. 747248717). Springer.

Lovelace, R., & Cheshire, J. (2014). Introduction to visualising spatial data in R. National Centre for Research Methods, 1403. Retrieved from http://eprints.ncrm.ac.uk/3295/

Wickham, H. (2014). Tidy data. The Journal of Statistical Software, 14(5). Retrieved from http://vita.had.co.nz/papers/tidy-data.html

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Le Monde puzzle [#887quater]

By xi’an

(This article was first published on Xi’an’s Og » R, and kindly contributed to R-bloggers)

And yet another resolution of this combinatorics Le Monde mathematical puzzle: that puzzle puzzled many more people than usual! This solution is by Marco F, using a travelling salesman representation and existing TSP software.

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

For instance, take n=199, you should first calculate the “friends”. Save them on a symmetric square matrix:

m1 <- matrix(Inf, nrow=199, ncol=199)
diag(m1) <- 0
for (i in 1:199) m1[i,friends[i]] <- 1

Export the distance matrix to a file (in TSPlib format):

library(TSP)
tsp <- TSP(m1)
tsp
image(tsp)
write_TSPLIB(tsp, "f199.TSPLIB")

And use a solver to obtain the results. The best solver for TSP is Concorde. There are online versions where you can submit jobs:

0 2 1000000
2 96 1000000
96 191 1000000
191 168 1000000
  ...

The numbers of the solution are in the second column (2, 96, 191, 168…). And they are 0-indexed, so you have to add 1 to them:

3  97 192 169 155 101 188 136 120  49 176 148 108 181 143 113 112  84  37  63 18  31  33  88168 193  96 160 129 127 162 199  90  79 177 147  78  22 122 167 194 130  39 157  99 190 13491 198  58  23  41 128 196  60  21 100 189 172 152 73 183 106  38 131 125 164 197  59 110 146178 111 145  80  20  61 135 121  75  6  94 195166 123 133 156  69  52 144  81  40   9  72 184  12  24  57  87  82 62  19  45  76 180 109 116 173 151  74  26  95 161 163 126  43 153 17154  27 117 139  30  70  11  89 107 118 138 186103  66 159 165 124 132  93  28   8  17  32  45  44  77 179 182 142  83  86  14  50 175 114 55 141 115  29  92 104 185  71  10  15  34   27  42 154 170 191  98 158  67 102 187 137 119 25  56 65  35  46 150 174  51  13  68  53  47 149 140  85  36  64 105  16  48

Filed under:

To leave a comment for the author, please follow the link and comment on his blog: Xi’an’s Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Eggnog for Thanksgiving

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It’s Thanksgiving Day here in the US, so we’re taking the day off to spend some time with our families and to eat far too much food. If you’re in the US or celebrating Thanksgiving elsewhere, enjoy the day! And for everyone in this season of joy, here’s a handy app to scale your eggnog recipe, however many people are around your table.

The R code and the Shiny app is from Hadley Wickham. Enjoy!

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News