Basic mapping and attribute joins in R

By Robin Lovelace – R

(This article was first published on Robin Lovelace – R, and kindly contributed to R-bloggers)

This post is based on the free and open source Creating-maps-in-R teaching resource for introducing R as a command-line GIS.

R is well known as an language ideally suited for data processing, statistics and modelling. R has a number of spatial packages, allowing analyses that would require hundreds of lines of code in other languages to be implemented with relative ease. Geographically weighted regression, analysis of space-time data and raster processing are three niche areas where R outperform much of the competition, thanks to community contributions such as spgwr, spacetime and the wonderfully straightforward raster packages.

What seems to be less well known is that R performs well as a self standing Geographical Information System (GIS) in its own right. Everyday tasks such as reading and writing geographical data formats, reprojecting, joining, subsetting and overlaying spatial objects can be easy and intuitive in R, once you understand the slightly specialist data formats and syntax of spatial R objects and functions. These basic operations are the basic foundations of GIS. Mastering them will make much more advanced operations much easier. Based on the saying ‘master walking before trying to run’, this mini tutorial demonstrates how to load and plot a simple geographical object in R, illustrating that the ease with which continuous and binned choropleth map color schemes can be created using ggmap, an extension of the popular ggplot2 graphics package. Crucially, we will also see how to join spatial and non spatial datasets, resulting in a map of where the Conservative party succeeded and failed in gaining council seats in the 2014 local elections.

As with any project, the starting point is to load the data we’ll be using. In this case we can download all the datasets from a single souce: the Creating-maps-in-R github repository which is designed to introduce R’s basic geographical functionality to beginners. We can use R to download and unzip the files using the following commands (from a Linux-based operating system). This ensures reproducibility:

# load the packages we'll be using for this tutorial
x <- c("rgdal", "dplyr", "ggmap", "RColorBrewer")
lapply(x, library, character.only = TRUE)
# download the repository:
download.file("https://github.com/Robinlovelace/Creating-maps-in-R/archive/master.zip", destfile = "rmaps.zip", method = "wget")
unzip("rmaps.zip") # unzip the files

Once ‘in’ the folder, R has easy access to all the datasets we need for this tutorial. As this is about GIS, the first stage is to load and plot some spatial data: a map of London:

setwd("/home/robin/Desktop/Creating-maps-in-R-master/") # navigate into the unzipped folder
london <- readOGR("data/", layer = "london_sport")
## OGR data source with driver: ESRI Shapefile 
## Source: "data/", layer: "london_sport"
## with 33 features and 4 fields
## Feature type: wkbPolygon with 2 dimensions
plot(london)

The data has clearly loaded correctly and can be visualised, but where is it? The london is simply printed, a load of unreadable information is printed, including the coordinates defining the geographical extent of each zone and additional non-geographical attributes. The polymophic means that generic functions behave differently depending on the type of data they are fed. The following command, for example, is actually calling mean.Date behind the scenes, allowing R to tell us that the the 2nd of July was half way through the year. The default mean.default function does not work:

mean(as.Date(c("01/01/2014", "31/12/2014"), format = "%d/%m/%Y"))
## [1] "2014-07-02"

In the same way, we can use the trusty summary function to summarise our R object:

summary(london)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##        min      max
## x 503571.2 561941.1
## y 155850.8 200932.5
## Is projected: TRUE 
## proj4string :
## [+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000
## +y_0=-100000 +ellps=airy +units=m +no_defs]
## Data attributes:
##    ons_label                    name      Partic_Per       Pop_2001     
##  00AA   : 1   Barking and Dagenham: 1   Min.   : 9.10   Min.   :  7181  
##  00AB   : 1   Barnet              : 1   1st Qu.:17.60   1st Qu.:181284  
##  00AC   : 1   Bexley              : 1   Median :19.40   Median :216505  
##  00AD   : 1   Brent               : 1   Mean   :20.05   Mean   :217335  
##  00AE   : 1   Bromley             : 1   3rd Qu.:21.70   3rd Qu.:248917  
##  00AF   : 1   Camden              : 1   Max.   :28.40   Max.   :330584  
##  (Other):27   (Other)             :27

This has outputed some very useful information: the bounding box of the object, its coordinate reference system (CRS) and even summaries of the attributes associated with each zone. nrow(london) will tell us that there are 33 polygons represented within the object.

To gain a fuller understanding of the structure of the london object, we can use the str function (but only on the first polygon, to avoid an extrememly long output):

str(london[1,])
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  1 obs. of  4 variables:
##   .. ..$ ons_label : Factor w/ 33 levels "00AA","00AB",..: 6
##   .. ..$ name      : Factor w/ 33 levels "Barking and Dagenham",..: 5
##   .. ..$ Partic_Per: num 21.7
##   .. ..$ Pop_2001  : int 295535
##   ..@ polygons   :List of 1
##   .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots
##   .. .. .. ..@ Polygons :List of 1
##   .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots
##   .. .. .. .. .. .. ..@ labpt  : num [1:2] 542917 165647
##   .. .. .. .. .. .. ..@ area   : num 1.51e+08
##   .. .. .. .. .. .. ..@ hole   : logi FALSE
##   .. .. .. .. .. .. ..@ ringDir: int 1
##   .. .. .. .. .. .. ..@ coords : num [1:63, 1:2] 541178 541872 543442 544362 546662 ...
##   .. .. .. ..@ plotOrder: int 1
##   .. .. .. ..@ labpt    : num [1:2] 542917 165647
##   .. .. .. ..@ ID       : chr "0"
##   .. .. .. ..@ area     : num 1.51e+08
##   ..@ plotOrder  : int 1
##   ..@ bbox       : num [1:2, 1:2] 533569 156481 550541 173556
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "x" "y"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
##   .. .. ..@ projargs: chr "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs"

This shows us that the fundamental structure of a SpatialPolygonsDataFrame is actually rather complicated. This complexity is useful, allowing R to store the full range of information needed to describe almost any polygon-based dataset. The @ symbol in the structure represents slots which are specific to the S4 object class and contain specific pieces of information within the wider london object. The basic slots within the london object are:

  • @data, which contains the the attribute data for the zones
  • @polygons, the geographic data associated with each polygon (this confusingly contains the @Polygons slot: each polygon feature can contain multiple Polygons, e.g. if an administrative zone is non-contiguous)
  • @plotOrder is simply the order in which the polygons are plotted
  • @bbox is a slot associated with all spatial objects, representing its spatial extent
  • @proj4string the CRS associated with the object

Critically for exploring the attributes of london is the data slot. We can look at and modify the attributes of the subdivisions of london easily using the @ notation:

head(london@data)
##   ons_label                 name Partic_Per Pop_2001
## 0      00AF              Bromley       21.7   295535
## 1      00BD Richmond upon Thames       26.6   172330
## 2      00AS           Hillingdon       21.5   243006
## 3      00AR             Havering       17.9   224262
## 4      00AX Kingston upon Thames       24.4   147271
## 5      00BF               Sutton       19.3   179767

Having seen his notation, many (if not most) R beginners will tend to always use it to refer to attribute data in spatial objects. Yet @ is often not needed. To refer to the population of London, for example, the following lines of code yield the same result:

mean(london@data$Pop_2001)
## [1] 217335.1
mean(london$Pop_2001)
## [1] 217335.1

Thus we can treat the S4 spatial data classes as if they were regular data frames in some contexts, which is extremely useful for concise code. To plot the population of London zones on a map, the following code works:

cols <- brewer.pal(n = 4, name = "Greys")
lcols <- cut(london$Pop_2001,
  breaks = quantile(london$Pop_2001),
  labels = cols)
plot(london, col = as.character(lcols))

Now, how about joining additional variables to the spatial object? To join information to the existing variables, the join functions from dplyr (which replaces and improves on plyr) are a godsend. The following code loads a non-geographical dataset and joins an additional variable to london@data:

ldat <- read.csv("/home/robin/Desktop/Creating-maps-in-R-master/data/london-borough-profiles-2014.csv")
dat <- select(ldat, Code, contains("Anxiety"))
dat <- rename(dat, ons_label = Code, Anxiety = Anxiety.score.2012.13..out.of.10.)
dat$Anxiety <- as.numeric(as.character(dat$Anxiety))
## Warning: NAs introduced by coercion
london@data <- left_join(london@data, dat)
## Joining by: "ons_label"
head(london@data) # the new data has been added
##   ons_label                 name Partic_Per Pop_2001 Anxiety
## 1      00AF              Bromley       21.7   295535    3.20
## 2      00BD Richmond upon Thames       26.6   172330    3.56
## 3      00AS           Hillingdon       21.5   243006    3.34
## 4      00AR             Havering       17.9   224262    3.17
## 5      00AX Kingston upon Thames       24.4   147271    3.23
## 6      00BF               Sutton       19.3   179767    3.34

Plotting maps with ggplot

In order to plot the average anxiety scores across london we can use ggplot2:

lf <- fortify(london, region = "ons_label")
## Loading required package: rgeos
## rgeos version: 0.2-19, (SVN revision 394)
##  GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921 
##  Polygon checking: TRUE
lf <- rename(lf, ons_label = id)
lf <- left_join(lf, london@data)
## Joining by: "ons_label"
ggplot(lf) + geom_polygon(aes(long, lat, group = group, fill = Anxiety))

The challenge

Using the skills you have learned in the above tutorial, see if you can replicate the graph below: the proportion of Conservative councilors selected in different parts of London. Hint: the data is contained in ldat, as downloaded from here: http://data.london.gov.uk/dataset/london-borough-profiles.

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using Markdown and Pandoc for Publication

By suman

(This article was first published on MATHEMATICS IN MEDICINE, and kindly contributed to R-bloggers)

Using Markdown and Pandoc for Publication

The other day I was involved in editing job, in which I was supposed to edit 18 articles written in Microsoft Word (doc/docx format) and convert them into pdf format (for printing into a book) and html format (for web publishing). Manuscripts written by people not proficient in doc(x) format are notorious for formatting heterogeneity and errors making conversion of documents into different formats a nightmare. I accomplished the task with help of a couple of open source softwares with following steps:

  1. Installing appropriate softwares.
  2. Making a folder where I will keep all the markdown files. That folder becomes my working directory for R project (say, WORK). Make following subfolders: fig to hold figures, html to hold final html files, html/fig which will be a copy of the fig subfolder and will be referenced by the html files, pdf to hold final pdf files. Make a folder .pandoc/templates in the HOME folder which will hold the Pandoc Templates (default.html(5) and default.latex)
  3. Opening the doc(x) documents (say, doc1.doc(x)) with LibreOffice Writer. Saving any figures in fig folder in png format.
  4. Saving documents in html format (say, doc1.html).
  5. Convert html document into markdown format with Pandoc.
  6. Modify markdown files in any of the text editors.
  7. Build a YAML file in the WORK folder holding all the variables to be used throughout all the documents (say, my.yaml). Any document specific YAML can be inserted in the md file.
  8. Build a css file (say, my.css) in WORK/html folder, which contain all the necessary formatting codes for html output.
  9. Convert the markdown files into pdf and html format in Pandoc.

Installing appropriate softwares

The following softwares were used (clocking on the hyperlinks will lead to the sites from where the softwares can be downloaded):

  1. Ubuntu 12.04 64 bit
  2. R version 3.1.1
  3. R Studio 0.98.932
  4. LibreOffice Writer 4.1.0.4
  5. Pandoc. It comes pre-installed with current version of R Studio.
  6. Pandoc templates. There are many more sites where tailormade templates can be found to be used.

Working on doc(x) in LibreOffice Writer

After opening the doc1.doc(x) file in LibreOffice Writer, we save any pictures in it in the WORK/fig after giving it an appropriate name, preferably in .png format.
We save the file to doc1.html using LibreOffice Writer.

Converting html into markdown format

We take help of Pandoc to convert html into markdown format.
We open the terminal and reach the WORK folder and enter following to create doc1.md.

pandoc doc1.html -o doc1.md

Making appropriate Pandoc template

We copy the default.html and default.latex into the home/.pandoc/templates folder as told before.
We open the default.html in text editor. Following is an example of the template:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"$if(lang)$ lang="$lang$" xml:lang="$lang$"$endif$>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
$for(author-meta)$
<meta name="author" content="$author-meta$" />
$endfor$
$if(date-meta)$
<meta name="date" content="$date-meta$" />
$endif$
<title>$if(title-prefix)$$title-prefix$ - $endif$$pagetitle$</title>
<style type="text/css">code{white-space: pre;}</style>
$if(quotes)$
<style type="text/css">q { quotes: "“" "”" "‘" "'"; }</style>
$endif$
$if(highlighting-css)$
<style type="text/css">
$highlighting-css$
</style>
$endif$
$for(css)$
<link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
$endfor$
$if(math)$
$math$
$endif$
$for(header-includes)$
$header-includes$
$endfor$
</head>
<body>
$for(include-before)$
$include-before$
$endfor$
$if(title)$
<div id="$idprefix$header">
<h1 class="title">$title$</h1>
$if(subtitle)$
<h1 class="subtitle">$subtitle$</h1>
$endif$

<div class="author"><b>$author$</b></div>

<div class="affil"><i>$affiliation$</i></div>

$if(date)$
<h3 class="date">$date$</h3>
$endif$
</div>
$endif$
$if(toc)$
<div id="$idprefix$TOC">
$toc$
</div>
$endif$
$body$
$for(include-after)$
$include-after$
$endfor$
</body>
</html>

The following characteristics are seen from the above code segment:

  1. $---$: These are the variables, the values of which are to be provided with YAML document (to be told later). Sometimes, when variable is in form of a collection (like author -> name & address in YAML), the variable name of author can be accessed as $author.name$ and address of author can be accessed as $author.address$
  2. $if(---)$ --- $else$ --- $endif$ construct: This the branching code for the template. One example is as below:
    $if(date)$
    <h3 class="date">$date$</h3>
    $endif$

    The bove construct means that if date variable is given in YAML then it will be entered in the html document as h3 with class “date” (whose formatting can be manipulated inside css file).

  3. $for(---)$ --- $endfor$ construct: This is the loopng code for the template. One example is as below:
    $for(css)$
    <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
    $endfor$

    The above construct checks for the css variable which is a collection of variables. It inserts given html statement for each element of css variable.

  4. $body$ construct: This variable contains all the contents of doc1.md file after converting into html format by Pandoc converter. We cannot change anything which is denoted by $body$ variable inside the template. If we want to assign a new class (or say id) to any of the element inside the md file, we will have to do it by inserting raw html statement, as depicted below.
    ## header 2
    The normal statement
    <p class="myclass">Content of the special paragraph. It can **contain** markdown codes.</p>
    Another normal statement
    ## Another header 2

Similar template is available for latex, which can be modified by the user.
The details of Pandoc template can be found here.
The reader is requested to add any more resources for the above (I am not aware of them).

Editing markdown file

The doc1.md file is edited with R Studio editor using the standard method manually. There are many resources of Pandoc Markdown, this and this.

Make a YAML file

The YAML code can be put inside individual markdown files (for variables which are different for each markdown files) or put inside a separate file as depicted above.
The minimum content of my.yaml should be as under:

---
css: my.css
---

The details of YAML language construct can be found here.
In summary, the following points are evident:

  1. The YAML construct is delimited with the following
    ---
    YAML CODE
    ---
  2. Each variable (which was denoted as $variable$ in Pandoc template) is denoted as variable and following is the code for assigning a value to the variable.
    ---
    variable: value
    ---
  3. The following is an example of complex variable (equivalent to list in R).
    ---
    author:
    name: xxx
    address: yyy
    ---

    The name of author is accessed in Pandoc template as $author.name$. Note is to be made of indentation in front of name and address. Indentation is to be made by inserting space, not tab.

  4. The following is an example of collection (equivalent of vector in R).
---
css:
- my1.css
- my2.css
---

The variable css has two values associated with it (my1.css and my2.css).

$for(css)$
<link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/>
$endfor$

The above code segment in Pandoc Template will access both the values of css and insert a line each for my1.css and my2.css.

Converting resulting md files into html and pdf format

Finally, the resulting md files (associated with fig, css, template and yaml files) can be converted into html and pdf format by using following codes in terminal.

pandoc doc1.md my.yaml -s --data-dir=/home/HOME/.pandoc -o html/doc1.html  @for html file output@
pandoc doc1.md my.yaml -s --data-dir=/home/HOME/.pandoc -o pdf/doc1.pdf @for pdf file output@

If many md files are present, as in the project I was doing, then the whole process may be automated using a batch file with the following code:

file <- as.list(list.files()[grep(".md",list.files())])

foo <- function(x) {
s.pdf <- paste0("pandoc ", x, " m.yaml -s --data-dir=/home/HOME/.pandoc -o pdf/", str_sub(x, 1L, -4L), ".pdf")
s.htm <- paste0("pandoc ", x, " m.yaml -s --data-dir=/home/HOME/.pandoc -o html/", str_sub(x, 1L, -4L), ".html")
system(s.pdf)
system(s.htm)
}

lapply(file, foo)

Conclusion

The above described method was very efficient in terms of time taken and human effort expended to format all the documents into a uniform one.
YOUR COMMENTS/CRITICISMS ARE WELCOME.
BYE.

To leave a comment for the author, please follow the link and comment on his blog: MATHEMATICS IN MEDICINE.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News