RApiDatetime 0.0.1

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Very happy to announce a new package of mine is now up on the CRAN repository network: RApiDatetime.

It provides six entry points for C-level functions of the R API for Date and Datetime calculations: asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. These six functions are all fairly essential and useful, but not one of them was previously exported by R. Hence the need to put them together in the this package to complete the accessible API somewhat.

These should be helpful for fellow package authors as many of us have either our own partial copies of some of this code, or rather farm back out into R to get this done.

As a simple (yet real!) illustration, here is an actual Rcpp function which we could now cover at the C level rather than having to go back up to R (via Rcpp::Function()):

    inline Datetime::Datetime(const std::string &s, const std::string &fmt) {
        Rcpp::Function strptime("strptime");    // we cheat and call strptime() from R
        Rcpp::Function asPOSIXct("as.POSIXct"); // and we need to convert to POSIXct
        m_dt = Rcpp::as<double>(asPOSIXct(strptime(s, fmt)));
        update_tm();
    }

I had taken a first brief stab at this about two years ago, but never finished. With the recent emphasis on C-level function registration, coupled with a possible use case from anytime I more or less put this together last weekend.

It currently builds and tests fine on POSIX-alike operating systems. If someone with some skill and patience in working on Windows would like to help complete the Windows side of things then I would certainly welcome help and pull requests.

For questions or comments please use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

QR Decomposition with the Gram-Schmidt Algorithm

By Aaron Schlegel

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

QR decomposition is another technique for decomposing a matrix into a form that is easier to work with in further applications. The QR decomposition technique decomposes a square or rectangular matrix, which we will denote as A, into two components, Q, and R.
A = QR

Where Q is an orthogonal matrix, and R is an upper triangular matrix. Recall an orthogonal matrix is a square matrix with orthonormal row and column vectors such that Q^T Q = I, where I is the identity matrix. The term orthonormal implies the vectors are of unit length and are perpendicular (orthogonal) to each other.

QR decomposition is often used in linear least squares estimation and is, in fact, the method used by R in its lm() function. Signal processing and MIMO systems also employ QR decomposition. There are several methods for performing QR decomposition, including the Gram-Schmidt process, Householder reflections, and Givens rotations. This post is concerned with the Gram-Schmidt process.

The Gram-Schmidt Process

The Gram-Schmidt process is used to find an orthogonal basis from a non-orthogonal basis. An orthogonal basis has many properties that are desirable for further computations and expansions. As noted previously, an orthogonal matrix has row and column vectors of unit length:

||a_n|| = sqrt{a_n cdot a_n} = sqrt{a_n^T a_n} = 1

Where a_n is a linearly independent column vector of a matrix. The vectors are also perpendicular in an orthogonal basis. The Gram-Schmidt process works by finding an orthogonal projection q_n for each column vector a_n and then subtracting its projections onto the previous projections (q_j). The resulting vector is then divided by the length of that vector to produce a unit vector.

Consider a matrix A with n column vectors such that:

A = left[ a_1 | a_2 | cdots | a_n right]

The Gram-Schmidt process proceeds by finding the orthogonal projection of the first column vector a_1.

v_1 = a_1, qquad e_1 = frac{v_1}{||v_1||}

Because a_1 is the first column vector, there is no preceeding projections to subtract. The second column a_2 is subtracted by the previous projection on the column vector:

v_2 = a_2 – proj_{v_1} (a_2) = a_2 – (a_2 cdot e_1) e_1, qquad e_2 = frac{v_2}{||v_2||}

This process continues up to the n column vectors, where each incremental step k + 1 is computed as:

v_{k+1} = a_{k+1} – (a_{k+1} cdot e_{1}) e_1 – cdots – (a_{k+1} cdot e_k) e_k, qquad e_{k+1} = frac{u_{k+1}}{||u_{k+1}||}

The || cdot || is the L_2 norm which is defined as:

sqrt{sum^m_{j=1} v_k^2} The projection can also be defined by:

Thus the matrix A can be factorized into the QR matrix as the following:

A = left[a_1 | a_2 | cdots | a_n right] = left[e_1 | e_2 | cdots | e_n right] begin{bmatrix}a_1 cdot e_1 & a_2 cdot e_1 & cdots & a_n cdot e_1 0 & a_2 cdot e_2 & cdots & a_n cdot e_2 vdots & vdots & & vdots 0 & 0 & cdots & a_n cdot e_nend{bmatrix} = QR

Gram-Schmidt Process Example

Consider the matrix A:

begin{bmatrix} 2 & – 2 & 18 2 & 1 & 0 1 & 2 & 0 end{bmatrix}

We would like to orthogonalize this matrix using the Gram-Schmidt process. The resulting orthogonalized vector is also equivalent to Q in the QR decomposition.

The Gram-Schmidt process on the matrix A proceeds as follows:

v_1 = a_1 = begin{bmatrix}2 2 1end{bmatrix} qquad e_1 = frac{v_1}{||v_1||} = frac{begin{bmatrix}2 2 1end{bmatrix}}{sqrt{sum{begin{bmatrix}2 2 1end{bmatrix}^2}}} e_1 = begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix}
v_2 = a_2 – (a_2 cdot e_1) e_1 = begin{bmatrix}-2 1 2end{bmatrix} – left(begin{bmatrix}-2 1 2end{bmatrix}, begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix}right)begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix}
v_2 = begin{bmatrix}-2 1 2end{bmatrix} qquad e_2 = frac{v_2}{||v_2||} = frac{begin{bmatrix}-2 1 2end{bmatrix}}{sqrt{sum{begin{bmatrix}-2 1 2end{bmatrix}^2}}}
e_2 = begin{bmatrix} -frac{2}{3} frac{1}{3} frac{2}{3} end{bmatrix}
v_3 = a_3 – (a_3 cdot e_1) e_1 – (a_3 cdot e_2) e_2
v_3 = begin{bmatrix}18 0 0end{bmatrix} – left(begin{bmatrix}18 0 0end{bmatrix}, begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix}right)begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix} – left(begin{bmatrix}18 0 0end{bmatrix}, begin{bmatrix} -frac{2}{3} frac{1}{3} frac{2}{3} end{bmatrix} right)begin{bmatrix} -frac{2}{3} frac{1}{3} frac{2}{3} end{bmatrix}
v_3 = begin{bmatrix}2 – 4 4 end{bmatrix} qquad e_3 = frac{v_3}{||v_3||} = frac{begin{bmatrix}2 -4 4end{bmatrix}}{sqrt{sum{begin{bmatrix}2 -4 4end{bmatrix}^2}}}
e_3 = begin{bmatrix} frac{1}{3} -frac{2}{3} frac{2}{3} end{bmatrix}

Thus, the orthogonalized matrix resulting from the Gram-Schmidt process is:

begin{bmatrix} frac{2}{3} & -frac{2}{3} & frac{1}{3} frac{2}{3} & frac{1}{3} & -frac{2}{3} frac{1}{3} & frac{1}{3} & frac{2}{3} end{bmatrix}

The component R of the QR decomposition can also be found from the calculations made in the Gram-Schmidt process as defined above.

R = begin{bmatrix}a_1 cdot e_1 & a_2 cdot e_1 & cdots & a_n cdot e_1 0 & a_2 cdot e_2 & cdots & a_n cdot e_2 vdots & vdots & & vdots 0 & 0 & cdots & a_n cdot e_n end{bmatrix} = begin{bmatrix} begin{bmatrix} 2 2 1 end{bmatrix} cdot begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix} & begin{bmatrix} -2 1 2 end{bmatrix} cdot begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix} & begin{bmatrix} 18 0 0 end{bmatrix} cdot begin{bmatrix} frac{2}{3} frac{2}{3} frac{1}{3} end{bmatrix} 0 & begin{bmatrix} -2 1 2 end{bmatrix} cdot begin{bmatrix} -frac{2}{3} frac{1}{3} frac{2}{3} end{bmatrix} & begin{bmatrix} 18 0 0 end{bmatrix} cdot begin{bmatrix} -frac{2}{3} frac{1}{3} frac{2}{3} end{bmatrix} 0 & 0 & begin{bmatrix} 18 0 0 end{bmatrix} cdot begin{bmatrix} frac{1}{3} -frac{2}{3} frac{2}{3} end{bmatrix}end{bmatrix}
R = begin{bmatrix} 3 & 0 & 12 0 & 3 & -12 0 & 0 & 6 end{bmatrix}

The Gram-Schmidt Algorithm in R

We use the same matrix A to verify our results above.

A <- rbind(c(2,-2,18),c(2,1,0),c(1,2,0))
A
##      [,1] [,2] [,3]
## [1,]    2   -2   18
## [2,]    2    1    0
## [3,]    1    2    0

The following function is an implementation of the Gram-Schmidt algorithm using the modified version of the algorithm. A good comparison of the classical and modified versions of the algorithm can be found here. The Modified Gram-Schmidt algorithm was used above due to its improved numerical stability, which results in more orthogonal columns over the Classical algorithm.

gramschmidt <- function(x) {
  x <- as.matrix(x)
  # Get the number of rows and columns of the matrix
  n <- ncol(x)
  m <- nrow(x)
  
  # Initialize the Q and R matrices
  q <- matrix(0, m, n)
  r <- matrix(0, n, n)
  
  for (j in 1:n) {
    v = x[,j] # Step 1 of the Gram-Schmidt process v1 = a1
    # Skip the first column
    if (j > 1) {
      for (i in 1:(j-1)) {
        r[i,j] <- t(q[,i]) %*% x[,j] # Find the inner product (noted to be q^T a earlier)
        # Subtract the projection from v which causes v to become perpendicular to all columns of Q
        v <- v - r[i,j] * q[,i] 
      }      
    }
    # Find the L2 norm of the jth diagonal of R
    r[j,j] <- sqrt(sum(v^2))
    # The orthogonalized result is found and stored in the ith column of Q.
    q[,j] <- v / r[j,j]
  }
  
  # Collect the Q and R matrices into a list and return
  qrcomp <- list('Q'=q, 'R'=r)
  return(qrcomp)
}

Perform the Gram-Schmidt orthogonalization process on the matrix A using our function.

gramschmidt(A)
## $Q
##           [,1]       [,2]       [,3]
## [1,] 0.6666667 -0.6666667  0.3333333
## [2,] 0.6666667  0.3333333 -0.6666667
## [3,] 0.3333333  0.6666667  0.6666667
## 
## $R
##      [,1] [,2] [,3]
## [1,]    3    0   12
## [2,]    0    3  -12
## [3,]    0    0    6

The results of our function match those of our manual calculations!

The qr() function in R also performs the Gram-Schmidt process. The qr() function does not output the Q and R matrices, which must be found by calling qr.Q() and qr.R(), respectively, on the qr object.

A.qr <- qr(A)
A.qr.out <- list('Q'=qr.Q(A.qr), 'R'=qr.R(A.qr))
A.qr.out
## $Q
##            [,1]       [,2]       [,3]
## [1,] -0.6666667  0.6666667  0.3333333
## [2,] -0.6666667 -0.3333333 -0.6666667
## [3,] -0.3333333 -0.6666667  0.6666667
## 
## $R
##      [,1] [,2] [,3]
## [1,]   -3    0  -12
## [2,]    0   -3   12
## [3,]    0    0    6

Thus the qr() function in R matches our function and manual calculations as well.

References

http://www.calpoly.edu/~jborzell/Courses/Year%2005-06/Spring%202006/304Gram_Schmidt_Exercises.pdf

http://cavern.uark.edu/~arnold/4353/CGSMGS.pdf

https://www.math.ucdavis.edu/~linear/old/notes21.pdf

http://www.math.ucla.edu/~yanovsky/Teaching/Math151B/handouts/GramSchmidt.pdf

The post QR Decomposition with the Gram-Schmidt Algorithm appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Announcing R Tools 1.0 for Visual Studio 2015

By Blog Administrator

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Shahrokh Mortazavi, Partner PM, Visual Studio Cloud Platform Tools at Microsoft

I’m delighted to announce the general availability of R Tools 1.0 for Visual Studio 2015 (RTVS). This release will be shortly followed by R Tools 1.0 for Visual Studio 2017 in early May.

RTVS is a free and open source plug-in that turns Visual Studio into a powerful and productive R development environment. Check out this video for a quick tour of its core features:

Core IDE Features

RTVS builds on Visual Studio, which means you get numerous features for free: from using multiple languages to word-class Editing and Debugging to over 7,000 extensions for every need:

  • A polyglot IDE – VS supports R, Python, C++, C#, Node.js, SQL, etc. projects simultaneously.
  • Editor – complete editing experience for R scripts and functions, including detachable/tabbed windows, syntax highlighting, and much more.
  • IntelliSense – (aka auto-completion) available in both the editor and the Interactive R window.
  • R Interactive Window – work with the R console directly from within Visual Studio.
  • History window – view, search, select previous commands and send to the Interactive window.
  • Variable Explorer – drill into your R data structures and examine their values.
  • Plotting – see all of your R plots in a Visual Studio tool window.
  • Debugging – breakpoints, stepping, watch windows, call stacks and more.
  • R Markdown – R Markdown/knitr support with export to Word and HTML.
  • Git – source code control via Git and GitHub.
  • Extensions – over 7,000 Extensions covering a wide spectrum from Data to Languages to Productivity.
  • Help – use ? and ?? to view R documentation within Visual Studio.

It’s Enterprise-Grade

RTVS includes various features that address the needs of individual as well as Data Science teams, for example:

SQL Server 2016

RTVS integrates with SQL Server 2016 R Services and SQL Server Tools for Visual Studio 2015. These separate downloads enhance RTVS with support for syntax coloring and Intellisense, interactive queries, and deployment of stored procedures directly from Visual Studio.

Microsoft R Client

Use the stock CRAN R interpreter, or the enhanced Microsoft R Client and its ScaleR functions that support multi-core and cluster computing for practicing data science at scale.

Visual Studio Team Services

Integrated support for git, continuous integration, agile tools, release management, testing, reporting, bug and work-item tracking through Visual Studio Team Services. Use our hosted service or host it yourself privately.

Remoting

Whether it’s data governance, security, or running large jobs on a powerful server, RTVS workspaces enable setting up your own R server or connecting to one in the cloud.

The road ahead

We’re very excited to officially bring another language to the Visual Studio family! Along with Python Tools for Visual Studio, you have the two main languages for tackling most any analytics and ML related challenge. In the near future (~May), we’ll release RTVS for Visual Studio 2017 as well. We’ll also resurrect the “Data Science workload” in VS2017 which gives you R, Python, F# and all their respective package distros in one convenient install. Beyond that, we’re looking forward to hearing from you on what features we should focus on next! R package development? Mixed R+C debugging? Model deployment? VS Code/R for cross-platform development? Let us know at the RTVS Github repository!

Thank you!

Bits: http://microsoft.github.io/RTVS-docs/installation
Code: https://github.com/Microsoft/RTVS
Docs: http://microsoft.github.io/RTVS-docs

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Survminer Cheatsheet to Create Easily Survival Plots

By Easy Guides

survminer cheatsheet

We recently released the survminer verion 0.3, which includes many new features to help in visualizing and sumarizing survival analysis results.

In this article, we present a cheatsheet for survminer, created by Przemysław Biecek, and provide an overview of main functions.

survminer cheatsheet

The cheatsheet can be downloaded from STHDA and from Rstudio. It contains selected important functions, such as:

  • ggsurvplot() for plotting survival curves
  • ggcoxzph() and ggcoxdiagnostics() for assessing the assumtions of the Cox model
  • ggforest() and ggcoxadjustedcurves() for summarizing a Cox model

Additional functions, that you might find helpful, are briefly described in the next section.

survminer overview

The main functions, in the package, are organized in different categories as follow.

Survival Curves


  • ggsurvplot(): Draws survival curves with the ‘number at risk’ table, the cumulative number of events table and the cumulative number of censored subjects table.

  • arrange_ggsurvplots(): Arranges multiple ggsurvplots on the same page.

  • ggsurvevents(): Plots the distribution of event’s times.

  • surv_summary(): Summary of a survival curve. Compared to the default summary() function, surv_summary() creates a data frame containing a nice summary from survfit results.

  • surv_cutpoint(): Determines the optimal cutpoint for one or multiple continuous variables at once. Provides a value of a cutpoint that correspond to the most significant relation with survival.

  • pairwise_survdiff(): Multiple comparisons of survival curves. Calculate pairwise comparisons between group levels with corrections for multiple testing.

Diagnostics of Cox Model


  • ggcoxzph(): Graphical test of proportional hazards. Displays a graph of the scaled Schoenfeld residuals, along with a smooth curve using ggplot2. Wrapper around plot.cox.zph().

  • ggcoxdiagnostics(): Displays diagnostics graphs presenting goodness of Cox Proportional Hazards Model fit.

  • ggcoxfunctional(): Displays graphs of continuous explanatory variable against martingale residuals of null cox proportional hazards model. It helps to properly choose the functional form of continuous variable in cox model.

Summary of Cox Model


  • ggforest(): Draws forest plot for CoxPH model.

  • ggcoxadjustedcurves(): Plots adjusted survival curves for coxph model.

Competing Risks


  • ggcompetingrisks(): Plots cumulative incidence curves for competing risks.

Find out more at http://www.sthda.com/english/rpkgs/survminer/, and check out the documentation and usage examples of each of the functions in survminer package.

Infos

This analysis has been performed using R software (ver. 3.3.2).


Source:: R News

Make the [R] Kenntnis-Tage 2017 your stage

By eoda GmbH

[R] Kenntnis-Tage 2017: Call for Papers

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

At the [R] Kenntnis-Tage 2017 on November 8 and 9, 2017 you will get the chance to benefit not only from the exchange about the programming language R in a business context and practical tutorials but also from the audience: use the [R] Kenntnis-Tage 2017 as your platform and hand in your topic for the call of papers.

[R] Kenntnis-Tage 2017: Call for Papers

The topics can be as diverse as the event itself: Whether you share your data science use case with the participants, tell them your personal lessons learned with R in the business environment or show how your company uses data science and R. As a speaker at the [R] Kenntnis-Tage 2017, you will get free admission to the event.

Companies present themselves as big data pioneers

At last year’s event, many speakers already took their chance: Julia Flad from Trumpf Laser talked about the application possibilities of data science in the industry and shared her experiences with the participants. In his talk “Working efficiently with R – faster to the data product” Julian Gimbel from Lufthansa Industry Solutions gave a vivid example for working with the open source programming language.

Take your chance to become part of the [R] Kenntnis-Tage 2017 and hand in your topic related to data science or R. For more information, e.g. on the length of the presentation or the choice of topic, visit our website.

You don’t want to give a talk but still want to participate? When you register until May 5 you can profit from our early bird tickets. Data science goes professional – join in.

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Tidyverse Curse

By Bob Muenchen

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

I’ve just finished a major overhaul to my widely read article, Why R is Hard to Learn. It describes the main complaints I’ve heard from the participants to my workshops, and how those complaints can often be mitigated. Here’s the only new section:

The Tidyverse Curse

There’s a common theme in many of the sections above: a task that is hard to perform using base a R function is made much easier by a function in the dplyr package. That package, and its relatives, are collectively known as the tidyverse. Its functions help with many tasks, such as selecting, renaming, or transforming variables, filtering or sorting observations, combining data frames, and doing by-group analyses. dplyr is such a helpful package that Rdocumentation.org shows that it is the single most popular R package (as of 3/23/2017.) As much of a blessing as these commands are, they’re also a curse to beginners as they’re more to learn. The main packages of dplyr, tibble, tidyr, and purrr contain a few hundred functions, though I use “only” around 60 of them regularly. As people learn R, they often comment that base R functions and tidyverse ones feel like two separate languages. The tidyverse functions are often the easiest to use, but not always; its pipe operator is usually simpler to use, but not always; tibbles are usually accepted by non-tidyverse functions, but not always; grouped tibbles may help do what you want automatically, but not always (i.e. you may need to ungroup or group_by higher levels). Navigating the balance between base R and the tidyverse is a challenge to learn.

A demonstration of the mental overhead required to use tidyverse function involves the usually simple process of printing data. I mentioned this briefly in the Identity Crisis section above. Let’s look at an example using the built-in mtcars data set using R’s built-in print function:

> print(mtcars)
                  mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0 6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0 6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8 4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4 6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1 6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

We see the data, but the variable names actually ran off the top of my screen when viewing the entire data set, so I had to scroll backwards to see what they were. The dplyr package adds several nice new features to the print function. Below, I’m taking mtcars and sending it using the pipe operator “%>%” into dplyr’s as_data_frame function to convert it to a special type of tidyverse data frame called a “tibble” which prints better. From there I send it to the print function (that’s R’s default function, so I could have skipped that step). The output all fits on one screen since it stopped at a default of 10 observations. That allowed me to easily see the variable names that had scrolled off the screen using R’s default print method. It also notes helpfully that there are 22 more rows in the data that are not shown. Additional information includes the row and column counts at the top (32 x 11), and the fact that the variables are stored in double precision ().

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   print()
# A tibble: 32 × 11
   mpg   cyl  disp    hp  drat    wt  qsec    vs   am   gear  carb
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 21.0   6   160.0  110  3.90 2.620 16.46    0     1     4     4
 2 21.0   6   160.0  110  3.90 2.875 17.02    0     1     4     4
 3 22.8   4   108.0   93  3.85 2.320 18.61    1     1     4     1
 4 21.4   6   258.0  110  3.08 3.215 19.44    1     0     3     1
 5 18.7   8   360.0  175  3.15 3.440 17.02    0     0     3     2
 6 18.1   6   225.0  105  2.76 3.460 20.22    1     0     3     1
 7 14.3   8   360.0  245  3.21 3.570 15.84    0     0     3     4
 8 24.4   4   146.7   62  3.69 3.190 20.00    1     0     4     2
 9 22.8   4   140.8   95  3.92 3.150 22.90    1     0     4     2
10 19.2   6   167.6  123  3.92 3.440 18.30    1     0     4     4
# ... with 22 more rows

The new print format is helpful, but we also lost something important: the names of the cars! It turns out that row names get in the way of the data wrangling that dplyr is so good at, so tidyverse functions replace row names with 1, 2, 3…. However, the names are still available if you use the rownames_to_columns() function:

> library("dplyr")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
Error in function_list[[i]](value) : 
 could not find function "rownames_to_column"

Oops, I got an error message; the function wasn’t found. I remembered the right command, and using the dplyr package did cause the car names to vanish, but the solution is in the tibble package that I “forgot” to load. So let’s load that too (dplyr is already loaded, but I’m listing it again here just to make each example stand alone.)

> library("dplyr")
> library("tibble")
> mtcars %>%
+   as_data_frame() %>%
+   rownames_to_column() %>%
+   print()
# A tibble: 32 × 12
 rowname            mpg   cyl disp    hp   drat   wt   qsec   vs    am   gear carb
  <chr>            <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4         21.0   6   160.0  110  3.90  2.620 16.46   0     1     4     4
2 Mazda RX4 Wag     21.0   6   160.0  110  3.90  2.875 17.02   0     1     4     4
3 Datsun 710        22.8   4   108.0   93  3.85  2.320 18.61   1     1     4     1
4 Hornet 4 Drive    21.4   6   258.0  110  3.08  3.215 19.44   1     0     3     1
5 Hornet Sportabout 18.7   8   360.0  175  3.15  3.440 17.02   0     0     3     2
6 Valiant           18.1   6   225.0  105  2.76  3.460 20.22   1     0     3     1
7 Duster 360        14.3   8   360.0  245  3.21  3.570 15.84   0     0     3     4
8 Merc 240D         24.4   4   146.7   62  3.69  3.190 20.00   1     0     4     2
9 Merc 230          22.8   4   140.8   95  3.92  3.150 22.90   1     0     4     2
10 Merc 280         19.2   6   167.6  123  3.92  3.440 18.30   1     0     4     4
# ... with 22 more rows

Another way I could have avoided that problem is by loading the package named tidyverse, which includes both dplyr and tibble, but that’s another detail to learn.

In the above output, the row names are back! What if we now decided to save the data for use with a function that would automatically display row names? It would not find them because now they’re now stored in a variable called rowname, not in the row names position! Therefore, we would need to use either the built-in names function or the tibble package’s column_to_rownames function to restore the names to their previous position.

Most other data science software requires row names to be stored in a standard variable e.g. rowname. You then supply its name to procedures with something like SAS’
“ID rowname;” statement. That’s less to learn.

This isn’t a defect of the tidyverse, it’s the result of an architectural decision on the part of the original language designers; it probably seemed like a good idea at the time. The tidyverse functions are just doing the best they can with the existing architecture.

Another example of the difference between base R and the tidyverse can be seen when dealing with long text strings. Here I have a data frame in tidyverse format (a tibble). I’m asking it to print the lyrics for the song American Pie. Tibbles normally print in a nicer format than standard R data frames, but for long strings, they only display what fits on a single line:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   print()
# A tibble: 1 × 1
 lyrics
 <chr>
1 a long long time ago i can still remember how that music used

The whole song can be displayed by converting the tibble to a standard R data frame by routing it through the as.data.frame function:

> songs_df %>%
+   filter(song == "american pie") %>%
+   select(lyrics) %>%
+   as.data.frame() %>%
+   print()
 ... <truncated>
1 a long long time ago i can still remember how that music used 
to make me smile and i knew if i had my chance that i could make 
those people dance and maybe theyd be happy for a while but 
february made me shiver with every paper id deliver bad news on 
the doorstep i couldnt take one more step i cant remember if i cried 
...

These examples demonstrate a small slice of the mental overhead you’ll need to deal with as you learn base R and the tidyverse packages, such as dplyr. Since this section has focused on what makes R hard to learn, it may make you wonder why dplyr is the most popular R package. You can get a feel for that by reading the Introduction to dplyr. Putting in the time to learn it is well worth the effort.

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

10 Million Dots: Mapping European Population

By James

Dot density maps are an increasingly popular way of showing population datasets. The technique has its limitations (see

Giant dot density maps with R

I produced this map as part of an experiment to see how R could handle generating millions of dots from spatial data and then plotting them. Even though the final image is 3.8 metres by 3.8 metres – this is as big as I could make it without causing R to crash – many of the grid cells in urban areas are saturated with dots so a lot of detail is lost in those areas. The technique is effective in areas that are more sparsely populated. It would probably work better with larger spatial units that aren’t necessarily grid cells.

Generating the dots (using the spsample function) and then plotting them in one go would require way more RAM than I have access to. Instead I wrote a loop to select each grid cell of population data, generate the requisite number of points and then plot them. This created a much lighter load as my computer happily chugged away for a few hours to produce the plot. I could have plotted a dot for each person (500 million+) but this would have completely saturated the map, so instead I opted for 1 dot for every 50 people.

I have provided full code below. T&Cs on the data mean I can’t share that directly but you can download here.

Load rgdal and the spatial object required. In this case it’s the GEOSTAT 2011 population grid. In addition I am using the rainbow function to generate a rainbow palette for each of the countries.

library(rgdal)

Input<-readOGR(dsn=".", layer="SHAPEFILE")

Cols<- data.frame(Country=unique(Input$Country), Cols=rainbow(nrow(data.frame(unique(Input$Country)))))

Create the initial plot. This is empty but it enables the dots to be added interatively to the plot. I have specified 380cm by 380cm at 150 dpi PNG – this is as big as I could get it without crashing R.

png("europe_density.png",width=380, height=380, units="cm", res=150)
par(bg='black')
plot(t(bbox(Input)), asp=T, axes=F, ann=F)

Loop through each of the individual spatial units – grid cells in this case – and plot the dots for each. The number of dots are specified in spsample as the grid cell’s population value/50.

for (j in 1:nrow(Cols))
{

Subset<- Input[Input$Country==Cols[j,1],]

for (i in 1:nrow(Subset@data))
  {
  Poly<- Subset[i,]
  pts1<-spsample(Poly, n = 1+round(Poly@data$Pop/50), "random", iter=15)
  #The colours are taken from the Cols object.
  plot(pts1, pch=16, cex=0.03, add=T, col=as.vector(Cols[j,2]))
  }
}
dev.off()

Source:: R News

Euler Problem 17: Number Letter Counts

By Peter Prevos

Euler Problem 17: written numbers

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Euler Problem 17 asks to count the letters in numbers written as words. This is a skill we all learnt in primary school mainly useful when writing cheques—to those that still use them.

Each language has its own rules for writing numbers. My native language Dutch has very different logic to English. Both Dutch and English use a single syllable until the number twelve. Linguists have theorised this is evidence that early Germanic numbers were duodecimal. This factoid is supported by the importance of a “dozen” as a counting word and the twelve hours in the clock. There is even a Dozenal Society that promotes the use of a number system based on 12.

The English language changes the rules when reaching the number 21. While we say eight-teen in English, we do no say “one-twenty”. Dutch stays consistent and the last number is always spoken first. For example, 37 in English is “thirty-seven”, while in Dutch it is written as “zevenendertig” (seven and thirty).

Euler Problem 17 Definition

If the numbers 1 to 5 are written out in words: one, two, three, four, five, then there are 3 + 3 + 5 + 4 + 4 = 19 letters used in total. If all the numbers from 1 to 1000 (one thousand) inclusive were written out in words, how many letters would be used?

NOTE: Do not count spaces or hyphens. For example, 342 (three hundred and forty-two) contains 23 letters and 115 (one hundred and fifteen) contains 20 letters. The use of “and” when writing out numbers is in compliance with British usage.

Solution

The first piece of code provides a function that generates the words for numbers 1 to 999,999. This is more than the problem asks for, but it might be a useful function for another application. The last line concatenates all words together and removes the spaces.

numword.en <- function(x) { if (x > 999999) return("Error: Oustide my vocabulary")
    # Vocabulary 
    single <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine")
    teens <- c( "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen")
    tens <- c("ten", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety")
    # Translation
    numword.10 <- function (y) {
        a <- y %% 100
        if (a != 0) {
            and <- ifelse(y > 100, "and", "")
            if (a < 20)
                return (c(and, c(single, teens)[a]))
            else
                return (c(and, tens[floor(a / 10)], single[a %% 10]))
        }
    }
    numword.100 <- function (y) {
        a <- (floor(y / 100) %% 100) %% 10
        if (a != 0)
            return (c(single[a], "hundred"))
    }
    numword.1000 <- function(y) {
        a <- (1000 * floor(y / 1000)) / 1000
        if (a != 0)
            return (c(numword.100(a), numword.10(a), "thousand"))
    }
    numword <- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse=" ")
    return (trimws(numword))
}

answer <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.en), collapse="")))
print(answer)

Writing Numbers in Dutch

I went beyond Euler Problem 17 by translating the code to spell numbers in Dutch. Interesting bit of trivia is that it takes 307 fewer characters to spell the numbers 1 to 1000 in Dutch than it does in English.

It would be good if other people can submit functions for other languages in the comment section. Perhaps we can create an R package with a multi-lingual function for spelling numbers.

numword.nl <- function(x) { if (x > 999999) return("Error: Getal te hoog.")
    single <- c("een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "nenen")
    teens <- c( "tien", "elf", "twaalf", "dertien", "veertien", "fifteen", "zestien", "zeventien", "achtien", "nenentien")
    tens <- c("tien", "twintig", "dertig", "veertig", "vijftig", "zestig", "zeventig", "tachtig", "negengtig")
    numword.10 <- function(y) {
        a <- y %% 100
        if (a != 0) {
            if (a < 20)
                return (c(single, teens)[a])
            else
                return (c(single[a %% 10], "en", tens[floor(a / 10)]))
        }
    }
    numword.100 <- function(y) {
        a <- (floor(y / 100) %% 100) %% 10 if (a == 1) return ("honderd") if (a > 1) 
            return (c(single[a], "honderd"))
    }
    numword.1000 <- function(y) {
        a <- (1000 * floor(y / 1000)) / 1000 if (a == 1) return ("duizend ") if (a > 0)
            return (c(numword.100(a), numword.10(a), "duizend "))
    }
    numword<- paste(c(numword.1000(x), numword.100(x), numword.10(x)), collapse="")
    return (trimws(numword))
}

antwoord <- nchar(gsub(" ", "", paste0(sapply(1:1000, numword.nl), collapse="")))
print(antwoord)

print(answer - antwoord)

The post Euler Problem 17: Number Letter Counts appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data Visualization – Part 2

By Scott Stoltzman

plot of chunk unnamed-chunk-1

(This article was first published on R-Projects – Stoltzmaniac, and kindly contributed to R-bloggers)

A Quick Overview of the ggplot2 Package in R

While it will be important to focus on theory, I want to explain the ggplot2 package because I will be using it throughout the rest of this series. Knowing how it works will keep the focus on the results rather than the code. It’s an incredibly powerful package and once you wrap your head around what it’s doing, your life will change for the better! There are a lot of tools out there which provide better charts, graphs and ease of use (i.e. plot.ly, d3.js, Qlik, Tableau), but ggplot2 is still a fantastic resource and I use it all of the time.

In case you missed it, here’s a link to Data Visualization – Part 1

Why would you use ggplot2?

  1. More robust plotting than the base plot package
  2. Better control over aesthetics – colors, axes, background, etc.
  3. Layering
  4. Variable Mapping (aes)
  5. Automatic aggregation of data
  6. Built in formulas & plotting (geom_smooth)
  7. The list goes on and on…

Basically, ggplot2 allows for a lot more customization of plots with a lot less code (the rest of it is behind the scenes). Once you are used to the syntax, there’s no going back. It’s faster and easier.

Why wouldn’t you use ggplot2?

  1. A bit of a learning curve
  2. Lack of user interactivity with the plots

Fundamentally, ggplot2 gives the user the ability to start a plot and layer everything in. There are many ways to accomplish the same thing, so figure out what makes sense for you and stick to it.

A Basic Example: Unemployment Over Time

library(dplyr)
library(ggplot2)

# Load the economics data from ggplot2
data(economics,package='ggplot2')
# Take a look at the format of the data
head(economics)
## # A tibble: 6 × 6
##         date   pce    pop psavert uempmed unemploy
##       &lt;date&gt; &lt;dbl&gt;  &lt;int&gt;   &lt;dbl&gt;   &lt;dbl&gt;    &lt;int&gt;
## 1 1967-07-01 507.4 198712    12.5     4.5     2944
## 2 1967-08-01 510.5 198911    12.5     4.7     2945
## 3 1967-09-01 516.3 199113    11.7     4.6     2958
## 4 1967-10-01 512.9 199311    12.5     4.9     3143
## 5 1967-11-01 518.1 199498    12.5     4.7     3066
## 6 1967-12-01 525.8 199657    12.1     4.8     3018
# Create the plot
ggplot(data = economics) + geom_line(aes(x = date, y = unemploy))

plot of chunk unnamed-chunk-4

What happened to get that?

  • ggplot(economics) loaded the data frame
  • + tells ggplot() that there is more to be added to the plot
  • geom_line() defined the type of plot
  • aes(x = date, y = unemploy) mapped the variables

The aes() portion is what typically throws new users off but is my favorite feature of ggplot2. In simple terms, this is what “auto-magically” brings your plot to life. You are telling ggplot2, “I want ‘date’ to be on the x-axis and ‘unemploy’ to be on the y-axis.” It’s pretty straightforward in this case but there are more complex use cases as well.

Side Note: you could have achieved the same result by mapping the variables in the ggplot() function rather than in geom_line():
ggplot(data = economics, aes(x = date, y = unemploy)) + geom_line()

Here’s the basic formula for success:

  • Everything in ggplot2 starts with ggplot(data) and utilizes + to add on every element thereafter
  • Include your data frame (economics) in a ggplot function: ggplot(data = economics)
  • Input the type of plot you would like (i.e. line chart of unemployment over time): + geom_line(aes(x = date, y = unemploy))
    • “geom” stands for “geometric object” and determines the type of object (there can be more than one type per plot)
    • There are a lot of types of geometric objects – check them out here
  • Add in layers and utilize fill and col parameters within aes()

I’ll go through some of the examples from the Top 50 ggplot2 Visualizations Master List. I will be using their examples but I will also explain what’s going on.

Note: I believe the intention of the author of the Top 50 ggplot2 Visualizations Master List was to illustrate how to use ggplot2 rather than doing a full demonstration of what important data visualization techniques are – so keep that in mind as I go through these examples. Some of the visuals do not line up with my best practices addressed in my first post on data visualization.

As usual, some packages must be loaded.

library(reshape2)
library(lubridate)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)
library(gridExtra)

The Scatterplot

This is one of the most visually powerful tool for data analysis. However, you have to be careful when using it because it’s primarily used by people doing analysis and not reporting (depending on what industry you’re in).

The author of this chart was looking for a correlation between area and population.

# Use the "midwest"" data from ggplot2
data("midwest", package = "ggplot2")

head(midwest)
## # A tibble: 6 × 28
##     PID    county state  area poptotal popdensity popwhite popblack
##   &lt;int&gt;     &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;    &lt;int&gt;      &lt;dbl&gt;    &lt;int&gt;    &lt;int&gt;
## 1   561     ADAMS    IL 0.052    66090  1270.9615    63917     1702
## 2   562 ALEXANDER    IL 0.014    10626   759.0000     7054     3496
## 3   563      BOND    IL 0.022    14991   681.4091    14477      429
## 4   564     BOONE    IL 0.017    30806  1812.1176    29344      127
## 5   565     BROWN    IL 0.018     5836   324.2222     5264      547
## 6   566    BUREAU    IL 0.050    35688   713.7600    35157       50
## # ... with 20 more variables: popamerindian &lt;int&gt;, popasian &lt;int&gt;,
## #   popother &lt;int&gt;, percwhite &lt;dbl&gt;, percblack &lt;dbl&gt;, percamerindan &lt;dbl&gt;,
## #   percasian &lt;dbl&gt;, percother &lt;dbl&gt;, popadults &lt;int&gt;, perchsd &lt;dbl&gt;,
## #   percollege &lt;dbl&gt;, percprof &lt;dbl&gt;, poppovertyknown &lt;int&gt;,
## #   percpovertyknown &lt;dbl&gt;, percbelowpoverty &lt;dbl&gt;,
## #   percchildbelowpovert &lt;dbl&gt;, percadultpoverty &lt;dbl&gt;,
## #   percelderlypoverty &lt;dbl&gt;, inmetro &lt;int&gt;, category &lt;chr&gt;

Here’s the most basic version of the scatter plot

This can be called by geom_point() in ggplot2

# Scatterplot
ggplot(data = midwest, aes(x = area, y = poptotal)) + geom_point()  #ggplot

plot of chunk unnamed-chunk-7

Here’s version with some additional features

While the addition of the size of the points and color don’t add value, it does show the level of customization that’s possible with ggplot2.

ggplot(data = midwest, aes(x = area, y = poptotal)) + 
geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle="Area Vs Population", 
       y="Population", 
       x="Area", 
       title="Scatterplot", 
       caption = "Source: midwest")

plot of chunk unnamed-chunk-8

Explanation:

ggplot(data = midwest, aes(x = area, y = poptotal)) +
Inputs the data and maps x and y variables as area and poptotal.

geom_point(aes(col=state, size=popdensity)) +
Creates a scatterplot and maps the color and size of points to state and popdensity.

geom_smooth(method="loess", se=F) +
Creates a smoothing curve to fit the data. method is the type of fit and se determines whether or not to show error bars.

xlim(c(0, 0.1)) +
Sets the x-axis limits.

ylim(c(0, 500000)) +
Sets the y-axis limits.

labs(subtitle="Area Vs Population",

y="Population",

x="Area",

title="Scatterplot",

caption = "Source: midwest")
Changes the labels of the subtitle, y-axis, x-axis, title and caption.

Notice that the legend was automatically created and placed on the lefthand side. This is also highly customizable and can be changed easily.

The Density Plot

Density plots are a great way to see how data is distributed. They are similar to histograms in a sense, but show values in terms of percentage of the total. In this example, the author used the mpg data set and is looking to see the different distributions of City Mileage based off of the number of cylinders the car has.

# Examine the mpg data set
head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl      trans   drv   cty   hwy    fl
##          &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;      &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
## 1         audi    a4   1.8  1999     4   auto(l5)     f    18    29     p
## 2         audi    a4   1.8  1999     4 manual(m5)     f    21    29     p
## 3         audi    a4   2.0  2008     4 manual(m6)     f    20    31     p
## 4         audi    a4   2.0  2008     4   auto(av)     f    21    30     p
## 5         audi    a4   2.8  1999     6   auto(l5)     f    16    26     p
## 6         audi    a4   2.8  1999     6 manual(m5)     f    18    26     p
## # ... with 1 more variables: class &lt;chr&gt;

Sample Density Plot

g = ggplot(mpg, aes(cty))
g + geom_density(aes(fill=factor(cyl)), alpha=0.8) + 
    labs(title="Density plot", 
         subtitle="City Mileage Grouped by Number of cylinders",
         caption="Source: mpg",
         x="City Mileage",
         fill="# Cylinders")

plot of chunk unnamed-chunk-10

You’ll notice one immediate difference here. The author decided to create a the object g to equal ggplot(mpg, aes(cty)) – this is a nice trick and will save you some time if you plan on keeping ggplot(mpg, aes(cty)) as the fundamental plot and simply exploring other visualizations on top of it. It is also handy if you need to save the output of a chart to an image file.

ggplot(mpg, aes(cty)) loads the mpg data and aes(cty) assumes aes(x = cty)

g + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
geom_density kicks off a density plot and the mapping of cyl is used for colors. alpha is the transparency/opacity of the area under the curve.

labs(title="Density plot",

subtitle="City Mileage Grouped by Number of cylinders",

caption="Source: mpg",

x="City Mileage",

fill="# Cylinders")
Labeling is cleaned up at the end.

How would you use your new knowledge to see the density by class instead of by number of cylinders?

**Hint: ** g = ggplot(mpg, aes(cty)) has already been established.

g + geom_density(aes(fill=factor(class)), alpha=0.8) + 
    labs(title="Density plot", 
         subtitle="City Mileage Grouped by Class",
         caption="Source: mpg",
         x="City Mileage",
         fill="Class")

plot of chunk unnamed-chunk-11
Notice how I didn’t have to write out ggplot() again because it was already stored in the object g.

The Histogram

How could we show the city mileage in a histogram?

g = ggplot(mpg,aes(cty))
g + geom_histogram(bins=20) +
    labs(title="Histogram", 
         caption="Source: mpg",
         x="City Mileage")

plot of chunk unnamed-chunk-12

geom_histogram(bins=20) plots the histogram. If bins isn’t set, ggplot2 will automatically set one.

The Bar/Column Chart

For all intensive purposes, bar and column charts are essentially the same. Technically, the term “column chart” can be used when the bars run vertically. The author of this chart was simply looking at the frequency of the vehicles listed in the data set.

#Data Preparation
freqtable &lt;- table(mpg$manufacturer)
df &lt;- as.data.frame.table(freqtable)
head(df)
##        Var1 Freq
## 1      audi   18
## 2 chevrolet   19
## 3     dodge   37
## 4      ford   25
## 5     honda    9
## 6   hyundai   14
#Set a theme
theme_set(theme_classic())

g &lt;- ggplot(df, aes(Var1, Freq))
g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + 
      labs(title="Bar Chart", 
           subtitle="Manufacturer of vehicles", 
           caption="Source: Frequency of Manufacturers from 'mpg' dataset") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6))

plot of chunk unnamed-chunk-14

The addition of theme_set(theme_classic()) adds a preset theme to the chart. You can create your own or select from a large list of themes. This can help set your work apart from others and save a lot of time.

However, theme_set() is different than the theme(axis.text.x = element_text(angle=65, vjust=0.6)) the one used inside the plot itself in this case. The author decided to tilt the text along the x-axis. vjust=0.6 changes how far it is spaced away from the axis line.

Within geom_bar() there is another new piece of information: stat="identity" which tells ggplot to use the actual value of Freq.

You may also notice that ggplot arranged all of the data in alphabetical order based off of the manufacturer. If you want to change the order, it’s best to use the reorder() function. This next chart will use the Freq and coord_flip() to orient the chart differently.

g &lt;- ggplot(df, aes(reorder(Var1,Freq), Freq))
g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + 
      labs(title="Bar Chart", 
           x = 'Manufacturer',
           subtitle="Manufacturer of vehicles", 
           caption="Source: Frequency of Manufacturers from 'mpg' dataset") +
      theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  coord_flip()

plot of chunk unnamed-chunk-15

Let’s continue with bar charts – what if we wanted to see what hwy looked like by manufacturer and in terms of cyl?

g = ggplot(mpg,aes(x=manufacturer,y=hwy,col=factor(cyl),fill=factor(cyl)))
g + geom_bar(stat='identity', position='dodge') + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

plot of chunk unnamed-chunk-16

position='dodge' had to be used because the default setting is to stack the bars, 'dodge' places them side by side for comparison.

Despite the fact that the chart did what I wanted, it is very difficult to read due to how many manufacturers there are. This is where the facet_wrap() feature comes in handy.

theme_set(theme_bw())

g = ggplot(mpg,aes(x=factor(cyl),y=hwy,col=factor(cyl),fill=factor(cyl)))
g + geom_bar(stat='identity', position='dodge') + 
  facet_wrap(~manufacturer)

plot of chunk unnamed-chunk-17
This created a much nicer view of the information. It “auto-magically” split everything out by manufacturer!

Spatial Plots

Another nice feature of ggplot2 is the integration with maps and spatial plotting. In this simple example, I wanted to plot a few cities in Colorado and draw a border around them. Other than the addition of the map, ggplot simply places the dots directly on the locations via their longitude and latitude “auto-magically.”

This map is created with ggmap which utilizes Google Maps API.

library(ggmap)
library(ggalt)

foco &lt;-  geocode("Fort Collins, CO")  # get longitude and latitude

# Get the Map ----------------------------------------------
colo_map &lt;- qmap("Colorado, United States",zoom = 7, source = "google")   

# Get Coordinates for Places ---------------------
colo_places &lt;- c("Fort Collins, CO",
                    "Denver, CO",
                    "Grand Junction, CO",
                    "Durango, CO",
                    "Pueblo, CO")

places_loc &lt;- geocode(colo_places)  # get longitudes and latitudes


# Plot Open Street Map -------------------------------------
colo_map + geom_point(aes(x=lon, y=lat),
                             data = places_loc, 
                             alpha = 0.7, 
                             size = 7, 
                             color = "tomato") + 
                  geom_encircle(aes(x=lon, y=lat),
                                data = places_loc, size = 2, color = "blue")

plot of chunk unnamed-chunk-18

Final Thoughts

I hope you learned a lot about the basics of ggplot2 in this. It’s extremely powerful but yet easy to use once you get the hang of it. The best way to really learn it is to try it out. Find some data on your own and try to manipulate it and get it plotted. Without a doubt, you will have all kinds of errors pop up, data you expect to be plotted won’t show up, colors and fills will be different, etc. However, your visualizations will be leveled-up!

Coming soon:

  • Determining whether or not you need a visualization
  • Choosing the type of plot to use depending on the use case
  • Visualization beyond the standard charts and graphs

I made some modifications to the code, but almost all of the examples here were from Top 50 ggplot2 Visualizations – The Master List .

As always, the code used in this post is on my GitHub

To leave a comment for the author, please follow the link and comment on their blog: R-Projects – Stoltzmaniac.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News