## CRAN Release of R/exams 2.3-1

By R/exams

(This article was first published on R/exams, and kindly contributed to R-bloggers)

New minor release of the R/exams package to CRAN, containing a wide range of smaller improvements and bug fixes. Notable new features include a dedicated OpenOLAT interface, and a convenience function facilitating the use of TikZ-based graphics.

Version 2.3-1 of the one-for-all exams generator R/exams has been published on the Comprehensive R Archive Network at https://CRAN.R-project.org/package=exams. In the next days this will propagate to other CRAN mirrors along with Windows binary packages. The development version of the package is now version 2.3-2 on http://R-Forge.R-project.org/forum/?group_id=1337.

## New features

• Added new interface exams2openolat() for the open-source OpenOLAT learning management system. This is only a convenience wrapper to exams2qti12() or exams2qti21() with some dedicated tweaks for optimizing MathJax output for OpenOLAT.
• New function include_tikz() that facilitates compiling standalone TikZ figures into a range of output formats, especially PNG and SVG (for HTML-based output). This is useful when including TikZ in R/Markdown exercises or when converting R/LaTeX exercises to HTML. Two examples have been added to the package that illustrate the capabilities of include_tikz(): automaton, logic. A dedicated blog post is also planned.

## Written exams (NOPS)

• Following the blog post on Written R/exams around the World several users have been kind enough to add language support for: Croatian (hr.dcf, contributed by Krunoslav Juraić), Danish (da.dcf, contributed by Tue Vissing Jensen and Jakob Messner),Slovak (sk.dcf, contributed by Peter Fabsic), Swiss German (gsw.dcf, contributed by Reto Stauffer), Turkish (tr.dcf, contributed by Emrah Er). Furthermore, Portuguese has been distinguished into pt-PT.dcf (Portuguese Portuguese) vs. pt-BR.dcf (Brazilian Portuguese) with pt.dcf defaulting to the former (contributed by Thomas Dellinger).
• After setting a random seed exams2nops() and exams2pdf() now yield the same random versions of the exercises. Previously, this was not the case because exams2nops() internally generates a single random trial exam first for a couple of sanity checks. Now, the .GlobalEnv$.Random.seed is restored after generating the trial exam. • Fixed the support for nsamp argument in exams2nops(). Furthermore, current limitations of exams2nops() are pointed out more clearly in error messages and edge cases caught. • Allow varying points within a certain exercise in nops_eval(). ## HTML output and Base64-encoded supplements • In exams2html() and other interfaces based on make_exercise_transform_html() the option base64 = TRUE now uses Base64 encoding for all file extensions (known to the package) whereas base64 = NULL only encodes image files (previous default behavior). • Bug fixes and improvements in HTML transformers: • Only ="file.ext" (with =") for supplementary files embedded into HTML is replaced now by the corresponding Base64-encoded version. • href="file.ext" is replaced by href="file.ext" download="file.ext" prior to Base 64 replacement to assure that the file name is preserved for the browser/downloader. • alt="file.ext" and download="file.ext" are preserved without the Base64-encoded version of file.ext. • Include further file URIs for Base64 supplements, in particular .sav for SPSS data files. • In exams2blackboard(..., base64 = FALSE, ...) the base64 = FALSE was erroneously ignored. No matter how base64 was specified essentially base64 = TRUE was used, it is honored again now. ## Extensions • exshuffle{} can now also be used for schoice exercises with more than one TRUE answer. In a first step only one of the TRUE answers is selected and then -1 items from the FALSE answers. • Function include_supplement(..., dir = "foo") – without full path to "foo" – now also works if "foo" is not a local sub-directory but a sub-directory to the exercise directory edir (if specified). • Enable passing of envir argument from exams2html() to xweave() in case of R/Markdown (.Rmd) exercises. • When using exams2html(..., mathjax = TRUE) for testing purposes, mathjax.rstudio.com is used now rather than cdn.mathjax.org which is currently redirecting and will eventually be shut down completely. • Added support for tightlist (as produced by pandoc) in all current LaTeX templates as well as exams2nops(). ## Bug fixes • Fixed a bug in stresstest_exercise() where the “rank” (previously called “order”) of the correct solution was computed incorrectly. Additional enhancements in plots and labels. • Fixed a bug for tex2image(..., tikz = TRUE) where erroneously usetikzlibrary{TRUE} was included. Also tex2image(..., Sweave = TRUE) (the default) did not run properly on Windows, fixed now. • Better warnings if exshuffle{} could not be honored due to a lack of sufficiently many (suitable) answer alternatives. • Bug fix in CSV export of exams2arsnova(). Recent ARSnova versions use “mc” (rather than “MC”) and “abcd” (rather than “SC”) to code multiple-choice and single-choice questions, respectively. To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## Statistics Sunday: Taylor Swift vs. Lorde – Analyzing Song Lyrics (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers) Statistics Sunday Last week, I showed how to tokenize text. Today I’ll use those functions to do some text analysis of one of my favorite types of text: song lyrics. Plus, this is a great opportunity to demonstrate a new R package I discovered: geniusR, which will download lyrics from Genius. There are two packages – geniusR and geniusr – which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway? To install geniusR, you’ll use a different method than usual – you’ll need to install the package devtools, then call the install_github function to download the R package directly from GitHub. install.packages("devtools") devtools::install_github("josiahparry/geniusR") ## Downloading GitHub repo josiahparry/geniusR@master## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master ## Installing geniusR ## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file ## --no-environ --no-save --no-restore --quiet CMD INSTALL ## '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82' ## --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' ## --install-tests ##  Now you’ll want to load geniusR and tidyverse so we can work with our data. library(geniusR)library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ggplot2 2.2.1 purrr 0.2.4## tibble 1.4.2 dplyr 0.7.4## tidyr 0.8.0 stringr 1.3.0## readr 1.1.1 forcats 0.3.0 ## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──## dplyr::filter() masks stats::filter()## dplyr::lag() masks stats::lag() For today’s demonstration, I’ll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles. geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track. swift_lyrics genius_album(artist="Taylor Swift", album="Reputation") ## Joining, by = c("track_title", "track_n", "track_url") lorde_lyrics genius_album(artist="Lorde", album="Melodrama") ## Joining, by = c("track_title", "track_n", "track_url") Now we want to tokenize our datasets, remove stop words, and count word frequency – this code should look familiar, except this time, I’m combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them. library(tidytext)tidy_swift swift_lyrics %>%unnest_tokens(word,lyric) %>%anti_join(stop_words) %>%count(word, sort=TRUE) ## Joining, by = "word" head(tidy_swift) ## # A tibble: 6 x 2## word n## ## 1 call 46## 2 wanna 37## 3 ooh 35## 4 ha 34## 5 ah 33## 6 time 32 tidy_lorde lorde_lyrics %>%unnest_tokens(word,lyric) %>%anti_join(stop_words) %>%count(word, sort=TRUE) ## Joining, by = "word" head(tidy_lorde) ## # A tibble: 6 x 2## word n## ## 1 boom 40## 2 love 26## 3 shit 24## 4 dynamite 22## 5 homemade 22## 6 light 22 Looking at the top 6 words for each, it doesn’t look like there will be a lot of overlap. But let’s explore that, shall we? Lorde’s album is 3 tracks shorter than Taylor Swift’s. To make sure our word comparisons are meaningful, I’ll create new variables that takes into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I’ll be joining the datasets, I’ll be sure to label these new columns by artist name. tidy_swift tidy_swift %>%rename(swift_n = n) %>%mutate(swift_prop = swift_n/sum(swift_n))tidy_lorde tidy_lorde %>%rename(lorde_n = n) %>%mutate(lorde_prop = lorde_n/sum(lorde_n)) There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist’s set, it will fill those variables in with missing values. compare_words tidy_swift %>%full_join(tidy_lorde, by = "word")summary(compare_words) ## word swift_n swift_prop lorde_n ## Length:957 Min. : 1.000 Min. :0.00050 Min. : 1.0 ## Class :character 1st Qu.: 1.000 1st Qu.:0.00050 1st Qu.: 1.0 ## Mode :character Median : 1.000 Median :0.00050 Median : 1.0 ## Mean : 3.021 Mean :0.00152 Mean : 2.9 ## 3rd Qu.: 3.000 3rd Qu.:0.00151 3rd Qu.: 3.0 ## Max. :46.000 Max. :0.02321 Max. :40.0 ## NA's :301 NA's :301 NA's :508 ## lorde_prop ## Min. :0.0008 ## 1st Qu.:0.0008 ## Median :0.0008 ## Mean :0.0022 ## 3rd Qu.:0.0023 ## Max. :0.0307 ## NA's :508 The final dataset contains 957 tokens – unique words – and the NAs tell how many words are only present in one artist’s corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap. There are many things we could do with these data, but let’s visualize words and proportions, with one artist on the x-axis and the other on the y-axis. ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) +geom_abline() +geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +labs(y="Lorde", x="Taylor Swift") + theme_classic() ## Warning: Removed 809 rows containing missing values (geom_text). The warning lets me know there are 809 rows with missing values – those are the words only present in one artist’s corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses “love,” “light,” and, yes, “shit,” more than Swift, while Swift uses “call,” “wanna,” and “hands” more than Lorde. They use words like “waiting,” “heart,” and “dreams” at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift’s most common word only accounts for about 2.3% of her total words; Lorde’s most common word only accounts for about 3.1% of her total words. This highlights why it’s important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like “the,” “a”, and “and.” Next Statistics Sunday, we’ll take a look at sentiment analysis! To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## Tips for Ellipse Summary Plot By tomizono (This article was first published on R – ЯтомизоnoR, and kindly contributed to R-bloggers) I privately had some questions and reply here, because it may also help others including me. ## 1. How to specify size With plot axis parameters. > ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], xlim=c(4,8), ylim=c(2,5)) ## 2. How to specify color With plot color parameter. > ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’)) ## 3. How to give names Using builtin iris data. > ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)]) ## Digging deeper ### about iris data > str(iris) ‘data.frame’: 150 obs. of 5 variables:$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …

The column Species are used in both of x and y data. These are used to give the name of each catergory.

### example

Using fivenum instead of default ninenum.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’), SUMMARY=fivenum)

Above shows the plot shown above.

Below may help you to know values on each axis. Here, for the fivenum, each 3rd values is a (x, y) set of each category average.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], SUMMARY=fivenum, plot=FALSE)
$setosa x y 1 4.3 2.3 2 4.8 3.2 3 5.0 3.4 4 5.2 3.7 5 5.8 4.4$versicolor
x y
1 4.9 2.0
2 5.6 2.5
3 5.9 2.8
4 6.3 3.0
5 7.0 3.4

$virginica x y 1 4.9 2.2 2 6.2 2.8 3 6.5 3.0 4 6.9 3.2 5 7.9 3.8 To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## AphA Scotland – it’s a thing (This article was first published on HighlandR, and kindly contributed to R-bloggers) Reflections on AphA Scotland launch event – On Tues 8th May there was only one Scottish based member of the Association of Professional Healthcare Analysts (me) but on Wed 9th May that number rose to around 80 with the launch of the the Scotland AphA Branch. The event took place in the very nice Perth Concert Hall, and consisted of several great speakers plus a series of workshops. I particularly enjoyed hearing from Mohammed Mohammed, who talked about setting up the NHS-R community, and how we might overcome the resistance to R in some quarters within the NHS. In later discussions after the event, there was some conjecture that this might be because traditional IT depts cannot provide support for R. Thinking back to when I first got it installed, I think I was told that I would get no support for it, which I was totally cool with because I just wanted to ggplot EVERYTHING. (Actually, I based, and latticed, before I ggplotted, but you get the point). The beauty of R though is that “support” is plentiful and 24/7 in the shape of the #rstats community. I think if someone has got to the point where they want to use R, they are well into “power-user” territory and beyond the scope of regular IT support anyway. Not only that, but to get to that point, they have almost certainly mastered the black arts of Google-Fu and Stack Overflowing like a demon. Therefore they are the sort of self starters who are not going to be bothering IT in the first place. In other words, if we want to use R, let us use R (responsibly). By virtue of being member number 1, there was a suggestion early on in the planning stage that I be involved on the actual day. Initially it was suggested that I might have to get up on stage with the Actual Proper Really Important People. Thankfully, this idea got canned and evolved into being able to co-host one of the workshops (I never want go up on a stage, unless it’s behind a drum-kit), with the very talented Mr Neil Pettinger. This gave us a chance to demonstrate some of the patient flow graphics we’ve been iterating on – Neil had originally drafted Excel versions and then I tied my hand at replicating them in R. I blogged about this towards the end of last year so go take a look there for some more background. Our aim was to make the workshop a conversational, interactive affair, and I think we managed it. As Neil is based in Edinburgh, and I’m up in Inverness, most of our communication has been via email or Twitter DMs. We had maybe 2 phonecalls prior to the event. On the day before, we tried doing a rehearsal via internet video conference but my 4 year old twins managed to gatecrash that and it was a bit of a disaster. There was a fair bit of trepidation on my part before the first session, but it went well – people asked questions, which is always a good thing. In all we had to run the workshop 4 times, which meant we missed out on the other sessions running in parallel. I would also have liked to have been able to participate in the discussion about AphA Scotland moves forward. My one hope is that future events remain in Perth, which is a lot more ‘central’ than the usual Edinburgh/ Glasgow locations, or, another way to look at it, is “equitably inconvenient” for everyone. The slides that I put together for our workshop are hosted here: DataTransitions: visualising and animating patient flow This shows the original and revised Excel plots, plus the dplyr and ggplot2 code in a step-by-step guide to creating the R equivalent. I’m still undecided on making presentations in R. For our purposes, PowerPoint might have been absolutely fine, BUT, for those who were new to R ( I’d say it was a 50/50 split in terms of our workshop attendees between those who’d seen/used it, and those who hadn’t), it was quite nice to demonstrate nice graphics, and also say, “Yes, these slides you’re looking at, these were put together using R”. I had also planned to spin up a Leaflet map centred on the concert hall with a big “You are here” sign but didn’t get round to it. One other cool moment, at least as far as this blog is concerned, was speaking to someone, who upon realising I was from the Highlands, put 2 and 2 together and asked “are you… HighlandR?” Yay! Someone reads this stuff! Big thanks to Paul Stroner (Apha CEO), Val Perigo ( Administrator extraordinaire), Scott Heald, Peter Knight and Neil Pettinger for organising the event, and to all those who attended. You wonderful people. I’m looking forward to seeing what happens next, roll on the next event and hopefully, some training too To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## logic: Interpretation of Logic Gates (Using TikZ) By R/exams (This article was first published on R/exams, and kindly contributed to R-bloggers) Exercise template for matching logic gate diagrams (drawn with TikZ) to the corresponding truth table. Name: logic Description: Gate diagrams for three logical operators (sampled from: and, or, xor, nand, nor) are drawn with TikZ and have to be matched to a truth table for another randomly drawn logical operator. Depending on the exams2xyz() interface the TikZ graphic can be rendered in PNG, SVG, or directly by LaTeX. Solution feedback: Yes Randomization: Shuffling, text blocks, and graphics Mathematical notation: No Verbatim R input/output: No Images: Yes Other supplements: No Template: Raw: (1 random version) PDF: HTML: Demo code: library("exams") set.seed(1090) exams2html("logic.Rnw") set.seed(1090) exams2pdf("logic.Rnw") set.seed(1090) exams2html("logic.Rmd") set.seed(1090) exams2pdf("logic.Rmd") To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## automaton: Interpretation of Automaton Diagrams (Using TikZ) By R/exams (This article was first published on R/exams, and kindly contributed to R-bloggers) Exercise template for assessing the interpretation of an automaton diagram (drawn with TikZ) based on randomly generated input sequences. Name: automaton Description: An automaton diagram with four states A-D is drawn with TikZ and is to be interpreted, where A is always the initial state and one state is randomly picked as the accepting state. Five binary 0/1 input sequences acceptance have to be assessed with approximately a quarter of all sequences being accepted. Depending on the exams2xyz() interface the TikZ graphic can be rendered in PNG, SVG, or directly by LaTeX. Solution feedback: Yes Randomization: Random numbers, text blocks, and graphics Mathematical notation: No Verbatim R input/output: No Images: Yes Other supplements: No Raw: (1 random version) PDF: HTML: Demo code: library("exams") set.seed(1090) exams2html("automaton.Rnw") set.seed(1090) exams2pdf("automaton.Rnw") set.seed(1090) exams2html("automaton.Rmd") set.seed(1090) exams2pdf("automaton.Rmd") To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## Do know Shake Shack’s locations outside of the US? You’d be surprised (This article was first published on R – nandeshwar.info, and kindly contributed to R-bloggers) ## Madison Shake I had heard that the lines to get some food at Shake Shack are long. So when I saw a new location opening in downtown LA, I wondered how many locations does it have and how fast are they spreading across the US. The answers surprised me. Using R and previous code, I created a few maps: Read on to learn how I got the data and plotted them. ## Load Libraries First, let’s load our favourite libraries.  1 2 3 4 5  library(rvest) library(readr) library(tidyverse) library(scales) library(ggmap) ## Figure out locations On its site, Shake Shack fortunately has all the locations and opening dates, going back to April 23, 2012. The archive pages run from 1 to 20 with this URL structure: https://www.shakeshack.com/location/page/ Using SelectorGadget, I figured out the XPath and CSS code to find the opening date, location name, and location page link. Then, I wrote a function to retrieve these values from a given archive page.  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  get_locations <- function(url) { page_html <- read_html(url) nodes <- page_html %>% html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "span4", " " ))]') data.frame(opdate = html_nodes(x = nodes, xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "date", " " ))]') %>% html_text(trim = TRUE), store_loc_name = html_nodes(x = nodes, css = 'h2') %>% html_text(trim = TRUE), store_loc_link = html_nodes(x = nodes, css = 'h2 a') %>% html_attr("href"), stringsAsFactors = FALSE) } I applied this function to retrieve all location opening dates, names, and individual location urls:  1 2 3  all_loc_pages <- paste0("https://www.shakeshack.com/location/page/", 1:20, "/") all_locations <- do.call(rbind, lapply(all_loc_pages, get_locations)) ## Find addresses of all locations If you visit an individual location’s page, such as this Tokyo Dome page, you will see that often the exact address is not listed, or if it is, you can’t directly geocode it. But, luckily, there’s a Google Map right below the location. I thought, they must be passing some parameters to Google Maps API. I spend a good amount of time, but couldn’t figure out how they were getting the map. And. Then. I found out that the text “CLICK MAP FOR DIRECTIONS” block had a valid address as part of the hyperlink!! I wrote another simple function to get the addresses from the given URL:  1 2 3 4 5 6 7 8 9 10  get_loc_cords <- function(loc_url) { location_html <- read_html(loc_url) data.frame(loc_url = loc_url, goog_map_url = location_html %>% html_nodes(xpath = '//a[text()="Click here for directions"]') %>% html_attr("href"), stringsAsFactors = FALSE) } location_google_maps_address <- do.call(rbind, lapply(all_locations$store_loc_link, get_loc_cords))

Then I joined the location name with the address data frame:

 1  all_locations <- left_join(all_locations, location_google_maps_address, by = c("store_loc_link" = "loc_url"))

Using the fantastic ggmap library and mutate_geocode function, I geocoded all the addresses:

 1 2 3  all_locations <- all_locations %>% mutate(google_addr_string = str_sub(goog_map_url, start = 36)) %>% mutate_geocode(google_addr_string, output = "latlon")

Here’s what the data frame looks like now:

### Tip

You may want to create a Google developer key for mass geocoding. Since the mutate_geocode function is used by many people, sometimes you may not get all the addresses geocoded. Use register_google(key = , account_type = ‘premium’, day_limit = 100000) function to register your key with ggmap functions.

## Data manipulation

Now that we have all the geographical coordinates, we just need to do some clean-up to get the data ready for plotting.

First, get the date field in order and add opening month and year columns:

 1 2 3 4  all_locations <- all_locations %>% mutate(open_date = as.Date(opdate, "%B %d, %Y"), open_month = lubridate::month(open_date), open_year = lubridate::year(open_date))

Second, get the cumulative count of store openings:

 1 2 3 4 5  ss_op_data_smry <- all_locations %>% count(open_date) %>% ungroup() %>% arrange(open_date) %>% mutate(cumm_n = cumsum(n))

Third, join the summary back to the locations data frame:

 1 2  all_locations_smry <- inner_join(all_locations, ss_op_data_smry, by = c("open_date" = "open_date"))

Using the ggmap library, I got the US map and a world map:

 1 2  us_map <- get_stamenmap(c(left = -125, bottom = 24, right = -67, top = 49), zoom = 5, maptype = "toner-lite") ggmap(us_map)

 1 2  world_map <- get_stamenmap(bbox = c(left = -180, bottom = -60, right = 179.9999, top = 70), zoom = 3, maptype = "toner-lite") ggmap(world_map)

## Create functions to plot each location

Repurposing my code from the Walmart spread across the US, I wrote a similar function to plot locations with two different sizes: big, if the locations opened during the mapped month, and small, if the locations opened before the mapped month. I did so that we could notice the new locations.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  my_us_plot <- function(df, plotdate, mapid){ g <- ggmap(us_map, darken = c("0.8", "black"), extent = "device") old_df <- filter(df, open_date < plotdate) new_df <- filter(df, open_date == plotdate) # old locations g <- g + geom_point(data = old_df, aes(x = lon, y = lat), size = 5, color = "dodgerblue", alpha = 0.4) # new locations g <- g + geom_point(data = new_df, aes(x = lon, y = lat), size = 8, color = "dodgerblue", alpha = 0.4) g <- g + theme(axis.ticks = element_blank(), axis.title = element_blank(), axis.text = element_blank(), plot.title = element_blank(), panel.background = element_rect(fill = "grey20"), plot.background = element_rect(fill = "grey20")) g <- g + annotate("text", x = -77, y = 33, label = "MONTH/YEAR:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 32, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color = "white", size = rel(6), fontface = 2, hjust = 0) g <- g + annotate("text", x = -77, y = 31, label = "STORE COUNT:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 30, label = comma(unique(new_df$cumm_n)), color = "white", size = rel(6), fontface = 2, hjust = 0) filename <- paste0("maps/img_" , str_pad(mapid, 7, pad = "0"), ".png") ggsave(filename = filename, plot = g, width = 13, height = 7, dpi = 120, type = "cairo-png") } I modified this function to map the world:  1 2 3 4 5 6 7 8 9 10 11 12 13 14  my_world_plot <- function(df, plotdate, mapid){ g <- ggmap(world_map, darken = c("0.8", "black"), extent = "device") old_df <- filter(df, open_date < plotdate) new_df <- filter(df, open_date == plotdate) g <- g + geom_point(data = old_df, aes(x = lon, y = lat), size = 5, color = "dodgerblue", alpha = 0.4) g <- g + geom_point(data = new_df, aes(x = lon, y = lat), size = 8, color = "dodgerblue", alpha = 0.4) g <- g + theme(axis.ticks = element_blank(), axis.title = element_blank(), axis.text = element_blank(), plot.title = element_blank(), panel.background = element_rect(fill = "grey20")) g <- g + annotate("text", x = -130, y = 0, label = "MONTH/YEAR:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -130, y = -10, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color = "white", size = rel(6), fontface = 2, hjust = 0) g <- g + annotate("text", x = -130, y = -20, label = "STORE COUNT:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -130, y = -30, label = comma(unique(new_df$cumm_n)), color = "white", size = rel(6), fontface = 2, hjust = 0) filename <- paste0("maps/img_" , str_pad(mapid, 7, pad = "0"), ".png") ggsave(filename = filename, plot = g, width = 12, height = 6, dpi = 150, type = "cairo-png") }

## Create maps

Now, the exciting part: create month-by-month maps.

US maps:

 1 2 3 4  all_locations_smry %>% mutate(mapid = group_indices_(all_locations_smry, .dots = 'open_date')) %>% group_by(open_date) %>% do(pl = my_us_plot(all_locations_smry, unique(.$open_date), unique(.$mapid)))

World maps:

 1 2 3 4  all_locations_smry %>% mutate(mapid = group_indices_(all_locations_smry, .dots = 'open_date')) %>% group_by(open_date) %>% do(pl = my_world_plot(all_locations_smry, unique(.$open_date), unique(.$mapid)))

## Create a movie

Using ffmpeg, we can put all the images together to create a movie:

 1 2 3  # works on a mac makemovie_cmd <- paste0("ffmpeg -framerate 8 -y -pattern_type glob -i '", paste0(getwd(), "/maps/"), "*.png'", " -c:v libx264 -pix_fmt yuv420p '", paste0(getwd(), "/maps/"), "movie.mp4'") system(makemovie_cmd)

We can use the convert function to create a gif:

 1 2 3  # https://askubuntu.com/a/43767 makegif_cmd <- paste0("convert -delay 8 -loop 0 ", paste0(getwd(), "/maps/"), "*.png ", "animated.gif") # loop 0 for forever looping system(makegif_cmd)

That’s it! We get nice looking videos showing location openings by each month. I was surprised to see how fast the company is opening the locations as well as how many locations it has in Asia!

## Post hoc

Using the ggimage library, I tried creating the maps using Shake Shack’s burger icon, but they didn’t turn out as good:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14  my_us_icon_plot <- function(df, plotdate, mapid){ g <- ggmap(us_map, darken = c("0.8", "black")) old_df <- filter(df, open_date < plotdate) new_df <- filter(df, open_date == plotdate) g <- g + geom_image(data = old_df, aes(x = lon, y = lat), image = "ss-app-logo.png", by = "height", size = 0.03, alpha = 0.4) g <- g + geom_image(data = new_df, aes(x = lon, y = lat), image = "ss-app-logo.png", by = "height", size = 0.07, alpha = 0.4) g <- g + theme(axis.ticks = element_blank(), axis.title = element_blank(), axis.text = element_blank(), plot.title = element_blank()) g <- g + annotate("text", x = -77, y = 33, label = "MONTH/YEAR:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 32, label = paste0(toupper(month.name[unique(new_df$open_month)]), "/", unique(new_df$open_year)), color = "white", size = rel(6), fontface = 2, hjust = 0) g <- g + annotate("text", x = -77, y = 31, label = "STORE COUNT:", color = "white", size = rel(5), hjust = 0) g <- g + annotate("text", x = -77, y = 30, label = comma(unique(new_df$cumm_n)), color = "white", size = rel(6), fontface = 2, hjust = 0) filename <- paste0("maps/img_" , str_pad(mapid, 7, pad = "0"), ".png") ggsave(filename = filename, plot = g, width = 13, height = 7, dpi = 150, type = "cairo-png") } ## Fun maps What do you think? How else would you visualize these data points? The post Do know Shake Shack’s locations outside of the US? You’d be surprised appeared first on nandeshwar.info. To leave a comment for the author, please follow the link and comment on their blog: R – nandeshwar.info. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## RStudio:addins part 2 – roxygen documentation formatting made easy (This article was first published on Jozef’s Rblog, and kindly contributed to R-bloggers) # Introduction Code documentation is extremely important if you want to share the code with anyone else, future you included. In this second post in the RStudio:addins series we will pay a part of our technical debt from the previous article and document our R functions conveniently using a new addin we will build for this purpose. #### The addin we will create in this article will let us create well formatted roxygen documentation easily by using keyboard shortcuts to add useful tags such as code{} or link{} around selected text in RStudio. # Quick intro to documentation with roxygen2 ## 1. Documenting your first function To help us generate documentation easily we will be using the roxygen2 package. You can install it using install.packages("roxygen2"). Roxygen2 works with in-code tags and will generate R’s documentation format .Rd files, create a NAMESPACE, and manage the Collate field in DESCRIPTION (not relevant to us at this point) automatically for our package. Documenting a function works in 2 simple steps: Documenting a function 1. Inserting a skeleton – Do this by placing your cursor anywhere in the function you want to document and click Code Tools -> Insert Roxygen Skeleton (default keyboard shortcut Ctrl+Shift+Alt+R). 2. Populating the skeleton with relevant information. A few important tags are: • #' @params – describing the arguments of the function • #' @return – describing what the function returns • #' @importFrom package function – in case your function uses a function from a different package Roxygen will automatically add it to the NAMESPACE • #' @export – if case you want the function to be exported (mainly for use by other packages) • #' @examples – showing how to use the function in practice ## 2. Generating and viewing the documentation Generating and viewing the documentation 1. We generate the documentation files using roxygen2::roxygenise() or devtools::document() (default keyboard shortcut Ctrl+Shift+D) 2. Re-installing the package (default keyboard shortcut Ctrl+Shift+B) 3. Viewing the documentation for a function using ?functioname e.g. ?mean, or placing cursor on a function name and pressing F1 in RStudio – this will open the Viewer pane with the help for that function ## 3. A real-life example Let us now document runCurrentRscript a little bit: #' runCurrentRscript #' @description Wrapper around executeCmd with default arguments for easy use as an RStudio addin #' @param path character(1) string, specifying the path of the file to be used as Rscript argument (ideally a path to an R script) #' @param outputFile character(1) string, specifying the name of the file, into which the output produced by running the Rscript will be written #' @param suffix character(1) string, specifying additional suffix to pass to the command #' @importFrom rstudioapi getActiveDocumentContext #' @importFrom rstudioapi navigateToFile #' @seealso executeCmd #' @return side-effects runCurrentRscript &1") { cmd  As we can see by looking at ?runCurrentRscript versus ?mean, our documentation does not quite look up to par with documentation for other functions: What is missing if we abstract from the richness of the content is the usage of markup commands (tags) for formatting and linking our documentation. Some of the very useful such tags are for example: • code{}, strong{}, emph{} for font style • link{}, href{}, url{} for linking to other parts of the documentation or external resources • enumerate{}, itemize{}, tabular{} for using lists and tables • eqn{}, deqn{} for mathematical expressions such as equations etc. For the full list of options regarding text formatting, linking and more see Writing R Extensions’ Rd format chapter # Our addins to make documenting a breeze As you can imagine, typing the markup commands in full all the time is quite tedious. The goal of our new addin will therefore be to make this process efficient using keyboard shortcuts – just select a text and our addin will place the desired tags around it. For this time, we will be satisfied with simple 1 line tags. ## 1. Add a selected tag around a character string roxyfy  ## 2. Apply the tag on a selection in an active document in RStudio We will make the functionality available for multi-selections as well by lapply-ing over the selection elements retrieved from the active document in RStudio. addRoxytag  ## 3. Wrappers around addRoxytag to be used as addin for some useful tags addRoxytagCode  ## 4. Add the addin bindings into addins.dcf and assign keyboard shortcuts As the final step, we need to add the bindings for our new addins to the inst/rstudio/addins.dcf file and re-install the package. Name: addRoxytagCode Description: Adds roxgen tag code to current selections in the active RStudio document Binding: addRoxytagCode Interactive: false Name: addRoxytagLink Description: Adds roxgen tag link to current selections in the active RStudio document Binding: addRoxytagLink Interactive: false Name: addRoxytagEqn Description: Adds roxgen tag eqn to current selections in the active RStudio document Binding: addRoxytagEqn Interactive: false assigning keyboard shortcuts to addins # The addins in action And now, let’s just select the text we want to format and watch our addins do the work for us! Then document the package, re-install it and view the improved help for our functions: The addins in action # What is next – even more automated documentation Next time we will try to enrich our addins for generating documentation by adding the following functionalities • automatic generation of @importFrom tags by inspecting the function code • allowing for more complex tags such as itemize # TL;DR – Just give me the package # References To leave a comment for the author, please follow the link and comment on their blog: Jozef’s Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## Enterprise Advocate (This article was first published on RStudio Blog, and kindly contributed to R-bloggers) We are looking for our next Enterprise Advocate to join the RStudio team. See what Pete Knast, Global Director of New Business, has to say about working at RStudio and the Enterprise Advocate role. When did you join RStudio and what made you interested in working here? I joined in early 2014. I was excited by RStudio since I love helping people and being an open source company RStudio seemed like a great way to reach a lot of people and get to assist with numerous interesting use cases. What types of projects do you work on? I get to work on the front lines corresponding directly with our customers. Since my focus is new business this means I am helping open source users take the next step in their use of R. Sometimes this means helping IT organizations understand how RStudio can integrate with corporate security/authentication protocols and sometimes it involves showing off various Shiny applications. The types of projects always vary which keeps me on my toes. What do you enjoy about working at RStudio? One would be that my colleagues are not only extremely smart but very genuine so there are always fun conversations in and outside of work. Another big reason would be the various use cases I get to play a part in. Since each customer has a different application area or industry I never get bored when I learn about how RStudio offerings are being applied. What types of qualities do you look for when hiring an Enterprise Advocate? Two words that come to mind are humble and smart. Also since RStudio is so popular but our company is small we have a high volume of customers to connect with so high energy is also a must. If you enjoy solving problems not only will you find the role a good fit but you will have the chance to help bring solutions to large corporations and assist in resolving issues that can even cure diseases. What are the goals for someone new to this role? To establish themselves in the r and data science community as a trusted consultant. When you wake up from a dream that involves R you have made it. If you think you or someone you know might be a good fit for this role and want to know more, check it out here. To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… Source:: R News ## Visualizing graphs with overlapping node groups (This article was first published on r-bloggers – WZB Data Science Blog, and kindly contributed to R-bloggers) I recently came across some data about multilateral agreements, which needed to be visualized as network plots. This data had some peculiarities that made it more difficult to create a plot that was easy to understand. First, the nodes in the graph were organized in groups but each node could belong to multiple groups or to no group at all. Second, there was one “super node” that was connected to all other nodes (while “normal” nodes were only connected within their group). This made it difficult to find the right layout that showed the connections between the nodes as well as the group memberships. However, digging a little deeper into the R packages igraph and ggraph it is possible to get satisfying results in such a scenario. ### Example data, nodes & edges We will use the following packages that we need to load at first: library(dplyr) library(purrr) library(igraph) library(ggplot2) library(ggraph) library(RColorBrewer)  Let’s create some exemplary data. Let’s say we have 4 groups a, b, c, d and 40 nodes with the node IDs 1 to 40. Each node can belong to several groups but it must not belong to any group. An example would be the following data: group_a  An excerpt of the data: > members id group 1 a 2 a [...] 5 a 1 b 2 b [...] 38 NA 39 NA 40 NA  Now we can create the edges of the graph, i.e. the connections between the nodes. All nodes within a group are connected to each other. Additionally, all nodes are connected with one “super node” (as mentioned in the introduction). In our example data, we pick node ID 1 to be this special node. Let’s start to create our edges by connecting all nodes to node 1: edges  We also denote here, that these edges are not part of any group memberships. We’ll handle these group memberships now: within_group_edges % split(.$group) %>%
map_dfr(function (grp) {
id2id 

At first, we split the members data by their group which produces a list of data frames. We then use map_dfr from the purrr package to handle each of these data frames that are passed as grp argument. grp$id contains the node IDs of the members of this group and we use combn to create the pair-wise combinations of these IDs. This will create a matrix id2id, where the columns represent the node ID pairs. We return a data frame with the from-to ID pairs and a group column that denotes the group to which these edges belong. These “within-group edges” are appended to the already created edges using bind_rows. > edges from to group 1 2 NA 1 3 NA 1 4 NA [...] 23 24 d 23 25 d 24 25 d  ### Plotting with ggraph We have our edges, so now we can create the graph with igraph and plot it using the ggraph package: g  Not bad for the first try, but the layout is a bit unfortunate, giving too much space to nodes that don’t belong to any group. We can tell igraph’s layout algorithm to tighten the non-group connections (the gray lines in the above figure) by giving them a higher weight than the within-group edges: # give weight 10 to non-group edges edges % split(.$group) %>%
map_dfr(function (grp) {
id2id 

We reconstruct the graph g and plot it using the same commands as before and get the following:

The nodes within groups are now much less cluttered and the layout is more balanced.

### Plotting with igraph

A problem with this type of plot is that connections within smaller groups are sometimes hardly visible (for example group a in the above figure). The plotting functions of igraph allow an additional method of highlighting groups in graphs: Using the parameter mark.groups will construct convex hulls around nodes that belong to a group. These hulls can then be highlighted with respective colors.

At first, we need to create a list that maps each group to a vector of the node IDs that belong to that group:

group_ids % split(.$group), function(grp) { grp$id })

> group_ids
$a [1] 1 2 3 4 5$b
[1] 1  2  3  4  5  6  7  8  9 10
[...]


Now we can create a color for each group using RColorBrewer:

group_color 

We plot it by using the graph object g that was generated before with graph_from_data_frame:

par(mar = rep(0.1, 4))   # reduce margins

plot(g, vertex.color = 'white', vertex.size = 9,
edge.color = rgb(0.5, 0.5, 0.5, 0.2),
mark.groups = group_ids,
mark.col = group_color_fill,
mark.border = group_color)

legend('topright', legend = names(group_ids),
col = group_color,
pch = 15, bty = "n",  pt.cex = 1.5, cex = 0.8,
text.col = "black", horiz = FALSE)


This option usually works well when you have groups that are more or less well separated, i.e. do not overlap too much. However, in our case there is quite some overlap and we can see that the shapes that encompass the groups also sometimes include nodes that do not actually belong to that group (for example node 8 in the above figure that is encompassed by group a although it does not belong to that group).

We can use a trick that leads the layout algorithm to bundle the groups more closely in a different manner: For each group, we introduce a “virtual node” (which will not be drawn during plotting) to which all the normal nodes in the group are tied with more weight than to each other. Nodes that only belong to a single group will be placed farther away from the center than those that belong to several groups, which will reduce clutter and wrongly overlapping group hulls. Furthermore, a virtual group node for nodes that do not belong to any group will make sure that these nodes will be placed more closely to each other.

We start by generating IDs for the virtual nodes:

# 4 groups plus one "NA-group"
virt_group_nodes 

This will give us the following IDs:

> virt_group_nodes
a    b    c    d   NA
41   42   43   44   45


We start to create the edges again by connecting all nodes to the “super node” with ID 1:

edges_virt 

Then, the edges within the groups will be generated again, but this time we add additional edges to each group’s virtual node:

within_virt %>% split(.$group) %>% map_dfr(function (grp) { group_name  We add edges from all nodes that don’t belong to a group to another virtual node: virt_group_na % filter(is.na(group)))$id
edges_na_group_virt 

This time, we also create a data frame for the nodes, because we want to add an additional property is_virt to each node that denotes if that node is virtual:

nodes_virt 

We’re ready to create the graph now:

g_virt 

To illustrate the effect of the virtual nodes, we can plot the graph directly and get a figure like this (virtual nodes highlighted in turquois):

We now want to plot the graph without the virtual nodes, but the layout should nevertheless be calculated with the virtual nodes. We can achieve that by running the layout algorithm first and then removing the virtual nodes from both the graph and the generated layout matrix:

# use "auto layout"
lay 

It’s important to pass the layout matrix now with the layout parameter to produce the final figure:

plot(g_virt, layout = lay, vertex.color = 'white', vertex.size = 9,
edge.color = rgb(0.5, 0.5, 0.5, 0.2),
mark.groups = group_ids, mark.col = group_color_fill,
mark.border = group_color)

legend('topright', legend = names(group_ids), col = group_color,
pch = 15, bty = "n",  pt.cex = 1.5, cex = 0.8,
text.col = "black", horiz = FALSE)


We can see that output is less cluttered and nodes that belong to the same groups are bundled nicely while nodes that do not share the same groups are well separated. Note that the respective edge weights were found empirically and you will probably need to adjust them to achieve a good graph layout for your data.