Mapping to a ‘t'(map)

By HighlandR

tmap-Highland-SIMD-Quintile.png

(This article was first published on HighlandR, and kindly contributed to R-bloggers)

tmap More maps of the Highlands?

Yep, same as last time, but no need to install dev versions of anything, we can get awesome maps courtesy of the tmap package.

Get the shapefile from the last post


library(tmap)
library(tmaptools)
library(viridis)

scot  read_shape("SG_SIMD_2016.shp", as.sf = TRUE)
highland  (scot[scot$LAName=="Highland", ])


#replicate plot from previous blog post:

quint  tm_shape(highland) +
  tm_fill(col = "Quintile",
          palette = viridis(n=5, direction = -1,option = "C"),
          fill.title = "Quintile",
          title = "SIMD 2016 - Highland Council Area by Quintile")

quint # plot

ttm() #switch between static and interactive - this will use interactive
quint # or use last_map()
# in R Studio you will find leaflet map in your Viewer tab

ttm() # return to plotting

The results:

One really nice thing is that because the polygons don’t have outlines, the DataZones that are really densely packed still render nicely – so no ‘missing’ areas.

A static image of the leaflet map:

leaflet-snapshot.png

Here I take the rank for all the Highland data zones, and the overall SIMD rank, and create a small multiple

small_mult tm_shape(highland) +
  tm_fill(col = c("IncRank","EmpRank","HlthRank","EduRank",
                  "GAccRank","CrimeRank","HouseRank","Rank"),
          palette = viridis(n=5, direction = -1,option = "C"),
          title=c("Income Rank", "Employment Rank","Health Rank","Education Rank",
                  "General Access Rank","Crime Rank", "Housing Rank","Overall Rank"))
small_mult

2017-08-31-tmap-Highland-SIMD-All-Domains-Ranked.png

Let’s take a look at Scotland as a whole, as I assume everyone’s pretty bored of the Highlands by now:

#try a map of scotland
scotplot  tm_shape(scot) +
  tm_fill(col = "Rank",
          palette = viridis(n=5, direction = -1,option = "C"),
          fill.title = "Overall Rank",
          title = "Overall-Rank")
scotplot # bit of a monster
         

2017-08-31ScotSIMD.png

With the interactive plot, we can really appreciate the density of these datazones in the Central belt.

2017-08-31-scotland-leaflet.PNG

Here, I’ve zoomed in a bit on the region around Glasgow, and then zoomed in some more:

2017-08-31-scotland-leaflet-zoom-in1.PNG

2017-08-31-scotland-leaflet-zoom.PNG

I couldn’t figure out how to host the leaflet map within the page (Jekyll / Github / Leaflet experts please feel free to educate me on that 🙂 ) but, given the size of the file, I very much doubt I could have uploaded it to Github anyway.

Thanks to Roger Bivand (@RogerBivand) for getting in touch and pointing me towards the tmap package! It’s really good fun and an easy way to get interactive maps up and running.

To leave a comment for the author, please follow the link and comment on their blog: HighlandR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Proof-Calculation Ping Pong

By Theory meets practice…

Daffodills

Abstract

The proof-calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the statistical proof-calculation ping-pong process.

Case Story

Ada and Bob have to proof-calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. For this post we will ignore the complex survey part of the problem and just pretend that our sample corresponds to the population (complete inventory count). After the questionnaire phase, the following data are available to Ada and Bob.

Id Version Region Sales Volume Staff Costs People
01 1 A 23000 10003 5
02 1 B 12200 7200 1
03 1 NA 19500 7609 2
04 1 A 17500 13000 3
05 1 B 119500 90000 NA
05 2 B 119500 95691 19
06 1 B NA 19990 6
07 1 A 19123 20100 8
08 1 D 25000 100 NA
09 1 D 45860 32555 9
10 1 E 33020 25010 7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

  • the overall number of fitness centres
  • total sales volume
  • total staff cost
  • total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada  fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People))
ada
## # A tibble: 4 x 5
##   Region NoOfUnits `Sales Volume` `Staff Costs` People
##    
## 1      A         3          59623         43103     16
## 2      B         1         119500         95691     19
## 3      D         1          45860         32555      9
## 4      E         1          33020         25010      7

Bob

Bob has a dark past as database developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach:

library(RSQLite)
## Create ad-hoc database
db  dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite"))
##Move fitness data into the ad-hoc DB
dbWriteTable(conn = db, name = "FITNESS", fitness, overwrite=TRUE, row.names=FALSE)
##Query using SQL
bob  dbGetQuery(db, "
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People
    FROM FITNESS
    WHERE [Sales Volume] > 17500 GROUP BY Region
")
bob
##   Region NoOfUnits Sales Volume Staff Costs People
## 1            1        19500        7609      2
## 2      A         2        42123       30103     13
## 3      B         2       239000      185691     19
## 4      D         2        70860       32655      9
## 5      E         1        33020       25010      7

The Ping-Pong phase

After Ada and Bob each have a result, they compare their resulting data.framess using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff)
diff  diff_data(ada, bob)
diff$get_data()
##    @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 +++            1          19500          7609      2
## 2  ->      A      3->2   59623->42123  43103->30103 16->13
## 3  ->      B      1->2 119500->239000 95691->185691     19
## 4  ->      D      1->2   45860->70860  32555->32655      9
## 5          E         1          33020         25010      7

After Ada’s and Bob’s serve, the two realize that their results just agree for one region (‘E’). Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

print.data_diff  function(x) x$get_data() %>% filter(`@@` != "")
print(diff)
##    @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 +++            1          19500          7609      2
## 2  ->      A      3->2   59623->42123  43103->30103 16->13
## 3  ->      B      1->2 119500->239000 95691->185691     19
## 4  ->      D      1->2   45860->70860  32555->32655      9

The two decide to first focus on agreeing on the number of units per region.

##    @@ Region NoOfUnits
## 1 +++            1
## 2  ->      A      3->2
## 3  ->      B      1->2
## 4  ->      D      1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob

bob2  dbGetQuery(db, "
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People
    FROM FITNESS
    WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL)
    GROUP BY Region
")
diff2  diff_data(ada, bob2, ordered=FALSE,count_like_a_spreadsheet=FALSE)
diff2 %>% print()
##   @@ Region NoOfUnits   Sales Volume   Staff Costs People
## 1 ->      A      3->2   59623->42123  43103->30103 16->13
## 2 ->      B      1->2 119500->239000 95691->185691     19
## 3 ->      D      1->2   45860->70860  32555->32655      9

Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and massages the SQL query as follows:

bob3  dbGetQuery(db, "
    SELECT Region,
           COUNT(*) As NoOfUnits,
           SUM([Sales Volume]) As [Sales Volume],
           SUM([Staff Costs]) AS [Staff Costs],
           SUM(People) AS People FROM
    (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM FITNESS GROUP BY Id)
    WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL)
    GROUP BY Region
")
diff3  diff_data(ada, bob3, ordered=FALSE,count_like_a_spreadsheet=FALSE)
diff3 %>% print()
##    @@ Region NoOfUnits Sales Volume  Staff Costs People
## 1 ...    ...       ...          ...          ...    ...
## 2  ->      D      1->2 45860->70860 32555->32655      9

Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2  fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People))
(diff_final  diff_data(ada2,bob3)) %>% print()
##    @@ Region NoOfUnits ... Staff Costs People
## 1 ...    ...       ... ...         ...    ...
## 2  ->      D         2 ...       32655  NA->9

Pong Ada

Oops, forgot to take care of the NA in the summation:

ada3  fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>%
  group_by(Region) %>%
  filter(`Sales Volume` >= 17500) %>%
  summarise(`NoOfUnits`=n(),
            `Sales Volume`=sum(`Sales Volume`),
            `Staff Costs`=sum(`Staff Costs`),
            People=sum(People,na.rm=TRUE))
diff_final  diff_data(ada3,bob3)
length(diff_final$get_data()) == 0
## [1] TRUE

Conclusion

Finally, their results agree and they move on to production and their results are published in a nice report.

Question 1: Do you agree with their results?

ada3
## # A tibble: 4 x 5
##   Region NoOfUnits `Sales Volume` `Staff Costs` People
##    
## 1      A         3          59623         43103     16
## 2      B         1         119500         95691     19
## 3      D         2          70860         32655      9
## 4      E         1          33020         25010      7

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below.

Source:: R News

Multiplicative Congruential Generators in R

By Aaron Schlegel

Multiplicative Congruential Generators at several values of n

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)
Part 2 of 2 in the series Random Number Generation

Multiplicative congruential generators, also known as Lehmer random number generators, is a type of linear congruential generator for generating pseudorandom numbers in . The multiplicative congruential generator, often abbreviated as MLCG or MCG, is defined as a recurrence relation similar to the LCG with c = 0.

large{X_{i+1} = aX_i space text{mod} space m}

Unlike the LCG, the parameters a and m for multiplicative congruential generators are more restricted and the initial seed X_0 must be relatively prime to the modulus m (the greatest common divisor between X_0 and m is 0). The current parameters in common use are m = 2^{31} - 1 = 2,147,483,647 text{and} a = 7^5 = 16,807. However, in a correspondence from the Communications of the ACM, Park, Miller and Stockmeyer changed the value of the parameter a, stating:

The minimal standard Lehmer generator we advocated had a modulus of m = 2^31 – 1 and a multiplier of a = 16807. Relative to this particular choice of multiplier, we wrote “… if this paper were to be written again in a few years it is quite possible that we would advocate a different multiplier ….” We are now prepared to do so. That is, we now advocate a = 48271 and, indeed, have done so “officially” since July 1990. This new advocacy is consistent with the discussion on page 1198 of [10]. There is nothing wrong with 16807; we now believe, however, that 48271 is a little better (with q = 44488, r = 3399).

Multiplicative Congruential Generators with Schrage’s Method

When using a large prime modulus m such as 2^{31} - 1, the multiplicative congruential generator can overflow. Schrage’s method was invented to overcome the possibility of overflow. We can check the parameters in use satisfy this condition:

a 

Schrage's method restates the modulus m as a decomposition m = aq + r where r = m space text{mod} space a and q = m / a.

ax space text{mod} space m = begin{cases} a(x space text{mod} space q) - rfrac{x}{q} & text{if} space x space text{is} geq 0  a(x space text{mod} space q) - rfrac{x}{q} + m & text{if} space x space text{is} leq 0 end{cases}
Multiplicative Congruential Generator in R

We can implement a Lehmer random number generator in R using the parameters mentioned earlier.

lehmer.rng 
# Print the first 10 randomly generated numbers
lehmer.rng()
##  [1] 0.68635675 0.12657390 0.84869106 0.16614698 0.08108171 0.89533896
##  [7] 0.90708773 0.03195725 0.60847522 0.70736551

Plotting our multiplicative congruential generator in three dimensions allows us to visualize the apparent ‘randomness’ of the generator. As before, we generate three random vectors x, y, z with our Lehmer RNG function and plot the points. The plot3d package is used to create the scatterplot and the animation package is used to animate each scatterplot as the length of the random vectors, n, increases.

library(plot3D)
library(animation)
n 

The generator appears to be generating suitably random numbers demonstrated by the increasing swarm of points as n increases.

References

Anne Gille-Genest (March 1, 2012). Implementation of the Pseudo-Random Number Generators and the Low Discrepancy Sequences.

Saucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.

Stephen K. Park; Keith W. Miller; Paul K. Stockmeyer (1988). “Technical Correspondence”. Communications of the ACM. 36 (7): 105–110.

The post Multiplicative Congruential Generators in R appeared first on Aaron Schlegel.

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

Probability functions intermediate

By Francisco Méndez

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises, we are going to explore some of the probability functions in R by using practical applications. Basic probability knowledge is required. In case you are not familiarized with the function apply, check the R documentation.

Note: We are going to use random numbers functions and random processes functions in R such as runif. A problem with these functions is that every time you run them, you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number') before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use set.seed(1).Don’t forget to specify it before every exercise that includes random numbers.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generating dice rolls Set your seed to 1 and generate 30 random numbers using runif. Save it in an object called random_numbers. Then use the ceiling function to round the values. These values represent rolling dice values.

Exercise 2

Simulate one dice roll using the function rmultinom. Make sure n = 1 is inside the function, and save it in an object called die_result. The matrix die_result is a collection of 1 one and 5 zeros, with the one indicating which value was obtained during the process. Use the function whichto create an output that shows only the value obtained after the dice is rolled.

Exercise 3

Using rmultinom, simulate 30 dice rolls. Save it in a variable called dice_result and use apply to transform the matrix into a vector with the result of each dice.

Exercise 4

Some gambling games use 2 dice, and after being rolled they sum their value. Simulate throwing 2 dice 30 times and record the sum of the values of each pair. Use rmultinomto simulate throwing 2 dice 30 times. Use the function apply to record the sum of the values of each experiment.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

  • work with different binomial and logistic regression techniques,
  • know how to compare regression models and choose the right fit,
  • and much more.

Exercise 5

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Using rnorm, simulate the height of 100 people and save it in an object called heights.

To get an idea of the values of heights, use the function summary.

Exercise 6

90% of the population is smaller than ____________?

Exercise 7

Which percentage of the population is bigger than 1.60 m?

Exercise 8

Run the following line code before this exercise. This will load a library required for the exercise.

if (!'MASS' %in% installed.packages()) install.packages('MASS')
library(MASS)

Simulate 1000 people with height and weight using the function mvrnorm with mu = c(1.70, 60) and
Sigma = matrix(c(.1,3.1,3.1,100), nrow = 2)

Exercise 9

How many people from the simulated population are taller than 1.70 m and heavier than 60 kg?

Exercise 10

How many people from the simulated population are taller than 1.75 m and lighter than 60 kg?

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

DEADLINE EXTENDED: Last call for Boston EARL abstracts

By Mango Solutions

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Are you solving problems and innovating with R?
Are you working with R in a commercial setting?
Do you enjoy sharing your knowledge?

If you said yes to any of the above, we want your abstract!

Share your commerical R stories with your peers at EARL Boston this November.

EARL isn’t about knowing the most or being the best in your field – it’s about taking what you’ve learnt and sharing it with others, so they can learn from your wins (and sometimes your failures, because we all have them!).

As long as your proposed presentation is focused on the commerical use of R, any topic from any industry is welcome!

The abstract submission deadline has been extended to Sunday 3 September.

Join David Robinson, Mara Averick and Tareef Kawaf on 1-3 November 2017 at The Charles Hotel in Cambridge.

To submit your abstract on the EARL website, click here.

See you in Boston!

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Text featurization with the Microsoft ML package

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week I wrote about how you can use the MicrosoftML package in Microsoft R to featurize images: reduce an image to a vector of 4096 numbers that quantify the essential characteristics of the image, according to an AI vision model. You can perform a similar featurization process with text as well, but in this case you have a lot more control of the features used to represent the text.

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here’s one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It’s great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms (“useful”, “indespensable”, “highly recommended”), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

transformRule = list(
  featurizeText(
    vars = c(Features = "REVIEW_TEXT"),
    # ngramLength=2: include not only "Azure", "AD", but also "Azure AD"
    # skipLength=1 : "computer" and "compuuter" is the same
    wordFeatureExtractor = ngramCount(
      weighting = "tfidf",
      ngramLength = 2,
      skipLength = 1),
    language = "English"
  ),
  selectFeatures(
    vars = c("Features"),
    mode = minCount(500)
  )
)

# train using transforms !
model  rxFastLinear(
  RATING ~ Features,
  data = train,
  mlTransforms = transformRule,
  type = "regression" # not binary (numeric regression)
)

We can then look at the coefficients associated with these features (presence of n-grams) to assess their impact on the overall rating. By this standard, the top 10 words or word-pairs contributing to a negative rating are:

boring       -7.647399
waste        -7.537471
not          -6.355953
nothing      -6.149342
money        -5.386262
bad          -5.377981
no           -5.210301
worst        -5.051558
poorly       -4.962763
disappointed -4.890280

Similarly, the top 10 words or word-pairs associated with a positive rating are:

will      3.073104
the|best  3.265797
love      3.290348
life      3.562267
wonderful 3.652950
,|and     3.762862
you       3.889580
excellent 3.902497
my        4.454115
great     4.552569

Another option is simply to look at the sentiment score for each review, which can be extracted using the getSentiment function.

sentimentScores  rxFeaturize(data=data, 
                    mlTransforms = getSentiment(vars = 
                                     list(SentimentScore = "REVIEW_TEXT")))

As we expect, a negative seniment (in the 0-0.5 range) is associated with 1- and 2-star reviews, while a positive sentiment (0.5-1.0) is associated with the 4- and 5-star reviews.

You can find more details on this analysis, including the Microsoft R code, at the link below.

Microsoft Technologies Blog for Enterprise Developers: Analyze your text in R (MicrosoftML)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Why to use the replyr R package

By John Mount

Nrow

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr"))
library("sparklyr")
packageVersion("dplyr")
#> [1] '0.7.2.9000'
packageVersion("sparklyr")
#> [1] '0.6.2'
packageVersion("dbplyr")
#> [1] '1.1.0.9000'

sc  * Using Spark: 2.1.0
d  [1] NA
ncol(d)
#> [1] NA
nrow(d)
#> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).

Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users.

The explanation is: tibble::truncate uses nrow()” and “print.tbl_spark is too slow since dbplyr started using tibble as the default way of printing records”.

A little digging gets us to this:

The above might make sense if tibble and dbplyr were the only users of dim(), ncol() or nrow().

Frankly if I call nrow() I expect to learn the number of rows in a table.

The suggestion is for all user code to adapt to use sdf_dim(), sdf_ncol() and sdf_nrow() (instead of tibble adapting). Even if practical (there are already a lot of existing sparklyr analyses), this prohibits the writing of generic dplyr code that works the same over local data, databases, and Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr dbplyr users (i.e., databases such as PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull“).

I admit, calling nrow() against an arbitrary query can be expensive. However, I am usually calling nrow() on physical tables (not on arbitrary dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow() to be a cheap operation.

Allowing the user to write reliable generic code that works against many dplyr data sources is the purpose of our replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark. Below are the functions replyr supplies for examining the size of tables:

library("replyr")
packageVersion("replyr")
#> [1] '0.5.4'

replyr_hasrows(d)
#> [1] TRUE
replyr_dim(d)
#> [1] 2 1
replyr_ncol(d)
#> [1] 1
replyr_nrow(d)
#> [1] 2

spark_disconnect(sc)

Note: the above is only working properly in the development version of replyr, as I only found out about the issue and made the fix recently.

replyr_hasrows() was added as I found in many projects the primary use of nrow() was to determine if there was any data in a table. The idea is: user code uses the replyr functions, and the replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of replyr is to provide re-usable work arounds of design choices far away from our influence.

Source:: R News

Community Call – rOpenSci Software Review and Onboarding

By Stefanie Butland

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

Are you thinking about submitting a package to rOpenSci’s open peer software review? Considering volunteering to review for the first time? Maybe you’re an experienced package author or reviewer and have ideas about how we can improve.

Join our Community Call on Wednesday, September 13th. We want to get your feedback and we’d love to answer your questions!

Agenda

  1. Welcome (Stefanie Butland, rOpenSci Community Manager, 5 min)
  2. guest: Noam Ross, editor (15 min)
    Noam will give an overview of the rOpenSci software review and onboarding, highlighting the role editors play and how decisions are made about policies and changes to the process.
  3. guest: Andee Kaplan, reviewer (15 min)
    Andee will give her perspective as a package reviewer, sharing specifics about her workflow and her motivation for doing this.
  4. Q & A (25 min, moderated by Noam Ross)

Speaker bios

Andee Kaplan is a Postdoctoral Fellow at Duke University. She is a recent PhD graduate from the Iowa State University Department of Statistics, where she learned a lot about R and reproducibility by developing a class on data stewardship for Agronomists. Andee has reviewed multiple (two!) packages for rOpenSci, iheatmapr and getlandsat, and hopes to one day be on the receiving end of the review process.

Andee on GitHub, Twitter

Noam Ross is one of rOpenSci’s four editors for software peer review. Noam is a Senior Research Scientist at EcoHealth Alliance in New York, specializing in mathematical modeling of disease outbreaks, as well as training and standards for data science and reproducibility. Noam earned his Ph.D. in Ecology from the University of California-Davis, where he founded the Davis R Users’ Group.

Noam on GitHub, Twitter

Resources

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Create and Update PowerPoint Reports using R

By Tim Bock

Chart created in PowerPoint using R

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In my sordid past, I was a data science consultant. One thing about data science that they don’t teach you at school is that senior managers in most large companies require reports to be in PowerPoint. Yet, I like to do my more complex data science in R – PowerPoint and R are not natural allies. As a result, creating an updating PowerPoint reports using R can be painful.

In this post, I discuss how to make R and PowerPoint work efficiently together. The underlying assumption is that R is your computational engine and that you are trying to get outputs into PowerPoint. I compare and contrast three tools for creating and updating PowerPoint reports using R: free ReporteRs package with two commercial products, Displayr and Q.


Option 1: ReporteRs

The first approach to getting R and PowerPoint to work together is to use David Gohel’s ReporteRs. To my mind, this is the most “pure” of the approaches from an R perspective. If you are an experienced R user, this approach works in pretty much the way that you will expect it to work.

The code below creates 250 crosstabs, conducts significance tests, and, if the p-value is less than 0.05, presents a slide containing each. And, yes, I know this is p-hacking, but this post is about how to use PowerPoint and R, not how to do statistics…

 
library(devtools)
devtools::install_github('davidgohel/ReporteRsjars')
devtools::install_github('davidgohel/ReporteRs')
install.packages(c('ReporteRs', 'haven', 'vcd', 'ggplot2', 'reshape2'))
library(ReporteRs)
library(haven)
library(vcd)
library(ggplot2)
library(reshape2)
dat = read_spss("http://wiki.q-researchsoftware.com/images/9/94/GSSforDIYsegmentation.sav")
filename = "c://delete//Significant crosstabs.pptx" # the document to produce
document = pptx(title = "My significant crosstabs!")
alpha = 0.05 # The level at which the statistical testing is to be done.
dependent.variable.names = c("wrkstat", "marital", "sibs", "age", "educ")
all.names = names(dat)[6:55] # The first 50 variables int the file.
counter = 0
for (nm in all.names)
    for (dp in dependent.variable.names)
    {
        if (nm != dp)
        {
            v1 = dat[[nm]]
            if (is.labelled(v1))
                v1 = as_factor(v1)
            v2 = dat[[dp]]
            l1 = attr(v1, "label")
            l2 = attr(v2, "label")
            if (is.labelled(v2))
                v2 = as_factor(v2)
            if (length(unique(v1))  0, colSums(x) > 0]
                ch = chisq.test(x)
                p = ch$p.value
                if (!is.na(p) && p 

Below we see one of the admittedly ugly slides created using this code. With more time and expertise, I am sure I could have done something prettier. A cool aspect of the ReporteRs package is that you can then edit the file in PowerPoint. You can then get R to update any charts and other outputs originally created in R.


Option 2: Displayr

A completely different approach is to author the report in Displayr, and then export the resulting report from Displayr to PowerPoint.

This has advantages and disadvantages relative to using ReporteRs. First, I will start with the big disadvantage, in the hope of persuading you of my objectivity (disclaimer: I have no objectivity, I work at Displayr).

Each page of a Displayr report is created interactively, using a mouse and clicking and dragging things. In my earlier example using ReporteRs, I only created pages where there was a statistically significant association. Currently, there is no way of doing such a thing in Displayr.

The flipside of using the graphical user interface like Displayr is that it is a lot easier to create attractive visualizations. As a result, the user has much greater control over the look and feel of the report. For example, the screenshot below shows a PowerPoint document created by Displayr. All but one of the charts has been created using R, and the first two are based on a moderately complicated statistical model (latent class rank-ordered logit model).

You can access the document used to create the PowerPoint report with R here (just sign in to Displayr first) – you can poke around and see how it all works.

A benefit of authoring a report using Displayr is that the user can access the report online, interact with it (e.g., filter the data), and then export precisely what they want. You can see this document as it is viewed by a user of the online report here.

Demographics dashboard example from Displayr


Option 3: Q

A third approach for authoring and updating PowerPoint reports using R is to use Q, which is a Windows program designed for survey reporting (same disclaimer as with Displayr). It works by exporting and updating results to a PowerPoint document. Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

In Q, a Report contains a series of analyses. Analyses can either be created using R, or, using Q's own internal calculation engine, which is designed for producing tables from survey data.

The map above (in the Displayr report) is an HTMLwidget created using the plotly R package. It draws data from a table called Region, which would also be shown in the report. (The same R code in the Displayr example can be used in an R object within Q). So when exported into PowerPoint, it creates a page, using the PowerPoint template, where the title is Responses by region and the map appears in the middle of the page.

The screenshot below is showing another R chart created in PowerPoint. The data has been extracted from Google Trends using the gtrendsR R package. However, the chart itself is a standard Excel chart, attached to a spreadsheet containing the data. These slides can then be customized using all the normal PowerPoint tools and can be automatically updated when the data is revised.


Explore the Displayr example

You can access the Displayr document used to create and update the PowerPoint report with R here (just sign in to Displayr first). Here, you can poke around and see how it all works or create your own document.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

Pacific Island Hopping using R and iGraph

By Peter Prevos

Pacific island hopping

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Last month I enjoyed a relaxing holiday in the tropical paradise of Vanuatu. One rainy day I contemplated how to go island hopping across the Pacific ocean visiting as many island nations as possible. The Pacific ocean is a massive body of water between, Asia and the Americas, which covers almost half the surface of the earth. The southern Pacific is strewn with island nations from Australia to Chile. In this post, I describe how to use R to plan your next Pacific island hopping journey.

The Pacific Ocean.

Listing all airports

My first step was to create a list of flight connections between each of the island nations in the Pacific ocean. I am not aware of a publically available data set of international flights so unfortunately, I created a list manually (if you do know of such data set, then please leave a comment).

My manual research resulted in a list of international flights from or to island airports. This list might not be complete, but it is a start. My Pinterest board with Pacific island airline route maps was the information source for this list.

The first code section reads the list of airline routes and uses the ggmap package to extract their coordinates from Google maps. The data frame with airport coordinates is saved for future reference to avoid repeatedly pinging Google for the same information.

# Init
library(tidyverse)
library(ggmap)
library(ggrepel)
library(geosphere)

# Read flight list and airport list
flights 

Create the map

To create a map, I modified the code to create flight maps I published in an earlier post. This code had to be changed to centre the map on the Pacific. Mapping the Pacific ocean is problematic because the -180 and +180 degree meridians meet around the date line. Longitudes west of the antemeridian are positive, while longitudes east are negative.

The world2 data set in the borders function of the ggplot2 package is centred on the Pacific ocean. To enable plotting on this map, all negative longitudes are made positive by adding 360 degrees to them.

# Pacific centric
flights$lon.x[flights$lon.x 

Pacific Island Hopping

This visualisation is aesthetic and full of context, but it is not the best visualisation to solve the travel problem. This map can also be expressed as a graph with nodes (airports) and edges (routes). Once the map is represented mathematically, we can generate travel routes and begin our Pacific Island hopping.

The igraph package converts the flight list to a graph that can be analysed and plotted. The shortest_path function can then be used to plan routes. If I would want to travel from Auckland to Saipan in the Northern Mariana Islands, I have to go through Port Vila, Honiara, Port Moresby, Chuuk, Guam and then to Saipan. I am pretty sure there are quicker ways to get there, but this would be an exciting journey through the Pacific.

library(igraph)
g 

View the latest version of this code on GitHub.

Pacific Flight network

The post Pacific Island Hopping using R and iGraph appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News