Association rules using FPGrowth in Spark MLlib through SparklyR

By Longhow Lam

(This article was first published on R – Longhow Lam’s Blog, and kindly contributed to R-bloggers)


Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

  • use more computing power (or cluster of computing nodes).
  • use another algorithm, for example FP Growth, which is more scalable. See this blog for some details on Apriori vs. FP Growth.

Or do both of the above points by using FPGrowth in Spark MLlib on a cluster. And the nice thing is: you can stay in your familiar R Studio environment!

Spark MLlib and sparklyr

Example Data set

We use the example groceries transactions data in the arules package. It is not a big data set and you would definitely not need more than a laptop, but it is much more realistic than the example given in the Spark MLlib documentation :-).

Preparing the data

I am a fan of sparklyr 🙂 It offers a good R interface to Spark and MLlib. You can use dplyr syntax to prepare data on Spark, it exposes many of the MLlib machine learning algorithms in a uniform way. Moreover, it is nicely integrated into the RStudio environment offering the user views on Spark data and a way to manage the Spark connection.

First connect to spark and read in the groceries transactional data, and upload the data to Spark. I am just using a local spark install on my Ubuntu laptop.

###### sparklyr code to perform FPGrowth algorithm ############


#### spark connect #########################################
sc "local")

#### first create some dummy data ###########################
transactions = readRDS("transactions.RDs")

#### upload to spark #########################################  
trx_tbl  = copy_to(sc, transactions, overwrite = TRUE)

For demonstration purposes, data is copied in this example from the local R session to Spark. For large data sets this is not feasible anymore, in that case data can come from hive tables (on the cluster).

The figure above shows the products purchased by the first four customers in Spark in an RStudio grid. Although transactional systems will often output the data in this structure, it is not what the FPGrowth model in MLlib expects. It expects the data aggregated by id (customer) and the products inside an array. So there is one more preparation step.

# data needs to be aggregated by id, the items need to be in a list
trx_agg = trx_tbl %>% 
   group_by(id) %>% 
      items = collect_list(item)

The figure above shows the aggregated data, customer 12, has a list of 9 items that he has purchased.

Running the FPGrowth algorithm

We can now run the FPGrowth algorithm, but there is one more thing. Sparklyr does not expose the FPGrowth algorithm (yet), there is no R interface to the FPGrowth algorithm. Luckily, sparklyr allows the user to invoke the underlying Scala methods in Spark. We can define an new object with invoke_new

  uid = sparklyr:::random_string("fpgrowth_")
  jobj = invoke_new(sc, "", uid) 

Now jobj is an object of class FPGrowth in Spark.



And by looking at the Scala documentation of FPGrowth we see that there are more methods that you can use. We need to use the function invoke, to specify which column contains the list of items, to specify the minimum confidence and to specify the minimum support.

jobj %>% 
    invoke("setItemsCol", "items") %>%
    invoke("setMinConfidence", 0.03) %>%
    invoke("setMinSupport", 0.01)  %>%
    invoke("fit", spark_dataframe(trx_agg))

By invoking fit, the FPGrowth algorithm is fitted and an FPGrowthModel object is returned where we can invoke associationRules to get the calculated rules in a spark data frame

rules = FPGmodel %>% invoke("associationRules")

The rules in the spark data frame consists of an antecedent column (the left hand side of the rule), a consequent column (the right hand side of the rule) and a column with the confidence of the rule. Note that the antecedent and consequent are lists of items! If needed we can split these lists and collect them to R for plotting for further analysis.

The invoke statements and rules extractions statements can of course be wrapped inside functions to make it more reusable. So given the aggregated transactions in a spark table trx_agg, you can get something like:

GroceryRules =  ml_fpgrowth(
) %>%



The complete R script can be found on my GitHub. If arules in R on your laptop is not workable anymore because of the size of your data, consider FPGrowth in Spark through sparklyr.

cheers, Longhow

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam’s Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R live class | Professional R Programming | Nov 29-30 Milan

By Quantide

Professional R Programming

(This article was first published on R blog | Quantide – R training & consulting, and kindly contributed to R-bloggers)

Professional R Programming is the sixth and last course of the autumn term. It takes place in November 29-30 in a location close to Milano Lima.
If you have a solid R knowledge and want to boost your programming skills, this course is made for you.
This course will give you an inner perspective of R working mechanisms, as well as tools for addressing your code’s issues and to make it more efficient. Once these concepts are established, you will learn how to create R packages and use them as the fundamental unit of reproducible R code.

Professional R Programming: Outlines

– Base Programming: environments, functions and loops
– Functionals in base R
– The purrr package
– Code style and clarity
– Profiling
– Parallel computation
– Testing and debugging
– Documenting your code: rmarkdown
– Sharing your code: github
– R packages

Professional R Programming is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

This course is for max 6 attendees.


The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.


If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

Sadly, this is the last course of the autumn term. Our next R classes’ session will be in Spring! Stay in touch for more updates.

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | Professional R Programming | Nov 29-30 Milan appeared first on Quantide – R training & consulting.

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide – R training & consulting. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

EARL Boston round up

By Mango Solutions

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Now we’ve recovered from over indulging in Boston’s culinary delights, we’re ready to share our highlights from this year’s EARL Boston Conference.

Day 1 highlights

Stack Overflow’s David Robinson kicked off the Conference, using Stack Overflow data to perform all sorts of interesting analyses. Highlights included trends in questions mentioning specific R packages over time, leading to the identification of rival R packages. We found that R is the least disliked language (because it’s the best obviously!); although David cautioned that often people who haven’t used R before haven’t heard of it either.

Richie Cotton’s talk on how DataCamp is a ‘data-inspired’ organisation was particularly entertaining and he was a really engaging speaker. It was also great to hear from Emily Riederer about tidycf; she shared a really good example of the type of data-driven revolution taking place in many financial institutions.

We also enjoyed Patrick Turgeon’s presentation on Quantitative Trading with R. His presentation portrayed quantitative trading as a scientific problem to be investigated using diverse sources of information, open source code and hard work. Free from jargon, Patrick demonstrated that placing bets on the markets does not have to be some mysterious art, but an analytic puzzle like any other.

A brilliant first day was rounded out with an evening reception overlooking the Charles River, where we enjoyed drinks and a chance to catch up with everyone at the Conference. It was a great opportunity to chat with all the attendees to find out what talks they enjoyed and what ones they wanted to catch on day two.

Day 2 highlights

Mara Averick got things moving on day two with a witty and humble keynote talk on the importance of good communication in data science. She may have also confessed to getting R banned in her fantasy basketball league. From having to argue with the internet that Krieger’s most common distinctive phrase is “yep yep”, to always having the correct word for any situation, she gave a fantastic presentation; a key skill for any data scientist (even if she says she isn’t one!).

Keeping up the theme of great communication in data science, Ali Zaidi gave a really clear rundown of what deep learning is, and how existing models can be reused to make pattern recognition practicably applicable even with modest hardware.

Other highlights included both Alex Albright’s and Monika Wahi’s talks. Alex showed us lots of enjoyable small analyses and suggested ideas for finding fun datasets and encouraged us all to share our findings when experimenting. Monika Wahi discussed why SAS still dominates in healthcare and how we can convert people to R. She talked about how R is easier and nicer to read and showed us equivalent code in SAS and R to illustrate her point.

It was tough, but we picked just a few of the many highlights from both days at EARL. Please tweet @EARLConf any of your EARL highlights so we can share them – we would love to see what people enjoyed the most.

We’d like to thank all of our attendees for joining us, our fantastic speakers, and our generous sponsors for making EARL Boston the success it has been.

To hear the latest news about EARL Conferences sign up to our mailing list, and we’ll let you know first when tickets are on sale.

You can now find speaker slides (where available) on the EARL website – just click on the speaker profile and download the file.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Happy Thanksgiving!

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Today is Thanksgiving Day here in the US, so we’re taking the rest of the week off to enjoy the time with family.

Even if you don’t celebrate Thanksgiving, today is still an excellent day to give thanks to the volunteers who have contributed to the R project and its ecosystem. In particular, give thanks to the R Core Group, whose tireless dedication — in several cases over a period of more than 20 years — was and remains critical to the success and societal contributions of the language we use and love: R. You can contribute financially by becoming a Supporting Member of the R Foundation.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R

By Abdul Majed Raja

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Sentiment Analysis is one of the most obvious things Data Analysts with unlabelled Text data (with no score or no rating) end up doing in an attempt to extract some insights out of it and the same Sentiment analysis is also one of the potential research areas for any NLP (Natural Language Processing) enthusiasts.

For an analyst, the same sentiment analysis is a pain in the neck because most of the primitive packages/libraries handling sentiment analysis perform a simple dictionary lookup and calculate a final composite score based on the number of occurrences of positive and negative words. But that often ends up in a lot of false positives, with a very obvious case being ‘happy’ vs ‘not happy’ – Negations, in general Valence Shifters.

Consider this sentence: ‘I am not very happy’. Any Primitive Sentiment Analysis Algorithm would just flag this sentence positive because of the word ‘happy’ that apparently would appear in the positive dictionary. But reading this sentence we know this is not a positive sentence.

While we could build our own way to handle these negations, there are couple of new R-packages that could do this with ease. One such package is sentimentr developed by Tyler Rinker.

Installing the package

sentimentr can be installed from CRAN or the development version can be installed from github.


Why sentimentr?

The author of the package himself explaining what does sentimentr do that other packages don’t and why does it matter?

“sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup. The next questions address why it matters.”

Sentiment Scoring:

sentimentr offers sentiment analysis with two functions: 1. sentiment_by() 2. sentiment()

Aggregated (Averaged) Sentiment Score for a given text with sentiment_by

sentiment_by('I am not very happy', by = NULL)

   element_id sentence_id word_count   sentiment
1:          1           1          5 -0.06708204

But this might not help much when we have multiple sentences with different polarity, hence sentence-level scoring with sentiment would help here.

sentiment('I am not very happy. He is very happy')

   element_id sentence_id word_count   sentiment
1:          1           1          5 -0.06708204
2:          1           2          4  0.67500000

Both the functions return a dataframe with four columns:

1. element_id – ID / Serial Number of the given text
2. sentence_id – ID / Serial Number of the sentence and this is equal to element_id in case of sentiment_by
3. word_count – Number of words in the given sentence
4. sentiment – Sentiment Score of the given sentence

Extract Sentiment Keywords

The extract_sentiment_terms() function helps us extract the keywords – both positive and negative that was part of the sentiment score calculation. sentimentr also supports pipe operator %>% which makes it easier to write multiple lines of code with less assignment and also cleaner code.

'My life has become terrible since I met you and lost money' %>% extract_sentiment_terms()
   element_id sentence_id      negative positive
1:          1           1 terrible,lost    money

Sentiment Highlighting:

And finally, the highight() function coupled with sentiment_by() that gives a html output with parts of sentences nicely highlighted with green and red color to show its polarity. Trust me, This might seem trivial but it really helps while making Presentations to share the results, discuss False positives and to identify the room for improvements in the accuracy.

'My life has become terrible since I met you and lost money. But I still have got a little hope left in me' %>% 
  sentiment_by(by = NULL) %>%

Output Screenshot:

Try using sentimentr for your sentiment analysis and text analytics project and do share your feedback in comments. Complete code used here is available on my github.

    Related Post

    1. Creating Reporting Template with Glue in R
    2. Predict Employee Turnover With Python
    3. Making a Shiny dashboard using ‘highcharter’ – Analyzing Inflation Rates
    4. Time Series Analysis in R Part 2: Time Series Transformations
    5. Time Series Analysis in R Part 1: The Time Series Object
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Learnings from 5 months of R-Ladies Chicago (Part 1)

    By David Smith


    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    by Angela Li, founder and organizer of R-Ladies Chicago. This article also appears on Angela’s personal blog.

    It’s been a few months since I launched R-Ladies Chicago, so I thought I’d sit down and write up some things that I’ve learned in the course of organizing this wonderful community. Looking back, there are a few things I wish someone told me at the beginning of the process, which I’ll share here over the course of a few weeks. The hope is that you can use these learnings to organize tech communities in your own area.

    Chicago #RLadies ❤️ R! Group photo from tonight’s Meetup @MicrosoftR (thanks @revodavid for the shirts) @RLadiesGlobal @d4tagirl #rstats

    — R-Ladies Chicago (@RLadiesChicago) October 27, 2017

    Note: some of this information may be more specific to Chicago, a major city in the US that has access to many resources, or to R-Ladies as a women’s group in particular. I tried to write down more generalizable takeaways. If you’re interested in starting a tech meetup in your community, R-related or not, use this series as a resource!

    Part 1: Starting the Meetup

    Sometimes, it just takes someone who’s willing:

    • The reason R-Ladies didn’t exist in Chicago before July wasn’t because there weren’t women using R in Chicago, or because there wasn’t interest in the community in starting a Meetup. It was just that no one had gotten around to it. I had a few people after our first Meetup say to me that they’d been interested in a R-Ladies Meetup for a long time, but had been waiting for someone to organize it. Guess what…that person could be you!
    • I also think people might have been daunted by the prospect of starting a Chicago Meetup, because it’s such a big city and it’s intimidating to organize a group without already knowing a few people. If you’re in a smaller place, consider it a benefit that you’re drawing from a smaller, tight-knit community of folks who use R.
    • If you can overcome the hurdle of starting something, you’ll be amazed at how many people will support you. Start the group and folks will come.

    You don’t have to be the most skilled programmer to lead a successful Meetup:

    • This is a super harmful mindset, and something that plagues women in tech in particular. I struggled with this a lot at the outset: “I’m not qualified to lead this. There’s no way I can explain in great detail EVERY SINGLE THING about R to someone else. Heck, I haven’t used R for 10+ years. Someone else would probably be better at this than am.”
    • The thing is, the skillset that it takes to organize a community is vastly different than the skillset that it takes to write code. You’re thinking about how to welcome beginners, encourage individuals to contribute, teach new skills, and form relationships between people. The very things that you believe make you “less qualified” as a programmer are the exact things that are valuable in this context — you understand the struggles of learning R, because you were recently going through that process yourself. Or you’re more accessible for someone to ask questions to, because they aren’t intimidated by you.
    • Being able to support and encourage your fellow R users is something you can do no matter what your skill level is. There are women in our group who have scads more experience in R than I do. That’s fantastic! My job as an organizer is to showcase and use the skills of the individuals in our community, and if I can get these amazing women to lead workshops, that’s less work for me AND great for them! Pave the way for people to do awesome stuff.

    Get yourself cheerleaders:

    • I cannot emphasize enough how important it was to have voices cheering me on as I was setting this up. The women in my office who told me they’d help me set up the first Meetup. My friends who told me they’d come to the first meeting (even if they didn’t use R). The R-Ladies across the globe who were so supportive and excited that a Chicago group was starting. When I doubted myself, there was someone there to encourage me.
    • Even better, start a Meetup with a friend and cheer each other on! If I could do this over again, I’d make sure I had co-organizers from the very start. More about this in weeks to come
    • Especially if you’re starting a R-Ladies group, realize that there’s a wider #rstats and @RLadiesGlobal community in place to support you. Each group is independent and has its own needs and goals, but there are so many people to turn to if you have questions. The beauty of tech communities is that you’re often already connected to people across the globe through online platforms. All you need to know is: you’re not going it alone!
    • The R community itself is incredibly supportive, and I’d be remiss if I didn’t mention how much support R-Ladies Chicago has received from David Smith and the team at Microsoft. Not only did they promote the group, but I reached out to David a month or so after starting the group, and he immediately offered to sponsor the group, get swag, and provide space for us. R-Ladies Chicago would be in a different place without Microsoft’s generous contributions. I’m grateful for their support of the group as we got off the ground.

    Next week: I’ll be talking about Meetup-to-Meetup considerations, or things you should be thinking about in the process of organizing an event for your group!

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Arbitrary Data Transforms Using cdata

    By John Mount


    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    We have been writing a lot on higher-order data transforms lately:

    What I want to do now is “write a bit more, so I finally feel I have been concise.”

    The cdata R package supplies general data transform operators.

    • The whole system is based on two primitives or operators cdata::moveValuesToRowsD() and cdata::moveValuesToColumnsD().
    • These operators have pivot, un-pivot, one-hot encode, transpose, moving multiple rows and columns, and many other transforms as simple special cases.
    • It is easy to write many different operations in terms of the cdata primitives.
    • These operators can work-in memory or at big data scale (with databases and Apache Spark; for big data we use the cdata::moveValuesToRowsN() and cdata::moveValuesToColumnsN() variants).
    • The transforms are controlled by a control table that itself is a diagram of or picture of the transform.

    We will end with a quick example, centered on pivoting/un-pivoting values to/from more than one column at the same time.

    Suppose we had some sales data supplied as the following table:

    SalesPerson Period BookingsWest BookingsEast
    a 2017Q1 100 175
    a 2017Q2 110 180
    b 2017Q1 250 0
    b 2017Q2 245 0

    Suppose we are interested in adding a derived column: which region the salesperson made most of their bookings in.

    ## Loading required package: wrapr
    d  d  %.>% 
      dplyr::mutate(., BestRegion = ifelse(BookingsWest > BookingsEast, 
                                           ifelse(BookingsEast > BookingsWest, 

    Our notional goal is (as part of a larger data processing plan) to reformat the data a thin/tall table or a RDF-triple like form. Further suppose we wanted to copy the derived column into every row of the transformed table (perhaps to make some other step involving this value easy).

    We can use cdata::moveValuesToRowsD() to do this quickly and easily.

    First we design what is called a transform control table.

    cT1  data.frame(Region = c("West", "East"),
                      Bookings = c("BookingsWest", "BookingsEast"),
                      BestRegion = c("BestRegion", "BestRegion"),
                      stringsAsFactors = FALSE)
    ##   Region     Bookings BestRegion
    ## 1   West BookingsWest BestRegion
    ## 2   East BookingsEast BestRegion

    In a control table:

    • The column names specify new columns that will be formed by cdata::moveValuesToRowsD().
    • The values specify where to take values from.

    This control table is called “non trivial” as it does not correspond to a simple pivot/un-pivot (those tables all have two columns). The control table is a picture of of the mapping we want to perform.

    An interesting fact is cdata::moveValuesToColumnsD(cT1, cT1, keyColumns = NULL) is a picture of the control table as a one-row table (and this one row table can be mapped back to the original control table by cdata::moveValuesToRowsD(), these two operators work roughly as inverses of each other; though cdata::moveValuesToRowsD() operates on rows and cdata::moveValuesToColumnsD() operates on groups of rows specified by the keying columns).

    The mnemonic is:

    • cdata::moveValuesToColumnsD() converts arbitrary grouped blocks of rows that look like the control table into many columns.
    • cdata::moveValuesToRowsD() converts each row into row blocks that have the same shape as the control table.

    Because pivot and un-pivot are fairly common needs cdata also supplies functions that pre-populate the controls tables for these operations (buildPivotControlTableD() and buildUnPivotControlTable()).

    To design any transform you draw out the control table and then apply one of these operators (you can pretty much move from any block structure to any block structure by chaining two or more of these steps).

    We can now use the control table to supply the same transform for each row.

    d  %.>% 
                    Quarter = substr(Period,5,6),
                    Year = as.numeric(substr(Period,1,4)))  %.>% 
      dplyr::select(., -Period)  %.>% 
                        controlTable = cT1, 
                        columnsToCopy = c('SalesPerson', 
                                          'Quarter')) %.>% 
      arrange_se(., c('SalesPerson', 'Year', 'Quarter', 'Region'))  %.>% 
    SalesPerson Year Quarter Region Bookings BestRegion
    a 2017 Q1 East 175 East
    a 2017 Q1 West 100 East
    a 2017 Q2 East 180 East
    a 2017 Q2 West 110 East
    b 2017 Q1 East 0 West
    b 2017 Q1 West 250 West
    b 2017 Q2 East 0 West
    b 2017 Q2 West 245 West

    Notice we were able to easily copy the extra BestRegion values into all the correct rows.

    It can be hard to figure out how to specify such a transformation in terms of pivots and un-pivots. However, as we have said: by drawing control tables one can easily design and manage fairly arbitrary data transform sequences (often stepping through either a denormalized intermediate where all values per-instance are in a single row, or a thin intermediate like the triple-like structure we just moved into).

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Mapping “world cities” in R

    By Sharp Sight

    (This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

    Here at Sharp Sight, we make a lot of maps.

    There are a few reasons for this.

    First, good maps are typically ‘information dense.’ You can get a lot of information at a glance from a good map. They are good visualization tools for finding and communicating insights.

    Second, it’s extremely easy to get data that you can use to make a map. From a variety of sources, you’ll find data about cities, states, counties, and countries. If you know how to retrieve this data and wrangle it into shape, it will be easy to find data that you can use to make a map.

    Finally, map making is just good practice. To create a map like the one we’re about to make, you’ll typically need to use a variety of data wrangling and data visualization tools. Maps make for excellent practice for intermediate data scientists who have already mastered some of the basics.

    With that in mind, this week we’ll make a map of “world cities.” This set of cities has been identified by the Globalization and World Cities (GaWC) Research Network as being highly connected and influential in the world economy.

    We’re going to initially create a very basic map, but we’ll also create a small multiple version of the map (broken out by GaWC ranking).

    Let’s get started.

    First, we’ll load the packages that we’ll need.


    Next, we’ll input the cities by hard coding them as data frames. To be clear, there is more than one way to do this (e.g., we could scrape the data), but there isn’t that much data here, so doing this manually is acceptable.


    Now, we'll create a new variable called rating. This will contain the global city rating.

    Notice that this is a very straightforward use of dplyr::mutate(), one of the tidyverse functions you should definitely master.

    df_alpha_plus_plus % mutate(rating = 'alpha++')
    df_alpha_plus % mutate(rating = 'alpha+')
    df_alpha % mutate(rating = 'alpha')
    df_alpha_minus % mutate(rating = 'alpha-')

    Next, we’ll combine the different data frames into one using rbind().


    Now that the data are combined into a single data frame, we'll get the longitude and latitude using geocode().


    Once we have the longitude and latitude data, we need to combine it with the original data in the alpha_cities data frame. To do this, we will use cbind().

    alpha_cities % rename(long = lon)

    Now we have the data that we need, but we’ll need to clean things up a little.

    In the visualization we’ll make, we will need to use the faceting technique from ggplot2. When we do this, we’ll facet on the rating variable, but we will need the levels of that variable to be ordered properly (otherwise the facets will be out of order).

    To reorder the factor levels of rating, we will use fct_relevel().

    # - the global city ratings should be ordered
    #   i.e., alpha++, then alpha+ ....
    # - to do this, we'll use forecats::fct_relevel()

    Because we will be building a map, we'll need to retrive a map of the world. We can get a world map by using map_data(“world”).


    Ok. We basically have everything we need. Now we will make a simple first draft.

    ggplot() +
      geom_polygon(data = map_world, aes(x = long, y = lat, group = group)) +
      geom_point(data = alpha_cities, aes(x = long, y = lat), color = 'red')

    … and now we’ll use the faceting technique to break out our plot using the rating variable.

    ggplot() +
      geom_polygon(data = map_world, aes(x = long, y = lat, group = group)) +
      geom_point(data = alpha_cities, aes(x = long, y = lat), color = 'red') +
      #facet_grid(. ~ rating)
      #facet_grid(rating ~ .)
      facet_wrap(~ rating)

    Once again, this is a good example of an intermediate-level project that you could do to practice your data wrangling and data visualization skills.

    Having said that, before you attempt to do something like this yourself, I highly recommend that you first master the individual tools that we used here (i.e., the tools from ggplot2, dplyr, and the tidyverse).

    Sign up now, and discover how to rapidly master data science

    To master data science, you need to master the essential tools.

    And to make rapid progress, you need to know what to learn, what not to learn, and you need to know how to practice what you learn.

    Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

    Sign up now for our email list, and you’ll receive regular tutorials and lessons.

    You’ll learn:

    • What data science tools you should learn (and what not to learn)
    • How to practice those tools
    • How to put those tools together to execute analyses and machine learning projects
    • … and more

    If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.


    The post Mapping “world cities” in R appeared first on SHARP SIGHT LABS.

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Tips for A/B Testing with R

    By INWT-Blog-RBloggers

    Example for the course of the p-value during a test

    (This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

    Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better?

    For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test. For this purpose, visitors of a website are randomly assigned to one of two groups between which the target metric (i.e. click-through rate, conversion rate…) can then be compared. Due to this randomization, the groups do not systematically differ in all other relevant dimensions. This means: If your target metric takes a significantly higher value in one group, you can be quite sure that it is because of your treatment and not because of any other variable.

    In comparison to other methods, conducting an A/B test does not require extensive statistical knowledge. Nevertheless, some caveats have to be taken into account.

    When making a statistical decision, there are two possible errors (see also table 1): A Type I error means that we observe a significant result although there is no real difference between our groups. A Type II error means that we do not observe a significant result although there is in fact a difference. The Type I error can be controlled and set to a fixed number in advance, e.g., at 5%, often denoted as α or the significance level. The Type II error in contrast cannot be controlled directly. It decreases with the sample size and the magnitude of the actual effect. When, for example, one of the designs performs way better than the other one, it’s more likely that the difference is actually detected by the test in comparison to a situation where there is only a small difference with respect to the target metric. Therefore, the required sample size can be computed in advance, given α and the minimum effect size you want to be able to detect (statistical power analysis). Knowing the average traffic on the website you can get a rough idea of the time you have to wait for the test to complete. Setting the rule for the end of the test in advance is often called “fixed-horizon testing”.

    Table 1: Overview over possible errors and correct decisions in statistical tests
    Effect really exists
    No Yes
    Statistical test is significant No True negative Type II error (false negative)
    Yes Type I error (false positive) True positive

    Statistical tests generally provide the p-value which reflects the probability of obtaining the observed result (or an even more extreme one) just by chance, given that there is no effect. If the p-value is smaller than α, the result is denoted as “significant”.

    When running an A/B test you may not always want to wait until the end but take a look from time to time to see how the test performs. What if you suddenly observe that your p-value has already fallen below your significance level – doesn’t that mean that the winner has already been identified and you could stop the test? Although this conclusion is very appealing, it can also be very wrong. The p-value fluctuates strongly during the experiment and even if the p-value at the end of the fixed-horizon is substantially larger than α, it can go below α at some point during the experiment. This is the reason why looking at your p-value several times is a little bit like cheating, because it makes your actual probability of a Type I error substantially larger than the α you chose in advance. This is called “α inflation”. At best you only change the color or position of a button although it does not have any impact. At worst, your company provides a special offer which causes costs but actually no gain. The more often you check your p-value during the data collection, the more likely you are to draw wrong conclusions. In short: As attractive as it may seem, don’t stop your A/B test early just because you are observing a significant result. In fact you can prove that if you increase your time horizon to infinity, you are guaranteed to get a significant p-value at some point in time.

    The following code simulates some data and plots the course of the p-value during the test. (For the first samples which are still very small R returns a warning that the chi square approximation may be incorrect.)

    library(timeDate) library(ggplot2) # Choose parameters: pA <- 0.05 # True click through rate for group A pB <- 0.08 # True click through rate for group B nA <- 500 # Number of cases for group A nB <- 500 # Number of cases for group B alpha <- 0.05 # Significance level # Simulate data: data <- data.frame(group = rep(c("A", "B"), c(nA, nB)),                    timestamp = sample(seq(as.timeDate('2016-06-02'),                                           as.timeDate('2016-06-09'), by = 1), nA+nB),                    clickedTrue = as.factor(c(rbinom(n = nA, size = 1, prob = pA),                                              rbinom(n = nB, size = 1, prob = pB)))) # Order data by timestamp data <- data[order(data$GMT.x..i..), ] levels(data$clickedTrue) <- c("0", "1") # Compute current p-values after every observation: pValues <- c() index <- c() for (i in 50:dim(data)[1]){   presentData <- table(data$group[1:i], data$clickedTrue[1:i])   if (all(rowSums(presentData) > 0)){     pValues <- c(pValues, prop.test(presentData)$p.value)     index <- c(index, i)} } results <- data.frame(index = index,                       pValue = pValues) # Plot the p-values: ggplot(results, aes(x = index, y = pValue)) +   geom_line() +   geom_hline(aes(yintercept = alpha)) +   scale_y_continuous(name = "p-value", limits = c(0,1)) +   scale_x_continuous(name = "Observed data points") +   theme(text = element_text(size=20))

    The figure below shows an example with 500 observations and true rates of 5% in both groups, i.e., no actual difference. You can see that the p-value nevertheless crosses the threshold several times, but finally takes a very high value. By stopping this test early, it would have been very likely to draw a wrong conclusion.

    Example for the course of a p-value for two groups with no actual difference

    The following code shows you how to test the difference between two rates in R, e.g., click-through rates or conversion rates. You can apply the code to your own data by replacing the URL to the example data with your file path. To test the difference between two proportions, you can use the function prop.test which is equivalent to Pearson’s chi-squared test. For small samples you should use Fisher’s exact test instead. Prop.test returns a p-value and a confidence interval for the difference between the two rates. The interpretation of a 95% confidence interval is as follows: When conducting such an analysis many times, then 95% of the displayed confidence intervals would contain the true difference. Afterwards you can also take a look at the fluctuations of the p-value during the tests by using the code from above.

    library(readr) # Specify file path: dataPath 

    There are some more pitfalls, but most of them can easily be avoided. First, as a counterpart of stopping your test early because of a significant result, you could gather more data after the planned end of the test because the results have not yet become significant. This would likewise lead to an α inflation. A second, similar problem arises when running several tests at once: The probability to achieve a false-positive result would then be α for each of the tests. The overall probability that at least one of the results is false-positive is much larger. So always keep in mind that some of the significant results may have been caused by chance. Third, you can also get into trouble when you reach the required sample size very fast and stop the test after a few hours already. You should always consider that the behavior of the users in this specific time slot might not be representative for the general case. To avoid this, you should plan the duration of the test so that it covers at least 24 hours or a week when customers are behaving different at the weekend than on a typical work day. A fourth caveat concerns a rather moral issue: When users discover they are part of an experiment and suffer from disadvantages as a result, they might rightly become angry. (This problem will probably not arise due to a different-colored button, but maybe because of different prices or special offers.)

    If you are willing to invest some more time, you may want to learn about techniques to avoid α inflation when conducting multiple tests or stopping your test as soon as the p-value crosses a certain threshold. In addition, there are techniques to include previous knowledge in your computations with Bayesian approaches. The latter is especially useful when you have rather small samples, but previous knowledge about the values that your target metric usually takes.

    To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    October 2017 New Packages

    By R Views

    (This article was first published on R Views, and kindly contributed to R-bloggers)

    Of the 182 new packages that made it to CRAN in October, here are my picks for the “Top 40”. They are organized into eight categories: Engineering, Machine Learning, Numerical Methods, Science, Statistics, Time Series, Utilities and Visualizations. Engineering is a new category, and its appearance may be an early signal for the expansion of R into a new domain. The Science category is well-represented this month. I think this is the result of the continuing trend for working scientists to wrap their specialized analyses into R packages.


    FlowRegEnvCost 0.1.1: Calculates the daily environmental costs of river-flow regulation by dams based on García de Jalon et al. (2017).

    rroad v0.0.4: Computes and visualizes the International Roughness Index (IRI) given a longitudinal road profile for a single road segment, or for a sequence of segments with a fixed length. For details on The International Road Roughness Experiment establishing a correlation and a calibration standard for measurements, see the World Bank technical paper. The vignette shows an example of a Road Condition Analysis. The following scaleogram was produced from a continuous wavelet transform of a 3D accelerometer signal.

    Machine Learning

    detrendr v0.1.0: Implements a method based on an algorithm by Nolan et al. (2017) for detrending images affected by bleaching. See the vignette

    MlBayesOpt v0.3.3: Provides a framework for using Bayesian optimization (see Shahriari et al. to tune hyperparameters for support vector machine, random forest, and extreme gradient boosting models. The vignette shows how to set things up.

    rerf v1.0: Implements an algorithm, Random Forester (RerF), developed by Tomita (2016), which is similar to the Random Combination (Forest-RC) algorithm developed by Breiman (2001). Both algorithms form splits using linear combinations of coordinates.

    Numerical Methods

    episode v1.0.0: Provides statistical tools for inferring unknown parameters in continuous time processes governed by ordinary differential equations (ODE). See the Introduction.

    KGode v1.0.1: Implements the kernel ridge regression and the gradient matching algorithm proposed in Niu et al. (2016), and the warping algorithm proposed in Niu et al. (2017) for improving parameter estimation in ODEs.


    adjclust v0.5.2: Implements a constrained version of hierarchical agglomerative clustering, in which each observation is associated with a position, and only adjacent clusters can be merged. The algorithm, which is time- and memory-efficient, is described in Alia Dehman (2015). There are vignettes on Clustering Hi-C Contact Maps, Implementation Notes, and Inferring Linkage Disequilibrium blocks from Genotypes.

    hsdar v0.6.0: Provides functions for transforming reflectance spectra, calculating vegetation indices and red edge parameters, and spectral resampling for hyperspectral remote sensing and simulation. The Introduction offers several examples.

    mapfuser v0.1.2: Constructs consensus genetic maps with LPmerge (See Endelman and Plomion (2014)) and models the relationship between physical distance and genetic distance using thin-plate regression splines (see Wood (2003)). The vignette explains how to use the package.

    mortAAR v1.0.0: Provides functions for the analysis of archaeological mortality data See Chamberlain (2006). There is a vignette on Lifetables and an Extended Discussion.

    skyscapeR v0.2.2: Provides a tool set for data reduction, visualization and analysis in skyscape archaeology, archaeoastronomy and cultural astronomy. The vignette shows how to use the package.


    BayesRS v0.1.2: Fits hierarchical linear Bayesian models, samples from the posterior distributions of model parameters in JAGS, and computes Bayes factors for group parameters of interest with the Savage-Dickey density ratio ([See Wetzels et al.(2009). There is an Introduction.

    CatPredi v1.1: Allows users to categorize a continuous predictor variable in a logistic or a Cox proportional hazards regression setting, by maximizing the discriminative ability of the model. See Barrio et al. (2015) and Barrio et al. (2017).

    CovTools v0.2.1: Provides a collection of geometric and inferential tools for convenient analysis of covariance structures. For an introduction to covariance in multivariate statistical analysis, see Schervish (1987).

    genlogis v0.5.0: Provides basic distribution functions for a generalized logistic distribution proposed by Rathie and Swamee (2006).

    emmeans v0.9.1: Provides functions to obtain estimated marginal means (EMMs) for many linear, generalized linear, and mixed models, and computes contrasts or linear functions of EMMs, trends, and comparisons of slopes. There are twelve vignettes including The Basics, Comparisons and Contrasts, Confidence Intervals and Tests, Interaction Analysis, and Working with Messy Data.

    ESTER v0.1.0: Provides an implementation of sequential testing that uses evidence ratios computed from the Akaike weights of a set of models. For details see Burnham & Anderson (2004). There is a vignette.

    FarmTest v1.0.0: Provides functions to perform robust multiple testing for means in the presence of latent factors. It uses Huber’s loss function to estimate distribution parameters and accounts for strong dependence among coordinates via an approximate factor model. See Zhou et al.(2017) for details. There is a vignette to get you started.

    miic v0.1: Implements an information-theoretic method which learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables, commonly found in many datasets. For more information see Verny et al. (2017).

    modcmfitr v0.1.0: Fits a modified version of the Connor-Mosimann distribution ( Connor & Mosimann (1969)), a Connor-Mosimann distribution, or a Dirichlet distribution to elicited quantiles of a multinomial distribution. See the vignette for details.

    pense v1.0.8: Provides a robust penalized elastic net S and MM estimator for linear regression as described in Freue et al. (2017).

    paramtest v0.1.0: Enables running simulations or other functions while easily varying parameters from one iteration to the next. The vignette shows how to run a power simulation.

    rENA v0.1.0: Implements functions to perform epistemic network analysis ENA, a novel method for identifying and quantifying connections among elements in coded data, and representing them in dynamic network models, which illustrate the structure of connections and measure the strength of association among elements in a network.

    rma.exact v0.1.0: Provides functions to compute an exact CI for the population mean under a random-effects model. For details, see Michael, Thronton, Xie, and Tian (2017).

    Time Series

    carfima v1.0.1: Provides a toolbox to fit a continuous-time, fractionally integrated ARMA process CARFIMA on univariate and irregularly spaced time-series data using a general-order CARFIMA(p, H, q) model for p>q as specified in Tsai and Chan (2005).

    colorednoise v0.0.1: Provides tools for simulating populations with white noise (no temporal autocorrelation), red noise (positive temporal autocorrelation), and blue noise (negative temporal autocorrelation) based on work by Ruokolainen et al. (2009). The vignette describes colored noise.

    nnfor v0.9: Provides functions to facilitate automatic time-series modelling with neural networks. Look here for some help getting started.


    hdf5r v1.0.0: Provides an object-oriented wrapper for the HDF5 API using R6 classes. The vignette shows how to use the package.

    geoops v0.1.2: Provides tools for working with the GeoJSON geospatial data interchange format. There is an Introduction.

    linl v0.0.2: Adds a LaTeX Letter class to rmarkdown, using the pandoc-letter template adapted for use with markdown. See the vignette and README for details.

    oshka v0.1.2: Expands quoted language by recursively replacing any symbol that points to quoted language with the language itself. There is an Introduction and a vignette on Non Standard Evaluation Functions.

    rcreds v0.6.6: Provides functions to read and write credentials to and from an encrypted file. The vignette describes how to use the package.

    RMariaDB v1.0-2: Implements a DBI-compliant interface to MariaDB and MySQL databases.

    securitytxt v0.1.0: Provides tools to identify and parse security.txt files to enable the analysis and adoption of the Web Security Policies draft standard.

    usethis v1.1.0: Automates package and project setup tasks, including setting up unit testing, test coverage, continuous integration, Git, GitHub, licenses, Rcpp, RStudio projects, and more that would otherwise be performed manually. README provides examples.

    xltabr v0.1.1: Provides functions to produce nicely formatted cross tabulations to Excel using [openxlsx]((, which has been developed to help automate the process of publishing Official Statistics. Look here for documentation.


    iheatmapr v0.4.2: Provides a system for making complex, interactive heatmaps. Look at the webpage for examples.

    otvPlots v0.2.0: Provides functions to automate the visualization of variable distributions over time, and compute time-aggregated summary statistics for large datasets. See the README for an introduction.

    To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News