Survey of Kagglers finds Python, R to be preferred tools

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Competitive predictive modeling site Kaggle conducted a survey of participants in prediction competitions, and the 16,000 responses provide some insights about that user community. (Whether those trends generalize to the wider community of all data scientists is unclear, however.) One question of interest asked what tools Kagglers use at work. Python is the most commonly-used tool within this community, and R is second. (Respondents could select more than one tool.)

Interestingly, the rankings varied according to the job title of the respondent. R and Python received top-ranking for every job-title subgroup except one (database administrators, who preferred SQL), according to the following division:

  • R: Business Analyst, Data Analyst, Data Miner, Operations Researcher, Predictive Modeler, Statistician
  • Python: Computer Scientist, Data Scientist, Engineer, Machine Learning Engineer, Other, Programmer, Researcher, Scientist, Software Developer

You can find summaries of the other questions in the survey at the link below. An anonymized dataset of survey responses is also available, as is the “Kaggle Kernel” (a kind of notebook) of the R code behind the survey analysis.

Kaggle: The State of Data Science and Machine Learning, 2017

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Free Software Foundation “Social Benefit” Award Nominations

By Ari Lamstein

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Ezra Haber Glenn, the author of the acs package in R, recently posted about the Free Software Foundation’s “Social Benefit” Award on the acs mailing list:

acs.R Community:

The Free Software Foundation is now accepting nominations for the 2017
“Project of Social Benefit Award,” presented to the project or team
responsible for applying free software, or the ideas of the free
software movement, in a project that intentionally and significantly
benefits society in other aspects of life.

If anyone is willing to nominate the acs package, the recognition
would be much appreciated — the package has been generously supported
by MIT and the Puget Sound Regional Council, as well as a great deal
of user-feedback and creative development on the part of the
ACS/Census/R community.

The nomination form is quick and easy — see
https://my.fsf.org/projects-of-social-benefit-award-nomination.
Deadline 11/5.

More info at https://www.fsf.org/awards/sb-award/.

Thanks!

I’m reposting this here for a few reasons.

The first is that I only learned about this award from Ezra’s post, and I think that it’s worth raising awareness of the award itself.

The second is that, in my opinion, the acs package does “intentionally and significantly benefit society.” I have used the acs package over several years to learn more about US demographics. Choroplethr, my R package for creating statistical maps, also uses the acs package to retrieve data from the Census Bureau. Several thousand people have taken my free course on Choroplethr, and each of those people has benefitted from the acs package as well.

Finally, I’m mentioning this award to point out that R package developers receive compensation in different ways. None of us receive monetary compensation when people use our packages. However, Ezra has indicated that getting nominated for this award would be useful to him.

For all these reasons, I was happy to nominate the acs package for the Free software Foundation’s “Social Benefit” Award. It took me less than 5 minutes to fill out the form. If you are a user of choroplethr, and you enjoy its integration with US Census Data, then I encourage you to nominate the acs package as well. You can do so here.

The post Free Software Foundation “Social Benefit” Award Nominations appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

linl 0.0.2: Couple improvements

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Following up on the initial 0.0.1 release of linl, Aaron and I are happy to announce release 0.0.2 which reached the CRAN network on Sunday in a smooth ‘CRAN-pretest-publish’ auto-admittance. linl provides a simple-yet-powerful Markdown—and RMarkdown—wrapper around the venerable LaTeX letter class; see below for an expanded example also included as the package vignette.

This versions sets a few sensible default values for font, font size, margins, signature (non-)indentation and more; it also expands the documentation.

The NEWS entry follows:

Changes in tint version 0.0.2 (2017-10-29)

  • Set a few defaults for a decent-looking skeleton and template: font, fontsize, margins, left-justify closing (#3)

  • Blockquote display is now a default as well (#4).

  • Updated skeleton.Rmd and vignette source accordingly

  • Documented new default options (#5 and #6).

  • Links are now by default printed as footnotes (#9).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

pinp 0.0.3: More docs, more features

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Our pinp package for snazzier one or two column vignette received it second update. Now at version 0.0.3, it arrived on CRAN on Saturday with minimal fuzz as an ‘CRAN-pretest-publish’ transition.

We added more frontmatter options, documented more, and streamlined some internals of the LaTeX class borrowed from PNAS. A screenshot of the (updated) vignette can be seen below. Additional screenshots of are at the pinp page.

The NEWS entry for this release follows.

Changes in tint version 0.0.3 (2017-10-28)

  • Section ‘Acknowledgements’ now conditional on a frontmatter setting, section ‘Matmethods’ has been removed, pnasbreak no longer used which stabilizes LaTeX float formatting. References are now shown in the column just like other content (Dirk in #36).

  • Vignette now uses new numbered sections frontmatter switch which improves the pdf outline.

  • New front-matter options for title/section header colors, and link colors (Dirk in #39).

  • YAML frontmater options are now documented in the help page for pinp as well (Dirk in #41).

  • Some typos were fixed (Michael in #42 and #43).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

gg_tweet’ing Power Outages

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

As many folks know, I live in semi-rural Maine and we were hit pretty hard with a wind+rain storm Sunday to Monday. The hrbrmstr compound had no power (besides a generator) and no stable/high-bandwidth internet (Verizon LTE was heavily congested) since 0500 Monday and still does not as I write this post.

I’ve played with scraping power outage data from Central Maine Power but there’s a great Twitter account — PowerOutage_us — that has done much of the legwork for the entire country. They don’t cover everything and do not provide easily accessible historical data (likely b/c evil folks wld steal it w/o payment or credit) but they do have a site you can poke at and do provide updates via Twitter. As you’ve seen in a previous post, we can use the rtweet package to easily read Twitter data. And, the power outage tweets are regular enough to identify and parse. But raw data is so…raw.

While one could graph data just for one’s self, I decided to marry this power scraping capability with a recent idea I’ve been toying with adding to hrbrthemes or ggalt: gg_tweet(). Imagine being able to take a ggplot2 object and “plot” it to Twitter, fully conforming to Twitter’s stream or card image sizes. By conforming to these size constraints, they don’t get cropped in the timeline view (if you allow images to be previewed in-timeline). This is even more powerful if you have some helper functions for proper theme-ing (font sizes especially need to be tweaked). Enter gg_tweet().

Power Scraping

We’ll cover scraping @PowerOutage_us first, but we’ll start with all the packages we’ll need and a helper function to convert power outage estimates to numeric values:

library(httr)
library(magick)
library(rtweet)
library(stringi)
library(hrbrthemes)
library(tidyverse)

words_to_num 

Now, I can’t cover setting up rtweet OAuth here. The vignette and package web site do that well.

The bot tweets infrequently enough that this is really all we need (though, bump up n as you need to):

outage 

Yep, that gets the last 300 tweets from said account. It’s amazingly simple.

Now, the outage tweets for the east coast / northeast are not individually uniform but collectively they are (there’s a pattern that may change but you can tweak this if they do):

filter(outage, stri_detect_regex(text, "#(EastCoast|NorthEast)")) %>% 
  mutate(created_at = lubridate::with_tz(created_at, 'America/New_York')) %>% 
  mutate(number_out = words_to_num(text)) %>%  
  ggplot(aes(created_at, number_out)) +
  geom_segment(aes(xend=created_at, yend=0), size=5) +
  scale_x_datetime(date_labels = "%Y-%m-%dn%H:%M", date_breaks="2 hours") +
  scale_y_comma(limits=c(0,2000000)) +
  labs(
    x=NULL, y="# Customers Without Power",
    title="Northeast Power Outages",
    subtitle="Yay! Twitter as a non-blather data source",
    caption="Data via: @PowerOutage_us "
  ) -> gg

That pipe chain looks for key hashtags (for my area), rejiggers the time zone, and calls the helper function to, say, convert 1.2+ Million to 1200000. Finally it builds a mostly complete ggplot2 object (you should make the max Y limit more dynamic).

You can plot that on your own (print gg). We’re here to tweet, so let’s go into the next section.

Magick Tweeting

@opencpu made it possible shunt plot output to a magick device. This means we have really precise control over ggplot2 output size as well as the ability to add other graphical components to a ggplot2 plot using magick idioms. One thing we need to take into account is “retina” plots. They are — essentially — double resolution plots (72 => 144 pixels per inch). For the best looking plots we need to go retina, but that also means kicking up base plot theme font sizes a bit. Let’s build on hrbrthemes::theme_ipsum_rc() a bit and make a theme_tweet_rc():

theme_tweet_rc  font_sizes
  
  theme_ipsum_rc(
    grid = grid, 
    plot_title_size = font_sizes[1], 
    subtitle_size = font_sizes[2], 
    axis_title_size = font_sizes[3], 
    axis_text_size = font_sizes[4], 
    caption_size = font_sizes[5]
  )
  
}

Now, we just need a way to take a ggplot2 object and shunt it off to twitter. The following gg_tweet() function does not (now) use rtweet as I’ll likely add it to either ggalt or hrbrthemes and want to keep dependencies to a minimum. I may opt-in to bypass the current method since it relies on environment variables vs an RDS file for app credential storage. Regardless, one thing I wanted to do here was provide a way to preview the image before tweeting.

So you pass in a ggplot2 object (likely adding the tweet theme to it) and a Twitter status text (there’s a TODO to check the length for 140c compliance) plus choose a style (stream or card, defaulting to stream) and decide on whether you’re cool with the “retina” default.

Unless you tell it to send the tweet it won’t, giving you a chance to preview the image before sending, just in case you want to tweak it a bit before committing it to the Twitterverse. It als returns the magick object it creates in the event you want to do something more with it:

gg_tweet  dims
  
  dims["res"]  res
    
    warn_for_status(res)
    
    unlink(tf)
    
  }
  
  fig
  
}

Two Great Tastes That Taste Great Together

We can combine the power outage scraper & plotter with the tweeting code and just do:

gg_tweet(
  gg + theme_tweet_rc(grid="Y"),
  status = "Progress! #rtweet #gg_tweet",
  send=TRUE
)

That was, in-fact, the last power outage tweet I sent.

Next Steps

Ironically, given current levels of U.S. news and public “discourse” on Twitter and some inane machinations in my own area of domain expertise (cyber), gg_tweet() is likely one of the few ways I’ll be interacting with Twitter for a while. You can ping me on Keybase — hrbrmstr — or join the rstats Keybase team via keybase team request-access rstats if you need to poke me for anything for a while.

FIN

Kick the tyres and watch for gg_tweet() ending up in ggalt or hrbrthemes. Don’t hesitate to suggest (or code up) feature requests. This is still an idea in-progress and definitely not ready for prime time without a bit more churning. (Also, words_to_num() can be optimized, it was hastily crafted).

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

computational methods for numerical analysis with R [book review]

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

This is a book by James P. Howard, II, I received from CRC Press for review in CHANCE. (As usual, the customary warning applies: most of this blog post will appear later in my book review column in CHANCE.) It consists in a traditional introduction to numerical analysis with backup from R codes and packages. The early chapters are setting the scenery, from basics on R to notions of numerical errors, before moving to linear algebra, interpolation, optimisation, integration, differentiation, and ODEs. The book comes with a package cmna that reproduces algorithms and testing. While I do not find much originality in the book, given its adherence to simple resolutions of the above topics, I could nonetheless use it for an elementary course in our first year classes. With maybe the exception of the linear algebra chapter that I did not find very helpful.

“…you can have a solution fast, cheap, or correct, provided you only pick two.” (p.27)

The (minor) issue I have with the book and that a potential mathematically keen student could face as well is that there is little in the way of justifying a particular approach to a given numerical problem (as opposed to others) and in characterising the limitations and failures of the presented methods (although this happens from time to time as e.g. for gradient descent, p.191). [Seeping in my Gallic “mal-être”, I am prone to over-criticise methods during classing, to the (increased) despair of my students!, but I also feel that avoiding over-rosy presentations is a good way to avoid later disappointments or even disasters.] In the case of this book, finding [more] ways of detecting would-be disasters would have been nice.

An uninteresting and highly idiosyncratic side comment is that the author preferred the French style for long division to the American one, reminding me of my first exposure to the latter, a few months ago! Another comment from a statistician is that mentioning time series inter- or extra-polation without a statistical model sounds close to anathema! And makes extrapolation a weapon without a cause.

“…we know, a priori, exactly how long the [simulated annealing] process will take since it is a function of the temperature and the cooling rate.” (p.199)

Unsurprisingly, the section on Monte Carlo integration is disappointing for a statistician/probabilistic numericist like me, as it fails to give a complete enough picture of the methodology. All simulations seem to proceed there from a large enough hypercube. And recommending the “fantastic” (p.171) R function integrate as a default is scary, given the ability of the selected integration bounds to misled its users. Similarly, I feel that the simulated annealing section is not providing enough of a cautionary tale about the highly sensitive impact of cooling rates and absolute temperatures. It is only through the raw output of the algorithm applied to the travelling salesman problem that the novice reader can perceive the impact of some of these factors. (The acceptance bound on the jump (6.9) is incidentally wrongly called a probability on p.199, since it can take values larger than one.)

Filed under: Books, Kids, pictures, R, Statistics, University life Tagged: book review, CRC Press, differential equation, Euler discretisation, integrate, integration, introductory textbooks, Monte Carlo integration, numerical analysis, optimisation, partial differential equations, R, R function, R package, Runge-Kutta, simulated annealing

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Recent updates to the Team Data Science Process

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It’s been over a year since we first introduced introduced the Team Data Science Process (TDSP). The data, technology and practices behind Data Science continue to evolve, and the TDSP has evolved in parallel. Over the past year, several new facets have been added, including:

For an example of applying the TDSP to effective data science projects, check out Buck Woody’s 10-part series walking through every stage of a typical data science project.

As the practice of data science changes, the TDSP continues to evolve. The TDSP is an open project hosted on Github, and your contributions are welcome.

Cortana Intelligence and Machine Learning Blog: The Microsoft Team Data Science Process (TDSP) – Recent Updates

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan

By Quantide

Data Viz _ Dashboard with R

(This article was first published on R blog | Quantide – R training & consulting, and kindly contributed to R-bloggers)

Data Visualization and Dashboard with R is our fourth course of the autumn term. It takes place in November 7-8 in a location close to Milano Lima.
This course will teach you how to build beautiful, effective and flexible plots using the most modern R tools for data visualization. Then you will discover how to embed visualizations and tables in a powerful Shinyapp, to make your data easily navigable and let their insights emerge.
You should take this course if you have some knowledge of R and would like to deepen your knowledge in data visualization with R, both static data visualization and dashboards.

Data Visualization and Dashboard with R: Outlines

– ggplot2 grammar
– Creating plots with ggplot (Scatter Plot, Line Plot, Bar Plot, Histogram, Box Plot, Surface Plot)
– Customizing Plots (aesthetics, legend, axes, faceting and themes)
– Specialised visualisation tools: ggmap and ggally
– Basic shiny concepts
– The structure of a shiny app
– Shiny: the server side and the user side
– Understanding reactivity in shiny
– An overview of html widgets

Data Visualization and Dashboard with R is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

This course is for max 6 attendees.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

You can find an overview of all our courses here. Next dates will be:

  • November 21-22: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
  • November 29-30: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan appeared first on Quantide – R training & consulting.

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide – R training & consulting.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

heatmaply: an R package for creating interactive cluster heatmaps for online publishing

By Tal Galili

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

This post on the heatmaply package is based on my recent paper from the journal bioinformatics (a link to a stable DOI). The paper was published just last week, and since it is released as CC-BY, I am permitted (and delighted) to republish it here in full. My co-authors for this paper are Jonathan Sidi, Alan O’Callaghan, and Carson Sievert.

Summary: heatmaply is an R package for easily creating interactive cluster heatmaps that can be shared online as a stand-alone HTML file. Interactivity includes a tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels. Thanks to the synergistic relationship between heatmaply and other R packages, the user is empowered by a refined control over the statistical and visual aspects of the heatmap layout.

Availability: The heatmaply package is available under the GPL-2 Open Source license. It comes with a detailed vignette, and is freely available from: http://cran.r-project.org/package=heatmaply

Contact: Tal.Galili@math.tau.ac.il

Introduction

A cluster heatmap is a popular graphical method for visualizing high dimensional data. In it, a table of numbers is scaled and encoded as a tiled matrix of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms and extra columns of categorical annotation. The ongoing development of this iconic visualization, spanning over more than a century, has provided the foundation for one of the most widely used of all bioinformatics displays (Wilkinson and Friendly, 2009). When using the R language for statistical computing (R Core Team, 2016), there are many available packages for producing static heatmaps, such as: stats, gplots, heatmap3, fheatmap, pheatmap, and others. Recently released packages also allow for more complex layouts; these include gapmap, superheat, and ComplexHeatmap (Gu et al., 2016). The next evolutionary step has been to create interactive cluster heatmaps, and several solutions are already available. However, these solutions, such as the idendro R package (Sieger et al., 2017), are often focused on providing an interactive output that can be explored only on the researcher’s personal computer. Some solutions do exist for creating shareable interactive heatmaps. However, these are either dependent on a specific online provider, such as XCMS Online, or require JavaScript knowledge to operate, such as InCHlib. In practice, when publishing in academic journals, the reader is left with a static figure only (often in a png or pdf format).

To fill this gap, we have developed the heatmaply R package for easily creating a shareable HTML file that contains an interactive cluster heatmap. The interactivity is based on a client-side JavaScript code that is generated based on the user’s data, after running the following command:

install.packages("heatmaply")
library(heatmaply)
heatmaply(data, file = "my_heatmap.html")

The HTML file contains a publication-ready, interactive figure that allows the user to zoom in as well as see values when hovering over the cells. This self-contained HTML file can be made available to interested readers by uploading it to the researcher’s homepage or as a supplementary material in the journal’s server. Concurrently, this interactive figure can be displayed in RStudio’s viewer pane, included in a Shiny application, or embedded in a knitr/RMarkdown HTML documents.

The rest of this paper offers guidelines for creating effective cluster heatmap visualization. Figure 1 demonstrates the suggestions from this section on data from project Tycho (van Panhuis et al., 2013), while the online supplementary information includes the interactive version, as well as several examples of using the package on real-world biological data.

Fig. 1. The (square root) number of people infected by Measles in 50 states, from 1928 to 2003. Vaccines were introduced in 1963

click the image for the online interactive version of the plot

An interactive version of the measles heatmap (embedded in the post using iframe)

I uploaded the measles_heatmaply.html to github and then used the following code to embed it in the post:

Here is the result:

heatmaply – a simple example

The generation of cluster heatmaps is a subtle process (Gehlenborg and Wong, 2012; Weinstein, 2008), requiring the user to make many decisions along the way. The major decisions to be made deal with the data matrix and the dendrogram. The raw data often need to be transformed in order to have a meaningful and comparable scale, while an appropriate color palette should be picked. The clustering of the data requires us to decide on a distance measure between the observation, a linkage function, as well as a rotation and coloring of branches that manage to highlight interpretable clusters. Each such decision can have consequences on the patterns and interpretations that emerge. In this section, we go through some of the arguments in the function heatmaply, aiming to make it easy for the user to tune these important statistical and visual parameters. Our toy example visualizes the effect of vaccines on measles infection. The output is given in the static Fig. 1, while an interactive version is available online in the supplementary file “measles.html”. Both were created using:

heatmaply(x = sqrt(measles),
           color = viridis, # the default
           Colv = NULL,
           hclust_method = "average", k_row = NA, # ...
           file = c("measles.html", "measles.png") )

The first argument of the function (x) accepts a matrix of the data. In the measles data, each row corresponds with a state, each column with a year (from 1928 to 2003), and each cell with the number of people infected with measles per 100,000 people. In this example, the data were scaled twice – first by not giving the raw number of cases with measles, but scaling them relatively to 100,000 people, thus making it possible to more easily compare between states. And second by taking the square root of the values. This was done since all the values in the data represent the same unit of measure, but come from a right-tailed distribution of count data with some extreme observations. Taking the square root helps with bringing extreme observations closer to one another, helping to avoid an extreme observation from masking the general pattern. Other transformations that may be considered come from Box-Cox or Yeo-Johnson family of power transformations. If each column of the data were to represent a different unit of measure, then leaving the values unchanged will often result in the entire figure being un-usable due to the column with the largest range of values taking over most of the colors in the figure. Possible per-column transformations include the scale function, suitable for data that are relatively normal. normalize, and percentize functions bring data to the comparable 0 to 1 scale for each column. The normalize function preserves the shape of each column’s distribution by subtracting the minimum and dividing by the maximum of all observations for each column. The percentize function is similar to ranking but with the simpler interpretation of each value being replaced by the percent of observations that have that value or below. It uses the empirical cumulative distribution function of each variable on its own values. The sparseness of the dataset can be explored using is.na10.

Once the data are adequately scaled, it is important to choose a good color palette for the data. Other than being pretty, an ideal color palette should have three (somewhat conflicting) properties: (1) Colorful, spanning as wide a palette as possible so as to make differences easy to see; (2) Perceptually uniform, so that values close to each other have similar-appearing colors compared with values that are far away, consistently across the range of values; and (3) Robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as printing well in grey scale. The default passed to the color argument in heatmaply is viridis, which offers a sequential color palette, offering a good balance of these properties. Divergent color scale should be preferred when visualizing a correlation matrix, as it is important to make the low and high ends of the range visually distinct. A helpful divergent palette available in the package is cool_warm (other alternatives in the package include RdBu, BrBG, or RdYlBu, based on the RColorBrewer package). It is also advisable to set the limits argument to range from -1 to 1.

Passing NULL to the Colv argument, in our example, removed the column dendrogram (since we wish to keep the order of the columns, relating to the years). The row dendrogram is automatically calculated using hclust with a Euclidean distance measure and the average linkage function. The user can choose to use an alternative clustering function (hclustfun), distance measure (dist_method), or linkage function (hclust_method), or to have a dendrogram only in the rows/columns or none at all (through the dendrogram argument). Also, the users can supply their own dendrogram objects into the Rowv (or Colv) arguments. The preparation of the dendrograms can be made easier using the dendextend R package (Galili, 2015) for comparing and adjusting dendrograms. These choices are all left for the user to decide. Setting the k_col/k_row argument to NA makes the function search for the number of clusters (between from 2 to 10) by which to color the branches of the dendrogram. The number picked is the one that yields the highest average silhouette coefficient (based on the find_k function from dendextend). Lastly, the heatmaply function uses the seriation package to find an “optimal” ordering of rows and columns (Hahsler et al., 2008). This is controlled using the seriation argument where the default is “OLO” (optimal-leaf-order) – which rotates the branches so that the sum of distances between each adjacent leaf (label) will be minimized (i.e.: optimize the Hamiltonian path length that is restricted by the dendrogram structure). The other arguments in the example were omitted since they are self-explanatory – the exact code is available in the supplementary material.

In order to make some of the above easier, we created the shinyHeatmaply package (available on CRAN) which offers a GUI to help guide the researcher with the heatmap construction, with the functionality to export the heatmap as an html file and summaries parameter specifications to reproduce the heatmap with heatmaply. For more detailed step-by-step demonstration of using heatmaply on biological datasets, you should explore the heatmaplyExamples package (at github.com/talgalili/heatmaplyExamples).

The following biological examples are available and fully reproducible from within the package. You may also view them online in the following links (the html files also include the R code for producing the figures):

Acknowledgements

The heatmaply package was made possible by leveraging many wonderful R packages, including ggplot2 (Wickham, 2009), plotly (Sievert et al., 2016), dendextend (Galili, 2015) and many others. We would also like to thank Yoav Benjamini, Madeline Bauer, and Marilyn Friedes for their helpful comments, as well as Joe Cheng for initiating the collaboration with Tal Galili on d3heatmap, which helped lay the foundation for heatmaply.

Funding: This work was supported in part by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project).

Conflict of Interest: none declared.

References

  • Galili,T. (2015) dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31, 3718–3720.
  • Gehlenborg,N. and Wong,B. (2012) Points of view: Heat maps. Nat. Methods, 9, 213–213.
  • Gu,Z. et al. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32, 2847–2849.
  • Hahsler,M. et al. (2008) Getting Things in Order : An Introduction to the R Package seriation. J. Stat. Softw., 25, 1–27.
  • van Panhuis,W.G. et al. (2013) Contagious Diseases in the United States from 1888 to the Present. N. Engl. J. Med., 369, 2152–2158.
  • R Core Team,(R Foundation for Statistical Computing) (2016) R: A Language and Environment for Statistical Computing.
  • Sieger,T. et al. (2017) Interactive Dendrograms: The R Packages idendro and idendr0. J. Stat. Softw., 76.
  • Sievert,C. et al. (2016) plotly: Create Interactive Web Graphics via ‘plotly.js’.
  • Weinstein,J.N. (2008) BIOCHEMISTRY: A Postgenomic Visual Icon. Science (80-. )., 319, 1772–1773.
  • Wickham,H. (2009) ggplot2 Elegant Graphics for Data Analysis.
  • Wilkinson,L. and Friendly,M. (2009) The History of the Cluster Heat Map. Am. Stat., 63, 179–184.
To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Making a Shiny dashboard using ‘highcharter’ – Analyzing Inflation Rates

By Anish Singh Walia

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Shiny is an amazing R package which lets the R developers and users build amazing web apps using R itself. It lets the R users analyze, visualize and deploy their machine learning models directly in the form of the web app. This package lets you host standalone apps on a webpage or embed them in R markdown documents or build dashboards and various forecasting applications. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions. Shiny lets us write client-side front-end code in R itself and also lets users write server-side script in R itself. More details on this package can be found here.

I recently learned Shiny and started developing a web application using it.And since then I have been in love with it and have been using it in each and every data science and analytics project. The syntax is super easy to understand and there are lots of amazing articles and documentation available for you to learn it and use it. I personally had a background of developing full-stack web applications using HTML, CSS and javascript and other JS based scripting languages so I found the syntax easy.

This article is meant for people who at least have a basic understanding of how to use shiny in R or who either have knowledge of developing web apps. But still, it is easy for an intermediate level or beginner R user to learn shiny and its syntax to develop web apps.

In this article I am going to demonstrate how to make a dashboard using shinydashboard package. Shiny Dashboard is also another package similar to shiny which is specifically used to make dashboards easily in R and deploy them. More details on this package can be found here.

Secondly, in this article I am going to demonstrate how to use highcharter package which is a Javascript based visualization library in R to develop amazing and beautiful plots. Its syntax is somewhat similar to qplot function from the ggplot2 package in R. So if you have good experience of using ggplot2 package then you would find it easier to use. Details about the highcharter can be found here.

Analyzing Inflation Rates

In this article I am going to demonstrate how to use shinydashboard and highcharter package in R to analyze and visualize Inflation rates of major economies and other economic and regional trade unions.

What are inflation rates

Inflation rates are the general rate at which price of the goods and services within a particular economy are rising and the purchasing power of the currency is declining due to the higher priced goods. High inflation is definitely not good for an economy because it will always reduce the value for money. In general central banks of an economy tries to and work towards reducing the inflation rate and avoiding deflation.

Deflation is opposite of inflation. Deflation occurs when the inflation rates become negative or are below 0. Deflation is more harmful and dangerous for an economy because it means that the prices of goods and services are going to decrease. Now, this sounds amazing for consumers like us. But what actually happens is that the demand for goods and services have declined over a long term of time. This directly indicates that a recession is on its way. This brings job losses, declining wages and a big hit to the stock portfolio. Deflation slows economy’s growth. As prices fall, people defer(postpone) purchases in hope of a better lower price deal. Due to this companies and firms have to cut down the cost of their goods and products which directly affects the wages of the employees which have to be lowered.

Now I won’t be explaining much about these economic terms. I leave these things for the readers to go check and read these things out. I am sure you will find such subjects quite interesting.

Lets start with designing UI in R

In shiny you have the choice to write UI and server-side code in a single file. But I prefer to write the client-side and server-side code in separate files for easy understanding and modularity and stop the things to get messed up if the code gets too long.

#loading the packages
library(shinydashboard)
require(shiny)
require(highcharter)
#layout of the dashboard
#defining character vectors for select inputs
country

Now, in the above code one can notice that the various functions which are used to design the UI are actually similar to the HTML elements. But in R shiny allows us to use those like functions and the HTML attributes as the function arguments. It's hard to explain everything in the above code, but I insist the readers to check the documentation and help files of the above functions and go and check out the links to various resources which I will add at the end.

The highchartOutput(id) function which takes the ‘id' as an argument which is used to define a high charter plot which will be developed and plotted at the server side. Here we only design and define the UI and layout of the dashboard.

Let's write the Server logic

All the server-side code would be in a separate file named server.R. In the server side, it would only contain the logic to reactively and dynamically plot the various plots.

The inflation rates dataset is downloaded from IMF(International Monetary Fund) website and is publically available. I first had to do some preprocessing and transformations using the tidyr and dplyr packages to convert the dataset to the desired format.

require(shinydashboard)
require(ggplot2)
require(dplyr)
require(highcharter) #to plot amazing time series plots
library(readxl)
require(tidyr)
#loading the dataset
inflation % #to enable downloading the plot image
  hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5",
  shared = TRUE, borderWidth = 2) %>% #designing the hoverable tooltil
  hc_title(text="Time series plot of Inflation Rates",align="center") %>% #title
        hc_subtitle(text="Data Source: IMF",align="center") %>% #subtile of the plot
        hc_add_theme(hc_theme_elementary())  #adding theme
             #to add 3-d effects
                  #hc_chart(type = "column",
                           #options3d = list(enabled = TRUE, beta = 15, alpha = 15))
    }) # end hcontainer

output$hc2% 
        hc_xAxis(categories=inf$year) %>% 
        hc_add_series(name = "India", data = India$inflation) %>% 
        hc_add_series(name = "USA", data = US$inflation) %>%
        hc_add_series(name = "UK", data = UK$inflation) %>%
        hc_add_series(name = "China", data = China$inflation) %>%
        hc_add_series(name = "Germany", data = Ger$inflation) %>%
        hc_add_series(name="Japan",data=Japan$inflation) %>%
        #to add colors
        hc_colors(c("red","blue","green","purple","darkpink","orange")) %>%
        hc_add_theme(hc_theme_elementary())
}) # end hc2
    
output$hc3% 
        hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5",
                   shared = TRUE, borderWidth = 2) %>%
        hc_title(text="Time series plot of Inflation Rates for Economic Unions",align="center") %>%
        hc_subtitle(text="Data Source: IMF",align="center") %>%
        hc_add_theme(hc_theme_elementary())
}) #end hc3

    output$hc4% 
        hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5",
                   shared = TRUE, borderWidth = 2) %>%
        hc_title(text="Time series plot of Inflation Rates for World",align="center") %>%
        hc_subtitle(text="Data Source: IMF",align="center") %>%
         hc_add_theme(hc_theme_elementary())
      
    }) #end hc4
 }

In the above code we access the plot defined in the UI using their id in the form output$id. We use the function renderHighchart({ #write all R code and logic }) to render the plots and write all the R-code to develop any output inside it.v Inside this function as you all can notice is simply the R code to make plots.

Now the function hchart(df,type,hcaes(x,y)...) works in the same fashion as qplot. It takes data, type of plot, plot asthetics as its arguments. Rest things such as title, subtitle etc are piped to it like we piped extra things and plotting characteristics in ggplot. For more details read this http://jkunst.com/highcharter/hchart.html

Screenshots of the Dashboard:

Now deploying a shiny app is just a matter of minutes. You just need to have an account on shinyapps.io and it offers infrastructure as a service(IaaS) to run and manage you application. You can deploy you app directly from your Rstudio IDE.

Conclusion

This was just a short tutorial on how to make a dashboard in R and use shinydashboard and highcharter package in R. This article is not aimed to explain everything in detail,rather just demonstrate the use of shinydashboard and highcharter in R to build awesome dashboards and web apps.I will attach various resources below to help out with learning to develop web apps in R and which helped me out too.

Resources

Hope this article is motivating enough to at least get you started with learning and developing your own web apps in R and start using highcharter in R to visualize data. For any doubt or queries feel free to drop a comment. I will be happy to help. Do like and share the article.
Happy coding!

    Related Post

    1. Time Series Analysis in R Part 2: Time Series Transformations
    2. Time Series Analysis in R Part 1: The Time Series Object
    3. Parsing Text for Emotion Terms: Analysis & Visualization Using R
    4. Using MongoDB with R
    5. Finding Optimal Number of Clusters
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News