Conjoint Analysis and the Strange World of All Possible Feature Combinations

By Joel Cadwell

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

The choice modeler looks over the adjacent display of cheeses and sees the joint marginal effects of the dimensions spanning the feature space: milk source, type, origin, moisture content, added mold or bacteria, aging, salting, packaging, price, and much more. Literally, if products are feature bundles, then one needs to specify all the sources of variation generating so many different cheeses. Here are the cheeses from goats, sheep and cows. Some are local, and some are imported from different countries. In addition, we will require columns separating the hard and soft cheeses. The feature list can become quite long. In the end, one accounts for all the different cheeses with a feature taxonomy consisting of a large multidimensional space of all possible feature combinations. Every cheese falls into a single cell in the joint distribution, and the empty cells represent new product possibilities (unless the feature configuration is impossible).

The retailer, on the other hand, was probably thinking more of supply and demand when they filled this cooler with cheeses. It’s an optimization problem that we can simplify as a tradeoff between losing customers because you do not have what they are looking for and losing money when the product spoils. Meanwhile, consumers have their own issues for they are buying for a reason and may infer a means to a desired end from individual features or complex combinations of transformed features. Neither the retailer nor the consumer is a naturalist seeking a feature taxonomy. In fact, except for the connoisseur, most consumers have very limited knowledge of any product category. We are simply not familiar with all the offerings nor could we name all the alternatives in the cheese cooler or the dog food aisle or the shelves filled with condensed soups. Instead, we rely on the physical or online displays to remind ourselves what is available, but even then, we do not consider every alternative or try to differentiate among all the products.

Thus, the conjoint world of all possible feature combinations is strange to a consumer who sees the products from a purposefully restricted perspective. The consumer categorizes products using goal-derived categories, for instance, restocking or running out of slices for your ham and Swiss sandwich. Thus, attention, categorization and preference are situated within the purchase context defined by goals and the purchase process including the physical product display (e.g., a deli counter with attendant is not the same as self-service selection of prepackaged products). Bartels and Johnson summarize this emerging view in their recent article “Connecting Cognition and Consumer Choice” (see Section 3 Learning and Constructing Value in Context).

Speaking of cheese (in particular, Brillat-Savarin cheese), we are reminded of the above quote popularized by the original Japanese Iron Chef TV show. Can it be this simple? I can tell you what is being discussed if you give me a “bag of words” and the R package topicmodels. R-bloggers shows how to recover the major cuisines from a list of ingredients from different recipes. My claim is that I learn a great deal by asking if you buy single wrapped slices of processed American cheese. As Charles de Gaulle quips, “How can you govern a country which has 246 varieties of cheese?” One can start by identifying the latent goal structure that shapes awareness, familiarity and usage.

Much is revealed by learning what music you listen to, your familiarity with various providers in a product category, which brands of Scotch whiskey you buy, or the food choices you make for breakfast. In each of those posts, the R package NMF was able to discover the underlying latent variables that could reproduce the raw data with many columns and most rows containing only a few responses (e.g., Netflix ratings with viewers in the rows seeing only a small proportion of all the movies in the columns). Nonnegative matrix factorization (NMF), however, is only one method for uncovering the hidden forces that structure consumption activities. You are free to select any latent variable model that can accommodate such high-dimensional sparse data (e.g., the already mentioned topic modeling, the R package HDclassif, the R package bclust, and more on the way). My preference for NMF stems from its ease of use and successful application across a diverse range of marketing research data as reported in prior posts.

Unfortunately, in the strange world of all possible feature combinations, consumers are unable to apply the strategies that work so well in the marketplace. Given nothing other than hypothetical products described by lists of orthogonal features, what else can a respondent do but rely on the information provided?

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Time Series Graphs & Eleven Stunning Ways You Can Use Them

By Plotly Blog

(This article was first published on Plotly Blog, and kindly contributed to R-bloggers)

Many graphs use a time series, meaning they measure events over time. William Playfair (1759 – 1823) was a Scottish economist and pioneer of this approach. Playfair invented the line graph. The graph below–one of his most famous–depicts how in the 1750s the Brits started exporting more than they were importing.

This post shows how you can use Playfair’s approach and many more for making a time series graph. To embed Plotly graphs in your applications, dashboards, and reports, check out Plotly Enterprise.

1. By Year

First we’ll show an example of a standard time series graph. The data is drawn from a paper on shaving trends. The author concludes that the “dynamics of taste”, in this case facial hair, are “common expressions of underlying conditions and sequences in social behavior.” Time is on the x-axis. The y-axis shows the respective percentages of men’s facial hair styles.

You can click and drag to move the axis, click and drag to zoom, or toggle traces on and off in the legend. The temperatue graph below shows how Plotly adjusts data from years to nanoseconds as you zoom. The first timestamp is 2014-12-15 08:55:13.961347, which is how Plotly reads dates. That is, `yyyy-mm-dd HH:MM:SS.ssssss`. Now that’s drilling down.

One of the special things about Plotly is that you can translate plots and data between programming lanuguages, file formats, and data types. For example, the multiple axis plot below uses stacked plots on the same time scale for different eonomic indicators. This plot was made using ggplot2’s time scale. We can convert the plot into Plotly, allowing anyone to edit the figure from different programming languages or the Plotly web app.

pce, pop, psavert, uempmed, unemploy

We have a time series tutorial that explains time series graphs, custom date formats, custom hover text labels, and time series plots in MATLAB, Python, and R.

2. Subplots & Small Multiples

Another way to slice your data is by subplots. These histograms were made with R and compare yearly data. Each plot shows the annual number of players who had a given batting average in Major League Baseball.

2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013

You can also display your data using small multiples, a concept developed by Edward Tufte. Small multiples are “illustrations of postage-stamp” size. They use the same graph type to index data by a cateogry or label. Using facets, we’ve plotted a dataset of airline passengers. Each subplot shows the overall travel numbers and a reference line for the thousands of passengers travelling that month.

    ,     ,     ,     ,     ,     ,     ,     ,     ,     ,     ,     , Jan, Feb, Mar, Apr, May, June, July, Aug, Sep, Oct, Nov, Dec

3. By Month

The heatmap below shows the percentages of people’s birthdays on a given date, gleaned from 480,040 life isurance applications. The x-axis shows months, the y-axis shows the day of the month, and the z shows the % of birthdays on each date.

<a target="_blank" href="https://plot.ly/~Dreamshot/354/" title="
How Common is Your Birthday?” ref=”nofollow” target=”_blank”><img src="https://plot.ly/~Dreamshot/354.png" alt="
How Common is Your Birthday?” width=”750″>

To show how values in your data are spaced over different months, we can use seasonal boxplots. The boxes represent how the data is spaced for each month; the dots represent outliers. We’ve used ggplot2 to make our plot and added a smoothed fit with a confidence interval. See our box plot tutorial to learn more.

Box plot with Smoothed Fit

We can use a bar chart with error bars to look at data over a monthly interval. In this case, we’re using R to make a graph with error bars showing snowfall in Montreal.

Snowfall in Montreal by Month

4. A Repeated Event With A Category

We may want to look at data that is not stricly a time series plot, but still represents changes over time. For example, we may want hourly event data. Below we’re showing the most popular hourly reasons to call 311 in NYC, a number you can call for non-emergency help. The plot is from our pandas and SQLite guide.

The 6 Most Common 311 Complaints by Hour in a Day

We can also show a before and after effect to examine changes from an event. The plot below, made in an IPython Notebook, tracks Conservative and Labour election impacts on Pounds and Dollars.

GBP USD during UK general elections by winning party

5. A 3D Graph

We can also use a 3D chart to show events over time. For example, our surface chart below shows the UK Swaps Term Structure with historical dates along the X axis, the Term Structure on the Y axis, and the swap rates over the Z Axis. The message: rates are lower than ever. At the long end of the curve we don’t see a massive increase. This example was made using cufflinks, a Python library by Jorge Santos. For more on 3D graphing see our Python, MATLAB, R, and web tutorials.

<a target="_blank" href="https://plot.ly/~MattSundquist/11399/" title="
UK Swap Rates” ref=”nofollow” target=”_blank”><img src="https://plot.ly/~MattSundquist/11399.png" alt="
UK Swap Rates” width=”685″>

Sharing & Deploying Plotly

If you liked this post, please consider sharing. We’re @plotlygraphs, or email us at feedback at plot dot ly. We have tutorials that show how to make and embed graphs in your website, blog, or apps. To learn more about how companies are using Plotly Enterprise across different industries, see our customer stories.

To leave a comment for the author, please follow the link and comment on his blog: Plotly Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Blowing Away the Competition

By Jeffrey Horner

Plots with DISTINCT.1mil Data Set

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

In February I embarked on a mission to speed up R, and I’m very pleased with the results so far. I redesigned the internal string cache, symbol table, and environments by using a somewhat obscure data structure called an Array Hash. It’s basically a cache conscious hash table designed for performance which you can read more about below.

Here’s the takeaway from this benchmark mesuring R hashed environments: R with Array Hashes outperforms Revolution Analytics Enterprise 7.3.0, HP’s new Distributed R, and a version of R-devel at revision 67716.

Since running this benchmark I’ve found a few other commercial R offerings, plus R 3.2.0 has been released and Revolution Analytics has a new release, so I plan to update the results when I get a chance.

Also for the curious, you can check out R-Array-Hash on github, scope out some other benchmarks, and even see the results of the brutal R make check-all which R-Array-Hash passes with flying colors.

Benchmark Design

Construction and search of a hashed R environment were measured against
various data sets. Also, the memory size of the hash table was measured
by using calls to the garbage collector as a proxy. Four versions
of R were tested with three data sets and six hash sizes ranging from
2^10 to 2^15. Also 3 runs of each were performed for a total of 216
independent tests.

The benchmark was set up similarly to the design in Nikolas Askitis and Justin Zobel’s paper “Redesigning the String Hash Table, Burst Trie, and BST to Exploit Cache.”.

NOTE: Hash tables are key-value stores. In R, environmnents can be
used as key-value stores with a named variable acting as the key, and
obviously the value of the variable acting as the value of the key. Hashed
environments are constructed with new.env(hash=TRUE) and an additional
argument ‘size’ specifing the size of the hashed environment. This value
allocates ‘size’ slots for
the hash table.

Data Sets

Large sets of words scraped from
Wikipedia articles were
created, one word per line.

There are two variants of the data sets, SKEW and DISTINCT. The SKEW
data sets obey Zipf’s law,
e.g. the most frequent word in the set will appear twice as likely as the
second most frequent word, etc, common in spoken langauge corpora. The
DISTINCT data set contains only distinct words, no repeats, and they
appear in the file in the order in which they were scraped, unordered.

Three data sets were tested: SKEW.1mil, DISTINCT.500thou, and DISTINCT.1mil

SKEW.1mil

SKEW.1mil contains one million words with repeats, 63469 distinct words, with an average length of 5.11 letters.

DISTINCT.500thou and DISTINCT.1mil

DISTINCT.500thou constains half a million unique words with an average length of 8.58 letters, and DISTINCT.1mil contains a million unique words, average length 8.73

Construction and Search of the R Environment

Each run of RUN.R constructed a new hashed R environment and a random string value using:

r
e <- new.env(hash=TRUE, size=hashsize)
val <- paste(sample(letters, size=7), collapse='')

The value was not important and was kept small so as not to add any
overhead cost. Then, for each word in the data file an assignment was
made in the environment:

r
assign(word, val, envir=e)

Once all words were assigned the file was read again one word at a
time, and the environment was searched with:

r
get(word, envir=e)

Measurements

So, what was measured? In Askitis and Zobel’s paper, they measured the
total time it took to construct the hash table, the total time it took
to search the hash table, and the size in memory of the hash table given
varying hash sizes.

In RUN.R we measured those a little differently since R is an interpreted
language with a garbage collector versus C, a compiled language with no
memory manaagement overhead:

  • ‘Construct’ time was the time it took to call assign(var, value, envir=e) for all words.
  • ‘Search’ time was the time it took to call get(var,envir=e) for all words, and
  • ‘Runtime’ was the time it took to both construct and search the R environment.

The ‘Runtime’ measurement included the overhead of reading the files,
calls to the garbage collector, hash table resizes etc.

Finally,

  • ‘Memory Size’ was measured by using calls of gc(reset=TRUE) and then another call of gc() to create a proxy for the size of the environment in memory.

Results

Results are very promising, with R-Array-Hash construction and search of a one million DISTINCT dataset performing faster than a comparative R-devel at SVN revision 67716, faster than Revolution R Enterprise 7.3.0, and faster that HP’s new Distributed R.

Tests were performed in single user mode on a DELL PRECISION M6800 running Redhat Enterprise Linux version 6 with a 4 core Intel i7-4800MQ processer and clock speed of 2.70GHz, 8 GB main memory, and a hybrid SSD.

For the plots below, the Y axis is time in seconds when the label is missing, and the X axis is the size of the hash table (and log scaled).

DISTINCT.1mil

DISTINCT.500thou

Plots with DISTINCT.500thou Data Set

SKEW.1mil

Plots with SKEW.1mil Data Set

To leave a comment for the author, please follow the link and comment on his blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Microsoft hiring engineers for R projects

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Are you a talented software engineer who would like to build out the R ecosystem and help more companies access the power of R? Microsoft (Revolution Analytics’ parent) is hiring a new team to do just that:

Our mission is to empower enterprises to easily and cost-effectively build high-scale analytics solutions leveraging R.

Exponential growth has transformed data into a new natural resource. Every industry has focused on exploiting data analytics for competitive advantage. Business applications of advanced analytics abound: consumer companies doing targeted marketing, financial firms scoring customer credit-worthiness, retailers managing product promotions, manufacturers detecting anomalies in sensor data, & many more.

For the uninitiated, R is an open source programming language & environment for statistical computing. More importantly, R is an innovation engine, with applications that run the gamut from quantitative finance to bioinformatics to machine learning. Over the past several years, R has enjoyed tremendous growth in usage & mindshare in the data science community, reaching a user count in the millions.

Within the Information Management & Machine Learning (IMML) organization, we are forming this new team around the Revolution Analytics acquisition to drive the future of R as a tool for enterprise advanced analytics. To achieve this, we are going to make the Microsoft platform a great place to operationalize R analytics workloads, both on-prem and in the cloud. We will democratize the process of deploying R code as production cloud services. We will enable the use of R within compelling in-database analytics scenarios. We will further accelerate & scale R analytics workloads by integrating with modern big data processing frameworks. And we will invest in the open-source R ecosystem in ways that help foster its evolution and add value to the data science community.

We’re seeking strong highly motivated software engineers, program managers, & software engineering leads to join us on this journey.

I’ve only been an official Microsoft employee for a couple of weeks now, but can already tell it’s a great place to work. If you know and love R, you could join us! For details on the positions available and how to apply, please follow the link below.

Github (elliotwmsft): R for enterprise advanced analytics @ Microsoft

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R’s plot function, the 1970′s retro look is not cool any more

By Derek Jones

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

Casual users of a system want to learn a few simple rules that enable them to get most things done. Many languages have a design principle of only providing one way of doing things. Members of one language family are known for providing umpteen different ways of doing something and R is no exception.

R comes with the plot function as part of the base system. I am an admirer of plot‘s ability to take whatever is thrown at it and generally produce a workman-like graphical image; workman-like is a kinder description than 1970′s retro look.

R has a thriving library of add-on packages and the package ggplot2 is a byword for fancy graphics in the R community. Anybody reading the description of the qplot function, in this package, would think it is the death kneel for plot. They would be wrong, qplot contains a fatal flaw, it does a very poor job of handling the simple stuff (often generating weird error messages in the process).

In the beginning I’m sure Hadley Wickham, the design+implementor of ggplot/ggplot2, was more concerned with getting his ideas implemented and was not looking to produce a replacement for plot. Unfortunately it looks as-if the vision for functions in the ggplot package is as high-end plot replacements (i.e., for power users) and not as universal plot replacements (i.e., support for casual users).

This leaves me pulling my hair out trying to produce beautiful looking graphs for a book I am working on. The readership are likely to be casual users of R and I am trying to recommend one way of doing something for every task. The source code+data of all the examples will be freely available and I’m eating my own dog food, so its plot I have to use.

To leave a comment for the author, please follow the link and comment on his blog: The Shape of Code » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Hash Table Performance in R: Part IV

By Jeffrey Horner

(This article was first published on Jeffrey Horner, and kindly contributed to R-bloggers)

In the last post I introduced the package envestigate that provides the hash table structure and interesting statistics associated with an R environment. Now I want to show you some performance characteristics of the R environment as a hash table.

I’ll be using a synthetic list of distinct words that mimics real world data scraped from Wikipedia which you can scope out here.

Hash Table Construction Time

As I alluded in Part III, while R allows you to set the size of an environment with new.env(size=X) it will automatically resize the internal hash table once the load factor reaches a certain threshold (.85 to be exact). Let’s see how that affects hash table construction time.

The code below creates a hash table and inserts each word from the DISTINCT.1mil set. We accumulate the time it takes to insert words 1000 at a time. We also note the size of the hash table after every 1000 inserts. This will reveal an interesting spike that occurs every time the hash table increases in size. There are other spikes that happen and I suspect those are due to garbage collection.

library(envestigate)
library(readr); library(dplyr)
library(ggplot2); library(scales)

# Character vector of distinct words
DISTINCT <- read_lines('DISTINCT.1mil')
len_DISTINCT <- length(DISTINCT)

# Accumulate data each ith_insert
ith_insert <- 10^3

# Number of observations
n_observations <- len_DISTINCT/ith_insert

# Collect these items each ith_insert
insert      <- numeric(n_observations)
size           <- numeric(n_observations)
construct_time <- numeric(n_observations)

# New environment with default hash size of 29L
e <- new.env()

i <- j <- 1L
elapsed_time <- 0.0
while(i <= len_DISTINCT){
  t1 <- proc.time()['elapsed']
  # Insert word into environment
  e[[DISTINCT[i]]] <- 1
  t2 <- proc.time()['elapsed']

  # t2 - t1 = time to insert one word
  # elapsed_time = accumulated time of inserts
  elapsed_time <- elapsed_time + (t2 - t1)

  # ith_insert occures here, collect data
  if (i %% ith_insert == 0){
    ht <- hash_table(e) # gather size of current hash table
    insert[j]              <- i
    size[j]                   <- ht$size
    construct_time[j]         <- elapsed_time

    elapsed_time <- 0.0
    j <- j + 1L
  }
  i <- i + 1L
}
# Our data frame of results
res <- data_frame(insert, size, construct_time)
res
## Source: local data frame [1,000 x 3]
## 
##    insert size construct_time
## 1    1000  849          0.006
## 2    2000 1758          0.009
## 3    3000 2530          0.004
## 4    4000 3643          0.005
## 5    5000 4371          0.003
## 6    6000 5245          0.005
## 7    7000 6294          0.007
## 8    8000 7552          0.011
## 9    9000 7552          0.003
## 10  10000 7552          0.003
## ..    ...  ...            ...
# Use a bimodal color scheme to denote when a hash size changes
num_uniq_sizes <- length(unique(res$size))
cols <- rep(hue_pal()(2),ceiling(num_uniq_sizes/2))

p <- ggplot(res, aes(x=insert,y=construct_time,color=factor(size))) +
  geom_line(size=2) + scale_color_manual(values=cols) +
  guides(col=guide_legend(ncol=2,title="Hash Size")) +
  labs(x="Number of Elements in Environment", y="Construct Time (seconds)") +
  theme( axis.text = element_text(colour = "black")) +
    ggtitle('Environment Construct Time')
p

Some interesting takeaways:

  • There’s a clear spike in construct time occurring every time the line color changes, signifying a change in the hash size. We presume the hash resize occurs during the spike. Also looks like linear growth.
  • Other spikes appear even when the hash size doesn’t change. We presume that’s due to the garbage collector either getting rid of old R objects or allocating more heap space.
  • The length of the colored lines along the x axis are growing as more inserts occur. It might follow a linear growth, but I haven’t looked at the R source code to confirm.
To leave a comment for the author, please follow the link and comment on his blog: Jeffrey Horner.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R for more powerful clustering

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Vidisha Vachharajani
Freelance Statistical Consultant

R showcases several useful clustering tools, but the one that seems particularly powerful is the marriage of hierarchical clustering with a visual display of its results in a heatmap. The term “heatmap” is often confusing, making most wonder – which is it? A “colorful visual representation of data in a matrix” or “a (thematic) map in which areas are represented in patterns (“heat” colors) that are proportionate to the measurement of some information being displayed on the map”? For our sole clustering purpose, the former meaning of a heatmap is more appropriate, while the latter is a choropleth.

The reason why we would want to link the use of a heatmap with hierarchical clustering is the former’s ability to lucidly represent the information in a hierarchical clustering (HC) output, so that it is easily understood and more visually appealing. It is also (as an in-built package in R, “heatmap.2”) a mechanism of applying HC to both rows and columns in a data matrix, so that it yields meaningful groups that share certain features (within the same group) and are differentiated from each other (across different groups).

Consider the following simple example which uses the “States” data sets in the car package. States contains the following features:

  • region: U. S. Census regions. A factor with levels: ENC, East North Central; ESC, East South Central; MA, Mid-Atlantic; MTN, Mountain; NE, New England; PAC, Pacific; SA, South Atlantic; WNC, West North Central; WSC, West South Central.
  • pop: Population: in 1,000s.
  • SATV: Average score of graduating high-school students in the state on the verbal component of the Scholastic Aptitude Test (a standard university admission exam).
  • SATM: Average score of graduating high-school students in the state on the math component of the Scholastic Aptitude Test.
  • percent: Percentage of graduating high-school students in the state who took the SAT exam.
  • dollars: State spending on public education, in $1000s per student.
  • pay: Average teacher’s salary in the state, in $1000s.

We wish to account for all but the first column (region) to create groups of states that are common with respect to the different pieces of information we have about them. For instance, what states are similar vis-a-vis exam scores vs. state education spending? Instead of doing just a hierarchical clustering, we can implement both the HC and the visualization in one step, using the heatmap.2() function in the gplots package.

# R CODE (output = "initial_plot.png")
library(gplots) # contains the heatmap.2 package library(car) States[1:3,] # look at the data scaled <- scale(States[,-1]) # scale all but the first column to make information comparable heatmap.2(scaled, # specify the (scaled) data to be used in the heatmap cexRow=0.5, cexCol=0.95, # decrease font size of row/column labels scale="none", # we have already scaled the data trace="none") # cleaner heatmap

This initial heatmap gives us a lot of information about the potential state grouping. We have a classic HC dendrogram on the far left of the plot (the output we would have gotten from an “hclust()” rendering). However, in order to get an even cleaner look, and have groups fall right out of the plot, we can induce row and column separators, rendering an “all-the-information-in-one-glance” look. Placement information of the separators come from the HC dendrograms (both row and column). Lets also play around with the colors to get a “red-yellow-green” effect for the scaling, which will render the underlying information even more clearly. Finally, we’ll also eliminate the underlying dendrograms, so we simply have a clean color plot with underlying groups (this option can be easily undone from the code below).

# R CODE (output = "final_plot.png")
 
# Use color brewer
library(RColorBrewer)
my_palette <- colorRampPalette(c('red','yellow','green'))(256)
 
scaled <- scale(States[,-1])    # scale all but the first column to make information comparable
heatmap.2(scaled,               # specify the (scaled) data to be used in the heatmap
          cexRow=0.5, 
cexCol=0.95, # decrease font size of row/column labels col = my_palette, # arguments to read in custom colors
colsep=c(2,4,5), # Adding on the separators that will clarify plot even more
rowsep = c(6,14,18,25,30,36,42,47),
sepcolor="black",
sepwidth=c(0.01,0.01),
scale
="none", # we have already scaled the data
dendrogram="none", # no need to see dendrograms in this one
trace="none") # cleaner heatmap

This plot gives us a nice, clear picture of the groups that come off of the HC implementation, as well as in context of column (attribute) groups. For instance, while Idaho, Okalahoma, Missouri and Arkansas perform well on the verbal and math SAT components, the state spending on education and average teacher salary is much lower than the other states. These attributes are reversed for Connecticut, New Jersey, DC, New York, Pennsylvania and Alaska.

This hierarchical-clustering/heatmap partnership is a useful, productive one, especially when one is digging through massive data, trying to glean some useful cluster-based conclusions, and render the conclusions in a clean, pretty, easily interpretable fashion.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Introducing ghrr: GitHub Hosted R Repository

By Thinking inside the box

<a target="_blank" href="https://twitter.com/share" data-url="http://www.r-bloggers.com/h3-idintroducing-ghrr-github-hosted-r-repositoryintroducing-ghrr-github-hosted-r-repositoryh3/" data-counturl="http://www.r-bloggers.com/h3-idintroducing-ghrr-github-hosted-r-repositoryintroducing-ghrr-github-hosted-r-repositoryh3/" data-text="

Introducing ghrr: GitHub Hosted R Repository

” data-count=”horizontal” data-via=”rbloggers”>

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Background

R relies on package repositories for initial installation of a package via install.packages(). A crucial second step is update.packages(): For all currently installed packages, a list of available updates is constructed or offered for either one-by-one or bulk updates. This keeps the local packages in sync with upstream, and provides for a very convenient way to obtain new features, bug fixes and other improvements. So by installing from a repository, we automatically have the ability to track the repository for updates.

Enter drat

Fairly recently, the drat package was added to the R ecosystem. It makes both aspects of package distribution easy: providing a package (if you are an author) as well as installing it (if you are a user). Now, because drat is at the same time source code (as it is also a package providing the functionality), and a repository (using what drat provides ib features), the “namespace” becomes a little cluttered.

But because a key feature of drat is the “one variable” unique identification via the GitHub, I opted to create a drat repository in the name of a new organisation: ghrr. This is a simple acronym for GitHub Hosted R Repository.

Use cases

We can outline several use case for packages in ghrr:

  • packages not published in a repo by their authors: I already use two like that:
    • fasttime, an impeccably fast parser for ISO datetimes by Simon Urbanek which was however never released into a repo by Simon, and
    • RcppR6, a very nice extension to both R6 (by Winston) and Rcpp, by Rich FitzJohn; similarly never released beyond GitHub;
  • packages possibly unsuitable for mainline repos:
    • Rblpapi is a great package by Whit Armstong and John Laing to which I have been contributing quite a bit of late. As it requires a free-to-use but not open source library and headers from Bloomberg, it will never make it to the mainline repository for R, but hosting it in ghrr is perfect as I can easily update several machines at work once I cut a new development release;
    • winsorize is a small package I needed a few weeks ago; it is spun out of robustHD but does not yet contain new code so Andreas and I are content to keep it in this drat for now;
  • packages in pre-relase mode:
    • RcppArmadillo where I announced both a release candidate before Armadillo 5.000 came out, as well as the actual RcppArmadillo 0.500.0.0 which is not (yet) on the mainline repository as two affected packages need a small update first. Users, however, can get RcppArmadillo already from the sibling Rcpp drat repo.
    • RcppToml is a new package I am currently working on implementing a toml parser based on cpptoml. It works, but it not quite ready for public announcements yet, and hence perfect for ghrr.

Going forward

ghrr is meant to be open. While anybody can open a drat repository, particularly on GitHub, it may be beneficial to somehow group packages. This is however not something that can be planned ex-ante: it may just happen if others who see similar benefits in this can in fact contribute. In that spirit, I strongly encourage pull requests.

Early on, I made my commit messages conform to a pattern of package version sha1 repourl to make code provenance of every commit very clear. Ideally, subsequent commits would conform to such a scheme, or replace it with a better one.

Some Resources

A few links to learn more about drat and ghrr:

Comments and questions via email or issue tickets are more than welcome. We hope that others find ghrr to be a useful tool for easy repository management and use via GitHub.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

<a target="_blank" href="https://twitter.com/share" data-url="http://www.r-bloggers.com/h3-idintroducing-ghrr-github-hosted-r-repositoryintroducing-ghrr-github-hosted-r-repositoryh3/" data-counturl="http://www.r-bloggers.com/h3-idintroducing-ghrr-github-hosted-r-repositoryintroducing-ghrr-github-hosted-r-repositoryh3/" data-text="

Introducing ghrr: GitHub Hosted R Repository

” data-count=”vertical” data-via=”rbloggers”>

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: <a href=http://feedproxy.google.com/~r/RBloggers/~3/gnM2Fq8Q81Y/ target="_blank" title="

Introducing ghrr: GitHub Hosted R Repository

” >R News

Parse and process XML (and HTML) with xml2

By hadleywickham

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m pleased to announced that the first version of xml2 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R:

  • Read XML and HTML with read_xml() and read_html().
  • Navigate the tree with xml_children(), xml_siblings() and xml_parent(). Alternatively, use xpath to jump directly to the nodes you’re interested in with xml_find_one() and xml_find_all(). Get the full path to a node with xml_path().
  • Extract various components of a node with xml_text(), xml_attrs(), xml_attr(), and xml_name().
  • Convert to list with as_list().
  • Where appropriate, functions support namespaces with a global url -> prefix lookup table. See xml_ns() for more details.
  • Convert relative urls to absolute with url_absolute(), and transform in the opposite direction with url_relative(). Escape and unescape special characters with url_escape() and url_unescape().
  • Support for modifying and creating xml documents in planned in a future version.

This package owes a debt of gratitude to Duncan Temple Lang who’s XML package has made it possible to use XML with R for almost 15 years!

Usage

You can install it by running:

install.packages("xml2")

(If you’re on a mac, you might need to wait a couple of days – CRAN is busy rebuilding all the packages for R 3.2.0 so it’s running a bit behind.)

Here’s a small example working with an inline XML document:

library(xml2)
x <- read_xml("<foo>
  <bar>text <baz id = 'a' /></bar>
  <bar>2</bar>
  <baz id = 'b' /> 
</foo>")

xml_name(x)
#> [1] "foo"
xml_children(x)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>

# Find all baz nodes anywhere in the document
baz <- xml_find_all(x, ".//baz")
baz
#> {xml_nodeset (2)}
#> [1] <baz id="a"/>
#> [2] <baz id="b"/>
xml_path(baz)
#> [1] "/foo/bar[1]/baz" "/foo/baz"
xml_attr(baz, "id")
#> [1] "a" "b"

Development

Xml2 is still under active development. If notice any problems (including crashes), please try the development version, and if that doesn’t work, file an issue.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R: single plot with two different y-axes

By Stephen Turner

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)
I forgot where I originally found the code to do this, but I recently had to dig it out again to remind myself how to draw two different y axes on the same plot to show the values of two different features of the data. This is somewhat distinct from the typical use case of aesthetic mappings in ggplot2 where I want to have different lines/points/colors/etc. for the same feature across multiple subsets of data.
For example, I was recently poking around with some data examining enrichment of a particular set of genes using a hypergeometric test as I was fiddling around with other parameters that included more genes in the selection (i.e., in the classic example, the number of balls drawn from some hypothetical urn). I wanted to show the -log10(p-value) on one axis and some other value (e.g., “n”) on the same plot, using a different axis on the right side of the plot.
Here’s how to do it. First, generate some data:
set.seed(2015-04-13)

d = data.frame(x =seq(1,10),
n = c(0,0,1,2,3,4,4,5,6,6),
logp = signif(-log10(runif(10)), 2))
x n logp
1 0 1.400
2 0 0.590
3 1 1.200
4 2 1.500
5 3 0.028
6 4 0.380
7 4 2.500
8 5 0.067
9 6 0.041
10 6 0.360
The strategy here is to first draw one of the plots, then draw another plot on top of the first one, and manually add in an axis. So let’s draw the first plot, but leave some room on the right hand side to draw an axis later on. I’m drawing a red line plot showing the p-value as it changes over values of x.
par(mar = c(5,5,2,5))
with(d, plot(x, logp, type="l", col="red3",
ylab=expression(-log[10](italic(p))),
ylim=c(0,3)))
Now, draw the second plot on top of the first using the par(new=T) call. Draw the plot, but don’t include an axis yet. Put the axis on the right side (axis(...)), and add text to the margin (mtext...). Finally, add a legend.
par(new = T)
with(d, plot(x, n, pch=16, axes=F, xlab=NA, ylab=NA, cex=1.2))
axis(side = 4)
mtext(side = 4, line = 3, 'Number genes selected')
legend("topleft",
legend=c(expression(-log[10](italic(p))), "N genes"),
lty=c(1,0), pch=c(NA, 16), col=c("red3", "black"))
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News