reshape: from long to wide format

By Xianjun Dong

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)

This is to continue on the topic of using the melt/cast functions in reshape to convert between long and wide format of data frame. Here is the example I found helpful in generating covariate table required for PEER (or Matrix_eQTL) analysis:

Here is my original covariate table:

Let’s say we need to convert the categorical variables such as condition, cellType, batch, replicate, readLength, sex into indicators (Note: this is required by most regression programs like PEER or Matrix-eQTL, since for example the batch 5 does not match it’s higher than batch 1, unlike the age or PMI). So, we need to convert this long format into wide format. Here is my R code for that:
library(reshape2)
categorical_varaibles = c(“batch”, “sex”, “readsLength”, “condition”, “cellType”, “replicate”);
for(x in categorical_varaibles) {cvrt = cbind(cvrt, value=1); cvrt[,x]=paste0(x,cvrt[,x]); cvrt = dcast(cvrt, as.formula(paste0(“… ~ “, x)), fill=0);}

Here is output:

To leave a comment for the author, please follow the link and comment on his blog: One Tip Per Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Why I think twice before editing plots in Powerpoint, Illustrator, Inkscape, etc.

By Maxwell B. Joseph

(This article was first published on Ecology in silico, and kindly contributed to R-bloggers)

Thanks to a nice post by Meghan Duffy on the Dynamic Ecology blog (How do you make figures?), we have some empirical evidence that many figures made in R by ecologists are secondarily edited in other programs including MS Powerpoint, Adobe Illustrator, Inkscape, and Photoshop.
This may not be advisable* for two reasons: reproducibility and bonus learning.

Reproducibility

R is nice because results are relatively easy to reproduce.
It’s free, and your code serves as a written record of what was done.
When figures are edited outside of R, they can be much more difficult to reproduce.
Independent of whether I am striving to maximize the reproducibility of my work for others, it behooves me to save time for my future self, ensuring that we (I?) can quickly update my own figures throughout the process of paper writing, submission, rewriting, resubmission, and so on.

I had to learn this the hard way.
The following figure was my issue: initially I created a rough version in R, edited it in Inkscape (~30 minutes invested), and ended up with a “final” version for submission.

Turns out that I had to remake the figure three times throughout the revision process (for the better). Eventually I realized that it was far more efficient to make the plot in R than to process it outside of R.

In retrospect, two things are clear:

  1. My energy allocation strategy was not conducive to the revision process. I wasted time trying to make my “final” version look good in Inkscape, when I could have invested time to figure out how to make the figure as I wanted it in R. The payoff from this time investment will be a function of how much manipulation is done outside R, how hard it is to get the desired result in R, and how many times a figure will be re-made.
  2. I probably could have found a better way to display the data. Another post perhaps.

Bonus learning

Forcing myself to remake the figure exactly as I wanted it using only R had an unintended side effect: I learned more about base graphics in R.
Now, when faced with similar situations, I can make similar plots much faster, because I know more graphical parameters and plotting functions.
In contrast, point-and-click programs are inherently slow because I’m manually manipulating elements, usually with a mouse, and my mouse isn’t getting any faster.

*different strokes for different folks, of course

(Figure from Does life history mediate changing disease risk when communities disassemble?)

To leave a comment for the author, please follow the link and comment on his blog: Ecology in silico.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni

By Stephen Turner

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)
This classic example really illustrates the importance of looking at your data, not just the summary statistics and model parameters you compute from it.
With that said, you can’t use data visualization to “cheat” your way into statistical significance. I recently had a collaborator who wanted some help automating a data visualization task so that she could decide which correlations to test. This is a terrible idea, and it’s going to get you in serious type I error trouble. To see what I mean, consider an experiment where you have a single outcome and lots of potential predictors to test individually. For example, some outcome and a bunch of SNPs or gene expression measurements. You can’t just visually inspect all those relationships then cherry-pick the ones you want to evaluate with a statistical hypothesis test, thinking that you’ve outsmarted your way around a painful multiple-testing correction.
Here’s a simple simulation showing why that doesn’t fly. In this example, I’m simulating 100 samples with a single outcome variable y and 64 different predictor variables, x. I might be interested in which x variable is associated with my y (e.g., which of my many gene expression measurement is associated with measured liver toxicity). But in this case, both x and y are random numbers. That is, I know for a fact the null hypothesis is true, because that’s what I’ve simulated. Now we can make a scatterplot for each predictor variable against our outcome, and look at that plot.
library(dplyr)
set.seed(42)
ndset = 64
n = 100
d = data_frame(
set = factor(rep(1:ndset, each = n)),
x = rnorm(n * ndset),
y = rep(rnorm(n), ndset))
d
## Source: local data frame [6,400 x 3]
##
## set x y
## 1 1 1.3710 1.2546
## 2 1 -0.5647 0.0936
## 3 1 0.3631 -0.0678
## 4 1 0.6329 0.2846
## 5 1 0.4043 1.0350
## 6 1 -0.1061 -2.1364
## 7 1 1.5115 -1.5967
## 8 1 -0.0947 0.7663
## 9 1 2.0184 1.8043
## 10 1 -0.0627 -0.1122
## .. ... ... ...
ggplot(d, aes(x, y)) + geom_point() + geom_smooth(method = lm) + facet_wrap(~set)
Now, if I were to go through this data and compute the p-value for the linear regression of each x on y, I’d get a uniform distribution of p-values, my type I error is where it should be, and my FDR and Bonferroni-corrected p-values would almost all be 1. This is what we expect — remember, the null hypothesis is true.
library(dplyr)
results = d %>%
group_by(set) %>%
do(mod = lm(y ~ x, data = .)) %>%
summarize(set = set, p = anova(mod)$"Pr(>F)"[1]) %>%
mutate(bon = p.adjust(p, method = "bonferroni")) %>%
mutate(fdr = p.adjust(p, method = "bonferroni"))
results
## Source: local data frame [64 x 4]
##
## set p bon fdr
## 1 1 0.2738 1.000 1.000
## 2 2 0.2125 1.000 1.000
## 3 3 0.7650 1.000 1.000
## 4 4 0.2094 1.000 1.000
## 5 5 0.8073 1.000 1.000
## 6 6 0.0132 0.844 0.844
## 7 7 0.4277 1.000 1.000
## 8 8 0.7323 1.000 1.000
## 9 9 0.9323 1.000 1.000
## 10 10 0.1600 1.000 1.000
## .. ... ... ... ...
library(qqman)
qq(results$p)
BUT, if I were to look at those plots above and cherry-pick out which hypotheses to test based on how strong the correlation looks, my type I error will skyrocket. Looking at the plot above, it looks like the x variables 6, 28, 41, and 49 have a particularly strong correlation with my outcome, y. What happens if I try to do the statistical test on only those variables?
results %>% filter(set %in% c(6, 28, 41, 49))
## Source: local data frame [4 x 4]
##
## set p bon fdr
## 1 6 0.0132 0.844 0.844
## 2 28 0.0338 1.000 1.000
## 3 41 0.0624 1.000 1.000
## 4 49 0.0898 1.000 1.000
When I do that, my p-values for those four tests are all below 0.1, with two below 0.05 (and I’ll say it again, the null hypothesis is true in this experiment, because I’ve simulated random data). In other words, my type I error is now completely out of control, with more than 50% false positives at a p0>

The moral of the story here is to always look at your data, but don’t “cheat” by basing which statistical tests you perform based solely on that visualization exercise.

Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Announcing shinyapps.io General Availability

By Roger Oberg

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio is excited to announce the general availability (GA) of shinyapps.io.

Shinyapps.io is an easy to use, secure, and scalable hosted service already being used by thousands of professionals and students to deploy Shiny applications on the web. Effective today, shinyapps.io has completed beta testing and is generally available as a commercial service for anyone.

As regular readers of our blog know, Shiny is a popular free and open source R package from RStudio that simplifies the creation of interactive web applications, dashboards, and reports. Until today, Shiny Server and Shiny Server Pro were the most popular ways to share shiny apps. Now, there is a commercially supported alternative for individuals and groups who don’t have the time or resources to install and manage their own servers.

We want to thank the nearly 8,000 people who created at least one shiny app and deployed it on shinyapps.io during its extensive alpha and beta testing phases! The service was improved for everyone because of your willingness to give us feedback and bear with us as we continuously added to its capabilities.

For R users developing shiny applications that haven’t yet created a shinyapps.io account, we hope you’ll give it a try soon! We did our best to keep the pricing simple and predictable with Free, Basic, Standard, and Professional plans. Each paid plan has features and functionality that we think will appeal to different users and can be purchased with a credit card by month or year. You can learn more about shinyapps.io pricing plans and product features on our website.

We hope to see your shiny app on shinyapps.io soon!

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Fuzzy String Matching – a survival skill to tackle unstructured information

By Bigdata Doc

stirng-fuzzy-2

(This article was first published on Big Data Doctor » R, and kindly contributed to R-bloggers)

“The amount of information available in the internet grows every day” thank you captain Obvious! by now even my grandma is aware of that!. Actually, the internet has increasingly become the first address for data people to find good and up-to-date data. But this is not and never has been an easy task.
Even if the Semantic Web was pushed very hard in the academic environments and in spite of all efforts driven by the community of internet visionaries, like Sir Tim Berners-Lee, the vast majority of the existing sites don’t speak RDF, don’t expose their data with microformats and keep giving a hard time to the people trying to programmatically consume their data.

I must admit though, that when I got my hands on angular.js, I was surprise by how MVC javascript frameworks are now trying to “semantify” the HTML via directives… really cool thing with a promising future, but still lacking of standardization.

And of course you see an ever increasing number of web platforms opening up APIs… but we both know, that there is so much interesting data you cannot just query from an API… what do you do? you end up scraping, parsing logs, applying regex, etc.
Let’s assume you got access to the information (good for me, that I’m putting this problem aside :) )… So we have managed to retrieve semi-structured data from different internet sources. What do you do next?

Connecting the dots without knowing exactly what a dot is

Well, we know, that data is in many cases useful only if it can be combined with other data. But in your retrieved data sets, there’s nothing like a matching key, so you don’t know how to connect sources.
The only thing you have in the two different data sets you are trying to match is item names… they actually look quite similar and a human could do the matching… but there are some nasty differences.
For example, you have a product name called “Apple iPad Air 16 GB 4G silber” on one source and “iPad Air 16GB 4G LTE” on the other hand… You know it is the same product.. but how can you do the matching?

There are several ways of tacking this problem:

  • The manual approach
    The most obvious one is just manually scanning the fields in both data sources than are supposed to contain the matching keys and creating a mapping table (for example, 2 columns in a csv). It works well when the data sources are relatively small, but good luck with larger ones!.
  • The “I let others do” approach
    You’ve certainly heard of crowd sourcing and services like Amazon Mechanical Turk, where you can task a remote team of people to do manually do the matching for you. It might not be very reliable, but for certain jobs it’s a really good option.
  • The “regex” hell If you have a closer look at your data, you might define regular expressions to extract parts of the potential matching keys (e.g.: gsub(‘.*?([0-9]+GB).*’,’1′, ‘Apple iPhone 16GB black’) to extract the number of memory GB in the name of a device and trying to match by several fields, not just by one). But there are so many special cases to consider, that you might well end up in a “regex” hell.

Obviously, you’d love to have a better method, or at least a more automatic one to accomplish this task. Let me say it upfront: there’s no automatic easy-to-implement 100% reliable approach to that: even those who after reading it started thinking of machine learning approaches can’t ever guarantee a 100% reliability because of the nature of the problem (overfitting vs. accuracy). But enough bad news! Let’s talk now about the art of the possible:

The Fuzzy String Matching approach

Fuzzy String Matching is basically rephrasing the YES/NO “Are string A and string B the same?” as “How similar are string A and string B?”… And to compute the degree of similarity (called “distance”), the research community has been consistently suggesting new methods over the last decades. Maybe the first and most popular one was Levenshtein, which is by the way the one that R natively implements in the utils package (adist)

Mark Van der Loo released a package called stringdist with additional popular fuzzy string matching methods, which we are going to use in our example below.

These fuzzy string matching methods don’t know anything about your data, but you might do. For example, you see that in a source the matching keys are kept much shorter than in the other one, where further features are included as part of the key. In this case, you want to have an approximate distance between the shorter key and portions of similar number of words of the larger key to decide whether there’s a match. This “semantics” usually need to be implemented on top but might well rely on the previously mentioned stringdist methods.

Let’s have a look at the three variants in R. Basically the process is done in three steps:

  • Reading the data from both sources
  • Computing the distance matrix between all elements
  • Pairing the elements with the minimum distance

The first methods based on the native approximate distance method looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Method 1: using the native R adist
source1.devices<-read.csv('[path_to_your_source1.csv]')
source2.devices<-read.csv('[path_to_your_source2.csv]')
# To make sure we are dealing with charts
source1.devices$name<-as.character(source1.devices$name)
source2.devices$name<-as.character(source2.devices$name)
 
# It creates a matrix with the Standard Levenshtein distance between the name fields of both sources
dist.name<-adist(source1.devices$name,source2.devices$name, partial = TRUE, ignore.case = TRUE)
 
# We now take the pairs with the minimum distance
min.name<-apply(dist.name, 1, min)
 
match.s1.s2<-NULL  
for(i in 1:nrow(dist.name))
{
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name[i]),match.s1.s2)
}
# and we then can have a look at the results
View(match.s1.s2)

Now let’s make use of all meaningful implementations of string distance metrics in the stringdist package:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Method 2: applying different string matching methods
#osa Optimal string aligment, (restricted Damerau-Levenshtein distance).
#lv Levenshtein distance (as in R's native adist).
#dl Full Damerau-Levenshtein distance.
#hamming Hamming distance (a and b must have same nr of characters).
#lcs Longest common substring distance.
#qgram q-gram distance.
#cosine cosine distance between q-gram profiles
#jaccard Jaccard distance between q-gram profiles
#jw Jaro, or Jaro-Winker distance.
 
#install.packages('stringdist')
library(stringdist)
 
distance.methods<-c('osa','lv','dl','hamming','lcs','qgram','cosine','jaccard','jw')
dist.methods<-list()
for(m in 1:length(distance.methods))
{
  dist.name.enh<-matrix(NA, ncol = length(source2.devices$name),nrow = length(source1.devices$name))
  for(i in 1:length(source2.devices$name)) {
    for(j in 1:length(source1.devices$name)) { 
      dist.name.enh[j,i]<-stringdist(tolower(source2.devices[i,]$name),tolower(source1.devices[j,]$name),method = distance.methods[m])      
        #adist.enhance(source2.devices[i,]$name,source1.devices[j,]$name)
    }  
  }
  dist.methods[[distance.methods[m]]]<-dist.name.enh
}
 
match.s1.s2.enh<-NULL
for(m in 1:length(dist.methods))
{
 
  dist.matrix<-as.matrix(dist.methods[[distance.methods[m]]])
  min.name.enh<-apply(dist.matrix, 1, base::min)
  for(i in 1:nrow(dist.matrix))
  {
    s2.i<-match(min.name.enh[i],dist.matrix[i,])
    s1.i<-i
    match.s1.s2.enh<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name.enh[i],method=distance.methods[m]),match.s1.s2.enh)
  }
}
# Let's have a look at the results
library(reshape2)
matched.names.matrix<-dcast(match.s1.s2.enh,s2.i+s1.i+s2name+s1name~method, value.var = "adist")
View(matched.names.matrix)

And lastly, let’s have a look at what an own implementation exploiting the known semantics about the data would look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Now my own method applying some more knowledge about the data
 
# First a small but really helpfull function
trim <- function (x) gsub("^s+|s+$", "", x)
# Then we implement our own distance function 
# taking the shortest string applying Levenshtein distance sliding over the largest one to take the minimum
adist.custom <- function (str1, str2, sliding = TRUE)
{  
  s.str1<-strsplit(trim(str1), split=' ')
  s.str2<-strsplit(trim(str2), split=' ')
  s.str2<-trim(unlist(s.str2))
  s.str1<-trim(unlist(s.str1))
 
  if (length(s.str2)>=length(s.str1))
  {
    short.str<-  s.str1
    long.str<-s.str2
  } else {
    short.str <- s.str2
    long.str<-s.str1
  }
  # sliding
  return.dist<-0
  if (sliding == TRUE)
  {
    min<-99999
    s1<-trim(paste(short.str,collapse = ' '))
    for (k in 1:(length(long.str)-length(short.str)))
    {
      s2<-trim(paste(long.str[k:(length(short.str)+(k-1))],collapse = ' '))    
      ads<-adist(s1,s2,partial = TRUE, ignore.case = TRUE)
      min <- ifelse(ads<min,ads,min)
    }
    return.dist<-min
  } else {
    #string start matching  
    s1<-trim(paste(short.str,collapse = ' '))
    s2<-trim(paste(long.str[1:length(short.str)],collapse = ' '))
    return.dist<-adist(s1,s2,partial = TRUE, ignore.case = TRUE)
  }
  return (return.dist)  
}
 
 
dist.name.custom<-matrix(NA, ncol = length(source2.devices$name),nrow = length(source1.devices$name))
for(i in 1:length(source2.devices$name)) {
  for(j in 1:length(source1.devices$name)) { 
    dist.name.custom[j,i]<-adist.custom(tolower(source2.devices[i,]$name),tolower(source1.devices[j,]$name))      
  }  
}
 
min.name.custom<-apply(dist.name.custom, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(dist.name.custom))
{
  s2.i<-match(min.name.custom[i],dist.name.custom[i,])
  s1.i<-i
  match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2.devices[s2.i,]$name, s1name=source1.devices[s1.i,]$name, adist=min.name.custom[i]),match.s1.s2)
}
# let's have a look at the results
View(match.s1.s2)

I run the code with two different lists of mobile devices names that you can find here: list1, list2.
Even if depending on the method you get a correct matching rate over 80%, you really need to supervise the outcome… but it’s still mich quicker than the other non-fuzzy methods, don’t you agree?

To leave a comment for the author, please follow the link and comment on his blog: Big Data Doctor » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Adobe Sitecatalyst API and R: integrate reports with the SAINT classification file

By Mec Analytics

Rplot

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

From original post @ http://analyticsblog.mecglobal.it/en/analytics-tools/adobe-sitecatalyst-api-in-r/

Are you a heavy user of Sitecatalyst and the very famous R package RSitecatalyst to analyze your web analytics data and make insightful vizualitations?

Or wonder why the hell do you need to download lots of excel files to get a simple report from Adobe cloud?

Well today, we are going to solve a problem that everyone whos into reporting face all day. We’ll show you what to do in order to provide analysts with more time to concentrate on the data, and zero time to aggregate them.
All you need is The R programming environment settled and we are ready to go:
First thing to do is obviously loading all the libraries we need and login using our username and token which you can find here:

library(RSiteCatalyst) library(sqldf)

SCAuth("[USERNAME:COMPANY]","[TOKEN]")

What to do next is simply get our hands dirty with the API.
This is an example on how we can use it to get visits and time on site by tracking code:

elements&lt;-GetElements("[WEBSITE_ID]")

visits_per_day_by_tracking_code&lt;-QueueTrended("[WEBSITE_ID]","[DATE_BEGIN]","[DATE_END]","visits",elements="trackingcode"
                          ,date.granularity = "day", top="1000", start="0")

time_per_day_by_tracking_code&lt;-QueueTrended("[WEBSITE_ID]","[DATE_BEGIN]","[DATE_END]","totaltimespent",elements="trackingcode"
                           ,date.granularity = "day", top="1000", start="0")

visits_and_time_by_tracking_code&lt;-merge(visits_by_tracking_code,time_by_tracking_code,by=c("name","datetime"),all.x=TRUE)

The problem with this data is that is definitely too detailed for our reporting scope. What we really need to know is how Campaigns have performed rather than tracking code.

To accomplish that, we’re gonna upload our saint classification file we can download from Adobe and then we’re ready to make our analysis easier. Once you have downloaded that here is the code to put all together.

library(xlsx)

saint_data&lt;-read.xlsx("[SAINT_CLASSIFICATION_FILE.xlsx]",1)

metrics_by_campaign_placement&lt;-merge(visits_and_time_by_tracking_code,saint_data,by.x="name",
                      by.y="Key")

landing_pages&lt;-QueueRanked("[WEBSITE_ID]","[DATE_BEGIN]","[DATE_END]", 
                                 c("visits","bounces"), 
                                 elements = c("entrypage","trackingcode"), top="50000", start="0" )

landing_pages_by_campaign&lt;-merge(landing_pages,saint_data,by.x="trackingcode",
                             by.y="Key",all.x=TRUE)

As you can see the code is quite straight forward, all we need to do is basically a inner join over the key which is unique in the tracking code and in the data retrieved from the API.
The cool thing here is that we can attach each metric we think would be crucial for our analysis of understanding the advertising activities in the saint classification without the need of loading it into the platform; What we internally attach are key metrics such as advertising costs and impressions by placement, campaigns divided by KPIS such as brand awareness or business performance.
To give you an example of what you can do, here we provide you with the code to generate a chart which give you an idea of which advertising publisher is performing better in terms of visits,time on site and impressions:

Campaign_performance&lt;-sqldf("select Campaigns,sum(visits) from metrics_by_campaign_placement group by Campaigns order by sum(visits)")

Publisher_performance&lt;-sqldf("select Publisher,sum(visits) as visits,sum(totaltimespent) time_on_site,sum(impressions) as impressions from metrics_by_campaign_placement group by Publisher order by sum(visits)")

Placement_performance&lt;-sqldf("select Ads,sum(visits),sum(totaltimespent),sum(impressions) from metrics_by_campaign_placement group by Ads order by sum(visits)")

landing_pages_performance&lt;-sqldf("select Campaigns,entrypage,sum(visits),sum(bounces)/sum(visits) as bouncerate from landing_pages_by_campaign group by Campaigns order by sum(visits) desc")

To summaries with a clear chart, here’s the output you can expect with these data using the ggplot package:

The dimensions of the balls is the number of impressions, while the balls are all the publisher or social platforms you use for your advertisings campaigns.

# PLOT THE DATA USING GGPLOT2

ggplot(data=Publisher_performance, aes(x=Visits, y=Time_on_site)) +
  geom_point(aes(size=impressions,colour = placement)) +  
  scale_size_continuous(range=c(2,15)) +
  theme(legend.position = "none")+
  geom_text(aes(label=placement),hjust = 1.5 )

To leave a comment for the author, please follow the link and comment on his blog: MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

RMySQL version 0.10.2: Full SSL Support

By Jeroen Ooms

opencpu logo

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

RMySQL version 0.10.2 has appeared on CRAN. This is a maintenance release to streamline the build process on various platforms. Most importantly, the Windows/OSX binary packages from CRAN are now built with full SSL support. On Linux, the configure script has been updated a bit to automatically find the mysql client library.

A big thanks to epoch.com for sponsoring the development of this important package.

How to install RMySQL

RMySQL is a very old package, and as a result there is a lot of outdated and incorrect information on the interwebs. Back in the day (up till version 0.9.3) you had to manually install mysql on your machine to make the package work. But since the 0.10 series earlier this year, the package is now entirely self contained. The recommended way to install RMySQL on Windows and OSX is simply:

install.packages("RMySQL")

On Linux the package still links against the system libmysqlclient. On most deb systems (Debian/Ubuntu) you need to install libmysqlclient-dev and on rpm systems such as Fedora/CentOS/RHEL you need mariadb-devel. It should also work with less known variants of MySQL such as Percona but this doesn’t get a lot of testing coverage.

Using SSL with MySQL

MySQL is not always used with SSL because often the client and server run on the same machine, or within a private network. Moreover encryption introduces some performance overhead, which slows down your database connection a bit. But if you are connecting to a MySQL server over the internet, then enabling SSL is probably a good idea if you don’t want everyone to see your data.

Most MySQL servers have been built with SSL support. To configure RMySQL to connect to server over SSL you need to set the certificates in your ~/.my.cnf file:

[client]
ssl-ca=c:/ssl_certs/ca-cert.pem
ssl-cert=c:/ssl_certs/client-cert.pem
ssl-key=c:/ssl_certs/client-key.pem

I’m not using this myself but others are so I’m taking their word that this works. If you’re experiencing any problems, open an issue on github.

Future Development

This is likely the final release of the 0.10 series. We (well mostly Hadley) are working on a full rewrite of the package based on Rcpp. The readme on Github contains instructions on how to install the latest version from source (it is really easy, even on Windows).

Past experiences have shown that problems in this package are often specific to the operating system and version of mysql. Therefore we really appreciate feedback and testing of the new version. If you use RMySQL, please check out the development version at some point so that we can make sure everythings works as expected when it gets released. Report bugs or suggetions on the github page; please include your OS and RMySQL version.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Talking about R, Data Science and Microsoft on theCUBE

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It was a pleasure to appear live on theCUBE last week while attending the Strata conference. In my interview with Jeff Kelly and John Furrier, I talked about the rising popularity of R, the applications of data science, and the recent announcement of Microsoft acquiring Revolution Analytics. I also gushed about the President’s shout-out to Data Scientists at Strata. You can watch the interview below, and check out the other theCUBE interviews from Strata at siliconangle.tv.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Announcing: Introduction to Data Science video course

By John Mount

600 387630642

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

Win-Vector LLC’s Nina Zumel and John Mount are proud to announce their new data science video course Introduction to Data Science is now available on Udemy.

We designed the course as an introduction to an advanced topic. The course description is:

Use the R Programming Language to execute data science projects and become a data scientist. Implement business solutions, using machine learning and predictive analytics.

The R language provides a way to tackle day-to-day data science tasks, and this course will teach you how to apply the R programming language and useful statistical techniques to everyday business situations.

With this course, you’ll be able to use the visualizations, statistical models, and data manipulation tools that modern data scientists rely upon daily to recognize trends and suggest courses of action.

Understand Data Science to Be a More Effective Data Analyst

  • Use R and RStudio
  • Master Modeling and Machine Learning
  • Load, Visualize, and Interpret Data

Use R to Analyze Data and Come Up with Valuable Business Solutions

This course is designed for those who are analytically minded and are familiar with basic statistics and programming or scripting. Some familiarity with R is strongly recommended; otherwise, you can learn R as you go.

You’ll learn applied predictive modeling methods, as well as how to explore and visualize data, how to use and understand common machine learning algorithms in R, and how to relate machine learning methods to business problems.

All of these skills will combine to give you the ability to explore data, ask the right questions, execute predictive models, and communicate your informed recommendations and solutions to company leaders.

Contents and Overview

This course begins with a walk-through of a template data science project before diving into the R statistical programming language.

You will be guided through modeling and machine learning. You’ll use machine learning methods to create algorithms for a business, and you’ll validate and evaluate models.

You’ll learn how to load data into R and learn how to interpret and visualize the data while dealing with variables and missing values. You’ll be taught how to come to sound conclusions about your data, despite some real-world challenges.

By the end of this course, you’ll be a better data analyst because you’ll have an understanding of applied predictive modeling methods, and you’ll know how to use existing machine learning methods in R. This will allow you to work with team members in a data science project, find problems, and come up solutions.

You’ll complete this course with the confidence to correctly analyze data from a variety of sources, while sharing conclusions that will make a business more competitive and successful.

The course will teach students how to use existing machine learning methods in R, but will not teach them how to implement these algorithms from scratch. Students should be familiar with basic statistics and basic scripting/programming.

The course has a different emphasis than our book Practical Data Science with R and does not require the book.

Most of the course materials are freely available from GitHub in the form of pre-prepared knitr workbooks.

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using Hadoop Streaming API to perform a word count job in R and C++

By Marek Gągolewski

(This article was first published on Rexamine » Blog/R-bloggers, and kindly contributed to R-bloggers)

by Marek Gagolewski, Maciej Bartoszuk, Anna Cena, and Jan Lasek (Rexamine).

Introduction

In a recent blog post we explained how we managed to set up a working Hadoop environment on a few CentOS7 machines. To test the installation, let’s play with a simple example.

Hadoop Streaming API allows to run Map/Reduce jobs with any programs as the mapper and/or the reducer.

Files are processed line-by-line. Mappers get appropriate chunks of the input file. Each line is assume to store information on key-value pairs. By default, the following form is used:

key1 t val1 n
key2 t val2 n

If there is no TAB character, then the value is assumed to be NULL.

In fact this is a hadoop version of a program that rearranges lines in the input file so that duplicated lines appear one after another – the output is always sorted by key.

This is because:

hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 
   -input /input/test.txt 
   -output /output
   -mapper /bin/cat
   -reducer /bin/cat
hdfs dfs -cat /output/part-00000

This is roughly equivalent to:

cat input | mapper | sort | reducer > output

More specifically, in our case that was:

cat input | cat | sort | cat > output

A sample Map/Reduce job

Let’s run a simple Map/Reduce job written in R and C++ (just for fun – we assume that all the nodes run the same operating system and they use the same CPU architecture).

  1. As we are in the CentOS 7 environment, we will need a newer version of R on all the nodes.
$ su
# yum install readline-devel
# cd
# wget http://cran.rstudio.com/src/base/R-3.1.2.tar.gz
# tar -zxf R-3.1.2.tar.gz
# cd R-3.1.2
# /configure --with-x=no --with-recommended-packages=no
# make
# make install
# R
R> install.packages('stringi')
R> q()
  1. Edit yarn-site.xml (on all nodes):
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

Without that, Hadoop may complain about too huge virtual memory memory consumption by R.

  1. Create script wc_mapper.R:
#!/usr/bin/env Rscript

library('stringi')
stdin <- file('stdin', open='r')

while(length(x <- readLines(con=stdin, n=1024L))>0) {
   x <- unlist(stri_extract_all_words(x))
   xt <- table(x)
   words <- names(xt)
   counts <- as.integer(xt)
   cat(stri_paste(words, counts, sep='t'), sep='n')
}
  1. Create a source file wc_reducer.cpp:
#include <iostream>
#include <string>
#include <cstdlib>

using namespace std;

int main()
{
  string line;
  string last_word = "";
  int last_count = 0;

  while(getline(cin,line))
  {
    size_t found = line.find_first_of("t");
    if(found != string::npos)
    {
      string key = line.substr(0,found);
      string value = line.substr(found);
      int valuei = atoi(value.c_str());
      //cerr << "key=" << key << " value=" << value <<endl;
      if(key != last_word)
      {
              if(last_word != "") cout << last_word << "t" << last_count << endl;

              last_word = key;
              last_count = valuei;
      }
      else
              last_count += valuei;
    }
  }
  if(last_word != "") cout << last_word << "t" << last_count << endl;


  return 0;
}

Now it’s time to compile the above C++ source file:

$ g++ -O3 wc_reducer.cpp -o wc_reducer
  1. Let’s submit a map/reduce job via the Hadoop Streaming API
$ chmod 755 wc_mapper.R
$ hadoop jar /opt/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 
   -input /input/test.txt 
   -output /output
   -mapper wc_mapper.R
   -reducer wc_reducer
   -file wc_mapper.R
   -file wc_reducer

By the way, Fedora 20 RPM Hadoop distribution provides Hadoop Streaming API jar file under /usr/share/hadoop/mapreduce/hadoop-streaming.jar.

Summary

In this tutorial we showed how to submit a simple Map/Reduce job via the Hadoop Streaming API. Interestingly, we used an R script as the mapper and a C++ program as the reducer. In an upcoming blog post we’ll explain how to run a job using the rmr2 package.

To leave a comment for the author, please follow the link and comment on his blog: Rexamine » Blog/R-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News