Bagging, the perfect solution for model instability

By insightr

By Gabriel Vasconcelos

Motivation

The name bagging comes from boostrap aggregating. It is a machine learning technique proposed by Breiman (1996) to increase stability in potentially unstable estimators. For example, suppose you want to run a regression with a few variables in two steps. First, you run the regression with all the variables in your data and select the significant ones. Second, you run a new regression using only the selected variables and compute the predictions.

This procedure is not wrong if your problem is forecasting. However, this two step estimation may result in highly unstable models. If many variables are important but individually their importance is small, you will probably leave some of them out, and small perturbations on the data may drastically change the results.

Bagging

The Bagging solves instability problem by giving each variable several chances to be in the model. The same procedure described in the previous section is repeated in many bootstrap samples and the final forecast will be the average forecast across all these samples.

In this post I will show an example of Bagging with regression in the selection context described above. But you should keep in mind that this technique can be used for any unstable model. The most successful case is to use Bagging on regression trees, resulting in the famous random forest. The Bagging is also very robust to over fitting and produces forecasts with less variance.

Application

I am going to show an application of Bagging from the scratch using the boot package for the bootstrap replications. We start by generating some data with the following dgp:

x_{i,j} = A_j f_i+epsilon_{i,j}~~~~~~i=1,dots,N; ~j=1,dots,K

The first equation shows how the variable we want to predict, y_i, is generated. The second equation shows that the independent variables are generated by a common factor, which will induce a correlation structure. Since we need instability, the data will be generated to have an R^2 around 0.33 and K=40 in order to have many variables with small individual importance.

N = 300 # = Number of obs. = #
K = 40 # = Number of Variables = #
set.seed(123) # = Seed for replication = #
f = rnorm(N) # = Common factor = #
A = runif(K,-1,1) # = Coefficients from the second equation = #
X = t(A%*%t(f)) + matrix(rnorm(N*K), N, K) # = Generate xij = #
beta = c((-1)^(1:(K-10)) ,rep(0, 10)) # = Coefficients from the first equation, the first 30 are equal 1 or -1 and the last 10 are 0 = #

# = R2 setup = #
aux = var(X%*%beta)
erro = rnorm(N, 0, sqrt(2*aux)) # = The variance of the error will be twice the variance of X%*%beta = #

y = X %*% beta+erro # = Generate y = #
1-var(erro)/var(y) # = R2 = #
##           [,1]
## [1,] 0.2925866
# = Break data into in-sample and out-of-sample = #
y.in = y[1:(N-100)]
X.in = X[1:(N-100),]
y.out = y[-c(1:(N-100))]
X.out = X[-c(1:(N-100)),]

Now we must define a function to be called by the boostrap. This function must receive the data and an argument for indexes that will tell R where the sampling must be made. This function will run a linear regression with all variables, select those with t-statistics bigger than 1.96, run a new regression with the selected variables and store the coefficients.

bagg = function(data,ind){
  sample = data[ind,]
  y = sample[, 1]
  X = sample[, 2:ncol(data)]
  model1 = lm(y ~ X)
  t.stat = summary(model1)$coefficients[-1, 3]
  selected = which(abs(t.stat)>=1.96)
  model2 = lm(y ~ X[, selected])
  coefficients = coef(model2)
  cons = coefficients[1]
  betas = coefficients[-1]
  aux = rep(0, ncol(X))
  aux[selected] = betas
  res = c(cons, aux)
  return(res)
}

Now we are ready to use the boot function to run the Bagging. The code below does precisely that with 500 boostrap replications. It also runs the simple OLS procedure for us to compare. If you type selected after running this code you will see that many variables were left out and some variables with beta=0 were selected.

library(boot)
# = Bagging = #
bagging = boot(data = cbind(y.in, X.in), statistic=bagg, R=500)
bagg.coef = bagging$t
y.pred.bagg = rowMeans(cbind(1, X.out)%*%t(bagg.coef))

# = OLS = #
ols0 = lm(y.in ~ X.in)
selected = which(abs(summary(ols0)$coefficients[-1, 3])>=1.96)
ols1 = lm(y.in ~ X.in[, selected])
y.pred = cbind(1, X.out[, selected])%*%coef(ols1)

# = Forecasting RMSE = #
sqrt(mean((y.pred-y.out)^2)) # = OLS = #
## [1] 10.43338
sqrt(mean((y.pred.bagg-y.out)^2)) # = Bagging = #
## [1] 9.725554

The results showed that the Bagging has a RMSE 7% smaller than the OLS. It is also interesting to see what happens to the coefficients. The plot below shows that the bagging performs some shrinkage on the coefficients towards zero.

barplot(coef(ols0)[-1])
barplot(colMeans(bagg.coef)[-1], add=TRUE, col="red")

plot of chunk unnamed-chunk-12

Finally, what would happen if we had no instability? If you run the same code again changing the R^2 to around 99% you will find the plot below. To change the R^2 to 99% just replace 2 by 0.01 in the code erro = rnorm(N, 0, sqrt(2*aux)) in the dgp. As you can see, the bagging and the simple procedure are the same in the absence of instability.

plot of chunk unnamed-chunk-13

References

Breiman, Leo (1996). “Bagging predictors”. Machine Learning. 24 (2): 123–140

Source:: R News

Sunflowers for COLOURlovers

By @aschinchon

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Andar, lo que es andar, anduve encima siempre de las nubes (Del tiempo perdido, Robe)

If you give importance to colours, maybe you know already COLOURlovers. As can be read in their website, COLOURlovers is a creative community where people from around the world create and share colors, palettes and patterns, discuss the latest trends and explore colorful articles… All in the spirit of love.

There is a R package called colourlovers which provides access to the COLOURlovers API. It makes very easy to choose nice colours for your graphics. I used clpalettes function to search for the top palettes of the website. Their names are pretty suggestive as well: Giant Goldfish, Thought Provoking, Adrift in Dreams, let them eat cake … Inspired by this post I have done a Shiny app to create colored flowers using that palettes. Seeds are arranged according to the golden angle. One example:

Some others:








You can play with the app here.

If you want to do your own sunflowers, here you have the code. This is the ui.R file:

library(colourlovers)
library(rlist)
top=clpalettes('top')
sapply(1:length(top), function(x) list.extract(top, x)$title)->titles

fluidPage(
  titlePanel("Sunflowers for COLOURlovers"),
  fluidRow(
    column(3,
           wellPanel(
             selectInput("pal", label = "Palette:", choices = titles),
             sliderInput("nob", label = "Number of points:", min = 200, max = 500, value = 400, step = 50)
           )
    ),
    mainPanel(
      plotOutput("Flower")
    )
  )
  )

And this is the server.R one:

library(shiny)
library(ggplot2)
library(colourlovers)
library(rlist)
library(dplyr)

top=clpalettes('top')
sapply(1:length(top), function(x) list.extract(top, x)$title)->titles

CreatePlot = function (ang=pi*(3-sqrt(5)), nob=150, siz=15, sha=21, pal="LoversInJapan") {
  
  list.extract(top, which(titles==pal))$colors %>% 
    unlist %>% 
    as.vector() %>% 
    paste0("#", .) -> all_colors
  
  colors=data.frame(hex=all_colors, darkness=colSums(col2rgb(all_colors)))
  colors %>% arrange(-darkness)->colors
  
  background=colors[1,"hex"] %>% as.character

  colors %>% filter(hex!=background) %>% .[,1] %>% as.vector()->colors

  ggplot(data.frame(r=sqrt(1:nob), t=(1:nob)*ang*pi/180), aes(x=r*cos(t), y=r*sin(t)))+
    geom_point(colour=sample(colors, nob, replace=TRUE, prob=exp(1:length(colors))), aes(size=(nob-r)), shape=16)+
    scale_x_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
    scale_y_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
    theme(legend.position="none",
          panel.background = element_rect(fill=background),
          panel.grid=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text=element_blank())}

function(input, output) {
 output$Flower=renderPlot({
    CreatePlot(ang=180*(3-sqrt(5)), nob=input$nob, siz=input$siz, sha=as.numeric(input$sha), pal=input$pal)
  }, height = 550, width = 550 )}

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Manipulate Biological Data Using Biostrings Package Exercises (Part 1)

By Stephen James

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Bioinformatics is an amalgamation Biology and Computer science.Biological Data is manipulated using Computers and Computer software’s in Bioinformatics. Biological Data includes DNA; RNA & Protiens. DNA & RNA is made of Nucleotides which are our genetic material in which we are coded.Our Structure and Functions are done by protein, which are build of Amino acids

In the exercises below we cover how we can Manipulate Biological Data using Biostrings package in Bioconductor.

Install Packages
Biostrings

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Print out the standard Genetic Code table using Biostrings Package

Exercise 2

Print the first codon in the standard genetic code

Exercise 3

Print out the Standard RNA Genetic Code table using Biostrings package

Exercise 4

Print out the Standard RNA Genetic Code of Stop codon using Biostrings package

Exercise 5

Print out the standard Amino acid codon table using Biostrings

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 6

Print the code of the start codon from the standard genetic code

Exercise 7

Print the three letter code of Amino acid Methionine using Biostrings

Exercise 8

Create a DNA string and print the length and dinucleotide frequency of the string

Exercise 9

Create RNA string and print the length and dinucleotide frequency of the string

Exercise 10

Create a Protein string and print the length of the protein

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A Very Palette-able Post

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Many of my posts seem to begin with a link to a tweet, and this one falls into that pattern:

And @_inundata is already working on a #rstats palette. https://t.co/bNfpL7OmVl

— Timothée Poisot (@tpoi) May 21, 2017

I’d seen the Ars Tech post about the named color palette derived from some training data. I could tell at a glance of the resultant palette:

that it would not be ideal for visualizations (use this site test the final image in this post and verify that on your own) but this was a neat, quick project to take on, especially since it let me dust off an old GH package, adobecolor and it was likely I could beat Karthik to creating a palette

The “B+” goal is to get a color palette that “matches” the one in the Tumlbr post. The “A” goal is to get a named palette.

These are all the packages we end up using:

library(tesseract)
library(magick)
library(stringi)
library(adobecolor) # hrbrmstr/adobecolor - may not be Windows friendly
library(tidyverse)

Attempt #1 (B+!!)

I’m a macOS user, so I’ve got great tools like xScope at my disposal. I’m really handy with that app and the Loupe tool makes it easy to point at a color, save it to a palette board and export an ACO palette file.

That whole process took ~18 seconds (first try). I’m not saying that to brag. But we often get hung up on both speed and programmatic reproducibility. I ultimately — as we’ll see in a bit — really went for speed vs programmatic reproducibility.

It’s dead simple to get the palette into R:

aco_fil 

IIRC there may still be a byte-order issue (PRs welcome) I need to deal with on Windows in adobecolor but you likely will never need to use the package again.

A quick eyeball comparison between the Tumblr list and that matrix indicates the colors are off. That could be for many reasons starting from the way they were encoded in the PNG by whatever programming language was used to train the neural net and make the image (likely Python) to Tumblr degrading it to something on my end. You’ll see that the colors are close enough for humans that it’s likely close enough.

There, I’ve got a B+ with about a total of 60s of work! Plenty of time left to try shooting for an A!

Attempt #2 (FAIL)

We’ve got the PNG from the Tumblr post and the tesseract package in R. Perhaps this will be super-quick, too:

pal_img_fil 

Ugh.

Perhaps if we crop out the colors:

image_read(pal_img_fil) %>%
  image_crop("+57") %>%
  ocr() %>%
  stri_split_lines()
## [[1]]
##  [1] "Clanfic Fug112113 84"       "Snowhunk 201 199 165"     
##  [3] "Cmbabcl 97 93 as"          "Bunfluwl90174155"          
##  [5] "Kunming Blue 121 114 125"  "Bank Bun 221196199"       
##  [7] "Caring Tan 171 ms 170"     "Slarguun 233 191 141"     
##  [9] "Sinkl76135110"             ""                         
## [11] "SIIImmy Beige 216 200 135" "Durkwuud e1 63 66"        
## [13] "Flower 175 154 196"        ""                         
## [15] "Sand Dan 201 172 143"      "Grade 1m AB 94: 53"       
## [17] ""                          "Light 0mm 175 150 147"    
## [19] "Grass Ba! 17a 99 ms"       "sxndis Poop 204 205 194"  
## [21] "Dupe 219 209 179"          ""                         
## [23] "Tesling 156 101 106"       "SloncrEluc 152 165 159"   
## [25] "Buxblc Simp 226 131 132"   "Sumky Bean 197 162 171"   
## [27] "1mfly 190 164 11a"        ""                         
## [29] ""

Ugh.

I’m woefully unfamiliar with how to use the plethora of tesseract options to try to get better performance and this is taking too much time for a toy post, so we’ll call this attempt a failure 🙁

Attempt #3 (A-!!)

I’m going to go outside of R again to New OCR and upload the Tumblr palette there and crop out the colors (it lets you do that in-browser). NOTE: Never use any free site for OCR’ing sensitive data as most are run by content thieves.

Now we’re talkin’:

ocr_cols 

We can get that into a more useful form pretty quickly:

stri_match_all_regex(ocr_cols, "([[:alpha:] ]+) ([[:digit:]]+) ([[:digit:]]+) ([[:digit:]]+)") %>%
  print() %>%
  .[[1]] -> col_mat
## [[1]]
##       [,1]                         [,2]             [,3]  [,4]  [,5] 
##  [1,] "Clardic Fug 112 113 84"     "Clardic Fug"    "112" "113" "84" 
##  [2,] "Snowbonk 201 199 165"       "Snowbonk"       "201" "199" "165"
##  [3,] "Catbabel 97 93 68"          "Catbabel"       "97"  "93"  "68" 
##  [4,] "Bunfiow 190 174 155"        "Bunfiow"        "190" "174" "155"
##  [5,] "Ronching Blue 121 114 125"  "Ronching Blue"  "121" "114" "125"
##  [6,] "Bank Butt 221 196 199"      "Bank Butt"      "221" "196" "199"
##  [7,] "Caring Tan 171 166 170"     "Caring Tan"     "171" "166" "170"
##  [8,] "Stargoon 233 191 141"       "Stargoon"       "233" "191" "141"
##  [9,] "Sink 176 138 110"           "Sink"           "176" "138" "110"
## [10,] "Stummy Beige 216 200 185"   "Stummy Beige"   "216" "200" "185"
## [11,] "Dorkwood 61 63 66"          "Dorkwood"       "61"  "63"  "66" 
## [12,] "Flower 178 184 196"         "Flower"         "178" "184" "196"
## [13,] "Sand Dan 201 172 143"       "Sand Dan"       "201" "172" "143"
## [14,] "Grade Bat 48 94 83"         "Grade Bat"      "48"  "94"  "83" 
## [15,] "Light Of Blast 175 150 147" "Light Of Blast" "175" "150" "147"
## [16,] "Grass Bat 176 99 108"       "Grass Bat"      "176" "99"  "108"
## [17,] "Sindis Poop 204 205 194"    "Sindis Poop"    "204" "205" "194"
## [18,] "Dope 219 209 179"           "Dope"           "219" "209" "179"
## [19,] "Testing 156 101 106"        "Testing"        "156" "101" "106"
## [20,] "Stoncr Blue 152 165 159"    "Stoncr Blue"    "152" "165" "159"
## [21,] "Burblc Simp 226 181 132"    "Burblc Simp"    "226" "181" "132"
## [22,] "Stanky Bean 197 162 171"    "Stanky Bean"    "197" "162" "171"
## [23,] "Thrdly 190 164 116"         "Thrdly"         "190" "164" "116"

The print() is in the pipe as I can never remember where each stringi functions stick lists but usually guess right, plus I wanted to check the output.

Making those into colors is super-simple:

y 

If we look at Attempt #1 and Attempt #2 together:

ocr_cols
##    Clardic Fug       Snowbonk       Catbabel        Bunfiow  Ronching Blue 
##      "#707154"      "#C9C7A5"      "#615D44"      "#BEAE9B"      "#79727D" 
##      Bank Butt     Caring Tan       Stargoon           Sink   Stummy Beige 
##      "#DDC4C7"      "#ABA6AA"      "#E9BF8D"      "#B08A6E"      "#D8C8B9" 
##       Dorkwood         Flower       Sand Dan      Grade Bat Light Of Blast 
##      "#3D3F42"      "#B2B8C4"      "#C9AC8F"      "#305E53"      "#AF9693" 
##      Grass Bat    Sindis Poop           Dope        Testing    Stoncr Blue 
##      "#B0636C"      "#CCCDC2"      "#DBD1B3"      "#9C656A"      "#98A59F" 
##    Burblc Simp    Stanky Bean         Thrdly 
##      "#E2B584"      "#C5A2AB"      "#BEA474"

aco_hex
##  [1] "#707055" "#CBC6A6" "#615C49" "#BFAE9C" "#78727C" "#DDC4C7" "#A9A7AB"
##  [8] "#E9BF8F" "#B18A6D" "#D8C8B9" "#3E3F43" "#B2B8C4" "#C7AC92" "#305E53"
## [15] "#AC9891" "#B1646B" "#CBCDC0" "#DBD2B3" "#A2626A" "#98A59E" "#E8B187"
## [22] "#C5A1AB" "#BFA17C"

we can see they’re really close to each other, and I doubt all but the most egregiously picky color snobs can tell the difference visually, too:

par(mfrow=c(1,2))
scales::show_col(ocr_cols)
scales::show_col(aco_hex)
par(mfrow=c(1,1))

(OK, #3D3F43 is definitely hitting my OCD as being annoyingly different than #3D3F42 on my MacBook Pro so count me in as a color snob.)

Here’s the final palette:

structure(c("#707154", "#C9C7A5", "#615D44", "#BEAE9B", "#79727D", 
"#DDC4C7", "#ABA6AA", "#E9BF8D", "#B08A6E", "#D8C8B9", "#3D3F42", 
"#B2B8C4", "#C9AC8F", "#305E53", "#AF9693", "#B0636C", "#CCCDC2", 
"#DBD1B3", "#9C656A", "#98A59F", "#E2B584", "#C5A2AB", "#BEA474"
), .Names = c("Clardic Fug", "Snowbonk", "Catbabel", "Bunfiow", 
"Ronching Blue", "Bank Butt", "Caring Tan", "Stargoon", "Sink", 
"Stummy Beige", "Dorkwood", "Flower", "Sand Dan", "Grade Bat", 
"Light Of Blast", "Grass Bat", "Sindis Poop", "Dope", "Testing", 
"Stoncr Blue", "Burblc Simp", "Stanky Bean", "Thrdly"))

This third attempt took ~5 minutes vs 60s.

FIN

Why “A-“? Well, I didn’t completely verify the colors and values matched 100% in the final submission. They are likely the same, but the best way to get something corrected by others it to put it on the internet, so there it is 🙂

I’d be a better human and coder if I took the time to learn tesseract more, but I don’t have much need for OCR’ing text. It is likely worth the time to brush up on tesseract after you read this post.

Don’t use this palette! I created it mostly to beat Karthik to making the palette (I have no idea if I succeeded), to also show that you should not forego your base R roots (I could have let that be subliminal but I wasn’t trying to socially engineer you in this post) and to bring up the speed/reproducibility topic. I see no issues with manually doing tasks (like uploading an image to a web site) in certain circumstances, but it’d be an interesting topic of debate to see just what “rules” folks use to determine how much effort one should put into 100% programmatic reproducibility.

You can find the ACO file and an earlier, alternate attempt at making the palette in this gist.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Using R: When using do in dplyr, don’t forget the dot

By mrtnj

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

library(plyr)
results 

Or maybe, if I felt serious and thought the function would ever be used again, I'd write:

calculate 

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

library(dplyr)
grouped 

Or once again with magrittr:

library(magrittr)
some_data %>%
  group_by(key) %>%
  do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped 

Don't forget the dot!

Postat i:

Source:: R News

Shiny Applications Layouts Exercises (Part-9)

By Euthymios Kasvikis

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Shiny Application Layouts – Update Input

In the ninth part of our series we will see the updated input scenario. This is different from the Dynamic UI example we met in part 8, where the UI component is generated on the server and sent to the UI, where it replaces an existing UI component.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in “Building Shiny App” series as we will build basic shiny staff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

UI Context

As always let us create the skeleton of the app.
#ui.r
fluidPage(
titlePanel("Title"),
fluidRow(
column(3,
wellPanel()),
column(3,
wellPanel()),
column(3,
wellPanel())
)
)

#server.R
function(input, output, clientData, session) {}

Exercise 1

Build the initial UI of your app. Give it a title, use 4 columns and one row for each one of the three well Panel that we are going to use.

Applying reactivity

To create a reactive context for our app we will use the observe({}) function in our server side.

Exercise 2

Create reactive context. HINT: Use observe({}).

“Master” Inputs.

Now we are going to create the two inputs that control the rest of the inputs of the app. Look at the example:
#ui.r
textInput("text_input",
"labels:",
"Some Text"),
sliderInput("slider_input",
"values:",
min = 1, max = 100, value = 50)

#server.r
t_input
s_input

Exercise 3

Create a text Input in the ui side that controls labels and then pass it to a new variable on the server side.

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 4

Create a slider Input in the ui side that controls values and then pass it to a new variable on the server side.

Dependent Inputs

Firstly let’s create a text input that changes both the label and the text.
#ui.r
textInput("Text", "Text input:", value = "text")
#server.r
updateTextInput(session, "Text",
label = paste("Sth", t_input),
value = paste("Sth", s_input)
)

Exercise 5

Create a text input, in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateTextInput().

Exercise 6

Create a slider input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateSliderInput().

Exercise 7

Create a numeric input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateNumericInput().

Exercise 8

Create a date input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateDateInput().

In order to create a Checkbox group with the same conditions as the rest of the inputs we just created we should first create a list of options like in the example below:
#server.r
options
options[[paste("option", s_input, "A")]]
paste0("option-", s_input, "-A")
options[[paste("option", s_input, "B")]]
paste0("option-", s_input, "-B")

Exercise 9

Create a list with three choices for your Checkbox Group. HINT: Use list().

Exercise 10

Create a checkboxgroup input, in the third well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateCheckboxGroupInput().

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

New series: R and big data (concentrating on Spark and sparklyr)

By John Mount

We Can Do It

J. Howard Miller’s, 1943.

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark through R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.”

Background

R is a best of breed in-memory analytics platform. R allows the analyst to write programs that operate over their data and bring in a huge suite of powerful statistical techniques and machine learning procedures. Spark is an analytics platform designed to operate over big data that exposes some of its own statistical and machine learning capabilities. R can now be operated “over Spark“. That is: R programs can delegate tasks to Spark clusters and issue commands to Spark clusters. In some cases the syntax for operating over Spark is deliberately identical to working over data stored in R.

Why R and Spark

The advantages are:

  • Spark can work at a scale and speed far larger than native R . The ability to send work to Spark increases R‘s capabilities.
  • R has machine learning and statistical capabilities that go far beyond what is available on Spark or any other “big data” system (many of which are descended from report generation or basic analytics). The ability to use specialized R methods on data samples yields additional capabilities.
  • R and Spark can share code and data.

The R/Spark combination is not the only show in town; but it is a powerful capability that may not be safe to ignore. We will also talk about additional tools that can be brought into the mix: such as the powerful large scale machine learning capabilities from h2o

The warts

Frankly a lot of this is very new, and still on the “bleeding edge.” Spark 2.x has only been

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R, share some hints, and even admit to some of the warts found in this combination of systems.

The ability to perform sophisticated analyses and modeling on “big data” with R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).

J. Howard Miller’s, 1943.

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark through R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.”

Background

R is a best of breed in-memory analytics platform. R allows the analyst to write programs that operate over their data and bring in a huge suite of powerful statistical techniques and machine learning procedures. Spark is an analytics platform designed to operate over big data that exposes some of its own statistical and machine learning capabilities. R can now be operated “over Spark“. That is: R programs can delegate tasks to Spark clusters and issue commands to Spark clusters. In some cases the syntax for operating over Spark is deliberately identical to working over data stored in R.

Why R and Spark

The advantages are:

  • Spark can work at a scale and speed far larger than native R . The ability to send work to Spark increases R‘s capabilities.
  • R has machine learning and statistical capabilities that go far beyond what is available on Spark or any other “big data” system (many of which are descended from report generation or basic analytics). The ability to use specialized R methods on data samples yields additional capabilities.
  • R and Spark can share code and data.

The R/Spark combination is not the only show in town; but it is a powerful capability that may not be safe to ignore. We will also talk about additional tools that can be brought into the mix: such as the powerful large scale machine learning capabilities from h2o

The warts

Frankly a lot of this is very new, and still on the “bleeding edge.” Spark 2.x has only been available in stable form since July 26, 2016 (or just under a year). Spark 2.x is much more capable than the Spark 1.x series in terms of both data manipulation and machine learning, so we strongly suggest clients strongly insist on Spark 2.x clusters from their infrastructure vendors (such as Couldera, Hortonworks, MapR, and others) despite having only become available in these packaged solutions recently. The sparklyr adapter itself was first available on CRAN only as of September 24th, 2016. And SparkR only started distributing with Spark 1.4 as of June 2015.

While R/Spark is indeed a powerful combination, nobody seems to sharing a lot of production experiences and best practices whith it yet.

Some of the problems are sins of optimism. A lot of people still confuse successfully standing a cluster up with effectively using it. Other people confuse statistical and procedures available in in-memory R (which are very broad and often quite mature) with those available in Spark (which are less numerous and less mature).

Our goal

What we want to do with the “R and big data” series is:

  • Give a taste of some of the power of the R/Spark combination.
  • Share a “capabilities and readiness” checklist you should apply when evaluating infrastructure.
  • Start to publicly document R/Spark best practices.
  • Describe some of the warts and how to work around them.
  • Share fun tricks and techniques that make working with R/Spark much easier and more effective.

The start

Our next article in this series will be up soon and will discuss the nature of data-handles in Sparklyr (one of the R/Spark interfaces) and how to manage your data inventory neatly.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Sankey charts for swinging voters

By Peter's stats stuff – R

(This article was first published on Peter’s stats stuff – R, and kindly contributed to R-bloggers)

Continuing my examination of the individual level voting behaviour from the New Zealand Election Study, I wanted to look at the way individuals swap between parties, and between “did not vote” and a party, from one election to another. How much and how this happens is obviously an important question for both political scientists and for politicians.

Vote transition visualisations

I chose a Sankey chart as a way of showing the transition matrix from self-reported party vote in the 2011 election to the 2014 election. Here’s a static version:

And here is the more screen-friendly interactive version, with mouseover tooltips to give actual estimates:

The point with these graphics is to highlight the transitions. For example, what were the implications of turnout being higher in 2014 than 2011 (77.9% of enrolled voters in 2014 compared to 74.2% in 2011)? Judging from this survey data, the National Party gained 6.6% of the enrolled population in 2014 by converting them from a 2011 “did not vote” and lost only 3.6% in the other direction. This net gain of three percentage points was enough to win the election for the National-led coalition. In contrast, the Labour party had a net gain from “did not vote” in 2011 of only 0.2 percentage points. Remember though that these are survey-based estimates, and subject to statistical error.

I find setting up and polishing Sankey charts – controlling colours for example – a bit of a pain, so the code at the bottom of this post on how this was done might be of interest.

Weighting, missing data, population and age complications

Those visualisations have a few hidden fishhooks, which careful readers would find if they compare the percentages in the tooltips of the interactive version with percentages reported by the New Zealand Electoral Commission.

  • The 2014 percentages are proportions of the enrolled population. As the 2014 turnout of enrolled voters was 77.9%, the numbers here are noticeably less than the usually cited percentages which were used to translate into seat counts (for example, National Party reported party vote of 47.0% of votes becomes 36.6% of enrolled voters)
  • The 2011 percentages are even harder to explain, because I’ve chosen not only to scale the party vote and “did not vote” to the 2011 enrolled population as reported by the Commission, but also to add in around 5% of the 2014 population that were too young to vote in 2011.

Two things I would have liked to have taken into account but wasn’t able to were:

  • The “leakage” from the 2011 election of people who were deceased or had left the country by 2014
  • Explicit recognition of people who voted in 2014 but not in 2011 because they were out of the country. There is a variable in the survey that picks up the year the respondent came to live in New Zealand if not born here, but for only 10 respondents was this 2012 or later (in contrast to age – there were 58 respondents aged 20 or less in 2014).

I re-weighted the survey so the 2014 and 2011 reported party votes matched the known totals (with the addition of people aged 15 to 17 in 2011). One doesn’t normally re-weight a survey based on answers provided by the respondents, but in this case I think it makes perfect sense to calibrate to the public totals. The biggest impact is that for both years, but particularly 2011, relying on the respondents’ self-report and the published weighting of the NZES, totals for “did not vote” are materially understated, compared to results when calibrated to the known totals.

When party vote in 2011 had been forgotten or was an NA, and this wasn’t explained by being too young in 2011, I used multiple imputation based on a subset of relevant variables to give five instances of probable party vote to each such respondent.

Taken together, all this gives the visualisations a perspective based in 2014. It is better to think of it as “where did the 2014 voters come from” than “where did the 2011 voters go”. This is fairly natural when we consider it is the 2014 New Zealand Election Study, but is worth keeping in mind in interpretation.

Age (and hence the impact new young voters coming in, and of older voters passing on) is important in voting behaviour, as even the most casual observation of politics shows. In New Zealand, the age distribution of party voters in 2014 is seen in the chart below:

Non-voters, Green voters and to a certain extent Labour voters are young; New Zealand First voters are older. If this interests you though, I suggest you look at the multivariate analysis in this blog post or, probably more fun, this fancy interactive web app which lets you play with the predicted probabilities of voting based on a combination of demographic and socio-economic status variables.

Code

Here’s the R code that did that imputation, weighting, and the graphics:

library(tidyverse)
library(forcats)
library(riverplot)
library(sankeyD3)
library(foreign)
library(testthat)
library(mice)
library(grid)
library(survey)

# Caution networkD3 has its own sankeyD3 with less features.  Make sure you know which one you're using!
# Don't load up networkD3 in the same session

#============import data======================
# See previous blog posts for where this comes from:
unzip("D:/Downloads/NZES2014GeneralReleaseApril16.sav.zip")

nzes_orig  read.spss("NZES2014GeneralReleaseApril16.sav", 
                       to.data.frame = TRUE, trim.factor.names = TRUE)

# Five versions of each row, so when we do imputation later on we
# in effect doing multiple imputation:
nzes  nzes_orig[rep(1:nrow(nzes_orig), each = 5), ] %>%
   # lump up party vote in 2014:
   mutate(partyvote2014 = ifelse(is.na(dpartyvote), "Did not vote", as.character(dpartyvote)),
          partyvote2014 = gsub("M.ori", "Maori", partyvote2014),
          partyvote2014 = gsub("net.Man", "net-Man", partyvote2014),
          partyvote2014 = fct_infreq(partyvote2014)) %>%
   mutate(partyvote2014 = fct_lump(partyvote2014, 5)) 

# party vote in 2011, and a smaller set of columns to do the imputation:
nzes2  nzes %>%
   mutate(partyvote2011 = as.factor(ifelse(dlastpvote == "Don't know", NA, as.character(dlastpvote)))) %>%
   # select a smaller number of variables so imputation is feasible:
   select(dwtfin, partyvote2014, partyvote2011, dethnicity_m, dage, dhhincome, dtradeunion, dprofassoc,
          dhousing, dclass, dedcons, dwkpt)

# impute the missing values.  Although we are imputing only a single set of values,
# because we are doing it with each row repeated five times this has the same impact,
# for the simple analysis we do later on, as multiple imputation:
nzes3  complete(mice(nzes2, m = 1))

# Lump up the 2011 vote, tidy labels, and work out who was too young to vote:
nzes4  nzes3 %>%
   mutate(partyvote2011 = fct_lump(partyvote2011, 5),
          partyvote2011 = ifelse(grepl("I did not vote", partyvote2011), "Did not vote", as.character(partyvote2011)),
          partyvote2011 = ifelse(dage  21, "Too young to vote", partyvote2011),
          partyvote2014 = as.character(partyvote2014))

#===============re-weighting to match actual votes in 2011 and 2014===================

# This makes a big difference, for both years, but more for 2011. "Did not vote" is only 16% in 2011
# if we only calibrate to 2014 voting outcomes, but in reality was 25.8%.  We calibrate for both
# 2011 and 2014 so the better proportions for both elections.

#------------2014 actual outcomes----------------
# http://www.elections.org.nz/news-media/new-zealand-2014-general-election-official-results
actual_vote_2014  data_frame(
   partyvote2014 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote"),
   Freq = c(1131501, 604534, 257356, 208300, 
            31850 + 16689 + 5286 + 95598 + 34095 + 10961 + 5113 + 1730 + 1096 + 872 + 639,
            NA)
)

# calculate the did not vote, from the 77.9 percent turnout
actual_vote_2014[6, 2]  (100 / 77.9 - 1) * sum(actual_vote_2014[1:5, 2])

# check I did the turnout sums right:
expect_equal(0.779 * sum(actual_vote_2014[ ,2]), sum(actual_vote_2014[1:5, 2]))

#---------------2011 actual outcomes---------------------
# http://www.elections.org.nz/events/past-events-0/2011-general-election/2011-general-election-official-results
actual_vote_2011  data_frame(
   partyvote2011 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote", "Too young to vote"),
   Freq = c(1058636, 614937, 247372, 147511, 
            31982 + 24168 + 23889 + 13443 + 59237 + 11738 + 1714 + 1595 + 1209,
            NA, NA)
)
# calculate the did not vote, from the 74.21 percent turnout at 
# http://www.elections.org.nz/events/past-events/2011-general-election
actual_vote_2011[6, 2]  (100 / 74.21 - 1) * sum(actual_vote_2011[1:5, 2])

# check I did the turnout sums right:
expect_equal(0.7421 * sum(actual_vote_2011[1:6 ,2]), sum(actual_vote_2011[1:5, 2]))

# from the below, we conclude 4.8% of the 2014 population (as estimated by NZES)
# were too young to vote in 2011:
nzes_orig %>%
   mutate(tooyoung = dage  21) %>%
   group_by(tooyoung) %>%
   summarise(pop = sum(dwtfin),
             n = n()) %>%
   ungroup() %>%
   mutate(prop = pop / sum(pop))

# this is pretty approximate but will do for our purposes.
actual_vote_2011[7, 2]  0.048 * sum(actual_vote_2011[1:6, 2])

# Force the 2011 numbers to match the 2014 population
actual_vote_2011$Freq  actual_vote_2011$Freq / sum(actual_vote_2011$Freq) * sum(actual_vote_2014$Freq)

#------------create new weights--------------------
# set up survey design with the original weights:
nzes_svy  svydesign(~1, data = nzes4, weights = ~dwtfin)

# calibrate weights to the known total party votes in 2011 and 2014:
nzes_cal  calibrate(nzes_svy, 
                      list(~partyvote2014, ~partyvote2011),
                      list(actual_vote_2014, actual_vote_2011))

# extract those weights for use in straight R (not just "survey")
nzes5  nzes4 %>%
   mutate(weight = weights(nzes_cal))

# See impact of calibrated weights for the 2014 vote:
prop.table(svytable(~partyvote2014, nzes_svy)) # original weights
prop.table(svytable(~partyvote2014, nzes_cal)) # recalibrated weights

# See impact of calibrated weights for the 2011 vote:
prop.table(svytable(~partyvote2011, nzes_svy)) # original weights
prop.table(svytable(~partyvote2011, nzes_cal)) # recalibrated weights


#=======================previous years vote riverplot version============

the_data  nzes5 %>%
   mutate(
       # add two spaces to ensure the partyvote2011 does not have any
       # names that exactly match the party vote in 2014
       partyvote2011 = paste(partyvote2011, "  "),
       partyvote2011 = gsub("M.ori", "Maori", partyvote2011)) %>%
   group_by(partyvote2011, partyvote2014) %>%
   summarise(vote_prop = sum(weight)) %>%
   ungroup() 

# change names to the names wanted by makeRiver
names(the_data)  c("col1", "col2", "Value")

# node ID need to be characters I think
c1  unique(the_data$col1)
c2  unique(the_data$col2)
nodes_ref  data_frame(fullname = c(c1, c2)) %>%
   mutate(position = rep(c(1, 2), times = c(length(c1), length(c2)))) %>%
   mutate(ID = LETTERS[1:n()])

edges   the_data %>%
   left_join(nodes_ref[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>%
   rename(N1 = ID) %>%
   left_join(nodes_ref[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>%
   rename(N2 = ID) %>%
   as.data.frame(stringsAsFactors = FALSE)

rp  makeRiver(nodes = as.vector(nodes_ref$ID), edges = edges,
                node_labels = nodes_ref$fullname,
                # manual vertical positioning by parties.  Look at
                # nodes_ref to see the order in which positions are set.  
                # This is a pain, so I just let it stay with the defaults:
                #  node_ypos = c(4, 1, 1.8, 3, 6, 7, 5, 4, 1, 1.8, 3, 6, 7),
                node_xpos = nodes_ref$position,
                # set party colours; all based on those in nzelect::parties_v:
                node_styles = list(C = list(col = "#d82a20"), # red labour
                                   D = list(col = "#00529F"), # blue national
                                   E= list(col = "black"),   # black NZFirst
                                   B = list(col = "#098137"), # green
                                   J = list(col = "#d82a20"), # labour
                                   I = list(col = "#098137"), # green
                                   K = list(col = "#00529F"), # national
                                   L = list(col = "black")))  # NZ First

# Turn the text horizontal, and pale grey
ds  default.style()
ds$srt  0
ds$textcol  "grey95"

mygp  gpar(col = "grey75")
# using PNG rather than SVG as vertical lines appear in the SVG version

par(bg = "grey40")

# note the plot_area argument - for some reason, defaults to using only half the
# vertical space available, so set this to higher than 0.5!:
plot(rp, default_style = ds, plot_area = 0.9)

title(main = "Self-reported party vote in 2011 compared to 2014", 
      col.main = "white", font.main = 1)

grid.text(x = 0.15, y = 0.1, label = "2011 party vote", gp = mygp)
grid.text(x = 0.85, y = 0.1, label = "2014 party vote", gp = mygp)
grid.text(x = 0.95, y = 0.03, 
          gp = gpar(fontsize = 7, col = "grey75"), just = "right",
          label = "Source: New Zealand Election Study, analysed at https://ellisp.github.io")

#=======================sankeyD3 version====================

nodes_ref2  nodes_ref %>%
   mutate(ID = as.numeric(as.factor(ID)) - 1) %>%
   as.data.frame()

edges2  the_data %>%
   ungroup() %>%
   left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>%
   rename(N1 = ID) %>%
   left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>%
   rename(N2 = ID) %>%
   as.data.frame(stringsAsFactors = FALSE) %>%
   mutate(Value = Value / sum(Value) * 100)

pal  'd3.scaleOrdinal()
         .range(["#DCDCDC", "#098137", "#d82a20", "#00529F",  "#000000",  "#DCDCDC", 
                 "#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC"]);'

sankeyNetwork(Links = edges2, Nodes = nodes_ref2, 
              Source = "N1", Target = "N2", Value = "Value",
              NodeID = "fullname",
              NodeGroup = "fullname",
              LinkGroup = "col2",
              fontSize = 12, nodeWidth = 30,
              colourScale = JS(pal),
              numberFormat = JS('d3.format(".1f")'),
              fontFamily = "Calibri", units = "%", 
              nodeShadow = TRUE,
              showNodeValues = FALSE,
              width = 700, height = 500) 

#=======other by the by analysis==================
# Age density plot by party vote

# Remember to weight by the survey weights - in this case it controls for
# the under or over sampling by age in the original design.
nzes5 %>%
   ggplot(aes(x = dage, fill = partyvote2014, weight = weight / sum(nzes5$weight))) +
   geom_density(alpha = 0.3) +
   facet_wrap(~partyvote2014, scales = "free_y") +
   scale_fill_manual(values = parties_v) +
   theme(legend.position = "none") +
   labs(x = "Age at time of 2014 election",
        caption = "Source: New Zealand Election Study") +
   ggtitle("Age distribution by Party Vote in the 2014 New Zealand General Election")
To leave a comment for the author, please follow the link and comment on their blog: Peter’s stats stuff – R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R Weekly Bulletin Vol – IX

By R programming

This week’s R bulletin will cover topics on how to list files, extracting file names, and creating a folder using R.

We will also cover functions like the select function, filter function, and the arrange function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Run the current chunk – Ctrl+Alt+C
2. Run the next chunk – Ctrl+Alt+N
3. Run the current function definition – Ctrl+Alt+F

Problem Solving Ideas

How to list files with a particular extension

To list files with a particular extension, one can use the pattern argument in the list.files function. For example to list csv files use the following syntax.

Example:

files = list.files(pattern = ".csv$")

This will list all the csv files present in the current working directory. To list files in any other folder, you need to provide the folder path.

 list.files(path = "C:/Users/MyFolder", pattern = ".csv$")

$ at the end means that this is end of the string. Adding . ensures that you match only files with extension .csv

Extracting file name using gsub function

When we download stock data from google finance, the file’s name corresponds to the stock data symbol. If we want to extract the stock data symbol from the file name, we can do it using the gsub function. The function searches for a match to the pattern argument and replaces all the matches with the replacement value given in the replacement argument. The syntax for the function is given as:

 gsub(pattern, replacement, x)

where,

pattern – is a character string containing a regular expression to be matched in the given character vector.
replacement – a replacement for matched pattern.
x – is a character vector where matches are sought.

In the example given below, we extract the file name for files stored in the “Reading MFs” folder. We have downloaded the stock price data in R working directory for two companies namely, MRF and PAGEIND Ltd.

Example:

folderpath = paste(getwd(), "/Reading MFs", sep = "")
temp = list.files(folderpath, pattern = "*.csv")
print(temp)

[1] “MRF.csv” “PAGEIND.csv”

gsub("*.csv$", "", temp)

[1] “MRF” “PAGEIND”

Create a folder using R

One can create a folder via R with the help of the “dir.create” function. The function creates a folder with the name as specified in the last element of the path. Trailing path separators are discarded.

The syntax is given as:

dir.create(path, showWarnings = FALSE, recursive = FALSE)

Example:

dir.create("D:/RCodes", showWarnings = FALSE, recursive = FALSE)

This will create a folder called “RCodes” in the D drive.

Functions Demystified

select function

The select function comes from the dplyr package and can be used to select certain columns of a data frame which you need. Consider the data frame “df” given in the example.

Example:

library(dplyr)
Ticker = c("INFY", "TCS", "HCL", "TECHM")
OpenPrice = c(2012, 2300, 900, 520)
ClosePrice = c(2021, 2294, 910, 524)
df = data.frame(Ticker, OpenPrice, ClosePrice)
print(df)

# Suppose we wanted to select the first 2 columns only. We can use the names of the columns in the 
# second argument to select them from the main data frame.

subset_df = select(df, Ticker:OpenPrice)
print(subset_df)

# Suppose we want to omit the OpenPrice column using the select function. We can do so by using
# the negative sign along with the column name as the second argument to the function.

subset_df = select(df, -OpenPrice)
print(subset_df)

# We can also use the 'starts_with' and the 'ends_with' arguments for selecting columns from the
# data frame. The example below will select all the columns which end with the word 'Price'.

subset_df = select(df, ends_with("Price"))
print(subset_df)

filter function

The filter function comes from the dplyr package and is used to extract subsets of rows from a data frame. This function is similar to the subset function in R.

Example:

library(dplyr)
Ticker = c("INFY", "TCS", "HCL", "TECHM")
OpenPrice = c(2012, 2300, 900, 520)
ClosePrice = c(2021, 2294, 910, 524)
df = data.frame(Ticker, OpenPrice, ClosePrice)
print(df)

# Suppose we want to select stocks with closing prices above 750, we can do so using the filter 
# function in the following manner:

subset_df = filter(df, ClosePrice > 750)
print(subset_df)

# One can also use a combination of conditions as the second argument in filtering a data set.

subset_df = filter(df, ClosePrice > 750 & OpenPrice 

arrange function

The arrange function is part of the dplyr package, and is used to reorder rows of a data frame according to one of the columns. Columns can be arranged in descending order or ascending order by using the special desc() operator.

Example:

library(dplyr)
Ticker = c("INFY", "TCS", "HCL", "TECHM")
OpenPrice = c(2012, 2300, 900, 520)
ClosePrice = c(2021, 2294, 910, 524)
df = data.frame(Ticker, OpenPrice, ClosePrice)
print(df)

# Arrange in descending order

subset_df = arrange(df, desc(OpenPrice))
print(subset_df)

# Arrange in ascending order.

subset_df = arrange(df, -desc(OpenPrice))
print(subset_df)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – IX appeared first on .

Source:: R News

Which science is all around? #BillMeetScienceTwitter

By Maëlle Salmon

plot of chunk unnamed-chunk-8

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I’ll admit I didn’t really know who Bill Nye was before yesterday. His name sounds a bit like Bill Nighy’s, that’s all I knew. But well science is all around and quite often scientists on Twitter start interesting campaigns. Remember the #actuallylivingscientists whose animals I dedicated a blog post? This time, the Twitter campaign is the #BillMeetScienceTwitter hashtag with which scientists introduce themselves to the famous science TV host Bill Nye. Here is a nice article about the movement.

Since I like surfing on Twitter trends, I decided to download a few of these tweets and to use my own R interface to the Monkeylearn machine learning API, monkeylearn (part of the rOpenSci project!), to classify the tweets in the hope of finding the most represented science fields. So, which science is all around?

Getting the tweets

It might sound a bit like trolling by now, but if you wanna get Twitter data, I recommend using rtweet because it’s a good package and because it’s going to replace twitteR which you might know from other blogs.

I only keep tweets in English, and moreover original ones, i.e. not retweets.

library("rtweet")
billmeet  search_tweets(q = "#BillMeetScienceTwitter", n = 18000, type = "recent")
billmeet  unique(billmeet)
billmeet  dplyr::filter(billmeet, lang == "en")
billmeet  dplyr::filter(billmeet, is_retweet == FALSE)

I’ve ended up with 2491 tweets.

Classifying the tweets

I’ve chosen to use this taxonomy classifier which classifies text according to generic topics and had quite a few stars on Monkeylearn website. I don’t think it was trained on tweets, and well it wasn’t trained to classify science topics in particular, which is not optimal, but it had the merit of being readily available. I’ve still not started training my own algorithms, and anyway, if I did I’d start by creating a very crucial algorithm for determining animal fluffiness on pictures, not text mining stuff. This was a bit off topic, let’s go back to science Twitter!

When I decided to use my own package I had forgotten it took charge of cutting the request vector into groups of 20 tweets, since the API only accept 20 texts at a time. I thought I’d have to do that splitting myself, but no, since I did it once in the code of the package, I’ll never need to write that code ever again. Great feeling! Look at how easy the code is after cleaning up the tweets a bit! One just needs to wait a bit before getting all results.

output  monkeylearn::monkeylearn_classify(request = billmeet$text,
                                            classifier_id = "cl_5icAVzKR")
str(output)
## Classes 'tbl_df', 'tbl' and 'data.frame':	4466 obs. of  4 variables:
##  $ category_id: int  64638 64640 64686 64696 64686 64687 64689 64692 64648 64600 ...
##  $ probability: num  0.207 0.739 0.292 0.784 0.521 0.565 0.796 0.453 0.301 0.605 ...
##  $ label      : chr  "Computers & Internet" "Internet" "Humanities" "Religion & Spirituality" ...
##  $ text_md5   : chr  "f7b28f45ea379b4ca6f34284ce0dc4b7" "f7b28f45ea379b4ca6f34284ce0dc4b7" "b95429d83df2cabb9cd701a562444f0b" "b95429d83df2cabb9cd701a562444f0b" ...
##  - attr(*, "headers")=Classes 'tbl_df', 'tbl' and 'data.frame':	0 obs. of  0 variables

In the output, the package creator decided not to put the whole text corresponding to each line but its digested form itself, digested by the MD5 algorithm. So to join the output to the tweets again, I’ll have to first digest the tweet, which I do just copying the code from the package. After all I wrote it. Maybe it was the only time I successfully used vapply in my whole life.

billmeet  dplyr::mutate(billmeet, text_md5 = vapply(X = text,
                                                    FUN = digest::digest,
                                                    FUN.VALUE = character(1),
                                                    USE.NAMES = FALSE,
                                                    algo = "md5"))
billmeet  dplyr::select(billmeet, text, text_md5)
output  dplyr::left_join(output, billmeet, by = "text_md5")

Looking at this small sample, some things make sense, other make less sense, either because the classification isn’t good or because the tweet looks like spam. Since my own field isn’t text analysis, I’ll consider myself happy with these results, but I’d be of course happy to read any better version of it.

As in my #first7jobs, I’ll make a very arbitrary decision and filter the labels to which a probability higher to 0.5 was attributed.

output  dplyr::filter(output, probability > 0.5)

This covers 0.45 of the original tweets sample. I can only hope it’s a representative sample.

How many labels do I have by tweet?

dplyr::group_by(output) %>%
  dplyr::summarise(nlabels = n()) %>%
  dplyr::group_by(nlabels) %>%
  dplyr::summarise(n_tweets = n()) %>%
  knitr::kable()
nlabels n_tweets
1415 1

Perfect, only one.

Looking at the results

I know I suck at finding good section titles… At least I like the title of the post, which is a reference to the song Bill Nighy, not Bill Nye, sings in Love Actually. My husband assumed that science Twitter has more biomedical stuff. Now, even if my results were to support this fact, note that this could as well be because it’s easier to classify biomedical tweets.

I’ll first show a few examples of tweets for given labels.

dplyr::filter(output, label == "Chemistry") %>%
  head(n = 5) %>%
  knitr::kable()
category_id probability label text_md5 text
64701 0.530 Chemistry e82fc920b07ea9d08850928218529ca9 Hi @billnye I started off running BLAST for other ppl but now I have all the money I make them do my DNA extractions #BillMeetScienceTwitter
64701 0.656 Chemistry d21ce4386512aae5458565fc2e36b686 .@uw’s biochemistry dept – home to Nobel Laureate Eddy Fischer & ZymoGenetics co founder Earl Davie… https://t.co/0nsZW3b3xu
64701 0.552 Chemistry 1d5be9d1e169dfbe2453b6cbe07a4b34 Yo @BillNye – I’m a chemist who plays w lasers & builds to study protein interactions w materials #BillMeetScienceTwitter
64701 0.730 Chemistry 1b6a25fcb66deebf35246d7eeea34b1f Meow @BillNye! I’m Zee and I study quantum physics and working on a Nobel prize. #BillMeetScienceTwitter https://t.co/oxAZO5Y6kI
64701 0.873 Chemistry 701d8c53e3494961ee7f7146b28b9c8c Hi @BillNye, I’m a organic chemist studying how molecules form materials like the liquid crystal shown below.… https://t.co/QNG2hSG8Fw
dplyr::filter(output, label == "Aquatic Mammals") %>%
  head(n = 5) %>%
  knitr::kable()
category_id probability label text_md5 text
64609 0.515 Aquatic Mammals f070a05b09d2ccc85b4b1650139b6cd0 Hi Bill, I am Anusuya. I am a palaeo-biologist working at the University of Cape Town. @BillNye #BillMeetScienceTwitter
64609 0.807 Aquatic Mammals bb06d18a1580c28c255e14e15a176a0f Hi @BillNye! I worked with people at APL to show that California blue whales are nearly recovered #BillMeetScienceTwitter
64609 0.748 Aquatic Mammals 1ca07aad8bc1abe54836df8dd1ff1a9d Hi @BillNye! I’m researching marine ecological indicators to improve Arctic marine monitoring and management… https://t.co/pJv8Om4IeI
64609 0.568 Aquatic Mammals a140320fcf948701cfc9e7b01309ef8b More like as opposed to vaginitis in dolphins or chimpanzees or sharks #BillMeetScienceTwitter https://t.co/gFCQIASty1
64609 0.520 Aquatic Mammals 06d1e8423a7d928ea31fd6db3c5fee05 Hi @BillNye I study visual function in ppl born w/o largest connection between brain hemispheres #callosalagenesis… https://t.co/WSz8xsP38R
dplyr::filter(output, label == "Internet") %>%
  head(n = 5) %>%
  knitr::kable()
category_id probability label text_md5 text
64640 0.739 Internet f7b28f45ea379b4ca6f34284ce0dc4b7 @BillNye #AskBillNye @BillNye join me @AllendaleCFD. More details at https://t.co/nJPwWARSsa
#BillMeetScienceTwitter
64640 0.725 Internet b2b7843dc9fcd9cd959c828beb72182d @120Stat you could also use #actuallivingscientist #womeninSTEM or #BillMeetScienceTwitter to spread the word about your survey as well
64640 0.542 Internet a357e1216c5e366d7f9130c7124df316 Thank you so much for the retweet, @BillNye! I’m excited for our next generation of science-lovers!… https://t.co/B3iz3KVCOQ
64640 0.839 Internet 61712f61e877f3873b69fed01486d073 @ParkerMolloy Hi @BillNye, Im an elem school admin who wants 2 bring in STEM/STEAM initiatives 2 get my students EX… https://t.co/VMLO3WKVRv
64640 0.924 Internet 4c7f961acfa2cdd17c9af655c2e81684 I just filled my twitter-feed with brilliance. #BIllMeetScienceTwitter

Based on that, and on the huge number of internet-labelled tweets, I decided to remove those.

library("ggplot2")
library("viridis")

label_counts  output %>% 
  dplyr::filter(label != "Internet") %>%
  dplyr::group_by(label) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::arrange(desc(n))

label_counts  label_counts %>%
  dplyr::mutate(label = ifelse(n  5, "others", label)) %>%
  dplyr::group_by(label) %>%
  dplyr::summarize(n = sum(n)) %>%
  dplyr::arrange(desc(n))

label_counts  dplyr::mutate(label_counts,
                        label = factor(label,
                                        ordered = TRUE,
                                        levels = unique(label)))

ggplot(label_counts) +
  geom_bar(aes(label, n, fill = label), stat = "identity")+
  scale_fill_viridis(discrete = TRUE, option = "plasma")+
    theme(axis.text.x = element_text(angle = 90,
                            hjust = 1,
                            vjust = 1),
          text = element_text(size=25),
          legend.position = "none")

In the end, I’m always skeptical when looking at the results of such classifiers, and well at the quality of my sample to begin with – but then I doubt there ever was a hashtag that was perfectly used to only answer the question and not spam it and comment it (which is what I’m doing). I’d say it seems to support my husband’s hypothesis about biomedical stuff.

I’m pretty sure Bill Nye won’t have had the time to read all the tweets, but I think he should save them, or at least all the ones he can get via the Twitter API thanks to e.g. rtweet, in order to be able to look through them next time he needs an expert. And in the random sample of tweets he’s read, let’s hope he was exposed to a great diversity of science topics (and of scientists), although, hey, the health and life related stuff is the most interesting of course. Just kidding. I liked reading tweets about various scientists, science rocks! And these last words would be labelled with “performing arts”, perfect way to end this post.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News