OpenCPU release 1.4.6: gzip and systemd

By Jeroen Ooms

opencpu logo

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

OpenCPU server version 1.4.6 has been released to launchpad, OBS, and dockerhub (more about docker in a future blog post). I also updated the instructions to install the server or build from source for rpm or deb. If you have a running deployment, you should be able to upgrade with apt-get upgrade or yum update respectively.

Compression

This release enables gzip compression in the default apache2 configuration for ocpu, which was suggested by several smart users. As was explained in an earlier post about the curl package:

Support for compression can make a huge difference when streaming large data. Text based formats such as json are popular because they are human readable, but the main downside of plain-text is inefficiency for storing numbers. However when gzipped, json payloads are often comparable to binary formats, giving you the best of both worlds.

The nice thing about http is that compression is handled entirely on the level of the protocol so it works for all content types and you don’t have to do anything to take advantage of it. Client and server will automatically negotiate a method of compression that they both support via the Accept-Encoding header.

Try playing around with the ocpu test page by looking at the Content-Encoding response header, or just use curl with the --compress flag (use -v to see headers)

curl https://demo.ocpu.io/MASS/data/Boston/json -v > /dev/null
curl https://demo.ocpu.io/MASS/data/Boston/json --compress -v > /dev/null

As usual, I also updated the library of R packages included with the server, including the latest jsonlite 0.9.14 which allows for controlling prettify indentation:

Support for systemd and docker

Apart from enabling compression and updating the R package library, this release has some internal changes to support systemd on Debian 8 (Jessie), on which the r-base docker images are based.

The introduction of systemd has been quite controversial in the Debian community, to say the least, which is perhaps why things are not working as smoothly yet in Jessie at in Fedora. My current init scripts definitely did not work out of the box with systemd (as advertised) and getting them fixed was quite painful.

However I did figure everything out eventually, and learned a lot about systemd while debugging it. I can see it being a very powerful system, definitely a big improvement over the old style init scripts. The way services are specified has a lot in common with how docker does it, which I’m sure is not a conicidence. I look forward to taking full advantage of it once it has landed in all major distributions.

I really hope the Debian folks will resolve their differences sooner rather than later though, because the current state of Jessie is not very good. Even popular packges such as nginx are currently broken due to the chaos and uncertainty surrounding the transition to systemd, which is not helping anyone. On the other hand, I do admire the commitment of the Debian community to transparent and democratic decision making (even when messy) which is something the R community seems to be missing sometimes…

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

top posts for 2014

By xi’an

(This article was first published on Xi’an’s Og » R, and kindly contributed to R-bloggers)

Here are the most popular entries for 2014:

17 equations that changed the World (#2) 995
Le Monde puzzle [website] 992
“simply start over and build something better” 991
accelerating MCMC via parallel predictive prefetching 990
Bayesian p-values 960
posterior predictive p-values 849
Bayesian Data Analysis [BDA3] 846
Bayesian programming [book review] 834
Feller’s shoes and Rasmus’ socks [well, Karl’s actually…] 804
the cartoon introduction to statistics 803
Asymptotically Exact, Embarrassingly Parallel MCMC 730
Foundations of Statistical Algorithms [book review] 707
a brief on naked statistics 704
In{s}a(ne)!! 682
the demise of the Bayes factor 660
Statistical modeling and computation [book review] 591
bridging the gap between machine learning and statistics 587
new laptop with ubuntu 14.04 574
Bayesian Data Analysis [BDA3 – part #2] 570
MCMC on zero measure sets 570
Solution manual to Bayesian Core on-line 567
Nonlinear Time Series just appeared 555
Sudoku via simulated annealing 538
Solution manual for Introducing Monte Carlo Methods with R 535
future of computational statistics 531

What I appreciate from that list is that (a) book reviews [of stats books] get a large chunk (50%!) of the attention and (b) my favourite topics of Bayesian testing, parallel MCMC and MCMC on zero measure sets made it to the top list. Even

To leave a comment for the author, please follow the link and comment on his blog: Xi’an’s Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

First Day of the Month, Using R

By The Clerk

(This article was first published on You Know, and kindly contributed to R-bloggers)

Future-proofing is an important concept when designing automated reports. One thing that can get out of hand over time is when you accumulate so many periods of data that your charts start to look overcrowded. You can solve for this by limiting the number of periods to, say, 13 (I like 13 for monthly data, because you get a full year of data, plus you can compare the month-over-month of the most recent data).

You could approach this by limiting your data to anything in the last 390 days (30 days x 13 months), but your starting period will likely be cut-off. You can fix this by finding the first day of the month for each record, then going back to get a full 13 months of data.

Here’s a quick one-liner to get the first day of the month for a given date: subtract the day of the month from the full date, then add 1.

# get some dates for the toy example:
df1 <- data.frame(YourDate=as.Date("2012-01-01")+seq(from=1,to=900,by=11))
df1$DayOne <- df1$YourDate – as.POSIXlt(df1$YourDate)$mday + 1

To leave a comment for the author, please follow the link and comment on his blog: You Know.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Multivariate Medians

By Dave Giles

(This article was first published on Econometrics Beat: Dave Giles’ Blog, and kindly contributed to R-bloggers)
I’ll bet that in the very first “descriptive statistics” course you ever took, you learned about measures of “central tendency” for samples or populations, and these measures included the median. You no doubt learned that one useful feature of the median is that, unlike the (arithmetic, geometric, harmonic) mean, it is relatively “robust” to outliers in the data.

(You probably weren’t told that J. M. Keynes provided the first modern treatment of the relationship between the median and the minimization of the sum of absolute deviations. See Keynes (1911) – this paper was based on his thesis work of 1907 and 1908. See this earlier post for more details.)

At some later stage you would have encountered the arithmetic mean again, in the context of multivariate data. Think of the mean vector, for instance.
However, unless you took a stats. course in Multivariate Analysis, most of you probably didn’t get to meet the median in a multivariate setting. Did you ever wonder why not?
One reason may have been that while the concept of the mean generalizes very simply from the scalar case to the multivariate case, the same is not true for the humble median. Indeed, there isn’t even a single, universally accepted definition of the median for a set of multivariate data!
Let’s take a closer look at this.

The key point to note is that the univariate concept of the median is that it relies on our ability to order (or rank) univariate data. In the case of multivariate data, there is no natural ordering of the data points. In order to develop the concept of the median in this case, we first have to agree on some convention for defining “order”.

This gives rise to a host of different multivariate medians, including:

  • The L1 Median
  • The Geometric Median
  • The Vector of Marginal Medians (or coordinate-wise median)
  • The Spatial Median
  • The Oja Median
  • The Liu Median
  • The Tukey Median.
For most of these measures a variety of different numerical algorithms are available. This complicates matters even further. You have to decide on a median definition, and then you have to find an efficient algorithm to compute it. To get idea of the issues involved, take a look at this interesting paper.

You can compute multivariate medians in R, using the “med” function. However, for the most part the associated algorithms are limited to two-dimensional data.

If this topic interests you, then a good starting point for further reading is the survey paper by Small (1990).

Finally, it’s worth keeping in mind that the median is just one of the “order statistics” associated with a body of data. The issues associated with defining a median in the case of multivariate data apply equally to other order statistics, or functions of the order statistics (such as the “range” of the data).
References

Keynes, J. M., 1911. The principal averages and the laws of error which lead to them. Journal of the Royal Statistical Society, 74, 322–331.
Small, C. G., 1990. A survey of multidimensional medians. International Statistical Review, 58, 263–277.

© 2014, David E. Giles

To leave a comment for the author, please follow the link and comment on his blog: Econometrics Beat: Dave Giles’ Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R wins a 2014 Bossie Award

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

I missed this when it was announced back on September 29, but R won a 2014 Bossie Award for best open-source big-data tools from InfoWorld (see entry number 5):

A specialized computer language for statistical analysis, R continues to evolve to meet new challenges. Since displacing lisp-stat in the early 2000s, R is the de-facto statistical processing language, with thousands of high-quality algorithms readily available from the Comprehensive R Archive Network (CRAN); a large, vibrant community; and a healthy ecosystem of supporting tools and IDEs. The 3.0 release of R removes the memory limitations previously plaguing the language: 64-bit builds are now able to allocate as much RAM as the host operating system will allow.

Traditionally R has focused on solving problems that best fit in local RAM, utilizing multiple cores, but with the rise of big data, several options have emerged to process large-scale data sets. These options include packages that can be installed into a standard R environment as well as integrations into big data systems like Hadoop and Spark (that is, RHive and SparkR).

Check out the full list of winners at the link below. (Thanks to RG for the tip!)

InfoWorld: Bossie Awards 2014: The best open source big data tools

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

rfoaas 0.0.5

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new version of rfoaas is now on CRAN. The rfoaas package provides an interface for R to the most excellent FOAAS service–which provides a modern, scalable and RESTful web service for the frequent need to tell someone to eff off.

This version aligns the rfoaas version number with the (at long last) updated upstream version number, and brings a change suggested by Richie Cotton to set the encoding of the returned object.

As usual, CRANberries provides a diff to the previous release. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to extract a data.frame from string data

By Tal Galili

A guest article by Asher Raz, PhD, CareerHarmony

Sometimes, data of subjects are recorded on a server (e.g. SQL server) as string data records for each subject. In some cases we need only a part of those string data for each subject and we need it as numerical data (e.g. as a data.frame). How can we get the required data?

In the following post I would like to share my experience with this issue.

Below is a sample of string data:

"03F05DCACF-15BF-4328-BF1B-5D2503B4A18D00004|||03889D4711-4968-45DA-B1EF-E8559EEE43B400001|||03E56E89E7-5EA3-4A5A-B945-BC6B94D982EE00003|||03FEC7049F-B4E2-4D65-833D-59478CC7780D00003|||039BC5FC41-6E83-4880-8531-F0E38A892C2D00002|||035F88A090-E28F-4E6E-B680-33502F95C41D00003|||",20,DD31036F-CB38-4FF6-8DD8-495F42F29185,"032D6E0DDF-B553-4150-83D9-241DFD401E6D00001|||031228E393-88BC-4550-83D2-FBF3427D483600003|||035FA90EE3-8C51-48B9-91A0-8F5D163D6A7A00004|||03A564C47F-0A5C-4DA4-B94E-CEB55AEA947400002|||03BAA8F0A8-5BE3-4BC1-9DAF-C69ECECF1F7400002|||03C7CD867A-6557-4315-AF34-4B3938D6E81700003|||"

Each string data is organized according to rules. It could be that the required data is organized according to one rule at the beginning of the sting data and according to another rule in the rest of the string data.

Firstly, we need to find the rules that organize the required data. That is done by checking the data thoroughly. For instance, in the sample string data mentioned above, the required data are numbers that appears after four 0’s. The first datum we need is 4, the second is 1, and so on. The first datum appears in the 45th place in the string data, the second in 46th place after the first datum and so on. In order to record those rules, we can create a vector that contains the places of all the required data in the string:

#Creating the counter vector.
 
counter <- c(45) 
i <- 45 
for(n in 1:6) { 
  if (n<6) { 
    i <- i+46 
  } else { 
    i <- i+91 
  } 
  countert <- c(i) 
  counter <- rbind(counter, countert) 
}
 
counter <- as.vector (counter) 
for(n in 1:30) { 
  for(n in 1:6) { 
    if (n<6) { 
      i <- i+46 
    } else { 
      i <- i+91 
    } 
    countert <- c(i) 
    counter <- c(counter, countert) 
  } 
}

Secondly, we need to input into R the string data (in this example from SQL server) and then prepare the data.frame. That can be done by the following code:

#Read the answers from the text files.
library (RODBC)
myconn <-odbcConnect("Subjects_Data", uid="ab", pwd="cde")
NEW-DATA_sql_data_RawAnswer <- sqlQuery(myconn, "select top 1000 RawAnswer from New-Data",stringsAsFactors = FALSE)
 
NEW-DATA_all_sample <-c()
list_item <- 14
 
for(RawAnswer in NEW-DATA_sql_data_RawAnswer$RawAnswer) {
save(NEW-DATA_sql_data_RawAnswer, file=" C:/R_DATA/NEW-DATAt.txt ",ascii=T)
NEW-DATA_string <- readLines("C:/R_DATA/NEW-DATAt.txt")
list_item <- list_item + 3
NEW-DATA_stringt <- NEW-DATA_string[c(list_item)]
NEW-DATA_one_subject <- c()
for(n in counter) {
  NEW-DATAt <- substring(NEW-DATA_stringt, n,n)
  NEW-DATA_one_subject <- rbind(NEW-DATA_one_subject, NEW-DATAt)
}
 
NEW-DATA_one_subject <- as.numeric(NEW-DATA_one_subject)
NEW-DATA_one_subject <- NEW-DATA_one_subject [c(-187)]
NEW-DATA_all_sample <- rbind(NEW-DATA_all_sample, NEW-DATA_one_subject)
}
 
NEW-DATA_all_sample_df<-as.data.frame(apply(NEW-DATA_all_sample, 2, as.numeric))
 
names(NEW-DATA_all_sample_df)<-sprintf("q%d",1:186)
 
NEW-DATA_sql_data_CandidateID_Username <- sqlQuery(myconn, "select top 1000 CandidateID,Username from New-Data",stringsAsFactors = FALSE)
 
NEW-DATA_sql_data_CandidateID_Username_df<-as.data.frame(NEW-DATA_sql_data_CandidateID_Username)
 
NEW-DATA_all_sample_df$CandidateID <- NEW-DATA_sql_data_CandidateID_Username_df$CandidateID
 
NEW-DATA_all_sample_df$Username <- NEW-DATA_sql_data_CandidateID_Username_df$Username
 
NEW-DATA_all_sample_df <- NEW-DATA_all_sample_df[,c(187,188,1:186)]
 
NEW-DATA_all_sample_df
 
write.csv(NEW-DATA_all_sample_df,file = "C:/R_DATA/NEW-DATA_all_sample.csv")

It should be noted that the string data that was imported to R had to be saved into a text file, before it can be read by R (see above the commands before the command NEW-DATA_string <- readLines(“C:/R_DATA/NEW-DATAt.txt”)). This command can’t work directly on the data that was imported from SQL server.

Source:: R News

RcppArmadillo 0.4.600.0

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Conrad produced another minor release 4.600 of Armadillo. As before, I had created a GitHub-only pre-release(s) of his pre-release(s), and tested a pre-release as well as the actual release against the now over one hundred CRAN dependents of our RcppArmadillo package. The tests passed fine as usual with less than a handful of checks not passing, all for known cases — and results are as always in the rcpp-logs repository.

Changes are summarized below based on the NEWS.Rd file.

Changes in RcppArmadillo version 0.4.600.0 (2014-12-27)

  • Upgraded to Armadillo release Version 4.600 (“Off The Reservation”)

    • added .head() and .tail() to submatrix views

    • faster matrix transposes within compound expressions

    • faster accu() and norm() when compiling with -O3 -ffast-math -march=native (gcc and clang)

    • workaround for a bug in GCC 4.4

Courtesy of CRANberries, there is also a diffstat report for the most recent release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A time series contest attempt

By Wingfeet

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I saw the post

First model

As a first step I tried a model with limited variables and only 10000 records. For this the x data has been compressed in two manners. From a time series perspective, where the ACF is used, and from trend perspective, where 10 points are used to capture general shape of the curves. The latter by using local regression (loess). Both these approaches are done on true data and all data. In addition, the summary variables provided in the data are used.
The result, an OOB error rate of 44%. Actually, this was a lucky run, I have had runs with error rates over 50%. I did not bother to check predictive capability.
mysel <- filter(train,rowno%
select(.,-chart_length,-rowno) %>%
collect()
yy <- factor(mysel$class)
vars <- as.matrix(select(mysel,var.1:var.1000))
leftp <- select(mysel,true_length:high_frq_true_samples)
rm(mysel)
myacf <- function(datain) {
a1 <- acf(datain$y,plot=FALSE,lag.max=15)
a1$acf
}
myint <- function(datain) {
ll <- loess(y ~x,data=datain)
predict(ll,data.frame(x=seq(0,1,length.out=10)))
}

la <- lapply(1:nrow(vars),function(i) {
allvar <- data.frame(x=seq(0,1,length.out=1000),y=vars[i,])
usevar <- data.frame(x=seq(0,1,length.out=leftp$true_length[i]),
y=allvar$y[(1001-leftp$true_length[i]):1000])
c(myacf(allvar),myacf(usevar),myint(allvar),myint(usevar))
})
rm(vars)
rightp <- do.call(rbind,la)
colnames(rightp) <- c(
paste(‘aacf’,c(2,6,11,16),sep=”),
paste(‘uacf’,c(2,6,11,16),sep=”),
paste(‘a’,seq(1,10),sep=”),
paste(‘u’,seq(1,10),sep=”))

xblok <- as.matrix(cbind(leftp,rightp))
rf1 <-randomForest(
x=xblok,
y=yy,
importance=TRUE)
rf1
Call:
randomForest(x = xblok, y = yy, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6

OOB estimate of error rate: 44.21%
Confusion matrix:
0 1 class.error
0 2291 2496 0.5214122
1 1925 3287 0.3693400
The plot shows the variable importance. Besides the variables provided the ACF seems important. Variables based on all time points seemed to work better than variables based on the true time series.

Second Model

In this model extra detail has been added to the all data variables. In addition extra momemnts of the data have been calculated. It did not help very much.
mysel <- filter(train,rowno%
select(.,-chart_length,-rowno) %>%
collect()
yy <- factor(mysel$class)
vars <- as.matrix(select(mysel,var.1:var.1000))
leftp <- select(mysel,true_length:high_frq_true_samples)
rm(mysel)
myacf <- function(datain,cc,lags) {
a1 <- acf(datain$y,plot=FALSE,lag.max=max(lags)-1)
a1 <- a1$acf[lags,1,1]
names(a1) <- paste('acf',cc,lags,sep='')
a1
}
myint <- function(datain,cc) {
datain$y <- datain$y/mean(datain$y)
ll <- loess(y ~x,data=datain)
pp <- predict(ll,data.frame(x=seq(0,1,length.out=20)))
names(pp) <- paste(cc,1:20,sep='')
pp
}

la <- lapply(1:nrow(vars),function(i) {
allvar <- data.frame(x=seq(0,1,length.out=1000),y=vars[i,])
usevar <- data.frame(x=seq(0,1,length.out=leftp$true_length[i]),
y=allvar$y[(1001-leftp$true_length[i]):1000])
acm <- all.moments(allvar$y,central=TRUE,order.max=5)[-1]
names(acm) <- paste('acm',2:6)
arm <- all.moments(allvar$y/mean(allvar$y),
central=FALSE,order.max=5)[-1]
names(arm) <- paste('arm',2:6)
ucm <- all.moments(usevar$y,central=TRUE,order.max=5)[-1]
names(ucm) <- paste('ucm',2:6)
urm <- all.moments(usevar$y/mean(usevar$y),
central=FALSE,order.max=5)[-1]
names(urm) <- paste('urm',2:6)
ff <- fft(allvar$y[(1000-511):1000])[1:10]
ff[is.na(ff)] <- 0
rff <- Re(ff)
iff <- Im(ff)
names(rff) <- paste('rff',1:10,sep='')
names(iff) <- paste('iff',1:10,sep='')
c(myacf(allvar,’a’,lags=c(2:10,seq(20,140,by=10))),
myint(allvar,’a’),
acm,
arm,
rff,
iff,
myacf(usevar,’u’,seq(2,16,2)),
myint(usevar,’u’)
)
})
#rm(vars)
rightp <- do.call(rbind,la)
xblok <- as.matrix(cbind(leftp,rightp))
rf1 <-randomForest(
x=xblok,
y=yy,
importance=TRUE,
nodesize=5)
rf1
Call:
randomForest(x = xblok, y = yy, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 10

OOB estimate of error rate: 42.76%
Confusion matrix:
0 1 class.error
0 2245 2542 0.5310215
1 1734 3478 0.3326938

SVM

Just to try something else than a randomForest. But I notice some overfitting.
sv1 <- svm(x=xblok,
y=yy
)
sv1
Call:
svm.default(x = xblok, y = yy)

Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.009259259

Number of Support Vectors: 9998
table(predict(sv1),yy)

yy
0 1
0 4776 2
1 11 5210

A test set (rowno>50000 in the training table) did much worse
ytest
0 1
0 547 580
1 6254 6149

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

[NYC] Featured R experts Meetup, R classes and 12 week Data Science Bootcamp

By Vivian S. Zhang

(This article was first published on SupStat Data Science Blog » r-bloggers, and kindly contributed to R-bloggers)

There are a few exciting announcements I would love to share with R community. We feel very honored to host meetup and class offered by Kaggle #1 ranked Data Scientist, Owen Zhang and book author of Applied predictive modeling, Max Kuhn.

Featured R experts meetup

Featured talk given by Kaggle world ranked #1 Owen Zhang

Tuesday, Jan 20, 2015, 7:00 PM

No location yet.

1 data scientists Attending

Check out this Meetup →

Applied Predictive Modeling by Max Kuhn

Thursday, Jan 22, 2015, 7:00 PM

Thoughtworks NYC
99 Madison Avenue, 15th Floor New York, NY

34 data scientists Attending

Max Kuhn, Director, Nonclinical Statistics of Pfizer, has published his long waited book . He will join us and share his experience with Data Mining with R.

Check out this Meetup →

Upcoming Data Science Courses

Course Time
Intro to Data Science with R Jan 15 and 16(Thurs and Fri) Details
R Shiny workshop Jan 23 and 24(Thurs and Fri) Details
Data Science with R: Data Analysis Jan 17th, 23th, 30th, Feb 7th, 14th, 2015(Five Sat) Details
Data Science with R: Machine learning Jan 18th, 24th, 31th, Feb 8th, Feb 15th(Five Sun) Details
Advanced R: Applied Predictive Modeling with Max Kuhn Feb 18 and 19(Wed and Thurs) Details
Data Science with R: Data Analysis Feb 21th, 28th, Mar 7th, 14th, 21th 2015(Five Sat) Details
Data Science with R: Machine learning Feb 22, Mar 1st, 8th, 15th, 22th 2015(Five Sun) Details

12-week Data Science Immersive program

Join our full time program to become a data scientist and learn the practical skills needed for your career while building awesome solutions for real business and industry problems.

In this program students will learn beginner and intermediate levels of Data Science with R, Python & Hadoop as well as the most popular and useful R packages like Shiny, Knitr, rCharts and more. Once the foundation of learning has been set, students work on a 2-week, hands-on project with the instructor and mentored by top Chief Data Scientists in NYC. During the final week, students will have the opportunity to interview 300+ hiring companies in New York and the Tri State area.(Apply to http://nycdatascience.com/bootcamp/, deadline: Jan 6th, 2015)

NYC Data Science Curriculum:

  • Weeks 1 & 2 ———– Data Science With R: Data Analysis & Github
  • Weeks 3 & 4 ———– Data Science With R: Machine Learning &
  • Week 5 —————— Most Popular And Useful R Toolkits
  • Week 6 —————— Data Science With Python: Data Analysis
  • Week 7 —————— Data Science With Python: Machine Learning & Python Flask
  • Week 8 —————— Big Data With Hadoop: Data Engineering Professionals
  • Week 9 —————— Big Data With Hadoop: 5 Real World Applications
  • Weeks 10 & 11 ——– Capstone Project
  • Week 12 —————- Interview Preparation, Job Fair & On-Site Interviews

As you can tell, we are heavy covering R in our program. As far as we know, we are the only data science bootcamp who is teaching R, the other schools are focusing on Python.

To leave a comment for the author, please follow the link and comment on his blog: SupStat Data Science Blog » r-bloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News