Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

By John Mount

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing.

vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark.

vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.

Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible.

We have some basic examples of the new vtreat capabilities here and here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.