In the last set of exercises , you have seen the basic functionalities of RevoScaleR .In this exercise set we will explore RevoScaleR further.
get the Credit card fraud data set from revolutionanalytics and lets get started
Answers to the exercises are available here.Please check the documentation before starting these exercise set
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
RevoScaleR provides option to convert a dataframe into a xdf file,which you might need while storing temporary data frame that you create during analysis work .
Now Create an XDF file from airquality dataset
In the previous set of exercise you have seen rxHistogram briefly,Now we will see how to get meaningful information from large dataset with a visualization .
create a scatterplot from the xdf file for Balance vs numTrans where numTrans is greater than 50 and have more than 50 creditline and the fraudrisk is 1 .
Good thing about RevoScaleR is that it comes with a sample data directory,find the sample data directory by rxGetoptions
and use claims.text and convert it into xdf file .You have already used rxImport ,which can be used but for plain text data there is an highly optimized method in RevoscaleR.Please use that
if you check the sampledatadir you can see that there are 10 csv files like moredefaulsmall2000.csv etc . rxImport can create a single xdf by joining them row wise .
Please create a new xdf which will contain all the data from moredefaultsmall2000.csv to moredefaultsmall2010.csv.
You can write a loop or do it in a functional way , if you have followed my exercises on functional programming (1 & 2) , I hope you know how to achieve this without loop .
We will see how rxDataStep can create new variables with complex calculations. on the creditcard fraud xdf create a column domTrans which is numTrans -numIntlTrans and create another
column balance per domTran which is Balance/domTrans. This exercise will show you how we can use temp variables to create more complex new columns from existing columns
in rxDataStep an important feature is transform functions . One thing to remember that when you use a transform function ,the transform function sees a chunk of data at a time , so if you require any value for the whole dataset ,you need to create it before the transformfunction ,not within the transform function .
Now create the z score of balance in the credit card fraud xdf file,which you have already created in the first set of exercise.
The next 3 exercise will give you a basic idea on how to split xdf into training /test set , we will see a bit deeper into rxLinMod as well.Please remember that the goal
of this exercise is not to predict but to make you aware of few important details of modelling by revoscaleR
split the credit card fraud xdf into two random parts with 25 percent being the test data and 75 percent is the training data.
Hint -You need to use splitByFactor and create it using transforms .Use a Seed to make this reproducible
- work with different data import techniques,
- know how to import data and transform it for a specific moddeling or analysis goal,
- and much more.
In the last set you used rxLinMod ,use the same expression but use cube =true as a parameter . You may need to tweak around to make it work , as the first expression should be categorical when using cube . Can you see the difference between both models .
How do you define a linear regression where you want to analyze whether fraudrisk depends only on the interaction of gender and balance and how do you define if you want to check fraudrisk’s dependency on the interaction as well as their individual contribution .
You might want to check fraudrisk for a different segment of balance . Create 5 different segments of 0-10k,10k-20k and so on and check the Linear Regression result .
Related exercise sets:
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News