How to add Trend Lines to Visualizations in Displayr

By Carmen Chan

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In Displayr, Visualizations of chart type Column, Bar, Area, Line and Scatter all support trend lines. Trend lines can be linear or non-parametric (cubic spline, Friedman’s super-smoother or LOESS).

Adding a linear trend line

Linear trend lines can be added to a chart by fitting a regression to each series in the data source. In the chart below, the linear trends are shown as dotted lines in the color corresponding to the data series. We see there is considerable fluctuation in the frequency of each search term. But the trend lines clarify that the overall trend for SPSS is downward, whereas the trend for Stata is increasing.

The data for this chart was generated by clicking Insert > More > Data > Google Trends. In the textbox for Topic(s) we typed in a comma-separated list of search terms (i.e., “SPSS, Stata”). This creates a table scoring the number of times each term was searched for each week. Any input table that has a similar structure to this can be used to create a chart. Time series data

If we click on the table, the object inspector showing the properties of this output is shown on the right. Under the Properties tab, expand the General group to see the name of the table, in this case google.trends.

We create a chart by selecting Insert > Visualizations > Line. In the dropdown for Output in ‘Pages’ select the name of the input table (i.e., google.trends). On the Chart tab in the object inspector, look for the Trend lines group. Set the Line of best fit dropdown to Linear. We also tick the checkbox for Ignore last data point. This option is useful for ignoring the last time period which may be incomplete if the data is in the process of being collected.

Trend lines using non-parametric smoothers

In many cases, we want to estimate a trend that is not constrained to a straight line. To estimate smooth (non-parametric) trend lines, we can use cubic splines, Friedman’s super smoother or LOESS. Note that LOESS uses a fixed span of 0.75 which may sometimes be overly large. In contrast, the cubic spline and Friedman’s smoother uses cross-validation to select a span, and they are usually better at identifying the important features. For example, in the figure below, the LOESS trend line suggests there is a gradual decrease in river flow from 1870 to 1900. However, the cubic spline and Friedman’s super smoother picks up a sharp decrease in 1989.

This example uses Nile, which is a built-in dataset in R. It is a time-series object, so to load the dataset in Displayr, first create an R Output and type the following code in the textbox for R code:

data(Nile)
x = Nile

The second line is necessary to assign a name to the time series data. You will then be able to use the data set in a chart by selecting x in the Output in ‘Pages’ dropdown (under Data Source in the Inputs tab).

Cubic spline trend line

super smoother trend lineLOESS trend line

Trend lines are added by going to the Chart tab in the object inspector and selecting a method (cubic spline, Friedman’s super smoother or LOESS) for the line of best fit. Once this option is selected, other controls to customize the appearance of line of best fit will be shown under the Trend Lines group. You may also want to adjust the opacity or the color palette (under the Data series group) of the data (i.e., column bars).

To find out more about visualizations in Displayr, head to our blog now. Ready to use trend lines for your own data? Grab your free trial of Displayr now.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Clean Your Data in Seconds with This R Function

By Naeemah Aliya Small

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

All data needs to be clean before you can explore and create models. Common sense, right. Cleaning data can be tedious but I created a function that will help.

The function do the following:

  • Clean Data from NA’s and Blanks
  • Separate the clean data – Integer dataframe, Double dataframe, Factor dataframe, Numeric dataframe, and Factor and Numeric dataframe.
  • View the new dataframes
  • Create a view of the summary and describe from the clean data.
  • Create histograms of the data frames.
  • Save all the objects

This will happen in seconds.

Package

First, load Hmisc package. I always save the original file.
The code below is the engine that cleans the data file.

cleandata 

The function

The function is below. You need to copy the code and save it in an R file. Run the code and the function cleanme will appear.

cleanme 

Type in and run:

cleanme(dataname)

When all the data frames appear, type to load the workspace as objects.

load("cleanmework.RData")

Enjoy

    Related Post

    1. Hands-on Tutorial on Python Data Processing Library Pandas – Part 2
    2. Hands-on Tutorial on Python Data Processing Library Pandas – Part 1
    3. Using R with MonetDB
    4. Recording and Measuring Your Musical Progress with R
    5. Spark RDDs Vs DataFrames vs SparkSQL – Part 4 Set Operators
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    pinp 0.0.6: Two new options

    By Thinking inside the box

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A small feature release of our pinp package for snazzier one or two column vignettes get onto CRAN a little earlier.

    It offers two new options. Saghir Bashir addressed a longer-standing help needed! issue and contributed code to select papersize options via the YAML header. And I added support for the collapse option of knitr, also via YAML header selection.

    A screenshot of the package vignette can be seen below. Additional screenshots of are at the pinp page.

    The NEWS entry for this release follows.

    Changes in pinp version 0.0.6 (2018-07-16)

    • Added YAML header option ‘papersize’ (Saghir Bashir in #54 and #58 fixing #24).

    • Added YAML header option ‘collapse’ with default ‘false’ (#59).

    Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    10 Jobs for R users from around the world (2018-07-17)

    By Tal Galili

    r_jobs

    To post your R job on the next post

    Just visit this link and post a new R job to the R community.

    You can post a job for free (and there are also “featured job” options available for extra exposure).

    Current R jobs

    Job seekers: please follow the links below to learn more and apply for your R job of interest:

    Featured Jobs

    1. Freelance
      Data Analytics Instructor Level Education From Northeastern University – Posted by LilyMeyer
      Boston Massachusetts, United States
      16 Jul 2018
    2. Full-Time
      Financial Systems Analyst National Audit Office – Posted by ahsinebadian
      London England, United Kingdom
      13 Jul 2018
    3. Full-Time
      Data Scientist National Audit Office – Posted by ahsinebadian
      London England, United Kingdom
      13 Jul 2018
    4. Freelance
      Senior Data Scientist Data Science Talent – Posted by damiendeighan
      Frankfurt am Main Hessen, Germany
      6 Jul 2018
    5. Full-Time
      Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
      New York New York, United States
      15 Jun 2018

    All New R Jobs

    1. Freelance
      R programmer & population data scientist United Nations, DESA, Population Division – Posted by pgerland
      Anywhere
      17 Jul 2018
    2. Freelance
      Data Analytics Instructor Level Education From Northeastern University – Posted by LilyMeyer
      Boston Massachusetts, United States
      16 Jul 2018
    3. Freelance
      Regex enthusiast needed for summer job MoritzH
      Anywhere
      13 Jul 2018
    4. Full-Time
      Financial Systems Analyst National Audit Office – Posted by ahsinebadian
      London England, United Kingdom
      13 Jul 2018
    5. Full-Time
      Data Scientist National Audit Office – Posted by ahsinebadian
      London England, United Kingdom
      13 Jul 2018
    6. Full-Time
      Shiny-R-Developer INSERM U1153 – Posted by menyssa
      Paris Île-de-France, France
      9 Jul 2018
    7. Freelance
      Senior Data Scientist Data Science Talent – Posted by damiendeighan
      Frankfurt am Main Hessen, Germany
      6 Jul 2018
    8. Full-Time
      Behavioral Risk Research Associate @ New York EcoHealth Alliance – Posted by EcoHealth Alliance
      New York New York, United States
      23 Jun 2018
    9. Full-Time
      postdoc in psychiatry: machine learning in human genomics University of Iowa – Posted by michaelsonlab
      Anywhere
      18 Jun 2018
    10. Full-Time
      Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
      New York New York, United States
      15 Jun 2018

    In R-users.com you can see all the R jobs that are currently available.

    R-users Resumes

    R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

    (you may also look at previous R jobs posts ).

    Source:: R News

    Hamiltonian tails

    By xi’an

    (This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

    “We demonstrate HMC’s sensitivity to these parameters by sampling from a bivariate Gaussian with correlation coefficient 0.99. We consider three settings (ε,L) = {(0.16; 40); (0.16; 50); (0.15; 50)}” Ziyu Wang, Shakir Mohamed, and Nando De Freitas. 2013

    In an experiment with my PhD student Changye Wu (who wrote all R codes used below), we looked back at a strange feature in an 2013 ICML paper by Wang, Mohamed, and De Freitas. Namely, a rather poor performance of an Hamiltonian Monte Carlo (leapfrog) algorithm on a two-dimensional strongly correlated Gaussian target, for very specific values of the parameters (ε,L) of the algorithm.

    The Gaussian target associated with this sample stands right in the middle of the two clouds, as identified by Wang et al. And the leapfrog integration path for (ε,L)=(0.15,50)

    keeps jumping between the two ridges (or tails) , with no stop in the middle. Changing ever so slightly (ε,L) to (ε,L)=(0.16,40) does not modify the path very much

    but the HMC output is quite different since the cloud then sits right on top of the target

    with no clear explanation except for a sort of periodicity in the leapfrog sequence associated with the velocity generated at the start of the code. Looking at the Hamiltonian values for (ε,L)=(0.15,50)

    and for (ε,L)=(0.16,40)

    does not help, except to point at a sequence located far in the tails of this Hamiltonian, surprisingly varying when supposed to be constant. At first, we thought the large value of ε was to blame but much smaller values still return poor convergence performances. As below for (ε,L)=(0.01,450)

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Continuous deployment of package documentation with pkgdown and Travis CI

    By filip Schouwenaars

    (This article was first published on DataCamp Community – r programming, and kindly contributed to R-bloggers)

    The problem

    pkgdown is an R package that can create a beautifully looking website for your own R package. Built and maintained by Hadley Wickham and his gang of prolific contributors, this package can parse the documentation files and vignettes for your package and builds a website from them with a single command: build_site(). This is what such a pkgdown-generated website looks like in action.

    The html files that pkgdown generated are stored in a docs folder. If your source code is hosted on GitHub, you just have to commit this folder to GitHub, navigate to the Settings panel of your GitHub repo and enable GitHub pages to host the docs folder at https://.github.io/. It’s remarkably easy and a great first step. In fact, this is how the pkgdown-built website for pkgdown itself is hosted.

    Although it’s an elegant flow, there are some issues with this approach. First, you’re committing files that were automatically generated even though the source required to build them is already stored in the package. In general, it’s not good practice to commit automatically generated files to your repo. What if you update your documentation, and commit the changes without rerendering the pkgdown website locally? Your repo files will be out of sync, and the pkgdown website will not reflect the latest changes. Second, there is no easy way to control when you release your documentation. Maybe you want to work off of the master branch, but you don’t want to update the docs until you’ve done a CRAN release and corresponding GitHub release. With the ad-hoc approach of committing the docs folder, this would be tedious.

    The solution

    There’s a quick fix for these concerns though, and that is to use Travis CI. Travis CI is a continuous integration tool that is free for open-source projects. When configured properly, Travis will pick up on any changes you make to your repo. For R packages, Travis is typically used to automatically run the battery of unit tests and check if the package builds on several previous versions of R, among other things. But that’s not all; Travis is also capable of doing deployments. In this case, I’ll show you how you can set up Travis so it automatically builds the pkgdown website for you, and commits the web files to the gh-pages branch, which is then subsequently used by GitHub to host your package website. To see how it’s set up for a R package in production check out the testwhat package on GitHub, which we use at DataCamp to grade student submissions and give useful feedback. In this tutorial, I will set up pkgdown for the tutorial package, another one of DataCamp’s open-source projects to make your blogs interactive.

    The steps

    1. Go to https://travis-ci.org and link your GitHub account.
    2. On your Travis CI profile page, enable Travis for the project repo that you want to build the documentation for. The next time you push a change to your GitHub project, Travis will be notified and will try to build your project. More on that later.

    3. In the DESCRIPTION file of your R package, add pkgdown to the Suggests list of packages. This ensures that when travis builds/installs your package, it will also install pkgdown so we can use it for building the website.

    4. In the .gitignore file, make sure that the entire docs folder is ignored by git: add the line docs/*.
    5. Add a file with the name .travis.yml to your repo’s root folder, with the following content:

      language: r
      cache: packages
      
      after_success:
        - Rscript -e 'pkgdown::build_site()'
      
      deploy:
        provider: pages
        skip-cleanup: true
        github-token: $GITHUB_PAT
        keep-history: true
        local-dir: docs
        on:
          branch: master
      

      This configuration file is very short, but it’s doing a lot of different things. Jeroen Ooms and Jim Hester are maintaining a default Travis build configuration for R packages that does a lot of things for you out of the box. A Travis config file with only the language: r tag would already build, test and check your package for inconsistencies. Let’s go over the other fields:

      • cache: packages tells Travis to cache the package installs between builds. This will significantly speed up your package build time if you have some package dependencies.
      • after_success tells Travis which steps to take when the R CMD CHECK step has succeeded. In our case, we’re telling Travis to build the pkgdown website, which will create a docs folder on Travis’s servers.
      • Finally, deploy asks Travis to go ahead and upload the files in the docs folder (local-dir) to GitHub pages, as specified through provider: pages. The on field tells Travis to do this deployment step if the change that triggered a build happened on the master branch.

      For a full overview of the settings, you can visit this help article. We do not have to specify the GitHub target branch where the docs have to be pushed to, as it defaults to gh-pages.

    6. Notice that the deploy step also features a github-token field, that takes an environment variable. Travis needs this key to make changes to the gh-pages branch. To get these credentials and make sure Travis can find them:

      • Go to your GitHub profile settings and create a new personal access token (PAT) under the Developer Settings tab. Give it a meaningful description, and make sure to generate a PAT that has either the public_repo (for public packages) or repo (for private packages) scope.

      • Copy the PAT and head over to the Travis repository settings, where you can specify environment variables. Make sure to name the environment variable GITHUB_PAT.

    7. The build should be good to go now! Commit the changes to your repo (DESCRIPTION and .travis.yml) to the master branch of your GitHub repo with a meaningful message.

    8. Travis will be notified and get to work: it builds the package, checks it, and if these steps are successful, it will build the pkgdown website and upload it to gh-pages.

    9. GitHub notices that a gh-pages branch has been created, and immediately hosts it at https://.github.io/. In our case, that is https://datacamp.github.io/tutorial. Have a look. What a beauty! Without any additional configuration, pkgdown has built a website with the GitHub README as the homepage, a full overview of all exported functions, and vignettes under the Articles section.

    From now on, every time you update the master branch of your package and the package checks pass, your documentation website will be updated automatically. You no longer have to worry about keeping the generated files in sync with your actual in-package documentation and vignettes. You can also easily tweak the deployment process so it only builds the documentation whenever you make a GitHub release. Along the way, you got continuous integration for your R package for free: the next time you make a change, Travis will notify you they broke any tests or checks.

    Happy packaging!

    References

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community – r programming.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    New Course Content: DS4B 201 Chapter 7, The Expected Value Framework For Modeling Churn With H2O

    By business-science.io – Articles

    Chapter 1: Business Understanding

    (This article was first published on business-science.io – Articles, and kindly contributed to R-bloggers)

    I’m pleased to announce that we released brand new content for our flagship course, Data Science For Business (DS4B 201). The latest content is focused on transitioning from modeling Employee Churn with H2O and LIME to evaluating our binary classification model using Return-On-Investment (ROI), thus delivering business value. We do this through application of a special tool called the Expected Value Framework. Let’s learn about the new course content available now in DS4B 201, Chapter 7, which covers the Expected Value Framework for modeling churn with H2O!

    Related Articles On Applying Data Science To Business

    If you’re interested in learning data science for business and topics discussed in the article (the expected value framework and the Business Science Problem Framework (BSPF)), check out some of these articles.

    Learning Trajectory

    We’ll touch on the following topics in this article:

    Alright, let’s get started!


    Get The Best Resources In Data Science. Every Friday!

    Sign up for our free “5 Topic Friday” Newsletter. Every week, I’ll send you the five coolest topics in data science for business that I’ve found that week. These could be new R packages, free books, or just some fun to end the week on.

    Sign Up For Five-Topic-Friday!


    Where We Came From (DS4B 201 Chapters 1-6)

    Data Science For Business (DS4B 201) is the ultimate machine learning course for business. Over the course of 10 weeks, the student learns from and end-to-end data science project involving a major issue impacting organizations: Employee Churn. The student:

    • Learns tools to size the Business Problem, to communicate with Executives and C-Suite, and to integrate data science results financially, which is how the organization benefits.

    • Follows our BSPF systematic data science for business framework (the Business Science Problem Framework, get it here, learn about it here)

    • Uses advanced tools including H2O Automated Machine Learning and LIME Feature Explanation (learn about them here)

    • Applies cutting-edge data science using the tidyverse, builds custom functions with Tidy Eval, and implements a host of other packages including fs, recipes, skimr, GGally, cowplot, and more in as you complete an integrated data science project.

    Chapter-By-Chapter Breakdown:

    “The ultimate machine learning course for business!”

    • Chapter 0, Getting Started: The student learns how to calculate the true cost of employee attrition, which is a hidden cost (indirect) because lost productivity doesn’t hit the financial statements. The student then learns about the tools in our toolbox including the integrated BSPF + CRISP-DM data science project framework for applying data science to business.

    • Chapter 1, Business Understanding: BSPF And Code Workflows: From here-on-out, lectures are 95% R code. The student begins his/her transition into sizing the problem financially using R code. The student creates his/her first custom function using Tidy Eval and develops custom plotting functions with ggplot2 to financially explain turnover by department and job role.

    Chapter 1: Code Workflow and Custom Functions for Understanding the Size of the Problem

    • Chapter 2, Data Understanding: By Data Type And Feature-Target Interactions: The student learns two methods for Exploratory Data Analysis (EDA). The first is exploration by data type using the skimr package. The second is visualizing the Feature-Target Interactions using GGally.

    • Chapter 3, Data Preparation: Getting Data Ready For People And Machines: The student first learns about wrangling and merging data using map() and reduce() to create a preprocessing pipeline for the human-readable data format. Next the student uses the recipes package to prepare the data for a pre-modeling Correlation Analysis. The pre-modeling Correlation Analysis is performed to tell us if we’re ready to move into Modeling.

    • Chapter 4, Modeling Churn: Using Automated Machine Learning With H2O: This is the first of two chapters on H2O. The student learns how to use h2o.automl() to generate an H2O Leaderboard containing multiple models including deep learning, stacked ensembles, gradient boosted machine, and more!

    • Chapter 5, Modeling Churn: Assessing H2O Performance: The student works with the h2o.performance() function output using various H2O performance functions. The student then learns how to measure performance for different audiences. The ROC Plot and Precision Vs Recall Plots are developed for data scientist evaluation. The Cumulative Gain and Lift Plots are built for executive / C-suite evaluation. The student ends this chapter building the “Ultimate Model Performance Comparison Dashboard” for evaluating multiple H2O models.

    Chapter 5: Modeling Churn, Ultimate Performance Dashboard

    Chapter 5: Ultimate Performance Dashboard for Comparing H2O Models

    • Chapter 6, Modeling Churn: Explaining Black-Box Models With LIME: Prediction is important, but it’s even more important in business to understand why employees are leaving. In this chapter, students learn LIME, a useful package for explaining deep learning and stacked ensembles, which often have the best performance.

    OK, now that we understand where we’ve been, let’s take a sneak peek at the new content!

    New Content (Chapter 7): Calculating The Expected ROI (Savings) Of A Policy Change

    This is where the rubber meets the road with ROI-Driven Data Science! You’ll learn how to use the Expected Value Framework to calculate savings for two policy changes:

    • Policy #1: “No Overtime”: Savings of 13% ($2.6M savings for 200 high-risk employees)!!!

    • Policy #2: “Targeted Overtime”: Savings of 16% ($3.2M savings for 200 high-risk employees)!!!!!

    Here’s the YouTube Video of the Expected Value Framework for Delivering ROI.

    Students implement two overtime reduction policies. The first is a “No Overtime Policy”, which results in a 13% savings versus the baseline (do nothing). The second is a “Targeted Overtime Reduction Policy”, which increased the savings to 16% versus the baseline (do nothing). The targeted policy is performed using the F1 score showing the performance boost over a “Do Nothing Policy” and the “No Overtime Policy”.

    The targeted policy requires working with the expected rates. It’s an un-optimized strategy that treats the true positives and true negatives equally (uses the F1 score, which does not account for business costs of false negatives). This occurs at a threshold of 28%, which can be seen in the Expected Rates graph below.

    Chapter 7: Working With Expected Rates

    Chapter 7: Working With Expected Rates

    Calculating the Expected Value at the threshold that balances false negatives and false positives yields a 16% savings over a “Do Nothing Policy”. This targeted policy applies an overtime reduction policy to anyone with greater than a 28% class probability of quitting.

    Chapter 7: Calculating Expected Savings

    Chapter 7: Calculating Expected Savings Vs Baseline (Do Nothing)

    We end Chapter 7 with a brief discussion on False Positives and False Negatives. The problem with using the threshold that maximizes F1 is that False Negatives are typically 3X to 5X more costly than False Positives. With a little extra work, we can do even better than a 16% savings, and that’s where Chapter 8 comes in.

    Where We’re Going (Chapter 8): Threshold Optimization and Sensitivity Analysis

    Chapter 8 picks up where Chapter 7 left off by focusing on using the purrr library to iteratively calculate savings. Two analyses are performed:

    1. Threshold Optimization Using Cost/Benefit and Expected Value Framework – Maximizes profit (savings)

    2. Sensitivity Analysis to adjust parameters that are “assumptions” to grid search best/worst case scenarios and to see there effect on expected savings.

    The threshold optimization is the first step, which can be performed by iteratively calculating the expected savings at various thresholds using the purrr package.

    Chapter 8: Threshold Optimization With purrr

    Chapter 8: Threshold Optimization With `purrr`

    Next, the student visualizes the threshold optimization results using ggplot2.

    Chapter 8: Threshold Optimization With ggplot2

    Chapter 8: Visualizing Optimization Results With `ggplot2`

    Sensitivity analysis is the final step. The student goes through a similar process but this time use purrr partial(), cross_df(), and pmap_dbl() to calculate a range of potential values for inputs that are not completely known. For example, the percentage overtime worked in the future is unlikely to be the same as the current year. How does that affect the model? How does the future overtime interact with other assumptions like the future net revenue per employee? Find out how to handle this by taking the course. 🙂

    Next Steps: Take The DS4B 201 Course!

    If interested in learning more, definitely check out Data Science For Business (DS4B 201). In 10 weeks, the course covers all of the steps to solve the employee turnover problem with H2O in an integrated end-to-end data science project.

    The students love it. Here’s a comment we just received last Sunday morning from one of our students, Siddhartha Choudhury, Data Architect at Accenture.

    Testimonial

    “To be honest, this course is the best example of an end to end project I have seen from business understanding to communication.”

    Siddhartha Choudhury, Data Architect at Accenture

    See for yourself why our students have rated Data Science For Business (DS4B 201) a 9.0 of 10.0 for Course Satisfaction!


    Get Started Today!

    Learning More

    Check out our other articles on Data Science For Business!

    Business Science University

    Business Science University is a revolutionary new online platform that get’s you results fast.

    Why learn from Business Science University? You could spend years trying to learn all of the skills required to confidently apply Data Science For Business (DS4B). Or you can take the first course in our integrated Virtual Workshop, Data Science For Business (DS4B 201). In 10 weeks, you’ll learn:

    • A 100% ROI-driven Methodology – Everything we teach is to maximize ROI.

    • A clear, systematic plan that we’ve successfully used with clients

    • Critical thinking skills necessary to solve problems

    • Advanced technology: H2O Automated Machine Learning

    • How to do 95% of the skills you will need to use when wrangling data, investigating data, building high-performance models, explaining the models, evaluating the models, and building tools with the models

    You can spend years learning this information or in 10 weeks (one chapter per week pace). Get started today!


    Sign Up Now!

    DS4B Virtual Workshop: Predicting Employee Attrition

    Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.


    Get Started Today!

    HR 301 Shiny Application: Employee Prediction

    Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

    Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

    The Virtual Workshop is code intensive (like these articles) but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework and many data science tools in an integrated fashion. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.

    Here’s what one of our students, Jason Aizkalns, Data Science Lead at Saint-Gobain had to say:

    “In an increasingly crowded data science education space, Matt and the Business Science University team have found a way to differentiate their product offering in a compelling way. BSU offers a unique perspective and supplies you with the materials, knowledge, and frameworks to close the gap between just “doing data science” and providing/creating value for the business. Students will learn how to formulate their insights with a value-creation / ROI-first mindset which is critical to the success of any data science project/initiative in the “real world”. Not only do students work a business problem end-to-end, but the icing on the cake is “peer programming” with Matt, albeit virtually, who codes clean, leverages best practices + a good mix of packages, and talks you through the why behind his coding decisions – all of which lead to a solid foundation and better habit formation for the student.”

    Jason Aizkalns, Data Science Lead at Saint-Gobain


    Get Started Today!

    Don’t Miss A Beat

    Connect With Business Science

    If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

    To leave a comment for the author, please follow the link and comment on their blog: business-science.io – Articles.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Twitter coverage of the useR! 2018 conference

    By nsaunders

    (This article was first published on R – What You’re Doing Is Rather Desperate, and kindly contributed to R-bloggers)

    In summary:

    The code that generated the report (which I’ve used heavily and written about before) is at Github too. A few changes required compared with previous reports, due to changes in the rtweet package, and a weird issue with kable tables breaking markdown headers.

    I love that the most popular media attachment is a screenshot of a Github repo.

    To leave a comment for the author, please follow the link and comment on their blog: R – What You’re Doing Is Rather Desperate.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Smoothing Time Series Data

    By Carmen Chan

    Global trend lines

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    These include both global methods, which involve fitting a regression over the whole time series; and more flexible local methods, where we relax the constraint by a single parametric function. Further details about how to construct estimated smooths in R can be found here.

    1. Global trends over time

    i. Linear

    One of the simplest methods to identify trends is to fit the time series to the linear regression model.

    ii. Quadratic

    For more flexibility, we can also fit the time series to a quadratic expression — that is, we use linear regression with the expanded basis functions (predictors) 1, x, x2.

    iii. Polynomial

    If the linear model is not flexible enough, it can be useful to try a higher-order polynomial. In practice, polynomials of degrees higher than three are rarely used. As demonstrated in the example below, changing from quadratic and cubic trend lines does not always significantly improve the goodness of fit.

    2. Local smoothers

    The first three approaches assume that the time series follows a single trend. Often, we want to relax this assumption. For example, we do not want variation at the beginning of the time-series to affect estimates near the end of the time series. In the following section, we demonstrate the use of local smoothers using the Nile data set (included in R’s built in data sets). It contains measurements of the annual river flow of the Nile over 100 years and is less regular than the data set used in first example.

    i. Moving averages

    The easiest local smoother to grasp intuitively is the moving average (or running mean) smoother. It consists of taking the mean of a fixed number of nearby points. As we only use nearby points, adding new data to the end of the time series does not change estimated values of historical results. Even with this simple method we see that the question of how to choose the neighborhood is crucial for local smoothers. Increasing the bandwidth from 5 to 20 suggests that there is a gradual decrease in annual river flow from 1890 to 1905, instead of a sharp decrease at around 1900.

    Moving average smoothers

    ii. Running line

    The running-line smoother reduces this bias by fitting a linear regression in a local neighborhood of the target value xi. A popular algorithm using the running line smoother is Friedman’s super-smoother, which uses cross-validation to find the best span. As seen in the plot below, the Friedman’s super-smoother with the cross-validated span is able to detect the sharp decrease in annual river flow at around 1900.

    Running line smoothers

    iii. Kernel smoothers

    An alternative approach to specifying a neighborhood is to decrease weights further away from the target value. In the figure below, we see that the continuous Gaussian kernel gives a smoother trend than a moving average or running-line smoother.

    Kernel smoothers

    iv. Smoothing splines

    Splines consist of a piece-wise polynomial with pieces defined by a sequence of knots where the pieces join smoothly. It is most common to use cubic splines. Higher order polynomials can have erratic behavior at the boundaries of the domain.

    The smoothing spline avoids the problem of over-fitting by using regularized regression. This involves minimizing a criterion that includes both a penalty for the least squares error and roughness penalty. Knots are initially placed at all of the data points. But the smoothing spline avoids over-fitting because the roughness penalty shrinks the coefficients of some of the basis functions towards zero. The smoothing parameter lambda controls the trade-off between goodness of fit and smoothness. It can be chosen by cross-validation.

    Smoothing spline

    v. LOESS

    LOESS (locally estimated scatterplot smoother) combines local regression with kernels by using locally weighted polynomial regression (by default, quadratic regression with tri-cubic weights). It is one of the most frequently used smoothers because of its flexibility. However, unlike Friedman’s super smoother or the smoothing spline, LOESS does not use cross-validation to select a span.

    LOESS

    Find out more about data visualizations here.

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    RcppClassic 0.9.11

    By Thinking inside the box

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new maintenance release, now at version 0.9.11, of the RcppClassic package arrived earlier today on CRAN. This package provides a maintained version of the otherwise deprecated initial Rcpp API which no new projects should use as the normal Rcpp API is so much better.

    Per another request from CRAN, we updated the source code in four places to no longer use dynamic exceptions specification. This is something C++11 deprecated, and g++-7 and above now complain about each use. No other changes were made.

    CRANberries also reports the changes relative to the previous release.

    Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News