Create your Machine Learning library from scratch with R ! (1/3)

By Antoine Guillot

[l(theta)=sum_{x,y} ; y.log(theta(X)) + (1-y).log(1-theta(X)) ]

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

When dealing with Machine Learning problems in R, most of the time you rely on already existing libraries. This fastens the analysis process, but do you really understand what is behind the algorithms? Could you implement a logistic regression from scratch with R?
The goal of this post is to create our own basic machine learning library from scratch with R. We will only use the linear algebra tools available in R. There will be three posts:

  1. Linear and logistic regression (this one)
  2. PCA and k-nearest neighbors classifiers and regressors
  3. Tree-based methods and SVM

Linear Regression (Least-Square)

The goal of liner regression is to estimate a continuous variable given a matrix of observations X. Before dealing with the code, we need to derive the solution of the linear regression.

Solution derivation of linear regression

Given a matrix of observations X and the target . The goal of the linear regression is to minimize the L2 norm between and a linear estimate of : hat{Y}=WX. Hence, linear regression can be rewritten as an optimization problem: arg ; min_W ||Y-WX||^2. A closed-form solution can easily be derived and the optimal W is hat{W}=(X^T X)^{-1}X^T Y

Linear regression in R

Using the closed-form solution, we can easily code the linear regression. Our linear model object will have three methods, an init method where the model is fitted, a predict method to work with new data and a plot method to visualize the residuals’ distribution.

###Linear model

The fit function is simple, the first few lines transform the data to matrices and add an intercept if required. Then, the ‘my_lm' object is created and the coefficients are computed. The solve() function is used to invert the matrix and %*% denotes matrix multiplication. A the end, the residuals and the estimates are computed and the class of the object is set as ‘my_lm'.
Now let's implement the predict and plot methods for the my_lm class:


You can test the code on some preinstalled R dataset such as the car one. The code will give you the same coefficients estimates as the lm function. For instance, on the car dataset:


Logistic regression

Previously, we worked on regression and the estimation of a continuous variable. Now, with logistic regression, we try to estimate a binary outcome (for instance, ill vs healthy, pass vs failed, …). Again, let’s deal with the maths first:

The mathematics of logistic regression

The goal is to estimate a binary outcome given the observations X. We assume that Y|X follows a Bernoulli distribution of parameter theta(X)=frac{e^{WX}}{1+e^{WX}}=sigma(WX). sigma is called the sigmoid function.
Hence, we have P(y|x)=theta(X)^y(1-theta(X))^{1-y}.
We want to maximize the log-likelihood of the observed sample(over theta and hence over W):

[l(theta)=sum_{x,y} ; y.log(theta(X)) + (1-y).log(1-theta(X)) ]

This maximization will be done using Newton’s Method. Newton’s method is a variant of gradient descent in which we try to find the optimal curvature of the function to increase the speed of convergence. If you are not familiar with the Newton method, you can just see it as a variant of batch gradient descent.. The weights updates has the following form:

[ W_{t+1}=W_t+(nabla^2 l(W_t))^{-1} nabla l(W_t) ]

with: The Hessian

[ nabla^2 l(W_t))=-X^T Diag(sigma( W^T X)_i(1-sigma( W^T X))_i) X]

and the gradient

[nabla l(W_t)=X^T.(y-sigma( W^T X))]

The algorithm in R will update the weights using this update until the termination criterion is met. Here, the termination criterion is met when the mean square error is below the user-defined tolerance.

Logistic regression in R

###Sigmoid function
sigmoid=function(x) {1/(1+exp(-x))}

###Fit logistic regression
  ##Type conversion
  if (!is.matrix(x))
  if (!is.matrix(y))
   ##Add intercept is required
  if (intercept)
  ##Algorithm initialization
  ##Weights are initialized to 1 
  ##Updates the weight until the max number of iter
  ##Or the termination criterion is met
  while (iterations0.5

The code is split into two part:

  • The fit part, where the logistic model is fitted using Newton’s method.
    This part has three main components. First, the data is put in the proper matrix format and, if required, the intercept is added. Then the algorithm parameters are updated (initially all the weights are set to one).
    Finally, the algorithm updates the weights until the MSE goes below the tolerance. The most important line is probably the weights update one where the update formula is used to update the weights of the model.
  • The predict method, where the outcome is estimated using the weights computed previously.

We can now use our logistic regression to predict the class of a flower from the iris dataset:


As expected, the algorithm can predict efficiently if a flower is a setosa or not.

If you like this post, follow us to learn how to create your Machine Learning library from scratch with R!

The post Create your Machine Learning library from scratch with R ! (1/3) appeared first on Enhance Data Science.

To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.