## Simulating Water Consumption to Develop Analysis and Reporting

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

I am currently working on developing analytics for a digital water metering project. Over the next five years, we are enabling 70,000 customer water meters with digital readers and transmitters. The data is not yet available but we don’t want to wait to build reporting systems until after the data is live. The R language comes to the rescue as it has magnificent capabilities to simulate data. Simulating data is a useful technique to progress a project when data is being collected. Simulated data also helps because the outcomes of the analysis are known, which helps to validate the outcomes.

The raw data that we will eventually receive from the digital customer meters has the following basic structure:

• DevEUI: Unique device identifier.
• Timestamp: Date and time in (UTC) of the transmission.
• Cumulative count: The number of revolutions the water meter makes. Each revolution is a pulse which equates to five litres of water.

Every device will send an hourly data burst which contains the cumulative meter read in pulse counts. The transmitters are set at a random offset from the whole our, to minimise the risk of congestion at the receivers. The time stamp for each read is set in the Coordinated Universal Time (UTC). Using this time zone prevents issues with daylight savings. All analysis will be undertaken in the Australian Eastern (Daylight) Time zone.

This article explains how we simulated test data to assist with developing reporting and analysis. The analysis of digital metering data follows in a future post. The code and the data can be found on GitHub. I have recently converted to using the Tidyverse for all my R coding. It has made my working life much easier and I will use it for all future posts.

## Simulating water consumption

For simplicity, this simulation assumes a standard domestic diurnal curve (average daily usage pattern) for indoor water use. Diurnal curves are an important piece of information in water management. The curve shows water consumption over the course of a day, averaged over a fixed period. The example below is sourced from a journal article. This generic diurnal curve consists of 24 data points based on measured indoor water consumption, shown in the graph below.

Source: Gurung et al. (2014) Smart meters for enhanced water supply network modelling and infrastructure planning. Resources, Conservation and Recycling (90), 34-50.

This diurnal curve only includes indoor water consumption and is assumed to be independent of seasonal variation. This is not a realitic assumption, but the purpose of this simulation is not to accurately model water consumption but to provide a data set to validate the reporting and analyses.

## Simulating water consumption in R

The first code snippet sets the parameters used in this simulation. The unique device identifiers (DevEUI) are simulated as six-digit random numbers. The timestamps vector consists of hourly date-time variables in UTC. For each individual transmitter, this timestamp is offset by a random time. Each transmitter is also associated with the number of people living in each house. This number is based on a Poisson distribution.

```# Libraries
library(tidyverse)
# Boundary conditions
n %
ggplot(aes(occupants)) + geom_bar(fill = "dodgerblue2", alpha = 0.5) +
xlab("Occupants") + ylab("Connections") + ggtitle("Occupants per connection")
```

Simulated number of occupants per connection.

The diurnal curve is based on actual data which includes leaks as the night time use shows a consistent flow of about one litre per hour. For that reason, the figures are rounded and reduced by one litre per hour, to show a zero flow when people are usually asleep. The curve is also shifted by eleven hours because the raw data is stored in UTC.

```diurnal
This simulation only aims to simulate a realistic data set and not to present an accurate depiction of reality. This simulation could be enhanced by using different diurnal curves for various customer segments and to include outdoor watering, temperature dependencies and so on.
Simulating Water Consumption
A leak is defined by a constant flow through the meter, in addition to the idealised diurnal curve. A weighted binomial distribution (θ = 0.1) models approximately one in ten properties with a leak. The size of the leak is derived from a random number between 10 and 50 litres per hour.
The data is stored in a matrix through a loop that cycles through each connection. The DevEUI is repeated over the simulated time period (24 times the number of days). The second variable is the time stamp plus the predetermined offset for each RTU. The meter count is defined by the cumulative sum of the diurnal flow, multiplied by the number of occupants. Each point in the diurnal deviates from the model curve by ±10%. Any predetermined leakage is added to each meter read over the whole period of 100 days. The hourly volumes are summed cumulatively to simulate meter reads. The flow is divided by five as each meter revolution indicate five litres.
The next code snippet simulates the digital metering data using the assumptions and parameters outlined above.
```
```# Leak simulation
leaks  0)

# Digital metering data simulation
as_data_frame() %>%
mutate(TimeStampUTC = as.POSIXct(TimeStampUTC, origin = "1970-01-01", tz ="UTC"))
```

## Missing Data Points

The data transmission process is not 100% reliable and the base station will not receive some reads. This simulation identifies reads to be removed from the data through the temporary variable remove. This simulation includes two types of failures:

• Faulty RTUs (2% of RTUs with missing 95% of data)
• Randomly missing data points (1% of data)
```# Initialise temp variable
select(-remove)

#Visualise
mutate(TimeStampAEST = as.POSIXct(format(TimeStampUTC,
tz = "Australia/Melbourne"))) %>%
filter(TimeStampAEST >= as.POSIXct("2020-02-06") &
TimeStampAEST <= as.POSIXct("2020-02-08")) %>%
arrange(DevEUI, TimeStampAEST) %>%
ggplot(aes(x = TimeStampAEST, y = Count, colour = factor(DevEUI)))  +
geom_line() + geom_point()
```

The graph shows an example of the cumulative reads and some missing data points.

## Analysing Digital Metering Data

Data simulation is a good way to develop your analysis algorithms before you have real data. I have also used this technique when I was waiting for survey results during my dissertation. When the data finally arrived, I simply had to plug it into the code and finetune the code. R has great capabilities to simulate reality to help you understand the data.

In next week’s article, I will outline how I used R and the Tidyverse package to develop libraries to analyse digital metering data.

The post Simulating Water Consumption to Develop Analysis and Reporting appeared first on The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Analysis of Trump’s State of the Union Speech, with R

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

President Trump’s State of the Union speech was last night, and it seemed to me it dragged on a bit. That’s because it was apparently the slowest SOTU speech in history, based on average number of words spoken per minute. The chart below, based on these data, was created by Josh Katz. (The R code for the chart is here. Note the use of his needs package to load the required R packages.)

Peter Aldhous at Buzzfeed analyzed last night’s State of the Union speech as well. Not only was the speech delivered slowly, the script was at a 9-year-old’s reading level, following a long-term decline in the complexity of State of the Union speeches.

You can find more analysis of this and past SOTU speeches at the link below. The R code behind the analysis can be found here. (You may also want to check out this analysis of Trump’s Twitter activity and the associated R code.)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## València summer school

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

In another continuation of the summer of Bayesian conferences in Europe, the Universidat de Valencià is organising a summer school on Bayesian statistics, from 16 July till 20 July, 2018. Which thus comes right after our summer school on computational statistics at Warwick. With a basic course on Bayesian learning (2 days). And a more advanced course on Bayesian modeling with BayesX. And a final day workshop.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## t-sne dimension reduction on Spotify mp3 samples

(This article was first published on R – Longhow Lam’s Blog, and kindly contributed to R-bloggers)

## Introduction

Not long ago I was reading on t-Distributed Stochastic Neighbor Embedding (t-sne), a very interesting dimension reduction technique, and on Mel frequency cepstrum a sound processing technique. Details of both techniques can be found here and here. Can we combine the two in a data analysis exercise? Yes, and with not too much R code you can already quickly create some visuals to get ‘musical’ insights.

## Spotify Data

Where can you get some sample audio files? Spotify! There is a Spotify API which allows you to get information on playlists, artists, tracks, etc. Moreover, for many songs (not all though) Spotify provides downloadable preview mp3’s of 30 seconds. The link to the preview mp3 can be retrieved from the API. I am going to use some of these mp3’s for analysis.

In the web interface of Spotify you can look for interesting playlists. In the search field type in for example ‘Bach‘ (my favorite classical composer). In the search results go to the playlists tab, you’ll find many ‘Bach’ playlists from different users, including the ‘user’ Spotify itself. Now, given the user_id (spotify) and the specific playlist_id (37i9dQZF1DWZnzwzLBft6A for the Bach playlist from Spotify) we can extract all the songs using the API:

``` GET https://api.spotify.com/v1/users/{user_id}/playlists/{playlist_id}
```

You will get the 50 Bach songs from the playlist, most of them have a preview mp3. Let’s also get the songs from a Heavy Metal play list, and a Michael Jackson play list. In total I have 146 songs with preview mp3’s in three ‘categories’:

• Bach,
• Heavy Metal,
• Michael Jackson.

## Transforming audio mp3’s to features

The mp3 files need to be transformed to data that I can use for machine learning, I am going to use the Python librosa package to do this. It is easy to call it from R using the reticulate package.

```library(reticulate)
librosa = import("librosa")

#### python environment with librosa module installed
use_python(python = "/usr/bin/python3")
```

The downloaded preview mp3’s have a sample rate of 22.050. So a 30 second audio file has in total 661.500 raw audio data points.

```onemp3 = librosa\$load("mp3songs/bach1.mp3")

length(onemp3[[1]])
length(onemp3[[1]])/onemp3[[2]]  # ~30 seconds sound

## 5 seconds plot
pp = 5*onemp3[[2]]
plot(onemp3[[1]][1:pp], type="l")
```

A line plot of the raw audio values will look like.

For sound processing, features extraction on the raw audio signal is often applied first. A commonly used feature extraction method is Mel-Frequency Cepstral Coefficients (MFCC). We can calculate the MFCC for a song with librosa.

```ff = librosa\$feature
mel = librosa\$logamplitude(
ff\$melspectrogram(
onemp3[[1]],
sr = onemp3[[2]],
n_mels=96
),
ref_power=1.0
)
image(mel)
```

Each mp3 is now a matrix of MFC Coefficients as shown in the figure above. We have less data points than the original 661.500 data points but still quit a lot. In our example the MFCC are a 96 by 1292 matrix, so 124.032 values. We apply a the t-sne dimension reduction on the MFCC values.

## Calculating t-sne

A simple and easy approach, each matrix is just flattened. So a song becomes a vector of length 124.032. The data set on which we apply t-sne consist of 146 records with 124.032 columns, which we will reduce to 3 columns with the Rtsne package:

```tsne_out = Rtsne(AllSongsMFCCMatrix, dims=3)
```

The output object contains the 3 columns, I have joined it back with the data of the artists and song names so that I can create an interactive 3D scatter plot with R plotly. Below is a screen shot, the interactive one can be found here.

## Conclusion

It is obvious that Bach music, heavy metal and Michael Jackson are different, you don’t need machine learning to hear that. So as expected, it turns out that a straight forward dimension reduction on these songs with MFCC and t-sne clearly shows the differences in a 3D space. Some Michael Jackson songs are very close to heavy metal The complete R code can be found here.

Cheers, Longhow

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Last call for the course on Advanced R programming

(This article was first published on bnosac :: open analytical helpers, and kindly contributed to R-bloggers)

Last call for the course on Advanced R programming scheduled in Leuven, Belgium on Febuary 20-21 2018. Register at: https://lstat.kuleuven.be/training/coursedescriptions/AdvancedprogramminginR.html

You’ll learn during that course:

• The apply family of functions, basic parallel programming for these functions and commonly needed data manipulation skills
• Making a basic reproducible report using Sweave and knitr including tables, graphs and literate programming
• How to create an R package
• Understand how S3 programming works, generics, environments, namespaces.
• Basic tips on how to organise and develop R code and test it.

Need other training: visit http://bnosac.be/index.php/training

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Connecting to SQL Server on shinyapps.io

(This article was first published on R on Locke Data Blog, and kindly contributed to R-bloggers)

If you use SQL Server (or Azure SQL DB) as your data store and you need to connect to the databasse from shinyapps.io, you’re presently stuck with FreeTDS. If you have any control over infrastructure I cannot recommend highly enough the actual ODBC Driver on Linux for ease. Alas, shinyapps.io does not let you control the infrastructure. We have to make do with with FreeTDS and it can be pretty painful to get right.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Where do you run to? Map your Strava activities on static and Leaflet maps.

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

So, Strava’s heatmap made quite a stir the last few weeks. I decided to give it a try myself. I wanted to create some kind of “personal heatmap” of my runs, using Strava’s API. Also, combining the data with Leaflet maps allows us to make use of the beautiful map tiles supported by Leaflet and to zoom and move the maps around – with the runs on it, of course.

So, let’s get started. First, you will need an access token for Strava’s API. I found all the necessary information for this in this helpful “Getting started” post. As soon as you have the token, you have access to your own data.

Now, let’s load some packages and define functions for getting and handling the data. For the get.activities() function, I adapted code from here.

library(httr)
library(rjson)
library(OpenStreetMap)
library(leaflet)
library(scales)
library(dplyr)

token ”

get.coord.df.from.stream
data.frame(lat = sapply(stream.obj[[1]]\$data, USE.NAMES = F, FUN = function (x) x[[1]]),
lon = sapply(stream.obj[[1]]\$data, USE.NAMES = F, FUN = function (x) x[[2]]))
}

get.stream.from.activity
stream
path = paste0(“api/v3/activities/”, act.id, “/streams/latlng”),
query = list(access_token = token))
content(stream)
}

get.activities
activities
query = list(access_token = token, per_page = 200))
activities
activities
activities
x[sapply(x, is.null)]
unlist(x)
})
data.frame(do.call(“rbind”, activities))
}

get.multiple.streams
res.list
for (act.id.i in 1:length(act.ids)) {
if (act.id.i %% 5 == 0) cat(“Actitivy no.”, act.id.i, “of”, length(act.ids), “n”)
stream
coord.df
res.list[[length(res.list) + 1]]
coords = coord.df)
}
res.list
}

We have all the functions we need to get and parse the APIs output available now. Let’s apply them. The logic is: First, we get all activities. This dataframe has a column called ‘id’ which we can use to get all the raw data for all activities (called ‘streams’ in the Strava API). The function get.coord.df.from.stream() creates a dataframe with lat/lon coordinates for one stream.

activities
stream.list

We might want to get the boundaries of the cumulated set of all streams. We can use these boundaries as a bounding box for plotting the data. This means that all activities are going to be in the plotted map section.

all.lats
x\$coords\$lat
}))
all.lons
x\$coords\$lon
}))
lats.range
lons.range

Alternatively, you can set your own bounding box. These are the boundaries for Stuttgart, Germany. One suggestion: to find your own boundaries, you can plot your first map and use the locator() function in RStudio, this is a very convenient way of getting coordinates by clicking.
lons.range
lats.range

We start by plotting the tracks in red against a black background. You can play around with the alpha value and the lwd parameter to change the appearance. By plotting several tracks over each other, thicker lines represent routes I took more often.

# Setting up the plot
par(bg = “black”)
plot(x = lons.range, y = lats.range, type = “n”, bty = “n”, xlab = “”, ylab = “”, xaxt = “n”, yaxt = “n”)

# Plotting tracks one by one
for (el in stream.list) {
lines(el\$coords\$lon, el\$coords\$lat,
col = alpha(“darkred”, .4), lwd = 2)
}

A black-and-red plot of my runs through Stuttgart.
Now, this already looks quite nice and we see some kind of network through the city. But the city itself is missing. We need to get some map below the tracks. I am using the OpenStreetMap package for this. I already used it in an earlier post. Note that getting the map tiles from the servers might take a long time and might fail in some cases. Loading time (and the resolution of the final map) will depend heavily on the ‘zoom’ parameter.
map
c(min(lats.range), max(lons.range)), type = “maptoolkit-topo”, zoom = 14)
transmap
plot(transmap, raster = T)

for (el in stream.list) {
lines(el\$coords\$lon, el\$coords\$lat,
col = alpha(“darkred”, .5), lwd = 3)
}
My runs through Stuttgart on an OpenStreetMap, click to enlarge

I also like a simple satellite map. I am using the ‘bing’ map type for this. There is a high-def version available here.
map
c(min(lats.range), max(lons.range)), type = “bing”, zoom = 15)
transmap
plot(transmap, raster = T)

for (el in stream.list) {
lines(el\$coords\$lon, el\$coords\$lat,
col = alpha(“yellow”, 1/3), lwd = 3)
}
My runs through Stuttgart on a Bing satellite map, click to enlarge or click here for an HD version

Now, for the final step. These static maps already look quite nice and with sufficient resolution (as in the HD case with the satellite map), we can also zoom the map without loosing too much quality. But a more dynamical map would also be nice. Let’s use the wonderful leaflet package for this. I already done this in another post with only a single track wrapped in a Shiny app. I am using some pipe notation from the dplyr package to adapt the map and get the tracks onto it.
map %
options = providerTileOptions(noWrap = T)) %>%
fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2

for (el in stream.list) {
map
color = “red”, opacity = 1/3, weight = 2)
}
With the function saveWidget(), we can save the resulting html map. You can move the map to Basel or Zürich to find some more tracks I ran there.

With the leaflet functions, we could even associate each track with a little mouseover text (like total distance or the date). I did not include this here because quite a few tracks have been plotted over each other and mouseover texts might just confuse us here.

Have fun running and plotting.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Fair communication requires mutual consent

(This article was first published on DanielPocock.com – r-project, and kindly contributed to R-bloggers)

Given the militaristic theme used in my own post, I was also somewhat amused to see news this week of the Strava app leaking locations and layouts of secret US military facilities like Area 51. What a way to mark International Data Privacy Day. Maybe rather than inadvertently misleading people to wonder if I was suggesting that Gmail users don’t make their beds, I should have emphasized that Admiral McRaven’s boot camp regime for Navy SEALS needs to incorporate some of my suggestions about data privacy?

A highlight of Agarwal’s blog is his comment I usually wait for a day or more when I feel myself getting inflamed/heated and I wish this had occurred in some of the other places where my ideas were discussed. Even though my ideas are sometimes provocative, I would kindly ask people to keep point 2 of the Debian Code of Conduct in mind, Assume good faith.

One thing that became clear to me after reading Agarwal’s blog is that some people saw my example one-line change to Postfix’s configuration as a suggestion that people need to run their own mail server. In fact, I had seen such comments before but I hadn’t realized why people were reaching a conclusion that I expect everybody to run a mail server. The purpose of that line was simply to emphasize the content of the proposed bounce message, to help people understand, the receiver of an email may never have agreed to Google’s non-privacy policy but if you do use Gmail, you impose that surveillance regime on them, and not just yourself, if you send them a message from a Gmail account.

Communication requires mutual agreement about the medium. Think about it another way: if you go to a meeting with your doctor and some stranger in a US military uniform is in the room, you might choose to leave and find another doctor rather than communicate under surveillance.

As it turns out, many people are using alternative email services, even if they only want a web interface. There is already a feature request discussion in ProtonMail about letting users choose to opt-out of receiving messages monitored by Google and send back the bounce message suggested in my blog. Would you like to have that choice, even if you didn’t use it immediately? You can vote for that issue or leave your own feedback comments in there too.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Create your Machine Learning library from scratch with R ! (1/3)

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

When dealing with Machine Learning problems in R, most of the time you rely on already existing libraries. This fastens the analysis process, but do you really understand what is behind the algorithms? Could you implement a logistic regression from scratch with R?
The goal of this post is to create our own basic machine learning library from scratch with R. We will only use the linear algebra tools available in R. There will be three posts:

1. Linear and logistic regression (this one)
2. PCA and k-nearest neighbors classifiers and regressors
3. Tree-based methods and SVM

## Linear Regression (Least-Square)

The goal of liner regression is to estimate a continuous variable given a matrix of observations . Before dealing with the code, we need to derive the solution of the linear regression.

### Solution derivation of linear regression

Given a matrix of observations and the target . The goal of the linear regression is to minimize the norm between and a linear estimate of : . Hence, linear regression can be rewritten as an optimization problem: . A closed-form solution can easily be derived and the optimal is

### Linear regression in R

Using the closed-form solution, we can easily code the linear regression. Our linear model object will have three methods, an init method where the model is fitted, a predict method to work with new data and a plot method to visualize the residuals’ distribution.

```###Linear model
fit_lm
The fit function is simple, the first few lines transform the data to matrices and add an intercept if required. Then, the ‘my_lm' object is created and the coefficients are computed. The solve() function is used to invert the matrix and %*% denotes matrix multiplication. A the end, the residuals and the estimates are computed and the class of the object is set as ‘my_lm'.
Now let's implement the predict and plot methods for the my_lm class:
```
```
predict.my_lm
You can test the code on some preinstalled R dataset such as the car one. The code will give you the same coefficients estimates as the lm function. For instance, on the car dataset:
```
```my_lm1=fit_lm(cars[,1],cars[,2])
vanilla_lm=lm(dist~speed,cars)
print(vanilla_lm[['coefficients']])
print(my_lm1[['coeffs']])
```

## Logistic regression

Previously, we worked on regression and the estimation of a continuous variable. Now, with logistic regression, we try to estimate a binary outcome (for instance, ill vs healthy, pass vs failed, …). Again, let’s deal with the maths first:

### The mathematics of logistic regression

The goal is to estimate a binary outcome given the observations . We assume that follows a Bernoulli distribution of parameter . is called the sigmoid function.
Hence, we have .
We want to maximize the log-likelihood of the observed sample(over and hence over ):

This maximization will be done using Newton’s Method. Newton’s method is a variant of gradient descent in which we try to find the optimal curvature of the function to increase the speed of convergence. If you are not familiar with the Newton method, you can just see it as a variant of batch gradient descent.. The weights updates has the following form:

with: The Hessian

The algorithm in R will update the weights using this update until the termination criterion is met. Here, the termination criterion is met when the mean square error is below the user-defined tolerance.

### Logistic regression in R

```###Sigmoid function
sigmoid=function(x) {1/(1+exp(-x))}

###Fit logistic regression
fit_logit=function(x,y,intercept=T,tol=10e-5,max_it=100)
{
##Type conversion
if (!is.matrix(x))
{
x=as.matrix(x)
}
if (!is.matrix(y))
{
y=as.matrix(y)
}
if (intercept)
{
x=cbind(x,1)
}
##Algorithm initialization
iterations=0
converged=F
##Weights are initialized to 1
coeffs=matrix(1,dim(x)[2])

##Updates the weight until the max number of iter
##Or the termination criterion is met
while (iterations0.5
}
}
```

The code is split into two part:

• The fit part, where the logistic model is fitted using Newton’s method.
This part has three main components. First, the data is put in the proper matrix format and, if required, the intercept is added. Then the algorithm parameters are updated (initially all the weights are set to one).
Finally, the algorithm updates the weights until the MSE goes below the tolerance. The most important line is probably the weights update one where the update formula is used to update the weights of the model.
• The predict method, where the outcome is estimated using the weights computed previously.

We can now use our logistic regression to predict the class of a flower from the iris dataset:

```fit_logit(iris[,1:4],iris[,5]=='setosa')
```

As expected, the algorithm can predict efficiently if a flower is a setosa or not.

If you like this post, follow us to learn how to create your Machine Learning library from scratch with R!

The post Create your Machine Learning library from scratch with R ! (1/3) appeared first on Enhance Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Deep Learning from first principles in Python, R and Octave – Part 3

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

“Once upon a time, I, Chuang Tzu, dreamt I was a butterfly, fluttering hither and thither, to all intents and purposes a butterfly. I was conscious only of following my fancies as a butterfly, and was unconscious of my individuality as a man. Suddenly, I awoke, and there I lay, myself again. Now I do not know whether I was then a man dreaming I was a butterfly, or whether I am now a butterfly dreaming that I am a man.”
from The Brain: The Story of you – David Eagleman

“Thought is a great big vector of neural activity”
Prof Geoffrey Hinton

# Introduction

This is the third part in my series on Deep Learning from first principles in Python, R and Octave. In the first part Deep Learning from first principles in Python, R and Octave-Part 1, I implemented logistic regression as a 2 layer neural network. The 2nd part Deep Learning from first principles in Python, R and Octave-Part 2, dealt with the implementation of 3 layer Neural Networks with 1 hidden layer to perform classification tasks, where the 2 classes cannot be separated by a linear boundary. In this third part, I implement a multi-layer, Deep Learning (DL) network of arbitrary depth (any number of hidden layers) and arbitrary height (any number of activation units in each hidden layer). The implementations of these Deep Learning networks, in all the 3 parts, are based on vectorized versions in Python, R and Octave. The implementation in the 3rd part is for a L-layer Deep Netwwork, but without any regularization, early stopping, momentum or learning rate adaptation techniques. However even the barebones multi-layer DL, is a handful and has enough hyperparameters to fine-tune and adjust.

The implementation of the vectorized L-layer Deep Learning network in Python, R and Octave were both exhausting, and exacting!! Keeping track of the indices, layer number and matrix dimensions required quite bit of focus. While the implementation was demanding, it was also very exciting to get the code to work. The trick was to be able to shift gears between the slight quirkiness between the languages. Here are some of challenges I faced

1. Python and Octave allow multiple return values to be unpacked in a single statement. With R, unpacking multiple return values from a list, requires the list returned, to be unpacked separately. I did see that there is a package gsubfn, which does this. I hope this feature becomes a base R feature.
2. Python and R allow dissimilar elements to be saved and returned from functions using dictionaries or lists respectively. However there is no real equivalent in Octave. The closest I got to this functionality in Octave, was the ‘cell array’. But the cell array can be accessed only by the index, and not with the key as in a Python dictionary or R list. This makes things just a bit more difficult in Octave.
3. Python and Octave include implicit broadcasting. In R, broadcasting is not implicit, but R has a nifty function, the sweep(), with which we can broadcast either by columns or by rows
4. The closest equivalent of Python’s dictionary, or R’s list, in Octave is the cell array. However I had to manage separate cell arrays for weights and biases and during gradient descent and separate gradients dW and dB
5. In Python the rank-1 numpy arrays can be annoying at times. This issue is not present in R and Octave.

Though the number of lines of code for Deep Learning functions in Python, R and Octave are about ~350 apiece, they have been some of the most difficult code I have implemented. The current vectorized implementation supports the relu, sigmoid and tanh activation functions as of now. I will be adding other activation functions like the ‘leaky relu’, ‘softmax’ and others, to the implementation in the weeks to come.

While testing with different hyper-parameters namely i) the number of hidden layers, ii) the number of activation units in each layer, iii) the activation function and iv) the number iterations, I found the L-layer Deep Learning Network to be very sensitive to these hyper-parameters. It is not easy to tune the parameters. Adding more hidden layers, or more units per layer, does not help and mostly results in gradient descent getting stuck in some local minima. It does take a fair amount of trial and error and very close observation on how the DL network performs for logical changes. We then can zero in on the most the optimal solution. Feel free to download/fork my code from Github DeepLearning-Part 3 and play around with the hyper-parameters for your own problems.

#### Derivation of a Multi Layer Deep Learning Network

Lets take a simple 3 layer Neural network with 3 hidden layers and an output layer

In the forward propagation cycle the equations are

$Z_{1} = W_{1}A_{0} +b_{1}$ and $A_{1} = g(Z_{1})$
$Z_{2} = W_{2}A_{1} +b_{2}$ and $A_{2} = g(Z_{2})$
$Z_{3} = W_{3}A_{2} +b_{3}$ and $A_{3} = g(Z_{3})$

The loss function is given by
$L = -(ylogA3 + (1-y)log(1-A3))$
and $dL/dA3 = -(Y/A_{3} + (1-Y)/(1-A_{3}))$

For a binary classification the output activation function is the sigmoid function given by
$A_{3} = 1/(1+ e^{-Z3})$. It can be shown that
$dA_{3}/dZ_{3} = A_{3}(1-A_3)$ see equation 2 in Part 1

$partial L/partial Z_{3} = partial L/partial A_{3}* partial A_{3}/partial Z_{3} = A3-Y$ see equation (f) in Part 1
and since
$partial L/partial A_{2} = partial L/partial Z_{3} * partial Z_{3}/partial A_{2} = (A_{3} -Y) * W_{3}$ because $partial Z_{3}/partial A_{2} = W_{3}$ -(1a)
and $partial L/partial Z_{2} =partial L/partial A_{2} * partial A_{2}/partial Z_{2} = (A_{3} -Y) * W_{3} *g'(Z_{2})$ -(1b)
$partial L/partial W_{2} = partial L/partial Z_{2} * A_{1}$ -(1c)
since $partial Z_{2}/partial W_{2} = A_{1}$
and
$partial L/partial b_{2} = partial L/partial Z_{2}$ -(1d)
because
$partial Z_{2}/partial b_{2} =1$

Also

$partial L/partial A_{1} =partial L/partial Z_{2} * partial Z_{2}/partial A_{1} = partial L/partial Z_{2} * W_{2}$ – (2a)
$partial L/partial Z_{1} =partial L/partial A_{1} * partial A_{1}/partial Z_{1} = partial L/partial A_{1} * W_{3} *g'(Z_{2})$ – (2b)
$partial L/partial W_{1} = partial L/partial Z_{1} * A_{0}$ – (2c)
$partial L/partial b_{1} = partial L/partial Z_{1}$ – (2d)

Inspecting the above equations (1a – 1d & 2a-2d), our ‘Uber Deep (bottomless)’ brain can easily discern the pattern in these equations. The equation for any layer ‘l’ is of the form
$Z_{l} = W_{l}A_{l-1} +b_{l}$ and $A_{l} = g(Z_{l})$
The equation for the backward propagation have the general form
$partial L/partial A_{l} = partial L/partial Z_{l+1} * W^{l+1}$
$partial L/partial Z_{l}=partial L/partial A_{l} *g'(Z_{l})$
$partial L/partial W_{l} =partial L/partial Z_{l} *A^{l-1}$
$partial L/partial b_{l} =partial L/partial Z_{l}$

Some other important results The derivatives of the activation functions in the implemented Deep Learning network
g(z) = sigmoid(z) = $1/(1+e^{-z})$ = a g'(z) = a(1-a) – See Part 1
g(z) = tanh(z) = a g'(z) = $1 - a^{2}$
g(z) = relu(z) = z when z>0 and 0 when z 0 and 0 when z
While it appears that there is a discontinuity for the derivative at 0 the small value at the discontinuity does not present a problem

The implementation of the multi layer vectorized Deep Learning Network for Python, R and Octave is included below. For all these implementations, initially I create the size and configuration of the the Deep Learning network with the layer dimennsions So for example layersDimension Vector ‘V’ of length L indicating ‘L’ layers where

V (in Python)= $[v_{0}, v_{1}, v_{2}$, … $v_{L-1}]$
V (in R)= $c(v_{1}, v_{2}, v_{3}$ , … $v_{L}$
V (in Octave)= [ $v_{1} v_{2} v_{3}$$v_{L}]$

In all of these implementations the first element is the number of input features to the Deep Learning network and the last element is always a ‘sigmoid’ activation function since all the problems deal with binary classification.

The number of elements between the first and the last element are the number of hidden layers and the magnitude of each $v_{i}$ is the number of activation units in each hidden layer, which is specified while actually executing the Deep Learning network using the function L_Layer_DeepModel(), in all the implementations Python, R and Octave

## 1a. Classification with Multi layer Deep Learning Network – Relu activation(Python)

In the code below a 4 layer Neural Network is trained to generate a non-linear boundary between the classes. In the code below the ‘Relu’ Activation function is used. The number of activation units in each layer is 9. The cost vs iterations is plotted in addition to the decision boundary. Further the accuracy, precision, recall and F1 score are also computed

``````import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.linear_model

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
import sklearn
import sklearn.datasets

#from DLfunctions import plot_decision_boundary
execfile("./DLfunctions34.py") #
os.chdir("C:softwareDeepLearning-Postspart3")

# Create clusters of 2 classes
X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9,
cluster_std = 1.3, random_state = 4)
#Create 2 classes
Y1=Y1.reshape(400,1)
Y1 = Y1 % 2
X2=X1.T
Y2=Y1.T
# Set the dimensions of DL Network
#  Below we have
#  2 - 2 input features
#  9,9 - 2 hidden layers with 9 activation units per layer and
#  1 - 1 sigmoid activation unit in the output layer as this is a binary classification
# The activation in the hidden layer is the 'relu' specified in L_Layer_DeepModel

layersDimensions = [2, 9, 9,1] #  4-layer model
parameters = L_Layer_DeepModel(X2, Y2, layersDimensions,hiddenActivationFunc='relu', learning_rate = 0.3,num_iterations = 2500, fig="fig1.png")
#Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X2,Y2,str(0.3),"fig2.png")

# Compute the confusion matrix
yhat = predict(parameters,X2)
from sklearn.metrics import confusion_matrix
a=confusion_matrix(Y2.T,yhat.T)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(Y2.T, yhat.T)))
print('Precision: {:.2f}'.format(precision_score(Y2.T, yhat.T)))
print('Recall: {:.2f}'.format(recall_score(Y2.T, yhat.T)))
print('F1: {:.2f}'.format(f1_score(Y2.T, yhat.T)))``````
## Accuracy: 0.90
## Precision: 0.91
## Recall: 0.87
## F1: 0.89``````

For more details on metrics like Accuracy, Recall, Precision etc. used in classification take a look at my post Practical Machine Learning with R and Python – Part 2. More details about these and other metrics besides implementation of the most common machine learning algorithms are available in my book My book ‘Practical Machine Learning with R and Python’ on Amazon

## 1b. Classification with Multi layer Deep Learning Network – Relu activation(R)

In the code below, binary classification is performed on the same data set as above using the Relu activation function. The DL network is same as above

``````library(ggplot2)
x  z[,1:2]
y  z[,3]
X1  t(x)
Y1  t(y)

# Set the dimensions of the Deep Learning network
# No of input features =2, 2 hidden layers with 9 activation units and 1 output layer
layersDimensions = c(2, 9, 9,1)
# Execute the Deep Learning Neural Network
retvals = L_Layer_DeepModel(X1, Y1, layersDimensions,
hiddenActivationFunc='relu',
learningRate = 0.3,
numIterations = 5000,
print_cost = True)``````
``````library(ggplot2)
source("DLfunctions33.R")
# Get the computed costs
costs  retvals[['costs']]
# Create a sequence of iterations
numIterations=5000
iterations  seq(0,numIterations,by=1000)
df data.frame(iterations,costs)
# Plot the Costs vs number of iterations
ggplot(df,aes(x=iterations,y=costs)) + geom_point() +geom_line(color="blue") +
xlab('No of iterations') + ylab('Cost') + ggtitle("Cost vs No of iterations")``````

``````# Plot the decision boundary
plotDecisionBoundary(z,retvals,hiddenActivationFunc="relu",0.3)``````

``library(caret)``
``````# Predict the output for the data values
yhat predict(retvals\$parameters,X1,hiddenActivationFunc="relu")
yhat[yhat==FALSE]=0
yhat[yhat==TRUE]=1
# Compute the confusion matrix
confusionMatrix(yhat,Y1)``````
``````## Confusion Matrix and Statistics
##
##           Reference
## Prediction   0   1
##          0 201  10
##          1  21 168
##
##                Accuracy : 0.9225
##                  95% CI : (0.8918, 0.9467)
##     No Information Rate : 0.555
##     P-Value [Acc > NIR] : ``````

## 1c. Classification with Multi layer Deep Learning Network – Relu activation(Octave)

Included below is the code for performing classification. Incidentally Octave does not seem to have implemented the confusion matrix, but confusionmat is available in Matlab.``` # Read the data data=csvread("data.csv"); X=data(:,1:2); Y=data(:,3); # Set layer dimensions layersDimensions = [2 9 7 1] #tanh=-0.5(ok), #relu=0.1 best! # Execute Deep Network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', learningRate = 0.1, numIterations = 10000); plotCostVsIterations(10000,costs); plotDecisionBoundary(data,weights, biases,hiddenActivationFunc="tanh")```

## 2a. Classification with Multi layer Deep Learning Network – Tanh activation(Python)

Below the Tanh activation function is used to perform the same classification. I found the Tanh activation required a simpler Neural Network of 3 layers.

``````# Tanh activation
import os
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
import sklearn.linear_model

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification, make_blobs
from matplotlib.colors import ListedColormap
import sklearn
import sklearn.datasets

#from DLfunctions import plot_decision_boundary
os.chdir("C:softwareDeepLearning-Postspart3")
execfile("./DLfunctions34.py")
# Create the dataset
X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9,
cluster_std = 1.3, random_state = 4)
#Create 2 classes
Y1=Y1.reshape(400,1)
Y1 = Y1 % 2
X2=X1.T
Y2=Y1.T
# Set the dimensions of the Neural Network
layersDimensions = [2, 4, 1] #  3-layer model
# Compute the DL network
parameters = L_Layer_DeepModel(X2, Y2, layersDimensions, hiddenActivationFunc='tanh', learning_rate = .5,num_iterations = 2500,fig="fig3.png")
#Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X2,Y2,str(0.5),"fig4.png")
``````

## 2b. Classification with Multi layer Deep Learning Network – Tanh activation(R)

R performs better with a Tanh activation than the Relu as can be seen below

`````` #Set the dimensions of the Neural Network
layersDimensions = c(2, 9, 9,1)
library(ggplot2)
x  z[,1:2]
y  z[,3]
X1  t(x)
Y1  t(y)
# Execute the Deep Model
retvals = L_Layer_DeepModel(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
learningRate = 0.3,
numIterations = 5000,
print_cost = True)``````
``````# Get the costs
costs  retvals[['costs']]
iterations  seq(0,numIterations,by=1000)
df data.frame(iterations,costs)
# Plot Cost vs number of iterations
ggplot(df,aes(x=iterations,y=costs)) + geom_point() +geom_line(color="blue") +
xlab('No of iterations') + ylab('Cost') + ggtitle("Cost vs No of iterations")``````

``````#Plot the decision boundary
plotDecisionBoundary(z,retvals,hiddenActivationFunc="tanh",0.3)``````

## 2c. Classification with Multi layer Deep Learning Network – Tanh activation(Octave)

The code below uses the Tanh activation in the hidden layers for Octave``` # Read the data data=csvread("data.csv"); X=data(:,1:2); Y=data(:,3); # Set layer dimensions layersDimensions = [2 9 7 1] #tanh=-0.5(ok), #relu=0.1 best! # Execute Deep Network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='tanh', learningRate = 0.1, numIterations = 10000); plotCostVsIterations(10000,costs); plotDecisionBoundary(data,weights, biases,hiddenActivationFunc="tanh")```

## 3. Bernoulli’s Lemniscate

To make things more interesting, I create a 2D figure of the Bernoulli’s lemniscate to perform non-linear classification. The Lemniscate is given by the equation
$(x^{2} + y^{2})^{2}$ = $2a^{2}*(x^{2}-y^{2})$

## 3a. Classifying a lemniscate with Deep Learning Network – Relu activation(Python)

``````import os
import numpy as np
import matplotlib.pyplot as plt
os.chdir("C:softwareDeepLearning-Postspart3")
execfile("./DLfunctions33.py")
x1=np.random.uniform(0,10,2000).reshape(2000,1)
x2=np.random.uniform(0,10,2000).reshape(2000,1)

X=np.append(x1,x2,axis=1)
X.shape

# Create a subset of values where squared is
# Create the equation
# (x^{2} + y^{2})^2 - 2a^2*(x^{2}-y^{2})
a=np.power(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2),2)
b=np.power(X[:,0]-5,2) - np.power(X[:,1]-5,2)
c= a - (b*np.power(4,2)) 0
Y=c.reshape(2000,1)
# Create a scatter plot of the lemniscate
plt.scatter(X[:,0], X[:,1], c=Y, marker= 'o', s=15,cmap="viridis")
Z=np.append(X,Y,axis=1)
plt.savefig("fig50.png",bbox_inches='tight')
plt.clf()

# Set the data for classification
X2=X.T
Y2=Y.T
# These settings work the best
# Set the Deep Learning layer dimensions for a Relu activation
layersDimensions = [2,7,4,1]
#Execute the DL network
parameters = L_Layer_DeepModel(X2, Y2, layersDimensions, hiddenActivationFunc='relu', learning_rate = 0.5,num_iterations = 10000, fig="fig5.png")
#Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(2.2),"fig6.png")

# Compute the Confusion matrix
yhat = predict(parameters,X2)
from sklearn.metrics import confusion_matrix
a=confusion_matrix(Y2.T,yhat.T)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(Y2.T, yhat.T)))
print('Precision: {:.2f}'.format(precision_score(Y2.T, yhat.T)))
print('Recall: {:.2f}'.format(recall_score(Y2.T, yhat.T)))
print('F1: {:.2f}'.format(f1_score(Y2.T, yhat.T)))``````
## Accuracy: 0.93
## Precision: 0.77
## Recall: 0.76
## F1: 0.76``````

We could get better performance by tuning further. Do play around if you fork the code.
Note:: The lemniscate data is saved as a CSV and then read in R and also in Octave. I do this instead of recreating the lemniscate shape

## 3b. Classifying a lemniscate with Deep Learning Network – Relu activation(R code)

The R decision boundary for the Bernoulli’s lemniscate is shown below

``````Z  as.matrix(read.csv("lemniscate.csv",header=FALSE))
Z1=data.frame(Z)
# Create a scatter plot of the lemniscate
ggplot(Z1,aes(x=V1,y=V2,col=V3)) +geom_point()``````
``````#Set the data for the DL network
X=Z[,1:2]
Y=Z[,3]

X1=t(X)
Y1=t(Y)

# Set the layer dimensions for the tanh activation function
layersDimensions = c(2,5,4,1)
# Execute the Deep Learning network with Tanh activation
retvals = L_Layer_DeepModel(X1, Y1, layersDimensions,
hiddenActivationFunc='tanh',
learningRate = 0.3,
numIterations = 20000, print_cost = True)``````
``````# Plot cost vs iteration
costs  retvals[['costs']]
numIterations = 20000
iterations  seq(0,numIterations,by=1000)
df data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() +geom_line(color="blue") +
xlab('No of iterations') + ylab('Cost') + ggtitle("Cost vs No of iterations")``````

``````#Plot the decision boundary
plotDecisionBoundary(Z,retvals,hiddenActivationFunc="tanh",0.3)``````

## 3c. Classifying a lemniscate with Deep Learning Network – Relu activation(Octave code)

Octave is used to generate the non-linear lemniscate boundary.
``` # Read the data data=csvread("lemniscate.csv"); X=data(:,1:2); Y=data(:,3); # Set the dimensions of the layers layersDimensions = [2 9 7 1] # Compute the DL network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', learningRate = 0.20, numIterations = 10000); plotCostVsIterations(10000,costs); plotDecisionBoundary(data,weights, biases,hiddenActivationFunc="relu")```

## 4a. Binary Classification using MNIST – Python code

Finally I perform a simple classification using the MNIST handwritten digits, which according to Prof Geoffrey Hinton is “the Drosophila of Deep Learning”.

The Python code for reading the MNIST data is taken from Alex Kesling’s github link MNIST.

In the Python code below, I perform a simple binary classification between the handwritten digit ‘5′ and ‘not 5′ which is all other digits. I will perform the proper classification of all digits using the Softmax classifier some time later.

``````import os
import numpy as np
import matplotlib.pyplot as plt
os.chdir("C:softwareDeepLearning-Postspart3")
execfile("./DLfunctions34.py")
lbls=[]
pxls=[]
print(len(training))

# Select the first 10000 training data and the labels
for i in range(10000):
l,p=training[i]
lbls.append(l)
pxls.append(p)
labels= np.array(lbls)
pixels=np.array(pxls)

#  Sey y=1  when labels == 5 and 0 otherwise
y=(labels==5).reshape(-1,1)
X=pixels.reshape(pixels.shape[0],-1)

# Create the necessary feature and target variable
X1=X.T
Y1=y.T

# Create the layer dimensions. The number of features are 28 x 28 = 784 since the 28 x 28
# pixels is flattened to single vector of length 784.
layersDimensions=[784, 15,9,7,1] # Works very well
parameters = L_Layer_DeepModel(X1, Y1, layersDimensions, hiddenActivationFunc='relu', learning_rate = 0.1,num_iterations = 1000, fig="fig7.png")

# Test data
lbls1=[]
pxls1=[]
for i in range(800):
l,p=test[i]
lbls1.append(l)
pxls1.append(p)

testLabels=np.array(lbls1)
testData=np.array(pxls1)

ytest=(testLabels==5).reshape(-1,1)
Xtest=testData.reshape(testData.shape[0],-1)
Xtest1=Xtest.T
Ytest1=ytest.T

yhat = predict(parameters,Xtest1)
from sklearn.metrics import confusion_matrix
a=confusion_matrix(Ytest1.T,yhat.T)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.2f}'.format(accuracy_score(Ytest1.T, yhat.T)))
print('Precision: {:.2f}'.format(precision_score(Ytest1.T, yhat.T)))
print('Recall: {:.2f}'.format(recall_score(Ytest1.T, yhat.T)))
print('F1: {:.2f}'.format(f1_score(Ytest1.T, yhat.T)))

probs=predict_proba(parameters,Xtest1)
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(Ytest1.T, probs.T)
closest_zero = np.argmin(np.abs(thresholds))
closest_zero_p = precision[closest_zero]
closest_zero_r = recall[closest_zero]
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.plot(precision, recall, label='Precision-Recall Curve')
plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3)
plt.xlabel('Precision', fontsize=16)
plt.ylabel('Recall', fontsize=16)
plt.savefig("fig8.png",bbox_inches='tight')
``````
``````
## Accuracy: 0.99
## Precision: 0.96
## Recall: 0.89
## F1: 0.92``````

In addition to plotting the Cost vs Iterations, I also plot the Precision-Recall curve to show how the Precision and Recall, which are complementary to each other vary with respect to the other. To know more about Precision-Recall, please check my post Practical Machine Learning with R and Python – Part 4. You could also check out my book My book ‘Practical Machine Learning with R and Python’ on Amazon for details on the key metrics and algorithms for classification and regression problems. A physical copy of the book is much better than scrolling down a webpage. Personally, I tend to use my own book quite frequently to refer to R, Python constructs, subsetting, machine Learning function calls and the necessary parameters etc. It is useless to commit any of this to memory, and a physical copy of a book is much easier to thumb through for the relevant code snippet. Pick up your copy today!

## 4b. Binary Classification using MNIST – R code

In the R code below the same binary classification of the digit ‘5′ and the ‘not 5′ is performed. The code to read and display the MNIST data is taken from Brendan O’ Connor’s github link at MNIST

``````source("mnist.R")
#show_digit(train\$x[2,]
layersDimensions=c(784, 7,7,3,1) # Works at 1500
x  t(train\$x)
# Choose only 5000 training data
x2  x[,1:5000]
y train\$y
# Set labels for all digits that are 'not 5' to 0
y[y!=5]  0
# Set labels of digit 5 as 1
y[y==5]  1
# Set the data
y1  as.matrix(y)
y2  t(y1)
# Choose the 1st 5000 data
y3  y2[,1:5000]

#Execute the Deep Learning Model
retvals = L_Layer_DeepModel(x2, y3, layersDimensions,
hiddenActivationFunc='tanh',
learningRate = 0.3,
numIterations = 3000, print_cost = True)``````
``````# Plot cost vs iteration
costs  retvals[['costs']]
numIterations = 3000
iterations  seq(0,numIterations,by=1000)
df data.frame(iterations,costs)
ggplot(df,aes(x=iterations,y=costs)) + geom_point() +geom_line(color="blue") +
xlab('No of iterations') + ylab('Cost') + ggtitle("Cost vs No of iterations")``````

``````# Compute probability scores
scores  computeScores(retvals\$parameters, x2,hiddenActivationFunc='relu')
a=y3==1
b=y3==0

# Compute probabilities of class 0 and class 1
class1=scores[a]
class0=scores[b]

# Plot ROC curve
pr pr.curve(scores.class0=class1,
scores.class1=class0,
curve=T)

plot(pr)``````

The AUC curve hugs the top left corner and hence the performance of the classifier is quite good.

## 4c. Binary Classification using MNIST – Octave code

This code to load MNIST data was taken from Daniel E blog.
Precision recall curves are available in Matlab but are yet to be implemented in Octave’s statistics package.
``` load('./mnist/mnist.txt.gz'); % load the dataset # Subset the 'not 5' digits a=(trainY != 5); # Subset '5' b=(trainY == 5); #make a copy of trainY #Set 'not 5' as 0 and '5' as 1 y=trainY; y(a)=0; y(b)=1; X=trainX(1:5000,:); Y=y(1:5000); # Set the dimensions of layer layersDimensions=[784, 7,7,3,1]; # Compute the DL network [weights biases costs]=L_Layer_DeepModel(X', Y', layersDimensions, hiddenActivationFunc='relu', learningRate = 0.1, numIterations = 5000);```

# Conclusion

It was quite a challenge coding a Deep Learning Network in Python, R and Octave. The Deep Learning network implementation, in this post,is the base Deep Learning network, without any of the regularization methods included. Here are some key learning that I got while playing with different multi-layer networks on different problems

a. Deep Learning Networks come with many levers, the hyper-parameters,
– learning rate
– activation unit
– number of hidden layers
– number of units per hidden layer
– number of iterations while performing gradient descent
b. Deep Networks are very sensitive. A change in any of the hyper-parameter makes it perform very differently
c. Initially I thought adding more hidden layers, or more units per hidden layer will make the DL network better at learning. On the contrary, there is a performance degradation after the optimal DL configuration
d. At a sub-optimal number of hidden layers or number of hidden units, gradient descent seems to get stuck at a local minima
e. There were occasions when the cost came down, only to increase slowly as the number of iterations were increased. Probably early stopping would have helped.
f. I also did come across situations of ‘exploding/vanishing gradient’, cost went to Inf/-Inf. Here I would think inclusion of ‘momentum method’ would have helped

I intend to add the additional hyper-parameters of L1, L2 regularization, momentum method, early stopping etc. into the code in my future posts.
Feel free to fork/clone the code from Github Deep Learning – Part 3, and take the DL network apart and play around with it.

I will be continuing this series with more hyper-parameters to handle vanishing and exploding gradients, early stopping and regularization in the weeks to come. I also intend to add some more activation functions to this basic Multi-Layer Network.
Hang around, there are more exciting things to come.

Watch this space!

To see all posts see Index of posts