tfestimators – Package: Embeddings for Categorical Variables

By Florian Teschner

plot of chunk unnamed-chunk-5

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

In my last posts (here and here) I explored how to use embeddings to represent categorical variables. Furthermore, I showed how to represent categorical variables with embeddings and add other variable to create a more complex model. Both posts focused on the Keras (R) functionality. I concluded that it feels artificial to represent categorical variables with embeddings in Keras. Especially concatenating multiple input layers is quite cumbersome with the current Keras interface.

This week, I watched the official release video by J.J. Allaire and learnt about the tfestimators package. It turns out that the awesome rstudio team build a very handy interface to access tensorflow and train models with multiple parameters and embeddings.

In this post, I will show how to use the package to quickly fit a model in which categorical variables are represented as embeddings.

As in the posts before, I work with the nyc citi bike count data from Kaggle. It contains daily bicycle counts for 4 major bridges in NYC. In order to have a longer dataset, I use the bicycle count for all bridges as the dependent variable.

df  read.csv("data/nyc-east-river-bicycle-counts.csv")
dflong  data.table::melt(df[c("Date", "Brooklyn.Bridge", "Manhattan.Bridge", "Williamsburg.Bridge" ,"Queensboro.Bridge")], idvars="date")
dflong$date  as.Date(dflong$Date)
dflong$weekday  wday(dflong$date, label = T)
dflong  merge(dflong, df[, c("Date", "Precipitation", "Low.Temp..Â.F.")], by="Date")
dflong$ScaledUsers  scale(dflong$value)
dflong$lowTemp  scale(dflong[,"Low.Temp..Â.F."])
dflong$rain  ifelse(dflong$Precipitation != 0, 0,1)
dflong$Bridge  factor(dflong$variable)
levels(dflong$Bridge)  1:length(levels(dflong$Bridge))
levels(dflong$weekday)  1:length(levels(dflong$weekday))

The goal of our play model is to predict the number of bicycle per day on a certain bridge dependent on the weekday, the bridge (“Brooklyn.Bridge”, “Manhattan.Bridge”, “Williamsburg.Bridge” ,”Queensboro.Bridge”), if it rains and the temperature. So overall we have 2 categorical variables, one binary and one continuous variable.

## convert the factor to integer -- tfestimators is strict with input types.
dflong$Bridge  as.integer(dflong$Bridge)
dflong$weekday  as.integer(dflong$weekday)
embedding_dimension_bridges = 2
embedding_dimension_weekdays = 3
cols  feature_columns(
  column_embedding(column_categorical_with_vocabulary_list("weekday", vocabulary_list = c(1:7)), embedding_dimension_weekdays),
  column_embedding(column_categorical_with_vocabulary_list("Bridge",vocabulary_list = c(1:4)), embedding_dimension_bridges)

The first step that we need to do is to define the input variables and their type. Let’s start with the simple numeric variables lowTemp and rain. We just define the input as “column_numeric(“lowTemp”,”rain”)”. Next, the two categorical variables that we want to embed, need a bit more work. a) they need a list of all possible values (defined in vocabulary_list parameter). Additionally, we need to define the embedding_dimension for each categorical variable.

Next, we write a short function that defines the input and output of the the model as well as batch size and number of epochs.

bridge_input_fn  function(data, num_epochs = 1) {
           features = c("lowTemp","rain","weekday", "Bridge"),
           response = "ScaledUsers",
           batch_size = 2,
           num_epochs = num_epochs)
############ Train and Test Dataset #############
indices  sample(1:nrow(dflong), size = 0.80 * nrow(dflong))
train  dflong[indices, ]
test   dflong[-indices, ]
############## Define the model ############
model  dnn_regressor(feature_columns = cols, hidden_units = c(32, 10), dropout = 0.15)
# train the model
history  model %>% train(bridge_input_fn(train[,c("ScaledUsers", "lowTemp","rain","weekday", "Bridge" )], num_epochs = 1))
## eval the output
model %>% evaluate(bridge_input_fn(test))

In order to evaluate the model, we split the data in train and test set. We define the model as deep neural network (DNN), regression model with two hidden layers (one with 32 the other with 10 nodes). Compared to the Keras version, in which one needs to concatenate different input layers this interface is straight-forward.
Finally, we check the model’s accuracy on test set and print the learning history.

df  data.frame(losses = history$losses$mean_losses, steps=history$step)
ggplot(df, aes(steps, losses))+ geom_point() +geom_smooth() +theme_bw() + ylab("Loss") + xlab("Training Steps") + ggtitle("Testing: TF Estimators")

To conclude, the package is a great step forward to apply deep neural nets to everyday problems and to quickly use embeddings for categorical variables. Big Kudus to the Rstudio team for their efforts. If you have time left and want a quick update on deep learning for the R community check out J.J. Allaire’s video.

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News