By Joel Cadwell
The nodes have been assigned a color by the author so that the underlying distinctions are more pronounced. Cars that are perceived as Economical (in aquamarine) are not seen as Sporty or Powerful (in cyan). The red edges connecting these attributes indicate negative relationships. Similarly, a Practical car (in light goldenrod) is not Technically Advanced (in light pink). This network of feature associations replicates both the economical to luxury and the practical to advanced differentiations so commonly found in the car market. North Americans living in the suburbs may need to be reminded that Europe has many older cities with less parking and narrower streets, which explains the inclusion of the city focus feature.
The data come from the R package plfm, as I explained in an earlier post where I ran a correspondence analysis using the same dataset and where I described the study in more detail. The input to the correspondence analysis was a cross tabulation of the number of respondents checking which of the 27 features (the nodes in the above graph) were associated with each of 14 different car models (e.g., Is the VW Golf Sporty, Green, Comfortable, and so on?).
I will not repeat those details, except to note that the above graph was not generated from a car-by-feature table with 14 car rows and 27 feature columns. Instead, as you can see from the R code at the end of this post, I reformatted the original long vector with 29,484 binary entries and created a data frame with 1092 rows, a stacking of the 14 cars rated by each of the 78 respondents. The 27 columns, on the other hand, remain binary yes/no associations of each feature with each car. One can question the independence of the 1092 rows given that respondent and car are grouping factors with nested observations. However, we will assume, in order to illustrate the technique, that cars were rated independently and that there is one common structure for the 14-car European market. Now that we have the data matrix, we can move on to the analysis.
As in the last post, we will model the associative net underlying these ratings using the IsingFit R package. I would argue that it is difficult to assert any causal ordering among the car features. Which comes first in consumer perception, Workmanship or High Trade-In Value? Although objectively trade-in value depends on workmanship, it may be more likely that the consumer learns first that the car maintains its value and then infers high quality. A possible resolution is to treat each of the 27 nodes as a dependent variable in their own regression equation with the remaining nodes as predictors. In order to keep the model sparse, IsingFit fits the logistic regressions with the R package glmnet.
For instance, when Economical is the outcome, we estimate the impact of the other 26 nodes including Powerful. Then, when Powerful is the outcome, we fit the same type of model with coefficients for the remaining 26 features, one of which is Economical. There is nothing guaranteeing that the two effects will be the same (i.e., Powerful’s effect on Economical = Economical’s effect on Powerful, controlling for all the other features). Since an undirected graph needs a symmetric affinity matrix as input, IsingFit checks to determine if both coefficients are nonzero (remember that sparse modeling yields lots of zero weights) and then averages the coefficients when Economical is in the Powerful model and Powerful is in the Economic model (called the AND rule).
Hastie, Tibshirani and Wainwright refer to this approach as “neighborhood-based” in their chapter on graph and model selection. Two nodes are in the same neighborhood when mutual relationships remain after controlling for everything else in the model. The red edge between Economical and Powerful indicates that each was in the other’s equation and that their average was negative. IsingFit output the asymmetric weights in a data matrix called asymm.weights (Res$weiadj is symmetric after averaging). It is always a good idea to check this matrix and determine if we are justified in averaging the upper and lower triangles.
It should be noted that the undirected graph is not a correlation network because the weighted edges represent conditional independence relationships and not correlations. You need only go back to the qgraph() function and replace Res$weiadj with cor(rating) or cor_auto(rating) in order to plot the correlation network. The qgraph documentation explains how cor_auto() checks to determine if a Pearson correlation is appropriate and substitutes a polychoric when all the variables are binary.
Sacha Epskamp provides a good introduction to the different types of network maps in his post on Network Model Selection Using qgraph. Larry Wasserman covers similar topics at an advanced level in this course on Statistical Machine Learning. There is a handout on Undirected Graphical Models along with two YouTube video lectures (#14 and #15). Wasserman raises some concerns about our ability to estimate conditional independence graphs when the data does not have just the right dependence structure (not too much and not too little), which is an interesting point-of-view given that he co-teaches the class with Ryan Tibshirani, whose name is associated with the lasso and sparse modeling.
# R code needed to reproduce the undirected graph
# car$data$rating is length 29,484
# 78 respondents x 14 cars x 27 attributes
# restructure as a 1092 row data frame with 27 columns
rating<-data.frame(t(matrix(car$data$rating, nrow=27, ncol=1092)))
# fits conditional independence model
Res <- IsingFit(rating, family='binomial', plot=FALSE)
# Plot results:
# creates grouping of variables to be assigned different colors.
gr<-list(c(1,3,8,20,25), c(2,5,7,23,26), c(4,10,16,17,21,27),
qgraph(Res$weiadj, fade = FALSE, layout="spring", groups=gr,
color=node_color, labels=names(rating), label.scale=FALSE,
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News