Shiny and Leaflet for Visualization Awards

By grumble10

(This article was first published on R – biologyforfun, and kindly contributed to R-bloggers)

Next week will be the meeting of the German (and Swiss and Austrians) ecologists in Marburg and the organizing team launched a visualization contest based on spatial data of the stores present in the city. Nadja Simons and I decided to enter the contest, our idea was to link the store data to the city bus network promoting a sustainable and safe way of movement through the town (we are ecologists after all).

We used leaflet to prepare the map and plot the bus lines, bus stops and stores. On top of that because all this information is rather overwhelming to grasp (more than 200 stores and 50 bus stops), we implemented a shiny App to allow the user to restrict its search of the best Bars in Marburg.

All the code is on GitHub and the App can accessed by clicking here.

Go there, enjoy, and if you want to learn a bit about the process of developing the App come back here.

First a few words on what is so cool about leaflet:

  • There is a lot of maps available
  • With the AwesomeMarker plugin, available in the github repo of the package, you can directly tap into three awesome libraries for icons: font awesome, glyphicons and ionicons
  • Leaflet understand HTML code, we used it to provide a clickable link to the bus timetables
  • Using groups makes it easy to interactively add or remove group of objects from the map

Having this was already nice, but putting it into a shiny App was even more exciting. Here are some of the main points:

  • A very important concept in shiny is the concept of reactivity, the way I understood it is that a reactive object is a small function that will get updated every time the user input some elements, see this nice tutorial for more on this.
  • Our idea of the App was that stores should appear when the mouse is passed over their nearest bus stop. The issue there is that the stores must then disappear if the mouse moves out. The trick is to create a set of reactiveValues that are NULL as long as the event (mouse passes over a bus stop) does not occure AND return to NULL when the event is finished (mouse moves out), helped by this post, we were able to implement what we wanted.
  • Shiny gives you a lot of freedom to create and customize the design of the App, what I found very cool was the possibility to have tabs.

Be sure to check the leaflet tutorial page, 90% of my questions were answered there, also check out the many articles on shiny.

See you in Marburg if you’ll be around!

Filed under:

To leave a comment for the author, please follow the link and comment on their blog: R – biologyforfun.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

IBM Data Science Experience:  First steps with yorkr

By Tinniam V Ganesh

Untitled

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Fresh, and slightly dizzy, from my foray into Quantum Computing with IBM’s Quantum Experience, I now turn my attention to IBM’s Data Science Experience (DSE).

I am on the verge of completing a really great 3 module ‘Data Science and Engineering with Spark XSeries‘ from the University of California, Berkeley and I have been thinking of trying out some form of integrated delivery platform for performing analytics, for quite some time. Coincidentally, IBM comes out with its Data Science Experience. a month back. There are a couple of other collaborative platforms available for playing around with Apache Spark or Data Analytics namely Jupyter notebooks, Databricks, Data.world.

I decided to go ahead with IBM’s Data Science Experience as the GUI is a lot cooler, includes shared data sets and integrates with Object Storage, Cloudant DB etc, which seemed a lot closer to the cloud, literally! IBM’s DSE is an interactive, collaborative, cloud-based environment for performing data analysis with Apache Spark. DSE is hosted on IBM’s PaaS environment, Bluemix. It should be possible to access in DSE the plethora of cloud services available on Bluemix. IBM’s DSE uses Jupyter notebooks for creating and analyzing data which can be easily shared and has access to a few hundred publicly available datasets

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

In this post, I use IBM’s DSE and my R package yorkr, for analyzing the performance of 1 ODI match (Aus-Ind, 2 Feb 2012) and the batting performance of Virat Kohli in IPL matches. These are my ‘first’ steps in DSE so, I use plain old “R language” for analysis together with my R package ‘yorkr’. I intend to do more interesting stuff on Machine learning with SparkR, Sparklyr and PySpark in the weeks and months to come.

Working with Jupyter notebooks are fairly straight forward which can handle code in R, Python and Scala. Each cell can either contain code (Python or Scala), Markdown text, NBConvert or Heading. The code is written into the cells and can be executed sequentially. Here is a screen shot of the notebook.

The ‘File’ menu can be used for ‘saving and checkpointing’ or ‘reverting’ to a checkpoint. The ‘kernel’ menu can be used to start, interrupt, restart and run all cells etc. Data Sources icon can be used to load data sources to your code. The data is uploaded to Object Storage with appropriate credentials. You will have to import this data from Object Storage using the credentials. In my notebook with yorkr I directly load the data from Github. You can use the sharing to share the notebook. The shared notebook has an extension ‘ipynb’. You can use the ‘Sharing’ icon to share the notebook. The shared notebook has an extension ‘ipynb’. You an import this notebook directly into your environment and can get started with the code available in the notebook.

You can import existing R, Python or Scala notebooks as shown below. My notebook ‘Using R package yorkr – A quick overview’ can be downloaded using the link ‘yorkrWithDSE‘ and clicking the green download icon on top right corner.

I have also uploaded the file to Github and you can download from here too ‘yorkrWithDSE‘. This notebook can be imported into your DSE as shown below

Untitled1

Jupyter notebooks have been integrated with Github and are rendered directly from Github. You can view my Jupyter notebook here – “Using R package yorkr – A quick overview‘. You can also view it on NBviewer at “Using R package yorkr – A quick overview

So there it is. You can download my notebook, import it into IBM’s Data Science Experience and then use data from ‘yorkrData” as shown. As already mentioned yorkrData contains converted data for ODIs, T20 and IPL. For details on how to use my R package yorkr please my posts on yorkr at “Index of posts

Hope you have fun playing wit IBM’s Data Science Experience and my package yorkr.

I will be exploring IBM’s DSE in weeks and months to come in the areas of Machine Learning with SparkR,SparklyR or pySpark.

Watch this space!!!

Disclaimer: This article represents the author’s viewpoint only and doesn’t necessarily represent IBM’s positions, strategies or opinions

Also see

1. Introducing QCSimulator: A 5-qubit quantum computing simulator in R
2. Natural Processing Language : What would Shakespeare say?
3. Introducing cricket package yorkr:Part 1- Beaten by sheer pace!
4. A closer look at “Robot horse on a Trot! in Android”
5. Re-introducing cricketr! : An R package to analyze performances of cricketers
6. What’s up Watson? Using IBM Watson’s QAAPI with Bluemix, NodeExpress – Part 1
7. Deblurring with OpenCV: Wiener filter reloaded

To see all my posts check
Index of posts

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts ….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Land Conflict, Property Rights and Deforestation in Brasil

By Thiemo Fetzer

protected-areas

(This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

Weak institutions and natural resource abundance are a recipe for civil conflict and the overexploitation of natural resources. In our forthcoming paper with the title “Take what you can: property rights, contestability and conflict”, Samuel Marden and I present evidence from Brazil indicating that insecure property rights are a major cause of land related conflict and a contributing factor in Amazon deforestation.

Brasil is a unique context. The contestability of property rights over land is enshrined into the Brazilian constitution. Land not in ‘productive use’ is vulnerable to invasion by squatters, who can develop the land and appeal to the government for title. This setup may cause violent conflict, between squatters and other groups claiming title to land. As such, weak property rights may distinctly affect conflict by directly and indirectly shaping the underlying incentives.

Protected Areas and Tribal Reserve Areas redefine the “productive use” of land, rendering the land not contestable anymore. We obtain variation in the share of municipal level that is contestable land by exploiting the dramatic expansion in the share of land under ecological or indigenous protection between 1997 and 2010. An increase in the municipal share of land under protection reduces the share of land in that municipality that is contestable. Property right assignment reduces land related conflict. Using this variation in the municipal share of land with contestable title in the Brazilian Amazon, we show that (in)secure property rights are strongly associated with land related conflict. We obtain these results using classical panel data regression methods, but also employ a novel quasi-event study approach.

Effects on deforestation and land use change. Our empirical results on land conflict suggest that the creation of protected areas and indigenous areas, should have distinct effects on land use: since its not attractive to develop a land claim on protected land or land, where title has been given to indigenous groups, there should be limited incentives to invest in the land by deforestation.

There are two distinct channels: permanent deforestation of the type associated with cropland conversion should be dramatically reduced; temporary deforestation, such as due to illegal logging may actually increase if protection is not associated with significant increases in enforcement capacity.

Measuring deforestation and land use change. On the Amazon frontier, it is difficult to obtain reliable statistics about crop production or simple cattle ranching, as a lot of the farming is subsistence farming at small scale. Hence, official agricultural statistics are unlikely to be very useful to document changes in land use. In our paper, we invoke a very simple machine learning method to infer land use change based on k-means clustering of sequences of land use based on pixel level classifications based on MODIS land use product . We sampled roughly 800,000 randomly selected points across the Amazon and developed a pixel level panel for these points, exploring the pattern of land use. Over a five year period a sequence of land use could look like:

Forest – Forest – Shrubland – Shrubland – Cropland

is indicative of permanent land use change towards farmland, while a sequence such as

Forest – Shrubland – Shrubland – Forest – Forest

may be indicative of temporary deforestation, followed by forest regrowth. The larger the state space (i.e. the larger the individual set of land use classes), the more difficult does it become to classify individual sequences of land use. For example with 10 different states, there are 10^5 possible combinations. This is where dimensionality reduction as carried out by k-means clustering can help, by clustering or grouping sequences that are similar based on some numeric features.

While in the end, we decided to use a very simplified state space with only three states: Forest, Cropland and Shrubland the method is nevertheless instructive and useful for simple applications like this.

The Features that we use are simple counts, such as the number of transitions to the state of being Forested or away from the state of being Forested (MODIS Land use category <=5), to the state of being Cropland (MODIS land use category 12/ 14) or the length that a pixel is classified as shrub land (in between cropland and forested). In addition, we count the length of repeat patterns, such as SCSC which may indicate crop rotation.

The separation achieved using R’s built in kmeans() function with different number of clusters with the different numeric features is illustrated below:

Numeric Features used for k-means clustering, cluster results and separation achieved.

We can look at the individual clusters and then it is up to the researcher to decide upon how to interpret the resulting clusters. In our case, there was a clear separation into two clusters that indicate permanent deforestation as 100% of the sequences falling in that category had been classified as cropland at least once. There is an inbetween class, a class that indicates clearly temporary deforestation due to patterns of forest regrowth and a category that rests somewhere in between temporary and permanent.

Lastly, the big cluster consists of the bulk of pixels that are always forested.

Permanent vs Temporary and Forest Use

Our results , based on a simple difference in difference analysis as well as a matching difference in difference (as e.g. carried out by Chris Nolte and many others), suggests that pixels that become protected are more likely to remain in a forested state after a protected area is established.

In terms of permanent and temporary deforestation, our results suggest that – while permanent deforestation goes down, temporary land use patterns actually seem to become more likely. This could suggest that environmental protection may induce some behavioural changes at the margin, whereby settlers, rather than trying to develop legal claims over land actually just turn to extract the natural resources on the land and move on. Naturally, such behavior is much more difficult to prevent through policing and enforcement efforts.

References

Assunção, J., & Rocha, R. (2014). Getting Greener by Going Black : The Priority Municipalities in Brazil. Mimeo.

Hidalgo, F. D., Naidu, S., Nichter, S., & Richardson, N. (2010). Economic Determinants of Land Invasions. Review of Economics and Statistics, 92(3), 505–523. http://doi.org/10.1162/REST_a_00007

Espírito-Santo, F. D. B., Gloor, M., Keller, M., Malhi, Y., Saatchi, S., Nelson, B., … Phillips, O. L. (2014). Size and frequency of natural forest disturbances and the Amazon forest carbon balance. Nature Communications, 5, 3434. http://doi.org/10.1038/ncomms4434

Fetzer, T. R. (2014). Social Insurance and Conflict: Evidence from India. Mimeo.

Fetzer, T. R., & Marden, S. (2016). Take what you can: property rights, contestability and conflict.

Morton, D. C., DeFries, R. S., Shimabukuro, Y. E., Anderson, L. O., Arai, E., del Bon Espirito-Santo, F., … Morisette, J. (2006). Cropland expansion changes deforestation dynamics in the southern Brazilian Amazon. Proceedings of the National Academy of Sciences of the United States of America, 103(39), 14637–14641.

To leave a comment for the author, please follow the link and comment on their blog: fReigeist » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Did she know we were writing a book?

By John Mount

600 387630642

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge.

Nina Zumel and I definitely troubled over possibilities for some time before deciding to write Practical Data Science with R, Nina Zumel, John Mount, Manning 2014.

In the end we worked very hard to organize and share a lot of good material in what we feel is a very readable manner. But I think the first-author may have been signaling and preparing a bit earlier than I was aware. Please read on to see some of her prefiguring work.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The importance of Data Visualization

By Fisseha Berhane

fig1

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Before we perform any analysis and come up with any assumptions about the distributions of and relationships between variables in our datasets, it is always a good idea to visualize our data in order to understand their properties and identify appropriate analytics techniques. In this post, let’s see the dramatic differences in conclutions that we can make based on (1) simple statistics only, and (2) data visualization.

The four data sets

The Anscombe dataset, which is found in the base R datasets packege, is handy for showing the importance of data visualization in data analysis. It consists of four datasets and each dataset consists of eleven (x,y) points.

anscombe 

   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89 

Let’s make some massaging to make the data more convinient for analysis and plotting

Create four groups: setA, setB, setC and setD.

library(ggplot2)
library(dplyr)
library(reshape2)

setA=select(anscombe, x=x1,y=y1)
setB=select(anscombe, x=x2,y=y2)
setC=select(anscombe, x=x3,y=y3)
setD=select(anscombe, x=x4,y=y4)

Add a third column which can help us to identify the four groups.

setA$group ='SetA'
setB$group ='SetB'
setC$group ='SetC'
setD$group ='SetD'

head(setA,4)  # showing sample data points from setA
   x    y group
1 10 8.04  SetA
2  8 6.95  SetA
3 13 7.58  SetA
4  9 8.81  SetA

Now, let’s merge the four datasets.

all_data=rbind(setA,setB,setC,setD)  # merging all the four data sets
all_data[c(1,13,23,43),]  # showing sample

    x    y group
1  10 8.04  SetA
13  8 8.14  SetB
23 10 7.46  SetC
43  8 7.91  SetD

Compare their summary statistics

summary_stats =all_data%>%group_by(group)%>%summarize("mean x"=mean(x),
                                       "Sample variance x"=var(x),
                                       "mean y"=round(mean(y),2),
                                       "Sample variance y"=round(var(y),1),
                                       'Correlation between x and y '=round(cor(x,y),2)
                                      )
models = all_data %>% 
      group_by(group) %>%
      do(mod = lm(y ~ x, data = .)) %>%
      do(data.frame(var = names(coef(.$mod)),
                    coef = round(coef(.$mod),2),
                    group = .$group)) %>%
dcast(., group~var, value.var = "coef")

summary_stats_and_linear_fit = cbind(summary_stats, data_frame("Linear regression" =
                                    paste0("y = ",models$"(Intercept)"," + ",models$x,"x")))

summary_stats_and_linear_fit

group mean x Sample variance x mean y Sample variance y Correlation between x and y 
1  SetA      9                11    7.5               4.1                         0.82
2  SetB      9                11    7.5               4.1                         0.82
3  SetC      9                11    7.5               4.1                         0.82
4  SetD      9                11    7.5               4.1                         0.82
  Linear regression
1      y = 3 + 0.5x
2      y = 3 + 0.5x
3      y = 3 + 0.5x
4      y = 3 + 0.5x

If we look only at the simple summary statistics shown above, we would conclude that these four data sets are identical.

What if we plot the four data sets?

 ggplot(all_data, aes(x=x,y=y)) +geom_point(shape = 21, colour = "red", fill = "orange", size = 3)+
    ggtitle("Anscombe's data sets")+geom_smooth(method = "lm",se = FALSE,color='blue') + 
    facet_wrap(~group, scales="free")

As we can see from the figures above, the datasets are very different from each other. The Anscombe’s quartet is a good example that shows that we have to visualize the relatonships, distributuions and outliers of our data and we shoul not rely only on simple statistics.

Summary

We should look at the data graphically before we start analysis. Further, we should understand that basic statistics properties can often fail to capture real-world complexities (such as outliers, relationships and complex distributions) since summary statistics do not capture all of the complexities of the data.

    Related Post

    1. ggplot2 themes examples
    2. Map the Life Expectancy in United States with data from Wikipedia
    3. What can we learn from the statistics of the EURO 2016 – Application of factor analysis
    4. Visualizing obesity across United States by using data from Wikipedia
    5. Plotting App for ggplot2 – Part 2
    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    The X-Factors: Where 0 means 1

    By Murtaza Haider

    (This article was first published on eKonometrics, and kindly contributed to R-bloggers)
    Hadley Wickham in a recent blog post mentioned that “Factors have a bad rap in R because they often turn up when you don’t want them.” I believe Factors are an even bigger concern. They not only turn up where you don’t want them, but they also turn things around when you don’t want them to.

    Consider the following example where I present a data set with two variables: x and y. I represent age in years as ‘y‘ and gender as a binary (0/1) variable as ‘x‘ where 1 represents males.

    I compute the means for the two variables as follows:

    The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good.
    Now let’s see what happens when I convert x into a factor variable using the following syntax:

    The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.

    I compute the average age for males and females as follows:

    See what happens when I try to compute the mean for the variable ‘male‘.

    Once you factor a variable, you can’t compute statistics such as mean or standard deviation. To do so, you need to declare the factor variable as numeric. I create a new variable gender that converts the male variable to a numeric one.
    I recompute the means below.

    Note that the average for males is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let’s look at the data set below:
    Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with gender as the dependent variable and y as the explanatory variable, R generates the following error:

    Factor or no factor, I would prefer my zeros to stay as zeros!

    To leave a comment for the author, please follow the link and comment on their blog: eKonometrics.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    An introduction to R, RStudio, and R Markdown with GIFs!

    By chesterismay

    (This article was first published on R – Chester’s R blog, and kindly contributed to R-bloggers)

    The development of the bookdown package from RStudio in the summer of 2016 has facilitated greatly the ability of educators to create open-source materials for their students to use. It expands to more than just academic settings though and it encourages the sharing of resources and knowledge in a free and reproducible way.

    As more and more students and faculty begin to use R in their courses and their research, I wanted to create a resource for the complete beginner to programming and statistics to more easily learn how to work with R. Specifically, the book includes GIF screen recordings that show the reader what specific panes do in RStudio and also the formatting of an R Markdown document and the resulting HTML file.

    Folks that have used a programming language for awhile often forget about all the troubles they had when they initially got started with it. To further support this, I’ll be working on updating the book (specifically Chapter 6) with examples of common R errors, what they mean, and how to remedy them.

    The book is entitled “Getting Used to R, RStudio, and R Markdown” and can be found at http://ismayc.github.io/rbasics-book. All of the source code for the book is available at http://ismayc.github.io/rbasics. You can also request edits to the book by clicking on the Edit button near the top of the page. You’ll also find a PDF version of the book there with links to the GIFs (since PDFs can’t have embedded GIFs like webpages can).

    Chapter 5 of the book walks through some of the basics of R in working with a data set corresponding to the elements of the periodic table. To expand on this book and on using R in an introductory statistics setting, I’ve also embarked on creating a textbook using bookdown focused on data visualization and resampling techniques to build inferential concepts. The book uses dplyr and ggplot2 packages and focuses on two main data sets in the nycflights13 and ggplot2movies packages. Chapters 8 and 9 are in development, but the plan is for an introduction to the broom package to also be given there. Lastly, there will be expanded Learning Checks throughout the book and Review Questions at the end of each chapter to help the reader better check their understanding of the material. This book is available at http://ismayc.github.io/moderndiver-book with source code available here.

    Feel free to email me or tweet to me on Twitter @old_man_chester.

    To leave a comment for the author, please follow the link and comment on their blog: R – Chester’s R blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Updated R Markdown thesis template

    By chesterismay

    (This article was first published on R – Chester’s R blog, and kindly contributed to R-bloggers)

    In October of 2015, I released an R Markdown senior thesis template R package and discussed it in the blogpost here. It was well-received by students and faculty that worked with it and this past summer I worked on updating it to make it even nicer for students. The big addition is the ability for students to export their senior thesis to a webpage (example here) and also label and cross-reference figures and tables more easily. These additions and future revisions will be in the new thesisdown package in the spirit of the bookdown package developed and released by RStudio in summer 2016.

    I encourage you to look over my blog post last year to get an idea of why R Markdown is such a friendly environment to work in. Markdown specifically allows for typesetting of the finished document to be transparent inside the actual document. Down the road, it is my hope that students will be able to write generating R Markdown files that will then export into many formats. These currently include the LaTeX format to produce a PDF following Reed’s senior thesis guidelines and the HTML version in gitbook style. Eventually, this will include a Word document following Reed’s guidelines and also an ePub (electronic book) version. These last two are available at the moment but are not fully functional.

    By allowing senior theses in a variety of formats, seniors will be more easily able to display their work to potential employers, other students, faculty members, and potential graduate schools. This will allow them to get the word out about their studies and research while still encouraging reproducibility in their computations and in their analyses.

    Install the template generating package

    To check out the package yourself, make sure you have RStudio and LaTeX installed and then direct your browser to the GitHub page for the template: http://github.com/ismayc/thesisdown. The README.md file near the bottom of the page below the files gives directions on installing the template package and getting the template running. As you see there, you’ll want to install the thesisdown package via the following commands in the RStudio console:

    install.packages("devtools")
    devtools::install_github("ismayc/thesisdown")
    

    If you have any questions, feedback, or would like to report any issues, please email me.

    (The generating R Markdown file for this HTML document—saved in the .Rmd extension—is available here.)

    To leave a comment for the author, please follow the link and comment on their blog: R – Chester’s R blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Variables can synergize, even in a linear model

    By John Mount

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    Introduction

    Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve the quality of the resulting model.

    For some informative discussion on such issues please see the following:

    In this article we are going to deliberately (and artificially) find and test one of the limits of the technique. We recommend simple variable pruning, but also think it is important to be aware of its limits.

    To be truly effective in applied fields (such as data science) one often has to use (with care) methods that “happen to work” in addition to methods that “are known to always work” (or at least be aware, you are always competing against such); hence the interest in mere heuristic.

    The pruning heuristics

    Let L(y;m;v1,...,vk) denote the estimate loss (or badness of performance, so smaller is better) of a model for y fit using modeling method m and the variables v1,...,vk. Let d(a;L(y;m;),L(y;m;a)) denote the portion of L(y;m;)-L(y;m;a) credited to the variable a. This could be the change in loss, something like effectsize(a), or -log(significance(a)); in all cases larger is considered better.

    For practical variable pruning (during predictive modeling) our intuition often implicitly relies on the following heuristic arguments.

    • L(y;m;) is monotone decreasing, we expect L(y;m;v1,...,vk,a) is no larger than L(y;m;v1,...,vk,). Note this may be achievable “in sample” (or on training data), but is often false if L(y;m;) accounts for model complexity or is estimated on out of sample data (itself a good practice).
    • If L(y;m;v1,...,vk,a) is significantly lower than L(y;m;v1,...,vk) then we will be lucky enough have d(a;L(y;m;),L(y;m;a)) not too small.
    • If d(a;L(y;m;),L(y;m;a)) is not too small then we will be lucky enough to have d(a;L(y;lm;),L(y;lm;a)) is non-negligible (where modeling method lm is one linear regression or logistic regression).

    Intuitively we are hoping variable utility has a roughly diminishing return structure and at least some non-vanishing fraction of a variable’s utility can be seen in simple linear or generalized linear models. Obviously this can not be true in general (interactions in decision trees being a well know situation where variable utility can increase in the presence of other variables, and there are many non-linear relations that escape detection by linear models).

    However, if the above were true (or often nearly true) we could effectively prune variables by keeping only the set of variables { a | d(a;L(y;lm;),L(y;lm;a)) is non negligible}. This is a (user controllable) heuristic built into our vtreat R package and proves to be quite useful in practice.

    I’ll repeat: we feel in real world data you can use the above heuristics to usefully prune variables. Complex models do eventually get into a regime of diminishing returns, and real world engineered useful variables usually (by design) have a hard time hiding. Also, remember data science is an empirical field- methods that happen to work will dominate (even if they do not apply in all cases).

    Counter-examples

    For every heuristic you should crisply know if it is true (and is in fact a theorem) or it is false (and has counter-examples). We stand behind the above heuristics, and will show their empirical worth in a follow-up article. Let’s take some time and show that they are not in fact laws.

    We are going to show that per-variable coefficient significances and effect sizes are not monotone in that adding more variables can in fact improve them.

    First example

    First (using R) we build a data frame where y = a xor b. This is a classic example of y being a function of two variable but not a linear function of them (at least over the real numbers, it is a linear relation over the field GF(2)).

    d <- data.frame(a=c(0,0,1,1),b=c(0,1,0,1))
    d$y <- as.numeric(d$a == d$b)

    We look at the (real) linear relations between y and a, b.

    summary(lm(y~a+b,data=d))
    ## 
    ## Call:
    ## lm(formula = y ~ a + b, data = d)
    ## 
    ## Residuals:
    ##    1    2    3    4 
    ##  0.5 -0.5 -0.5  0.5 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)    0.500      0.866   0.577    0.667
    ## a              0.000      1.000   0.000    1.000
    ## b              0.000      1.000   0.000    1.000
    ## 
    ## Residual standard error: 1 on 1 degrees of freedom
    ## Multiple R-squared:  3.698e-32,  Adjusted R-squared:     -2 
    ## F-statistic: 1.849e-32 on 2 and 1 DF,  p-value: 1
    anova(lm(y~a+b,data=d))
    ## Analysis of Variance Table
    ## 
    ## Response: y
    ##           Df Sum Sq Mean Sq F value Pr(>F)
    ## a          1      0       0       0      1
    ## b          1      0       0       0      1
    ## Residuals  1      1       1

    As we expect linear methods fail to find any evidence of a relation between y and a, b. This clearly violates our hoped for heuristics.

    For details on reading these summaries we strongly recommend Practical Regression and Anova using R, Julian J. Faraway, 2002.

    In this example the linear model fails to recognize a and b as useful variables (even though y is a function of a and b). From the linear model’s point of view variables are not improving each other (so that at least looks monotone), but it is largely because the linear model can not see the relation unless we were to add an interaction of a and b (denoted a:b).

    Second example

    Let us develop this example a bit more to get a more interesting counterexample.

    Introduce new variables u = a and b, v = a or b. By the rules of logic we have y == 1+u-v, so there is a linear relation.

    d$u <- as.numeric(d$a & d$b)
    d$v <- as.numeric(d$a | d$b)
    print(d)
    ##   a b y u v
    ## 1 0 0 1 0 0
    ## 2 0 1 0 0 1
    ## 3 1 0 0 0 1
    ## 4 1 1 1 1 1
    print(all.equal(d$y,1+d$u-d$v))
    ## [1] TRUE

    We can now see the counter-example effect: together the variables work better than they did alone.

    summary(lm(y~u,data=d))
    ## 
    ## Call:
    ## lm(formula = y ~ u, data = d)
    ## 
    ## Residuals:
    ##          1          2          3          4 
    ##  6.667e-01 -3.333e-01 -3.333e-01 -1.388e-16 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)   0.3333     0.3333       1    0.423
    ## u             0.6667     0.6667       1    0.423
    ## 
    ## Residual standard error: 0.5774 on 2 degrees of freedom
    ## Multiple R-squared:  0.3333, Adjusted R-squared:  5.551e-16 
    ## F-statistic:     1 on 1 and 2 DF,  p-value: 0.4226
    anova(lm(y~u,data=d))
    ## Analysis of Variance Table
    ## 
    ## Response: y
    ##           Df  Sum Sq Mean Sq F value Pr(>F)
    ## u          1 0.33333 0.33333       1 0.4226
    ## Residuals  2 0.66667 0.33333
    summary(lm(y~v,data=d))
    ## 
    ## Call:
    ## lm(formula = y ~ v, data = d)
    ## 
    ## Residuals:
    ##          1          2          3          4 
    ##  5.551e-17 -3.333e-01 -3.333e-01  6.667e-01 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)
    ## (Intercept)   1.0000     0.5774   1.732    0.225
    ## v            -0.6667     0.6667  -1.000    0.423
    ## 
    ## Residual standard error: 0.5774 on 2 degrees of freedom
    ## Multiple R-squared:  0.3333, Adjusted R-squared:      0 
    ## F-statistic:     1 on 1 and 2 DF,  p-value: 0.4226
    anova(lm(y~v,data=d))
    ## Analysis of Variance Table
    ## 
    ## Response: y
    ##           Df  Sum Sq Mean Sq F value Pr(>F)
    ## v          1 0.33333 0.33333       1 0.4226
    ## Residuals  2 0.66667 0.33333
    summary(lm(y~u+v,data=d))
    ## Warning in summary.lm(lm(y ~ u + v, data = d)): essentially perfect fit:
    ## summary may be unreliable
    ## 
    ## Call:
    ## lm(formula = y ~ u + v, data = d)
    ## 
    ## Residuals:
    ##          1          2          3          4 
    ## -1.849e-32  7.850e-17 -7.850e-17  1.849e-32 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error    t value Pr(>|t|)    
    ## (Intercept)  1.00e+00   1.11e-16  9.007e+15   <2e-16 ***
    ## u            1.00e+00   1.36e-16  7.354e+15   <2e-16 ***
    ## v           -1.00e+00   1.36e-16 -7.354e+15   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 1.11e-16 on 1 degrees of freedom
    ## Multiple R-squared:      1,  Adjusted R-squared:      1 
    ## F-statistic: 4.056e+31 on 2 and 1 DF,  p-value: < 2.2e-16
    anova(lm(y~u+v,data=d))
    ## Warning in anova.lm(lm(y ~ u + v, data = d)): ANOVA F-tests on an
    ## essentially perfect fit are unreliable
    ## Analysis of Variance Table
    ## 
    ## Response: y
    ##           Df  Sum Sq Mean Sq    F value    Pr(>F)    
    ## u          1 0.33333 0.33333 2.7043e+31 < 2.2e-16 ***
    ## v          1 0.66667 0.66667 5.4086e+31 < 2.2e-16 ***
    ## Residuals  1 0.00000 0.00000                         
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    In this example we see synergy instead of diminishing returns. Each variable becomes better in the presence of the other. This is on its own good, but indicates variable pruning is harder than one might expect- even for a linear model.

    Third example

    We can get around the above warnings by adding some rows to the data frame that don’t follow the designed relation. We can even draw rows from this frame to show the effect on a “more row independent looking” data frame.

    d0 <- d
    d0$y <- 0
    d1 <- d
    d1$y <- 1
    dG <- rbind(d,d,d,d,d0,d1)
    set.seed(23235)
    dR <- dG[sample.int(nrow(dG),100,replace=TRUE),,drop=FALSE]
    
    summary(lm(y~u,data=dR))
    ## 
    ## Call:
    ## lm(formula = y ~ u, data = dR)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -0.8148 -0.3425 -0.3425  0.3033  0.6575 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  0.34247    0.05355   6.396 5.47e-09 ***
    ## u            0.47235    0.10305   4.584 1.35e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.4575 on 98 degrees of freedom
    ## Multiple R-squared:  0.1765, Adjusted R-squared:  0.1681 
    ## F-statistic: 21.01 on 1 and 98 DF,  p-value: 1.349e-05
    anova(lm(y~u,data=dR))
    ## Analysis of Variance Table
    ## 
    ## Response: y
    ##           Df  Sum Sq Mean Sq F value    Pr(>F)    
    ## u          1  4.3976  4.3976   21.01 1.349e-05 ***
    ## Residuals 98 20.5124  0.2093                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    summary(lm(y~v,data=dR))
    ## 
    ## Call:
    ## lm(formula = y ~ v, data = dR)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -0.7619 -0.3924 -0.3924  0.6076  0.6076 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   0.7619     0.1049   7.263 9.12e-11 ***
    ## v            -0.3695     0.1180  -3.131   0.0023 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.4807 on 98 degrees of freedom
    ## Multiple R-squared:  0.09093,    Adjusted R-squared:  0.08165 
    ## F-statistic: 9.802 on 1 and 98 DF,  p-value: 0.002297
    anova(lm(y~v,data=dR))
    ## Analysis of Variance Table
    ## 
    ## Response: y
    ##           Df Sum Sq Mean Sq F value   Pr(>F)   
    ## v          1  2.265 2.26503  9.8023 0.002297 **
    ## Residuals 98 22.645 0.23107                    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    summary(lm(y~u+v,data=dR))
    ## 
    ## Call:
    ## lm(formula = y ~ u + v, data = dR)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -0.8148 -0.1731 -0.1731  0.1984  0.8269 
    ## 
    ## Coefficients:
    ##             Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  0.76190    0.08674   8.784 5.65e-14 ***
    ## u            0.64174    0.09429   6.806 8.34e-10 ***
    ## v           -0.58883    0.10277  -5.729 1.13e-07 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.3975 on 97 degrees of freedom
    ## Multiple R-squared:  0.3847, Adjusted R-squared:  0.3721 
    ## F-statistic: 30.33 on 2 and 97 DF,  p-value: 5.875e-11
    anova(lm(y~u+v,data=dR))
    ## Analysis of Variance Table
    ## 
    ## Response: y
    ##           Df  Sum Sq Mean Sq F value    Pr(>F)    
    ## u          1  4.3976  4.3976  27.833 8.047e-07 ***
    ## v          1  5.1865  5.1865  32.826 1.133e-07 ***
    ## Residuals 97 15.3259  0.1580                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Conclusion

    Consider the above counter example as exceptio probat regulam in casibus non exceptis (“the exception confirms the rule in cases not excepted”). Or roughly outlining the (hopefully labored and uncommon) structure needed to break the otherwise common and useful heuristics.

    In later articles in this series we will show more about the structure of model quality and show the above heuristics actually working very well in practice (and adding a lot of value to projects).

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Mapping Traffic Fatalities

    By lucaspuente.github.io

    (This article was first published on lucaspuente.github.io/, and kindly contributed to R-bloggers)

    On Monday, August 29, DJ Patil, the Chief Data Scientist in the White House Office of Science and Technology Policy, and Mark Rosekind, the Administrator of the National Highway Traffic Safety Administration (NHTSA), announced the release of a data set documenting all traffic fatalities occurring in the United States in 2015. As part of their release, they issued a “call to action” for data scientists and analysts to “jump in and analyze it.” This post does exactly that by plotting these fatalities and providing the code for others to reproduce and extend the analysis.

    Step 1: Download and Clean the Data

    The NHTSA made downloading this data set very easy. Simply visit ftp://ftp.nhtsa.dot.gov/fars/2015/National/ and download the FARS2015NationalDBF.zip file, unzip it, and load into R.

    library(foreign)
    accidents <- read.dbf("FARS2015NationalDBF/accident.dbf")
    

    Since the goal here is to map the traffic fatalities, I also recommend subsetting the data to only include rows that have valid coordinates:

    accidents <- subset(accidents, LONGITUD!=999.99990 &  LONGITUD!=888.88880 & LONGITUD!=777.77770)
    

    Also, the map we’ll be producing will only include the lower 48 states, so we want to further subset the data to exclude Alaska and Hawaii:

    cont_us_accidents<-subset(accidents, STATE!=2 & STATE!=15)
    

    We also need to load in data on state and county borders to make our map more interpretable – without this, there would be no borders on display. Fortunately, the map_data function that’s part of the ggplot2 package makes this step very easy:

    library(ggplot2)
    county_map_data<-map_data("county")
    state_map <- map_data("state")
    

    Step 2: Plot the Data

    Plotting the data using ggplot is also not particularly complicated. The most important thing is to use layers. We’ll first add a polygon layer to a blank ggplot object to map the county borders in light grey and then subsequently add polygons to map the state borders. Then, we’ll add points to show exactly where in the (lower 48) United States traffic fatalities occurred in 2015, plotting these in red, but with a high level of transparency (alpha=0.05) to help prevent points from obscuring one another.

    map<-ggplot() + 
      #Add county borders:
      geom_polygon(data=county_map_data, aes(x=long,y=lat,group=group), colour = alpha("grey", 1/4), size = 0.2, fill = NA) +
      #Add state borders:
      geom_polygon(data = state_map, aes(x=long,y=lat,group=group), colour = "grey", fill = NA) +
      #Add points (one per fatality):
      geom_point(data=cont_us_accidents, aes(x=LONGITUD, y=LATITUDE), alpha=0.05, size=0.5, col="red") +
      #Adjust the map projection
      coord_map("albers",lat0=39, lat1=45) +
      #Add a title:
      ggtitle("Traffic Fatalities in 2015") +
      #Adjust the theme:
      theme_classic() +
      theme(panel.border = element_blank(),
            axis.text = element_blank(),
            line = element_blank(),
            axis.title = element_blank(),
            plot.title = element_text(size=40, face="bold", family="Avenir Next"))
    

    Step 3: View the Finished Product

    With this relatively simple code, we produce a map that clearly displays the location of 2015’s traffic fatalities:

    Hopefully with this post you’ll be well on the way to making maps of your own and can start exploring this data set and others like it. If you have any questions, please reach out on twitter. I’m available @lucaspuente.

    To leave a comment for the author, please follow the link and comment on their blog: lucaspuente.github.io/.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News