Identify & Analyze Web Site Tech Stacks With rappalyzer

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (often with special capabilities), cache managers and a wide array of front-end web components. Unless a site goes to great lengths to cloak what they are using, most of these stack components leave a fingerprint — bits and pieces that we can piece together to identify them.

Wappalyzer is one tool that we can use to take these fingerprints and match them against a database of known components. If you’re not familiar with that service, go there now and enter in the URL of your own blog or company and come back to see why I’ve mentioned it.

Back? Good.

If you explored that site a bit, you likely saw a link to their . If you poke around on their repo, you’ll find the repo’s wiki and eventually see that folks have made various unofficial ports to other languages.

By now, you’ve likely guessed that this blog post is introducing an R port of “wappalyzer” and if you have made such a guess, you’re correct! The rappalyzer package is a “work-in-progress” but it’s usable enough to get some feedback on the package API, solicit contributors and show some things you can do with it.

Just rappalyze something already!

For the moment, one real function is exposed: rappalyze(). Feed it a URL string or an httr response object and — if the analysis was fruitful — you’ll get back a data frame with all the identified tech stack components. Let’s see what jquery.com is made of:

devtools::install_github("hrbrmstr/rappalyzer")

library(rappalyzer)

rappalyze("https://jquery.com/")
## # A tibble: 8 x 5
##         tech              category match_type version confidence
##        
## 1 CloudFlare                   CDN    headers            100
## 2     Debian     Operating Systems    headers            100
## 3  Modernizr JavaScript Frameworks     script            100
## 4      Nginx           Web Servers    headers            100
## 5        PHP Programming Languages    headers  5.4.45        100
## 6  WordPress                   CMS       meta   4.5.2        100
## 7  WordPress                 Blogs       meta   4.5.2        100
## 8     jQuery JavaScript Frameworks     script  1.11.3        100

If you played around on the Wappalyzer site you saw that it “spiders” many URLs to try to get a very complete picture of components that a given site may use. I just used the main jQuery site in the above example and we managed to get quite a bit of information from just that interaction.

Wappalyzer (and, hence rappalyzer) works by comparing sets of regular expressions against different parts of a site’s content. For now, rappalyzer checks:

  • the site URL (the one after a site’s been contacted & content retrieved)
  • HTTP headers (an invisible set of metadata browsers/crawlers & servers share)
  • HTML content
  • scripts
  • meta tags

Wappalyzer-proper runs in a browser-context, so it can pick up DOM elements and JavaScript environment variables which can provide some extra information not present in the static content.

As you can see from the returned data frame, each tech component has one or more associated categories as well as a version number (if it could be guessed) and a value indicating how confident the guess was. It will also show where the guess came from (headers, scripts, etc).

A peek under the hood

I started making a pure R port of Wappalyzer but there are some frustrating “gotchas” with the JavaScript regular expressions that would have meant spending time identifying all the edge cases. Since the main Wappalyzer source is in JavaScript it was trivial to down-port it to a V8-compatible version and expose it via an R function. A longer-term plan is to deal with the regular expression conversion, but I kinda needed the functionality sooner than later (well, wanted it in R, anyway).

On the other hand, keeping it in JavaScript has some advantages, especially with the advent of chimera (a phantomjs alternative that can put a full headless browser into a V8 engine context). Using that would mean getting the V8 package ported to more modern version of the V8 source code which isn’t trivial, but doable. Using chimera would make it possible to identify even more tech stack components.

Note also that these tech stack analyses can be dreadfully slow. That’s not due to to using V8. The ones that slow are slow in all the language ports. This is due to the way the regular expressions are checked against the site content and for just using regular expressions to begin with. I’ve got some ideas to speed things up and may also introduce some “guide rails” to prevent lengthy operations enabling avoiding some checks if page characteristics meet certain criteria. Drop an issue if you have ideas as well.

Why is this useful?

Well, you can interactively see what a site uses like we did above. But, we can also look for commonalities across a number of sites. We’ll do this in a small example now. You can skip the expository and just work with the RStudio project if that’s how you’d rather roll.

Crawling 1,000 orgs

I have a recent-ish copy of a list of Fortune 1000 companies and their industry sectors along with their website URLs. Let’s see how many of those sites we can still access and what we identify.

I’m going to pause for just a minute to revisit some “rules” in the context of this operation.

Specifically, I’m not:

  • reading each site’s terms of service/terms & conditions
  • checking each site’s robots.txt file
  • using a uniquely identifying non-browser string crawler user-agent (since we need the site to present content like it would to a browser)

However, I’m also not:

  • using the site content in any way except to identify tech stack components (something dozens of organizations do)
  • scraping past the index page HTML content, so there’s no measurable resource usage

I believe I’ve threaded the ethical needle properly (and, I do this for a living), but if you’re uncomfortable with this you should use a different target URL list.

Back to the code. We’ll need some packages:

library(hrbrthemes)
library(tidyverse)
library(curl)
library(httr)
library(rvest)
library(stringi)
library(urltools)
library(rappalyzer) # devtools::install_github("hrbrmstr/rappalyzer")
library(rprojroot)
# I'm also using a user agent shortcut from the splashr package which is on CRAN

Now, I’ve included data in the RStudio project GH repo since it can be used later on to test out changes to rappalyzer. That’s the reason for the if/file.exists tests. As I’ve said before, be kind to sites you scrape and cache content whenever possible to avoid consuming resources that you don’t own.

Let’s take a look at our crawler code:

rt 

If you were expecting read_html() and/or GET() calls you’re likely surprised at what you see. We’re trying to pull content from a thousand web sites, some of which may not be there anymore (for good or just temporarily). Sequential calls would need many minutes with error handling. We’re using the asynchronous/parallel capabilities in the curl package to setup all 1,000 requests with the ability to capture the responses. There’s a hack-ish “progress” bar that uses . for “good” requests and X for ones that didn’t work. It still takes a little time (and you may need to tweak the total_con value on older systems) but way less than sequential GETs would have, and any errors are implicitly handled (ignored).

The next block does the tech stack identification:

if (!file.exists(file.path(rt, "data", "rapp_results.rds"))) {

  results  res
    class(res)  results

  # this takes a *while*
  pb  0) rap_df  rapp_results

  write_rds(rapp_results, file.path(rt, "data", "rapp_results.rds"))

} else {
  rapp_results 

We filter out non-“OK” (HTTP 200-ish response) curl results and turn them into just enough of an httr request object to be able to work with them.

Then we let rappalyzer work. I had time to kill so I let it run sequentially. In a production or time-critical context, I’d use the future package.

We’re almost down to data we can play with! Let’s join it back up with the original metadata:

left_join(
  mutate(rapp_results, host = domain(rapp_results$url)) %>%
    bind_cols(suffix_extract(.$host))
  ,
  mutate(f1k, host = domain(website)) %>%
    bind_cols(suffix_extract(.$host)),
  by = c("domain", "suffix")
) %>%
  filter(!is.na(name)) -> rapp_results

length(unique(rapp_results$name))
## [1] 754

Note that some orgs seem to have changed hands/domains so we’re left with ~750 sites to play with (we could have tried to match more, but this is just an exercise). If you do the same curl-fetching exercise on your own, you may get fewer or more total sites. Internet crawling is fraught with peril and it’s not really possible to conveniently convey 100% production code in a blog post.

Comparing “Fortune 1000” org web stacks

Let’s see how many categories we picked up:

xdf 

Plenty to play with!

Now, what do you think the most common tech component is across all these diverse sites?

count(xdf, tech, sort=TRUE)
## # A tibble: 115 x 2
##                        tech     n
##                       
##  1                   jQuery   572
##  2       Google Tag Manager   220
##  3                Modernizr   197
##  4        Microsoft ASP.NET   193
##  5          Google Font API   175
##  6        Twitter Bootstrap   162
##  7                WordPress   150
##  8                jQuery UI   143
##  9             Font Awesome   118
## 10 Adobe Experience Manager    69
## # ... with 105 more rows

I had suspected it was jQuery (and hinted at that when I used it as the initial rappalyzer example). In a future blog post we’ll look at just how vulnerable orgs are based on which CDNs they use (many use similar jQuery and other resource CDNs, and use them insecurely).

Let’s see how broadly each tech stack category is used across the sectors:

group_by(xdf, sector) %>%
  count(category) %>%
  ungroup() %>%
  arrange(category) %>%
  mutate(category = factor(category, levels=rev(unique(category)))) %>%
  ggplot(aes(category, n)) +
  geom_boxplot() +
  scale_y_comma() +
  coord_flip() +
  labs(x=NULL, y="Tech/Services Detected across Sectors",
       title="Usage of Tech Stack Categories by Sector") +
  theme_ipsum_rc(grid="X")

That’s alot of JavaScript. But, there’s also large-ish use of CMS, Web Frameworks and other components.

I wonder what the CMS usage looks like:

filter(xdf, category=="CMS") %>%
  count(tech, sort=TRUE)
## # A tibble: 15 x 2
##                        tech     n
##                       
##  1                WordPress    75
##  2 Adobe Experience Manager    69
##  3                   Drupal    68
##  4               Sitefinity    26
##  5     Microsoft SharePoint    19
##  6                 Sitecore    14
##  7                      DNN    11
##  8                Concrete5     6
##  9     IBM WebSphere Portal     4
## 10        Business Catalyst     2
## 11                    Hippo     2
## 12                TYPO3 CMS     2
## 13               Contentful     1
## 14              Orchard CMS     1
## 15             SilverStripe     1

Drupal and WordPress weren’t shocking to me (again, I do this for a living) but they may be to others when you consider this is the Fortune 1000 we’re talking about.

What about Web Frameworks?

filter(xdf, category=="Web Frameworks") %>%
  count(tech, sort=TRUE)
## # A tibble: 7 x 2
##                          tech     n
##                         
## 1           Microsoft ASP.NET   193
## 2           Twitter Bootstrap   162
## 3             ZURB Foundation    61
## 4               Ruby on Rails     2
## 5            Adobe ColdFusion     1
## 6                    Kendo UI     1
## 7 Woltlab Community Framework     1

Yikes! Quite a bit of ASP.NET. But, many enterprises haven’t migrated to more modern, useful platforms (yet).

FIN

As noted earlier, the data & code are on GitHub. There are many questions you can ask and answer with this data set and definitely make sure to share any findings you discover!

We’ll continue exploring different aspects of what you can glean from looking at web sites in a different way in future posts.

I’ll leave you with one interactive visualization that lets you explore the tech stacks by sector.

devtools::install_github("jeromefroe/circlepackeR")
library(circlepackeR)
library(data.tree)
library(treemap)

cpdf 

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data.Table by Example – Part 3

By atmathew

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

For this final post, I will cover some advanced topics and discuss how to use data tables within user generated functions. Once again, let’s use the Chicago crime data.

dat = fread("rows.csv")
names(dat) 

Let's start by subseting the data. The following code takes the first 50000 rows within the dat dataset, selects four columns, creates three new columns pertaining to the data, and then removes the original date column. The output was saved as to new variable and the user can see the first few columns of the new data table using brackets or head function.

ddat = dat[1:50000, .(Date, value1, value2, value3)][, 
               c("year", "month", "day") := 
                        .(year(mdy_hms(Date)), 
                          month(mdy_hms(Date)),
                          day(mdy_hms(Date)))][,-c("Date")]

ddat[1:3]  # same as head(ddat, 3)

We can now do some intermediate calculations and suppress their output by using braces.

unique(ddat$month)
ddat[, { avg_val1 = mean(value1)
         new_val1 = mean(abs(value2-avg_val1))
         new_val2 = new_val1^2 }, by=month][order(month)]

In this simple sample case, we have taken the mean value1 for each month, subtracted it from value2, and then squared that result. The output show the final calculation in the brackets, which is the result from squaring. Note that I also ordered the results by month with the chaining process.

That’s all very nice and convenient, but what if we want to create user defined functions that essentially automate these tasks for future work. Let us start with a function for extracting each component from the date column.

ddat = dat[1:50000, .(Date, value1, value2, value3)]

add_engineered_dates 

We've created a new functions that takes two arguments, the data table variable name and the name of the column containing the date values. The first step is to copy the data table so that we are not directly making changes to the original data. Because the goal is to add three new columns that extract the year, month, and day, from the date column, we've used the lubridate package to define the date column and then extract the desired values. Furthermore, each new column has been labeled so that it distinctly represents the original date column name and the component that it contains. The final step was to remove the original date column, so we've set the date column to NULL. The final line prints the results back to the screen. You can recognize from the code above that the get function is used to take a variable name that represents column names and extract those columns within the function.

result = add_engineered_dates(ddat)
result

Let’s now apply the previous code with braces to find the squared difference between value2 and the mean value1 metric.

result[, { avg_val1 = mean(value1)
           new_val1 = mean(abs(value2-avg_val1))
           new_val2 = new_val1^2 }, by=.(Date_year,Date_month)][
                  order(Date_year,Date_month)][1:12]

Perfect!

I’m hoping that these three posts on the data.table package have convinced beginners to R of its beauty. The syntax may be scary at first, but this is in my opinion one of the most important packages in R that everyone should become familiar with. So pull your sleeves up and have fun.

If you have any comments or would like to cover other specific topics, feel free to comment below. You can also contact me at mathewanalytics@gmail.com or reach me through LinkedIn

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

tidytext 0.1.4

By Rstats on Julia Silge

(This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

I am pleased to announce that tidytext 0.1.4 is now on CRAN!

This release of our package for text mining using tidy data principles has an excellent collection of delightfulness in it. First off, all the important functions in tidytext now support support non-standard evaluation through the tidyeval framework.

library(janeaustenr)
library(tidytext)
library(dplyr)

input_var %
    unnest_tokens(!! output_var, !! input_var)
## # A tibble: 122,204 x 1
##         word
##        
##  1     pride
##  2       and
##  3 prejudice
##  4        by
##  5      jane
##  6    austen
##  7   chapter
##  8         1
##  9        it
## 10        is
## # ... with 122,194 more rows

I have found the tidyeval framework useful already in my day job when writing functions using dplyr for complex data analysis tasks, so we are glad to have this support in tidytext. The older underscored functions (like unnest_tokens_()) that took only strings as arguments are still in the package for now, but tidyeval is the way to go, everybody!

I also used pkgdown to build a website to explore tidytext’s documentation and vignettes.

Our book website of course contains a lot of information about how to use tidytext, but the pkgdown site has a bit of a different focus in that you can explicitly see all the function documentation and such. Getting this site up and running went extremely smoothly, and I have not worked hard to customize it; this is just all the defaults. In my experience here, the relative bang for one’s buck in setting up a pkgdown site is extremely good.

Another exciting addition to this release of tidytext are tidiers and support for Structural Topic Models from the stm package using tidy data principles. I am becoming a real fan of this implementation of topic modeling in R after experimenting with it for a while (no rJava! so fast!) and soon I’ll have a complete code-through with some example text, The Adventures of Sherlock Holmes.

via GIPHY

There are a few other minor changes and bug fixes in this release as well. Get the new version of tidytext and let us know on GitHub if you have any issues!

To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

The Kelly Criterion — Does It Work?

By Ilya Kipnis

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be about implementing and investigating the running Kelly Criterion — that is, a constantly adjusted Kelly Criterion that changes as a strategy realizes returns.

For those not familiar with the Kelly Criterion, it’s the idea of adjusting a bet size to maximize a strategy’s long term growth rate. Both https://en.wikipedia.org/wiki/Kelly_criterionWikipedia and Investopedia have entries on the Kelly Criterion. Essentially, it’s about maximizing your long-run expectation of a betting system, by sizing bets higher when the edge is higher, and vice versa.

There are two formulations for the Kelly criterion: the Wikipedia result presents it as mean over sigma squared. The Investopedia definition is P-[(1-P)/winLossRatio], where P is the probability of a winning bet, and the winLossRatio is the average win over the average loss.

In any case, here are the two implementations.

investoPediaKelly  0]  0] 

Let's try this with some data. At this point in time, I'm going to show a non-replicable volatility strategy that I currently trade.

For the record, here are its statistics:

                              Close
Annualized Return         0.8021000
Annualized Std Dev        0.3553000
Annualized Sharpe (Rf=0%) 2.2574000
Worst Drawdown            0.2480087
Calmar Ratio              3.2341613

Now, let’s see what the Wikipedia version does:

badKelly 

badKelly

The results are simply ridiculous. And here would be why: say you have a mean return of .0005 per day (5 bps/day), and a standard deviation equal to that (that is, a Sharpe ratio of 1). You would have 1/.0005 = 2000. In other words, a leverage of 2000 times. This clearly makes no sense.

The other variant is the more particular Investopedia definition.

invKelly 

invKelly

Looks a bit more reasonable. However, how does it stack up against not using it at all?

compare 

kellyCompare

Turns out, the fabled Kelly Criterion doesn't really change things all that much.

For the record, here are the statistical comparisons:

                               Base     Kelly
Annualized Return         0.8021000 0.7859000
Annualized Std Dev        0.3553000 0.3588000
Annualized Sharpe (Rf=0%) 2.2574000 2.1903000
Worst Drawdown            0.2480087 0.2579846
Calmar Ratio              3.2341613 3.0463063

Thanks for reading.

NOTE: I am currently looking for my next full-time opportunity, preferably in New York City or Philadelphia relating to the skills I have demonstrated on this blog. My LinkedIn profile can be found here. If you know of such opportunities, do not hesitate to reach out to me.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R 3.4.2 is released (with several bug fixes and a few performance improvements)

By Tal Galili

logo

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

R 3.4.2 (codename “Short Summer”) was released yesterday. You can get the latest binaries version from here. (or the .tar.gz source code from here).

As mentioned by David Smith, R 3.4.2 includes a performance improvement for names:

c() and unlist() are now more efficient in constructing the names(.) of their return value, thanks to a proposal by Suharto Anggono. (PR#17284)

The full list of bug fixes and new features is provided below.

Thank you Duncan Murdoch !

On a related note, following the announcement on R 3.4.2, Duncan Murdoch wrote yesterday:

I’ve just finished the Windows build of R 3.4.2. It will make it to CRAN and its mirrors over the next few hours.

This is the last binary release that I will be producing. I’ve been building them for about 15 years, and it’s time to retire. Builds using different tools and scripts are available from https://mran.microsoft.com/download/. I’ll be putting my own scripts on CRAN soon in case anyone wants to duplicate them.

Nightly builds of R-patched and R-devel will continue to run on autopilot for the time being, without maintenance.

I will also be retiring from maintenance of the Rtools collection.

I am grateful to Duncan for contributing so much of his time and expertise throughout the years. And I am confident that other R users, using the binaries for the Windows OS, share this sentiment.

Upgrading to R 3.4.2 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install 
setInternet2(TRUE) # only for R versions older than 3.3.0
installr::updateR() # updating R.
# If you wish it to go faster, run: installr::updateR(T)

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.4.2

NEW FEATURES

  • Setting the LC_ALL category in Sys.setlocale() invalidates any cached locale-specific day/month names and the AM/PM indicator for strptime() (as setting LC_TIME has since R 3.1.0).
  • The version of LAPACK included in the sources has been updated to 3.7.1, a bug-fix release.
  • The default for tools::write_PACKAGES(rds_compress=) has been changed to "xz" to match the compression used by CRAN.
  • c() and unlist() are now more efficient in constructing the names(.) of their return value, thanks to a proposal by Suharto Anggono. (PR#17284)

UTILITIES

  • R CMD check checks for and R CMD build corrects CRLF line endings in shell scripts configure and cleanup (even on Windows).

INSTALLATION on a UNIX-ALIKE

  • The order of selection of OpenMP flags has been changed: Oracle Developer Studio 12.5 accepts -fopenmp and -xopenmp but only the latter enables OpenMP so it is now tried first.

BUG FIXES

  • within(List, rm(x1, x2)) works correctly again, including when List[["x2"]] is NULL.
  • regexec(pattern, text, *) now applies as.character(.) to its first two arguments, as documented.
  • write.table() and related functions, writeLines(), and perhaps other functions writing text to connections did not signal errors when the writes failed, e.g. due to a disk being full. Errors will now be signalled if detected during the write, warnings if detected when the connection is closed. (PR#17243)
  • rt() assumed the ncp parameter was a scalar. (PR#17306)
  • menu(choices) with more than 10 choices which easily fit into one getOption("width")-line no longer erroneously repeats choices. (PR#17312)
  • length() on a pairlist succeeds. (https://stat.ethz.ch/pipermail/r-devel/2017-July/074680.html)
  • Language objects such as quote(("n")) or R functions are correctly printed again, where R 3.4.1 accidentally duplicated the backslashes.
  • Construction of names() for very large objects in c() and unlist() now works, thanks to Suharto Anggono’s patch proposals in PR#17292.
  • Resource leaks (and similar) reported by Steve Grubb fixed. (PR#17314, PR#17316, PR#17317, PR#17318, PR#17319, PR#17320)
  • model.matrix(~1, mf) now gets the row names from mf also when they differ from 1:nrow(mf), fixing PR#14992 thanks to the suggestion by Sebastian Meyer.
  • sigma(fm) now takes the correct denominator degrees of freedom for a fitted model with NA coefficients. (PR#17313)
  • hist(x, "FD") no longer “dies” with a somewhat cryptic error message when x has extreme outliers or IQR() zero: nclass.FD(x) tries harder to find a robust bin width h in the latter case, and hist.default(*, breaks) now checks and corrects a too large breaks number. (PR#17274)
  • callNextMethod() works for ... methods.
  • qr.coef(qd, y) now has correct names also when qd is a complex QR or stems from qr(*, LAPACK=TRUE).
  • Setting options(device = *) to an invalid function no longer segfaults when plotting is initiated. (PR#15883)
  • encodeString() no longer segfaults. (PR#15885)
  • It is again possible to use configure --enable-maintainer-mode without having installed notangle (it was required in R 3.4.[01]).
  • S4 method dispatch on ... calls the method by name instead of .Method (for consistency with default dispatch), and only attempts to pass non-missing arguments from the generic.
  • readRDS(textConnection(.)) works again. (PR#17325)
  • (1:n)[-n] no longer segfaults for n (on a platform with enough RAM).
  • x now correctly returns NA. (PR#17333)
  • Running of finalizers after explicit GC request moved from the R interface do_gc to the C interface R_gc. This helps with reclaiming inaccessible connections.
  • help.search(topic) and ??topic matching topics in vignettes with multiple file name extensions (e.g., ‘*.md.rsp‘ but not ‘*.Rmd‘) failed with an error when using options(help_type = "html").
  • The X11 device no longer uses the Xlib backing store (PR#16497).
  • array(character(), 1) now gives (a 1D array with) NA as has been documented for a long time as in the other cases of zero-length array initialization and also compatibly withmatrix(character(), *). As mentioned there, this also fixes PR#17333.
  • splineDesign(.., derivs = 4) no longer segfaults.
  • fisher.test(*, hybrid=TRUE) now (again) will use the hybrid method when Cochran’s conditions are met, fixing PR#16654.

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Convert hand-drawn equations to LaTeX with the mathpix package

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Statistics involves a lot of mathematics, so one of the nice things about report-generation systems for R like Rmarkdown is that it makes it easy to include nicely-formatted equations by using the LaTeX syntax. So, if we want to include the density function of the Guassian Normal distribution:

$$ frac{1}{{sigma sqrt {2pi } }} e^ { – frac{ – left( {x – mu } right)^2 }{2sigma ^2} } $$

we can just add the following markup to the document:

[
frac{1}{{sigma sqrt {2pi} }} e^{-frac{-(x-mu)^2}{2sigma ^2}}
]

Creating that markup can be a little tricky, but it generally follows naturally from the mathematics, and it’s much easier than other methods like including a screenshot of the equation. Still, a new R package makes it even easier: the mathpix package will convert an image containing an equation to its LaTeX markup equivalent. That image might be a photo of a handwritten equation on paper of a whiteboard, or even a “stattoo”:

Mostly! Only propto that didn’t get recognized. Curious about results from @kopshtik & @emhrt_ pic.twitter.com/1aoXq1je7h

— Mikhail Popov (@bearloga) September 27, 2017

The resulting LaTeX isn’t quite perfect: it mistakes the proportionality symbol for the Greek letter alpha (a mistake I’ve seen a few typesetters make). With that correction, the rendered equation — used for Bayesian inference — looks like:

$$ p ( theta | y) propto p (y | theta) p (theta) $$

The mathpix package was created by Jonathan Carroll and is an interface to the Mathpix API. (It’s recommended you sign up for a free API key if you intend to use this package regularly.) The package is available now on CRAN, and you can find more details on its Github page, linked below.

Github (jonocarroll): mathpix : Query the mathpix API to convert math images to LaTeX

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Real time Yelp reviews analysis and response solutions for restaurant owners

By Yu-Han Chen

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Motivation

Before trying a new restaurant, we frequently consult with review platforms, such as Yelp, Zomato, or Google, where we can read comments from previous diners. Reading those reviews helps to make more informed decisions and can lead to a better dining experience with friends and family. It is clear that reviews are influential with diners, but how powerful can a single review be?

If a business receives one more star in overall rating, it can generate a 5 to 9% increase in revenue according to Michael Luca‘s research, “Reviews, Reputation, and Revenue: The Case of Yelp.com”. On the flip side, a single negative review may cost a business 30 future customers, which implies that an unfavorable comment can significantly damage reputation, profitability, and trustworthiness of a business.

Given that a single bad review can harm a business, how can the business owner mitigate the negative impact? One simple solution is to improve the business’s customer relationship management system, by raising review complaints to the business’s attention and developing an automated real-time complaint response engine.

Our code is available on GitHub, and we have also created a Shiny App to demonstrate the responses in action.

Workflow

After selecting the problem and possible business resolution, we chose the Yelp Open Dataset published on September 1, 2017 as our dataset. After inspecting and cleaning the data, we conducted sentiment analysis to separate the positive reviews from the complaints. We then used the spaCy package to preprocess the complaint review text, and we used the gensim package to train n-gram phrase models and Latent Dirichelet Allocation topic models. Finally, we built an automated response chatbot to demonstrate the response to different types of complaints.

Sentiment Analysis

Determining the sentiment of a review is a deceptively complex problem. As our dataset was unlabeled, it was necessary for us to use an unsupervised method to identify the complaint reviews. We considered several options to achieve this goal.

First, we considered using the number of review stars as a pseudo-label, assuming all one and two star reviews were complaints. However, upon further inspection it became clear that different reviewers have different standards; a three-star review from one reviewer may describe a negative experience, while a two-star review from a different reviewer may describe a balanced experience.

Second, we considered normalizing each review against the reviewer’s own standards to attempt to control for individual preferences and biases. To do this, we calculated the difference between the number of stars given for each review and that reviewer’s average star rating across all reviews. For example, if a reviewer’s average star rating is 3.5, a 2 star rating for an individual review would generate a “stars_dif” rating of -1.5.

However, this approach raised problems as well. The mean average star rating across all reviewers is 3.75, which means that the distribution of stars_dif will be skewed negative. We also found a large grouping of data points at stars_dif = 0, due to reviewers with a single review or reviewers giving the same star rating across all reviews. Thus we believe that stars_dif may not be a reliable indicator of sentiment.

FInally, we decided to look into the actual text of the reviews to determine sentiment. We created a sentiment score rating by tokenizing and processing the text of the reviews using NLTK, and calculating the ratio of positive words vs. negative words in the text using dictionaries of sentiment words. We used the following formula for each review:

(pos_words – neg_words) / total_words

Each review received a score bounded by -1 and 1; a more negative is closer to -1, while a more positive review is closer to 1.

Topic Modeling

After we separated the complaints from the positive reviews, we built topic models to categorize the complaint reviews based on the subjects discussed in the text. We trained bigram and trigram models using the Gensim package, creating phrases by linking individual words based on a ratio of the frequency of their appearance together versus the frequency apart.

After completing n-gram transformations, we trained a Latent Direchlet Allocation model, which is designed to uncover some of the latent topic structure within a corpus of documents. The user selects the desired number of topics as a hyperparameter, and the model will then assign to each topic an assortment of the individual tokens from the ‘vocabulary’ of the corpus. The LDA model assumes that the combination of topics within each document and the combination of tokens within each topic follows a Dirichlet probability distribution, which means that each document will contain just a few topics, and each topic will contain a small number of tokens.

In practice, the LDA model returns a group of the most frequent words for each topic. Each review in the corpus can be rated with a percentage for each topic, and the LDA model can also be used to classify new reviews by the same topic categories.

After testing a variety of inputs, we determined that three topics generated the most interpretable results with the greatest separation between topics. We also adjusted the models to eliminate some words which appeared high in the output for all topics, in order to improve separation. While the meaning of each topic necessarily involves an element of human interpretation, we believe the three topics most closely correspond to complaints about price, service quality, and food quality.

Generating Results on New Text

In order to classify and respond to new Yelp reviews, we wrote code to apply these models to a new input. The code uses the models to assign a primary complaint category and generate an appropriate response to the reviewer based on the complaint category. We then returned these responses to the reviewer via chatbot for demonstration purposes (see below).

Functions of the Shiny App (Check it out)

Interactive Map

In this map, we plot all of the restaurants with negative sentiment score reviews. Users can select the complaint type, and slider inputs can adjust the sentiment score, star rating, and the number of reviews. Notably, the clusters are not spread broadly across the country, because Yelp has only included reviews from 12 states in the dataset. Detailed restaurant information, including restaurant name, category, average star rating, and the number of reviews, is available by clicking on the location tag.

LDA Visualization

We divided negative reviews into three categories. By using this interactive graph, users can manually select each topic to view its most frequent terms. The slider input allows the user to select either the terms that appear in the topic most frequently or the terms that are most distinctive in that topic vs. the other two topics. This can help in assigning an interpretable name or “meaning” to each topic.

Complainteller Chatbot

As the chatbot is built based on LDA results, we classify complaints into three categories — price, food quality, and service quality. Once the chatbot receives a message, it runs the text through the NLP preprocessing and classification algorithms, ranks the percentage of possible complaint groups and then replies to the review immediately. For the demo video, we selected three-star reviews from NYY Steak (Yankee Stadium), which are new information to the chatbot and are viewed as a neutral comment without applying sentiment analysis. Our chatbot, Complainteller, successfully replies to those negative reviews with appropriate responses.

If you would like to talk to Complainteller, please feel free to register an account on Telegram and then you can start your conversation with @YelpReviews_bot. Telegram is a free, non-profit cloud-based instant messaging service; bots are represented as Telegram accounts operated by an automated program. They can respond to messages, be invited into groups and be integrated into other programs.

Future Work

For NLP and LDA modeling, we could run more tests on removing different common words to improve the topic groups. We could test more advanced sentiment score algorithms and models to improve the separation of complaint reviews.

The chatbot currently responds with a single reply to the most prominent topic. However, sometimes, customer complaints are mixed problems. We can further improve the chatbot to reply with an appropriate response incorporating multiple solutions. Additionally, the accuracy of the result depends on the volume of the text. Therefore, creating a common word database and tagging those words into different categories may improve the performance.

The post Real time Yelp reviews analysis and response solutions for restaurant owners appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

bupaR: Business Process Analysis with R

By Gert Janssenswillen

(This article was first published on R – Research group Business Informatics, and kindly contributed to R-bloggers)

Organizations are nowadays storing huge amounts of data related to various business processes. Process mining provides different methods and techniques to analyze and improve these processes. This allows companies to gain a competitive advantage. Process mining initiated with the discovery of work-flow models from event data. However, over the past 20 years, the process mining field has evolved into a broad and diverse research discipline.

bupaR is an open-source suite for the handling and analysis of business process data in R. It was developed by the Business Informatics research group at Hasselt University, Belgium. The central package includes basic functionality for creating event log objects in R. It contains several functions to get information about an event log and also provides specific event log versions of generic R functions. Together with the related packages, each of which has their own specific purpose, bupaR aims at supporting each step in the analysis of event data with R, from data import to online process monitoring.

The table below shows an example event log. Each row is an event which belongs to a case (a patient). Different events together can form an activity instance, or execution (e.g. event 2-4 belong to surgery 2). Each event in such an execution will have a different transactional lifecycle status. Note that there can be different instances of a specific activity (e.g. there are two surgeries in the example). Furthermore, each event has a timestamp, indicating when it happened, and a resource, indicating who performed it.

Given that the data shown above is stored in a data.frame, it can be turned into an event log object by indicating all the relevant data fields.

library(bupaR) 
data %>%
 eventlog(
   case_id = "patient",
   activity_id = "activity",
   activity_instance_id = "activity_instance",
   lifecycle_id = "status",
   timestamp = "timestamp",
   resource_id = "resource"
 )

Alternatively, event data can be read from XES-files. XES, eXtensible Event Stream notation, is the IEEE standard for storing and sharing event data. The xesreadR package, which is part of bupaR, provides the functions read_xes and write_xes as an interface between R and XES-files. The following statement shows how to read an event log from a xes-file, in this case with data on an order-to-cash (otc) process.

log_otc 

Event log objects can be visualized with processmapR. It allows the user to create a customizable dotted chart, showing all the events by time and case identifier in one graph. Precedence relations between activities can also be shown with a process map. Frequent traces, i.e. activity sequences, can be explored with the trace_explorer.

log_otc %>% 
 dotted_chart

log_otc %>%
 filter_trace_frequency(perc = 0.9) %>%
 process_map()

log_otc %>% 
  trace_explorer(coverage = 0.9)

edeaR stands for Exploratory and Descriptive Event-Data Analysis. This package provides several metric functions for in-depth analysis of event logs, as well as a diverse set of subsetting methods. The functions can be calculated at a varying number of granularity levels, allowing to drill-down in the data and focus on a specific part. Furthermore, all metrics are compatible with dplyr::group_by. The generic plot functions can be used to create predefined graphs, which can be customized using ggplot2.

The example below shows in how many cases each of the activities is present. This shows that in the given event log, there is a set of very common activities, and a set of very rare activities.

log_otc %>% activity_presence %>% plot

Next to the metrics, also a varied set of event-data specific subsetting methods are provided. All the functions are designed to work together with the piping symbol.

Next to the packages discussed above, there is also the eventdataR package which contains example event datasets and the processmonitR package which provides predefined dashboards for online process monitoring. For more information about bupaR, you can visit the website where you can also find a cheat sheet.

To leave a comment for the author, please follow the link and comment on their blog: R – Research group Business Informatics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Multicollinearity in R

By Bidyut Ghosh

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

One of the assumptions of Classical Linear Regression Model is that there is no exact collinearity between the explanatory variables. If the explanatory variables are perfectly correlated, you will face with these problems:

  • Parameters of the model become indeterminate
  • Standard errors of the estimates become infinitely large

However, the case of perfect collinearity is very rare in practical cases. Imperfect or less than perfect multicollinearity is the more common problem and it arises when in multiple regression modelling two or more of the explanatory variables are approximately linearly related.
The consequences are:

  • OLS estimators are still unbiased, but they have large variances and covariances, making precise estimation difficult
  • As a result, the confidence intervals tend to be wider. Therefore, we may not reject the “zero null hypothesis” (i.e. the true population coefficient is zero)
  • Because of first consequence, the t ratios of one or more coefficients tend to be statistically insignificant
  • Even though some regression coefficients are statistically insignificant, the (R^2 ) value may be very high
  • The OLS estimators and their standard errors can be sensitive to small changes in the data

So, it is must to detect the collinearity as well as to remove them. The collinearity can be detected in the following ways:

  • The The easiest way for the detection of multicollinearity is to examine the correlation between each pair of explanatory variables. If two of the variables are highly correlated, then this may the possible source of multicollinearity. However, pair-wise correlation between the explanatory variables may be considered as the sufficient, but not the necessary condition for the multicollinearity.
  • The second easy way for detecting the multicollinearity is to estimate the multiple regression and then examine the output carefully. The rule of thumb to doubt about the presence of multicollinearity is very high (R^2 ) but most of the coefficients are not significant according to their p-values. However, this cannot be considered as an acid test for detecting multicollinearity. It will provide an apparent idea for the presence of multicollinearity.
  • As, the coefficient of determination in the regression of regressor (X_j) on the remaining regressors in the model, increases toward unity, that is, as the collinearity of (X_j) with the other regressors increases, (VIF ) also increases and in the limit it can be infinite. Therefore, we can use the (VIF ) as an indicator of multicollinearity. The larger the value of (VIF_j ), the more “troublesome” or collinear the variable (X_j ). As a rule of thumb, if the (VIF ) of a variable exceeds 10, which will happen if multiple correlation coefficient for j-th variable (R_j^2 ) exceeds 0.90, that variable is said to be highly collinear.
  • The Farrar-Glauber test (F-G test) for multicollinearity is the best way to deal with the problem of multicollinearity.

The F-G test is, in fact, a set of three tests for testing multicollinearity

  • Firstly, a Chi-square test for the detection of the existence and severity of multicollinearity is a function with several explanatory variables. The hypothesis to be tested in the first step is:
    $$ H_0 : text {the X’s are orthogonal}
    against
    H_1: text {the X’s are not orthogonal} $$
    The variables are standardised for the sample size and standard deviation. Using the standardised explanatory variables, a standardised determinant is formed. This standardised determinant, also called as correlation determinant, is written in a form keeping in mind that the main diagonal elements are equal to unity and the off-diagonal elements are the simple correlation coefficients among the explanatory variables. The standardised determinant for three variables case is written as
    $$ begin{vmatrix}
    1 & r_{x_1 x_2} & r_{x_1 x_3}
    r_{x_1 x_2} & 1 & r_{x_2 x_3}
    r_{x_1 x_3} & r_{x_2 x_3} & 1
    end{vmatrix} $$
    Now in case of perfect multicollinearity the simple correlation coefficients are equal to unity and so the above determinant turns to zero. That is,
    $$ begin{vmatrix}
    1 & 1 & 1
    1 & 1 & 1
    1 & 1 & 1
    end{vmatrix} =0 $$
    On the other hand, in case of orthogonality of the explanatory variables the simple correlation coefficients are all equal to zero and so the standardized determinant turns to unity. That is,
    $$ begin{vmatrix}
    1 & 0 & 0
    0 & 1 & 0
    0 & 0 & 1
    end{vmatrix} =1 $$
    In practice, however, as either the perfect multicollinearity or orthogonality is very rare, the above determinant lies between zero and unity, and there is some degree of multicollinearity in the model.
    Thus, the problem of multicollinearity may be considered as the departure from the orthogonality. The stronger the departure from the orthogonality (or the closer the value of the standardized determinant to zero), the stronger the degree of multicollinearity and vice versa.
    From this notion Farrar and Glauber have developed the Chi-square test for detecting the strength of the multicollinearity over the whole set of explanatory variables. The Chi-square test statistic is given by
    $$ chi^2 = – left [n-1 frac 1 6 left (2k+5 right) right].log_e $$
    Under the assumption of true null hypothesis, the test statistic will follow a Chi-square distribution with (df=frac 1 2 k(k-1))

    If the observed value of the Chi-square test statistic is found to be greater than the critical value of Chi-square at the desired level of significance, we reject the assumption of orthogonality and accept the presence of multicollinearity in the model.
    If the observed value of the Chi-square test statistic is found to be less than the critical value of Chi-square at the desired level of significance, we accept that there is no problem of multicollinearity in the model.

  • The second test in the Farar-Glauber test is an F test for the location of multicollinearity. To do this, they have computed the multiple correlation coefficients among the explanatory variablesand tested the statistical significance of these multiple correlation coefficients using an F test. The test statistic is given as
    $$F = frac {left (R_{x_1.x_2 x_3 ….x_k}^2 right)/(k-1)}{left (1-R_{x_1.x_2 x_3 ….x_k}^2right)/(n-k)}$$
    where (n=text {sample size}) and (k=text {number of explanatory variables})

    The hypothesis to be tested is given as
    $$H_0 : R_{x_1 .x_2 x_2 …..x_k}^2=0 H_1: R_{x_1 .x_2 x_2 …..x_k}^2 neq 0$$

    If the observed value of (F) is found to be greater than the theoretical value of (F) with degrees of freedom at the desired level of significance, we accept that the variable (X_i) multicollinear.
    On the other hand, if the observed value of (F) is less than the theoretical value of (F), we accept that the variable (X_i) is not multicollinear.
    You should note that there will as many number of (F) test as the number of explanatory variables present in the model.

  • Finally, the Farrar – Glauber test concludes with a t – test for the pattern of multicollinearity. In fact, this is a t-test which aims at the detection of the variables which cause multicollinearity. To do this, the partial correlation coefficients among the explanatory variables are computed and their statistical significance are tested with the t test. For the above three variables modelling, the partial correlation coefficients are given by the formula
    $$ r_{x_1 x_2 .x_3}^2 = frac {left (r_{12} – r_{13}r_{23} right)^2}{left (1-r_{23}^2right) left (1- r_{13}^2right) } r_{x_1 x_3 .x_2}^2 = frac {left (r_{13} – r_{12}r_{23} right)^2}{left (1-r_{23}^2right) left (1- r_{12}^2right) } r_{x_1 x_2 .x_3}^2 = frac {left (r_{23} – r_{12}r_{13} right)^2}{left (1-r_{13}^2right) left (1- r_{12}^2right) } $$
    For the statistical model involving more than three explanatory variables, we can develop similar formula for partial correlation coefficients.
    The hypothesis to be tested at this step is as under:
    $$ H_0 : r_{x_i x_j . x_1 x_2 ….x_k} = 0 H_0 : r_{x_i x_j . x_1 x_2 ….x_k} neq 0 $$
    After calculating all the partial correlation coefficients, their statistical significance are tested individually for each of them by using the following test statistic
    $$ t= frac {left (r_{x_i x_j . x_1 x_2…..x_k}right )sqrt {n-k}} {sqrt {1- r_{x_i x_j .x_1 x_2…..x_k}^2}} $$
    The above test statistic follows the t -distribution with ((n – k)) degrees of freedom. Thus, if the computed value of t -statistic is greater than the theoretical value of t with ((n – k)) degrees of freedom at the desired level of significance, we accept that the variables (X_i) and (X_j) are responsible for the multicollinearity in the model, otherwise the variables are not the cause of multicollinearity since their partial correlation coefficient is not statistically significant.

Data Description
The datafile (wagesmicrodata.xls) is downloaded from http://lib.stat.cmu.edu/datasets/. It contains 534 observations on 11 variables sampled from the Current Population Survey of 1985. The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. We wish to determine whether wages are related to these characteristics.
In particular, we are seeking for the following model:

$$ wage = beta_0 + beta_1 occupation + beta_2 sector + beta_3 union +beta_4 education +beta_5 experience +beta_6 age +beta_7 sex +beta_8 marital_status +beta_0 race +beta_10 south + u $$

After estimating the above model and running the post estimation diagnosis in R, it is seen that if we consider log of wages as the dependent variable, the variances seems to be more stabilized. Hence the log-transformed wage is used in the subsequent estimation, that is,
$$ ln(wage) = beta_0 + beta_1 occupation + beta_2 sector beta_3 union +beta_4 education +beta_5 experience +beta_6 age +beta_7 sex +beta_8 marital_status +beta_0 race +beta_10 south + u $$

Data Analysis in R

Import the data, and attach to R allowing you not to load data everytime you run the code below.

library(readxl)
wagesmicrodata 

Fitting the Linear Model:
Assuming no multicollinearity, the model is being estimated using the following codes:

fit1

To get the model summary:

fit1Call:
lm(formula = log(WAGE) ~ OCCUPATION + SECTOR + UNION + EDUCATION + 
    EXPERIENCE + AGE + SEX + MARR + RACE + SOUTH)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.16246 -0.29163 -0.00469  0.29981  1.98248 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.078596   0.687514   1.569 0.117291    
OCCUPATION  -0.007417   0.013109  -0.566 0.571761    
SECTOR       0.091458   0.038736   2.361 0.018589 *  
UNION        0.200483   0.052475   3.821 0.000149 ***
EDUCATION    0.179366   0.110756   1.619 0.105949    
EXPERIENCE   0.095822   0.110799   0.865 0.387531    
AGE         -0.085444   0.110730  -0.772 0.440671    
SEX         -0.221997   0.039907  -5.563 4.24e-08 ***
MARR         0.076611   0.041931   1.827 0.068259 .  
RACE         0.050406   0.028531   1.767 0.077865 .  
SOUTH       -0.102360   0.042823  -2.390 0.017187 *  
---
Signif. codes:  0 ‘***' 0.001 ‘**' 0.01 ‘*' 0.05 ‘.' 0.1 ‘ ' 1

Residual standard error: 0.4398 on 523 degrees of freedom
Multiple R-squared:  0.3185,	Adjusted R-squared:  0.3054 
F-statistic: 24.44 on 10 and 523 DF,  p-value: 

By looking at the model summary, the R-squared value of 0.31 is not bad for a cross sectional data of 534 observations. The F-value is highly significant implying that all the explanatory variables together significantly explain the log of wages. However, coming to the individual regression coefficients, it is seen that as many as four variables (occupation, education, experience, age) are not statistically significant and two (marital status and south) are significant only at 10 % level of significance.
Further we can plot the model diagnostic checking for other problems such as normality of error term, heteroscedasticity etc.

par(mfrow=c(2,2))
plot(fit1)

Gives this plot:

Thus, the diagnostic plot is also look fair. So, possibly the multicollinearity problem is the reason for not getting many insignificant regression coefficients.

For further diagnosis of the problem, let us first look at the pair-wise correlation among the explanatory variables.

X

Gives this plot:

The correlation matrix shows that the pair-wise correlation among all the explanatory variables are not very high, except for the pair age – experience. The high correlation between age and experience might be the root cause of multicollinearity.
Again by looking at the partial correlation coefficient matrix among the variables, it is also clear that the partial correlation between experience – education, age – education and age – experience are quite high.

library(corpcor)
cor2pcor(cov(X))
             [,1]         [,2]         [,3]         [,4]        [,5]        [,6]         [,7]         [,8]         [,9]        [,10]
 [1,]  1.000000000  0.314746868  0.212996388  0.029436911  0.04205856 -0.04414029 -0.142750864 -0.018580965  0.057539374  0.008430595
 [2,]  0.314746868  1.000000000 -0.013531482 -0.021253493 -0.01326166  0.01456575 -0.112146760  0.036495494  0.006412099 -0.021518760
 [3,]  0.212996388 -0.013531482  1.000000000 -0.007479144 -0.01024445  0.01223890 -0.120087577  0.068918496 -0.107706183 -0.097548621
 [4,]  0.029436911 -0.021253493 -0.007479144  1.000000000 -0.99756187  0.99726160  0.051510483 -0.040302967  0.017230877 -0.031750193
 [5,]  0.042058560 -0.013261665 -0.010244447 -0.997561873  1.00000000  0.99987574  0.054977034 -0.040976643  0.010888486 -0.022313605
 [6,] -0.044140293  0.014565751  0.012238897  0.997261601  0.99987574  1.00000000 -0.053697851  0.045090327 -0.010803310  0.021525073
 [7,] -0.142750864 -0.112146760 -0.120087577  0.051510483  0.05497703 -0.05369785  1.000000000  0.004163264  0.020017315 -0.030152499
 [8,] -0.018580965  0.036495494  0.068918496 -0.040302967 -0.04097664  0.04509033  0.004163264  1.000000000  0.055645964  0.030418218
 [9,]  0.057539374  0.006412099 -0.107706183  0.017230877  0.01088849 -0.01080331  0.020017315  0.055645964  1.000000000 -0.111197596
[10,]  0.008430595 -0.021518760 -0.097548621 -0.031750193 -0.02231360  0.02152507 -0.030152499  0.030418218 -0.111197596  1.000000000

Farrar – Glauber Test

The ‘mctest’ package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity. There are two functions viz. ‘omcdiag’ and ‘imcdiag’ under ‘mctest’ package in R which will provide the overall and individual diagnostic checking for multicollinearity respectively.

library(mctest)
omcdiag(X,WAGE)
Call:
omcdiag(x = X, y = WAGE)


Overall Multicollinearity Diagnostics

                       MC Results detection
Determinant |X'X|:         0.0001         1
Farrar Chi-Square:      4833.5751         1
Red Indicator:             0.1983         0
Sum of Lambda Inverse: 10068.8439         1
Theil's Method:            1.2263         1
Condition Number:        739.7337         1

1 --> COLLINEARITY is detected 
0 --> COLLINEARITY in not detected by the test

===================================
Eigvenvalues with INTERCEPT
                   Intercept OCCUPATION SECTOR  UNION EDUCATION EXPERIENCE    AGE    SEX    MARR
Eigenvalues:          7.4264     0.9516 0.7635 0.6662    0.4205     0.3504 0.2672 0.0976  0.0462
Condition Indeces:    1.0000     2.7936 3.1187 3.3387    4.2027     4.6035 5.2719 8.7221 12.6725
                      RACE    SOUTH
Eigenvalues:        0.0103   0.0000
Condition Indeces: 26.8072 739.7337

The value of the standardized determinant is found to be 0.0001 which is very small. The calculated value of the Chi-square test statistic is found to be 4833.5751 and it is highly significant thereby implying the presence of multicollinearity in the model specification.
This induces us to go for the next step of Farrar – Glauber test (F – test) for the location of the multicollinearity.

imcdiag(X,WAGE)
Call:
imcdiag(x = X, y = WAGE)


All Individual Multicollinearity Diagnostics Result

                 VIF    TOL          Wi          Fi Leamer      CVIF Klein
OCCUPATION    1.2982 0.7703     17.3637     19.5715 0.8777    1.3279     0
SECTOR        1.1987 0.8343     11.5670     13.0378 0.9134    1.2260     0
UNION         1.1209 0.8922      7.0368      7.9315 0.9445    1.1464     0
EDUCATION   231.1956 0.0043  13402.4982  15106.5849 0.0658  236.4725     1
EXPERIENCE 5184.0939 0.0002 301771.2445 340140.5368 0.0139 5302.4188     1
AGE        4645.6650 0.0002 270422.7164 304806.1391 0.0147 4751.7005     1
SEX           1.0916 0.9161      5.3351      6.0135 0.9571    1.1165     0
MARR          1.0961 0.9123      5.5969      6.3085 0.9551    1.1211     0
RACE          1.0371 0.9642      2.1622      2.4372 0.9819    1.0608     0
SOUTH         1.0468 0.9553      2.7264      3.0731 0.9774    1.0707     0

1 --> COLLINEARITY is detected 
0 --> COLLINEARITY in not detected by the test

OCCUPATION , SECTOR , EDUCATION , EXPERIENCE , AGE , MARR , RACE , SOUTH , coefficient(s) are non-significant may be due to multicollinearity

R-square of y on all x: 0.2805 

* use method argument to check which regressors may be the reason of collinearity
===================================

The VIF, TOL and Wi columns provide the diagnostic output for variance inflation factor, tolerance and Farrar-Glauber F-test respectively. The F-statistic for the variable ‘experience’ is quite high (5184.0939) followed by the variable ‘age’ (F -value of 4645.6650) and ‘education’ (F-value of 231.1956). The degrees of freedom is ( (k-1 , n-k) )or (9, 524). For this degrees of freedom at 5% level of significance, the theoretical value of F is 1.89774. Thus, the F test shows that either the variable ‘experience’ or ‘age’ or ‘education’ will be the root cause of multicollinearity. Though the F -value for ‘education’ is also significant, it may happen due to inclusion of highly collinear variables such as ‘age’ and ‘experience’.
Finally, for examining the pattern of multicollinearity, it is required to conduct t-test for correlation coefficient. As there are ten explanatory variables, there will be six pairs of partial correlation coefficients. In R, there are several packages for getting the partial correlation coefficients along with the t- test for checking their significance level. We’ll the ‘ppcor’ package to compute the partial correlation coefficients along with the t-statistic and corresponding p-values.

library(ppcor)
pcor(X, method = "pearson")
$estimate
             OCCUPATION       SECTOR        UNION    EDUCATION  EXPERIENCE         AGE          SEX         MARR         RACE        SOUTH
OCCUPATION  1.000000000  0.314746868  0.212996388  0.029436911  0.04205856 -0.04414029 -0.142750864 -0.018580965  0.057539374  0.008430595
SECTOR      0.314746868  1.000000000 -0.013531482 -0.021253493 -0.01326166  0.01456575 -0.112146760  0.036495494  0.006412099 -0.021518760
UNION       0.212996388 -0.013531482  1.000000000 -0.007479144 -0.01024445  0.01223890 -0.120087577  0.068918496 -0.107706183 -0.097548621
EDUCATION   0.029436911 -0.021253493 -0.007479144  1.000000000 -0.99756187  0.99726160  0.051510483 -0.040302967  0.017230877 -0.031750193
EXPERIENCE  0.042058560 -0.013261665 -0.010244447 -0.997561873  1.00000000  0.99987574  0.054977034 -0.040976643  0.010888486 -0.022313605
AGE        -0.044140293  0.014565751  0.012238897  0.997261601  0.99987574  1.00000000 -0.053697851  0.045090327 -0.010803310  0.021525073
SEX        -0.142750864 -0.112146760 -0.120087577  0.051510483  0.05497703 -0.05369785  1.000000000  0.004163264  0.020017315 -0.030152499
MARR       -0.018580965  0.036495494  0.068918496 -0.040302967 -0.04097664  0.04509033  0.004163264  1.000000000  0.055645964  0.030418218
RACE        0.057539374  0.006412099 -0.107706183  0.017230877  0.01088849 -0.01080331  0.020017315  0.055645964  1.000000000 -0.111197596
SOUTH       0.008430595 -0.021518760 -0.097548621 -0.031750193 -0.02231360  0.02152507 -0.030152499  0.030418218 -0.111197596  1.000000000

$p.value
             OCCUPATION       SECTOR        UNION EDUCATION EXPERIENCE       AGE         SEX      MARR       RACE      SOUTH
OCCUPATION 0.000000e+00 1.467261e-13 8.220095e-07 0.5005235  0.3356824 0.3122902 0.001027137 0.6707116 0.18763758 0.84704000
SECTOR     1.467261e-13 0.000000e+00 7.568528e-01 0.6267278  0.7615531 0.7389200 0.010051378 0.4035489 0.88336002 0.62243025
UNION      8.220095e-07 7.568528e-01 0.000000e+00 0.8641246  0.8146741 0.7794483 0.005822656 0.1143954 0.01345383 0.02526916
EDUCATION  5.005235e-01 6.267278e-01 8.641246e-01 0.0000000  0.0000000 0.0000000 0.238259049 0.3562616 0.69337880 0.46745162
EXPERIENCE 3.356824e-01 7.615531e-01 8.146741e-01 0.0000000  0.0000000 0.0000000 0.208090393 0.3482728 0.80325456 0.60962999
AGE        3.122902e-01 7.389200e-01 7.794483e-01 0.0000000  0.0000000 0.0000000 0.218884070 0.3019796 0.80476248 0.62232811
SEX        1.027137e-03 1.005138e-02 5.822656e-03 0.2382590  0.2080904 0.2188841 0.000000000 0.9241112 0.64692038 0.49016279
MARR       6.707116e-01 4.035489e-01 1.143954e-01 0.3562616  0.3482728 0.3019796 0.924111163 0.0000000 0.20260170 0.48634504
RACE       1.876376e-01 8.833600e-01 1.345383e-02 0.6933788  0.8032546 0.8047625 0.646920379 0.2026017 0.00000000 0.01070652
SOUTH      8.470400e-01 6.224302e-01 2.526916e-02 0.4674516  0.6096300 0.6223281 0.490162786 0.4863450 0.01070652 0.00000000

$statistic
           OCCUPATION     SECTOR      UNION    EDUCATION   EXPERIENCE          AGE         SEX        MARR       RACE      SOUTH
OCCUPATION  0.0000000  7.5906763  4.9902208    0.6741338    0.9636171   -1.0114033 -3.30152873 -0.42541117  1.3193223  0.1929920
SECTOR      7.5906763  0.0000000 -0.3097781   -0.4866246   -0.3036001    0.3334607 -2.58345399  0.83597695  0.1467827 -0.4927010
UNION       4.9902208 -0.3097781  0.0000000   -0.1712102   -0.2345184    0.2801822 -2.76896848  1.58137652 -2.4799336 -2.2436907
EDUCATION   0.6741338 -0.4866246 -0.1712102    0.0000000 -327.2105031  308.6803174  1.18069629 -0.92332727  0.3944914 -0.7271618
EXPERIENCE  0.9636171 -0.3036001 -0.2345184 -327.2105031    0.0000000 1451.9092015  1.26038801 -0.93878671  0.2492636 -0.5109090
AGE        -1.0114033  0.3334607  0.2801822  308.6803174 1451.9092015    0.0000000 -1.23097601  1.03321563 -0.2473135  0.4928456
SEX        -3.3015287 -2.5834540 -2.7689685    1.1806963    1.2603880   -1.2309760  0.00000000  0.09530228  0.4583091 -0.6905362
MARR       -0.4254112  0.8359769  1.5813765   -0.9233273   -0.9387867    1.0332156  0.09530228  0.00000000  1.2757711  0.6966272
RACE        1.3193223  0.1467827 -2.4799336    0.3944914    0.2492636   -0.2473135  0.45830912  1.27577106  0.0000000 -2.5613138
SOUTH       0.1929920 -0.4927010 -2.2436907   -0.7271618   -0.5109090    0.4928456 -0.69053623  0.69662719 -2.5613138  0.0000000

$n
[1] 534

$gp
[1] 8

$method
[1] "pearson"

As expected the high partial correlation between ‘age’ and ‘experience’ is found to be highly statistically significant. Similar is the case for ‘education – experience’ and ‘education – age’ . Not only that even some of the low correlation coefficients are also found to be highyl significant. Thus, the Farrar-Glauber test points out that X1 is the root cause of all multicollinearity problem.

Remedial Measures

There are several remedial measure to deal with the problem of multicollinearity such Prinicipal Component Regression, Ridge Regression, Stepwise Regression etc.
However, in the present case, I’ll go for the exclusion of the variables for which the VIF values are above 10 and as well as the concerned variable logically seems to be redundant. Age and experience will certainly be correlated. So, why to use both of them? If we use ‘age’ or ‘age-squared’, it will reflect the experience of the respondent also. Thus, we try to build a model by excluding ‘experience’, estimate the model and go for further diagnosis for the presence of multicollinearity.

fit2Call:
lm(formula = log(WAGE) ~ OCCUPATION + SECTOR + UNION + EDUCATION + 
    AGE + SEX + MARR + RACE + SOUTH)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.16018 -0.29085 -0.00513  0.29985  1.97932 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.501358   0.164794   3.042 0.002465 ** 
OCCUPATION  -0.006941   0.013095  -0.530 0.596309    
SECTOR       0.091013   0.038723   2.350 0.019125 *  
UNION        0.200018   0.052459   3.813 0.000154 ***
EDUCATION    0.083815   0.007728  10.846  

Now by looking at the significance level, it is seen that out of nine of regression coefficients, eight are statistically significant. The R-square value is 0.31 and F-value is also very high and significant too.
Even the VIF values for the explanatory variables have reduced to very lower values.

vif(fit2)
OCCUPATION     SECTOR      UNION  EDUCATION        AGE        SEX       MARR       RACE      SOUTH 
  1.295935   1.198460   1.120743   1.125994   1.154496   1.088334   1.094289   1.037015 1.046306 

So, the model is now free from multicollinearity.

    Related Post

    1. Scikit-Learn for Text Analysis of Amazon Fine Food Reviews
    2. Survival Analysis – Part I
    3. Linear Regression from Scratch in Python
    4. Generalized Additive Models
    5. Second step with non-linear regression: adding predictors
    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    New Package GetITRData

    By R and Finance

    (This article was first published on R and Finance, and kindly contributed to R-bloggers)

    Downloading Quarterly Financial Reports from Bovespa –

    Financial statements of companies traded at B3 (formerly Bovespa), the
    Brazilian stock exchange, are available in its
    website. Accessing the data for a
    single company is straightforwardd. In the website one can find a simple
    interface for accessing this dataset. An example is given
    here.
    However, gathering and organizing the data for a large scale research,
    with many companies and many quarters, is painful. Quarterly reports
    must be downloaded or copied individually and later aggregated. Changes
    in the accounting format thoughout time can make this process slow,
    unreliable and irreproducible.

    Package GetITRData provides a R interface to all financial statements
    available in the website. It not only downloads the data but also
    organizes it in a tabular format and allows the use of inflation
    indexes. Users can simply select companies and a time period to download
    all available data. Several information about current companies, such as
    sector and available quarters are also at reach. The main purpose of the
    package is to make it easy to access quarterly financial statements in
    large scale research, facilitating the reproducibility of corporate
    finance studies with Bovespa data.

    Installation

    The package is available in CRAN (release version) and in Github
    (development version). You can install any of those with the following
    code:

    # Release version in CRAN 
    install.packages('GetITRData') 
    
    # Development version in Github
    devtools::install_github('msperlin/GetITRData') 
    

    How to use GetITRData

    The starting point of GetITRData is to find the official names of
    companies in Bovespa. Function gitrd.search.company serves this
    purpose. Given a string (text), it will search for a partial matches in
    companies names. As an example, let’s find the official name of
    Petrobras, one of the largest companies in Brazil:

    library(GetITRData)
    
    gitrd.search.company('petrobras')
    
    ## 
    ## Reading info file from github
    ## Found 24012 lines for 603 companies  [Actives =  457  Inactives =  146 ]
    ## Last file update:  2017-09-19
    ## 
    ## Found 1 companies:
    ## PETROBRAS | situation = ATIVO | first date = 1998-09-30 | last date - 2017-06-30
    

    Its official name in Bovespa records is PETROBRAS. Data for quarterly
    statements is available from 1998 to 2017. The situation of the company,
    active or canceled, is also given. This helps verifying the availability
    of data.

    The content of all available quarterly statements can be accessed with
    function gitrd.get.info.companies. It will read and parse a .csv file
    from my github
    repository
    . This will
    be periodically updated for new quarterly statements. Let’s try it out:

    df.info 

    This file includes several information that are gathered from Bovespa:
    names of companies, sectors, dates quarterly statements and, most
    importantly, the links to download the files. The resulting dataframe
    can be used to filter and gather information for large scale research
    such as downloading financial data for a specific sector.

    Downloading financial information for ONE company

    All you need to download financial data with GetITRData are the
    official names of companies, which can be found with
    gitrd.search.company, the desired starting and ending dates and the
    type of financial information (individual or consolidated). Let’s try it
    for PETROBRAS:

    name.companies 

    The resulting object is a tibble, a data.frame type of object that
    allows for list columns. Let’s have a look in its content:

    str(df.reports)
    
    ## Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  10 variables:
    ##  $ company.name: chr "PETROBRAS"
    ##  $ company.code: int 9512
    ##  $ type.info   : chr "individual"
    ##  $ min.date    : Date, format: "2005-03-31"
    ##  $ max.date    : Date, format: "2005-09-30"
    ##  $ n.quarters  : int 3
    ##  $ assets      :List of 1
    ##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 191 obs. of  5 variables:
    ##   .. ..$ acc.number  : chr  "1" "1.01" "1.01.01" "1.01.01.01" ...
    ##   .. ..$ acc.desc    : chr  "Ativo Total" "Ativo Circulante" "Disponibilidades" "Caixa e Bancos" ...
    ##   .. ..$ acc.value   : int  147026464 41052623 15146496 1498194 13648302 9450575 3488650 4955111 1090180 -83366 ...
    ##   .. ..$ ref.date    : Date, format: "2005-09-30" ...
    ##   .. ..$ company.name: chr  "PETROBRAS" "PETROBRAS" "PETROBRAS" "PETROBRAS" ...
    ##  $ liabilities :List of 1
    ##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 160 obs. of  5 variables:
    ##   .. ..$ acc.number  : chr  "2" "2.01" "2.01.01" "2.01.01.01" ...
    ##   .. ..$ acc.desc    : chr  "Passivo Total" "Passivo Circulante" "Empréstimos e Financiamentos" "Financiamentos" ...
    ##   .. ..$ acc.value   : int  147026464 44603494 1154258 1039841 114417 0 5205735 7429086 1156875 935603 ...
    ##   .. ..$ ref.date    : Date, format: "2005-09-30" ...
    ##   .. ..$ company.name: chr  "PETROBRAS" "PETROBRAS" "PETROBRAS" "PETROBRAS" ...
    ##  $ income      :List of 1
    ##   ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 100 obs. of  5 variables:
    ##   .. ..$ acc.number  : chr  "3.01" "3.02" "3.03" "3.04" ...
    ##   .. ..$ acc.desc    : chr  "Receita Bruta de Vendas e/ou Serviços" "Deduções da Receita Bruta" "Receita Líquida de Vendas e/ou Serviços" "Custo de Bens e/ou Serviços Vendidos" ...
    ##   .. ..$ acc.value   : int  37870949 -9778549 28092400 -15030559 13061841 -4270217 -1221668 -895230 -899 -894331 ...
    ##   .. ..$ ref.date    : Date, format: "2005-09-30" ...
    ##   .. ..$ company.name: chr  "PETROBRAS" "PETROBRAS" "PETROBRAS" "PETROBRAS" ...
    ##  $ df.cashflow :List of 1
    ##   ..$ :'data.frame': 0 obs. of  5 variables:
    ##   .. ..$ acc.desc    : Named list()
    ##   .. ..$ acc.value   : logi 
    ##   .. ..$ acc.number  : chr 
    ##   .. ..$ company.name: chr 
    ##   .. ..$ ref.date    :Class 'Date'  num(0) 
    ##   .. ..- attr(*, "na.action")=Class 'omit'  Named int [1:3] 1 2 3
    ##   .. .. .. ..- attr(*, "names")= chr [1:3] "1" "2" "3"
    

    Object df.reports only has one row since we only asked for data of one
    company. The number of rows increases with the number of companies, as
    we will soon learn with the next example. All financial statements for
    the different quarters are available within df.reports. For example,
    the income statements for all desired quarters of PETROBRAS are:

    df.income.long 

    The resulting dataframe is in the long format, ready for processing. In
    the long format, financial statements of different quarters are stacked.
    In the wide format, we have the columns as dates and rows as account names/ids. If you want the wide
    format, which I believe is most common in financial analysis, you can
    use function gitrd.convert.to.wide. See an example next:

    df.income.wide 
    acc.number acc.desc company.name 2005-03-31 2005-06-30 2005-09-30
    3.01 Receita Bruta de Vendas e/ou Serviços PETROBRAS 31355183 35425584 37870949
    3.02 Deduções da Receita Bruta PETROBRAS -8788723 -9321322 -9778549
    3.03 Receita Líquida de Vendas e/ou Serviços PETROBRAS 22566460 26104262 28092400
    3.04 Custo de Bens e/ou Serviços Vendidos PETROBRAS -12052044 -14530594 -15030559
    3.05 Resultado Bruto PETROBRAS 10514416 11573668 13061841
    3.06 Despesas/Receitas Operacionais PETROBRAS -2869703 -5249799 -4270217
    3.06.01 Com Vendas PETROBRAS -858170 -820899 -1221668
    3.06.02 Gerais e Administrativas PETROBRAS -768830 -880185 -895230
    3.06.02.01 Honor.Diretoria e Cons. Administração PETROBRAS -984 -886 -899
    3.06.02.02 De Administração PETROBRAS -767846 -879299 -894331
    3.06.03 Financeiras PETROBRAS -53787 -480281 -282706
    3.06.03.01 Receitas Financeiras PETROBRAS 525452 106753 272290
    3.06.03.02 Despesas Financeiras PETROBRAS -579239 -587034 -554996
    3.06.04 Outras Receitas Operacionais PETROBRAS 0 0 0
    3.06.05 Outras Despesas Operacionais PETROBRAS -2104923 -3155693 -1956858
    3.06.05.01 Custos Explot.p/Extração Petróleo/Gás PETROBRAS -107010 -101527 -334116
    3.06.05.02 Custo c/ Pesq. Desenv. Tecnológico PETROBRAS -192741 -221813 -247456
    3.06.05.03 Tributárias PETROBRAS -185581 -290086 -114519
    3.06.05.04 Variações Monetárias e Cambiais Líquidas PETROBRAS -117821 -921626 -401757
    3.06.05.05 Outras Despesas/Receitas Oper. Liquidas PETROBRAS -1501770 -1620641 -859010
    3.06.06 Resultado da Equivalência Patrimonial PETROBRAS 916007 87259 86245
    3.07 Resultado Operacional PETROBRAS 7644713 6323869 8791624
    3.08 Resultado Não Operacional PETROBRAS -151498 -64670 1064
    3.08.01 Receitas PETROBRAS 1256 8805 450552
    3.08.02 Despesas PETROBRAS -152754 -73475 -449488
    3.09 Resultado Antes Tributação/Participações PETROBRAS 7493215 6259199 8792688
    3.10 Provisão para IR e Contribuição Social PETROBRAS -1847762 -1151342 -3002751
    3.11 IR Diferido PETROBRAS -538132 -408725 -111709
    3.12 Participações/Contribuições Estatutárias PETROBRAS 0 0 0
    3.12.01 Participações PETROBRAS 0 0 0
    3.12.01.01 Partic. de Empregados e administradores PETROBRAS 0 NA NA
    3.12.02 Contribuições PETROBRAS 0 0 0
    3.13 Reversão dos Juros sobre Capital Próprio PETROBRAS 0 0 0
    3.15 Lucro/Prejuízo do Período PETROBRAS 5107321 4699132 5678228

    Downloading financial information for SEVERAL companies

    If you are doing serious research, it is likely that you need financial
    statements for more than one company. Package GetITRData is specially
    designed for handling large scale download of data. Let’s build a case
    with 5 randomly selected companies:

    set.seed(5)
    my.companies 

    And now we can check the resulting tibble:

    head(df.reports )
    
    ## # A tibble: 4 x 10
    ##                  company.name company.code  type.info   min.date
    ##                         
    ## 1                BANESTES S/A         1155 individual 2005-03-31
    ## 2 CIA. IGUAÇU DE CAFÉ SOLÚVEL         3336 individual 2005-03-31
    ## 3                     MUNDIAL         5312 individual 2005-03-31
    ## 4   TELEMAR PARTICIPAÇÔES S.A        18678 individual 2005-03-31
    ## # ... with 6 more variables: max.date , n.quarters ,
    ## #   assets , liabilities , income , df.cashflow 

    Every row of df.reports will provide information for one company.
    Metadata about the corresponding dataframes such as min/max dates is
    available in the first columns. Keeping a tabular structure facilitates
    the organization and future processing of all financial data. We can use
    tibble df.reports for creating other dataframes in the long format
    containing data for all companies. See next, where we create dataframes
    with the assets and liabilities of all companies:

    df.assets 

    As an example, let’s use the resulting dataframe for calculating and
    analyzing a simple liquidity index of a company, the total of current
    (liquid) assets (Ativo circulante) divided by the total of current
    short term liabilities (Passivo Circulante), over time.

    library(dplyr)
    
    ## 
    ## Attaching package: 'dplyr'
    
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    
    my.tab %
      group_by(company.name, ref.date) %>%
      summarise(Liq.Index = acc.value[acc.number == '1.01']/ acc.value[acc.number == '2.01'])
    
    my.tab
    
    ## # A tibble: 12 x 3
    ## # Groups:   company.name [?]
    ##                   company.name   ref.date Liq.Index
    ##                          
    ##  1                BANESTES S/A 2005-03-31 0.7985816
    ##  2                BANESTES S/A 2005-06-30 0.8393188
    ##  3                BANESTES S/A 2005-09-30 0.9151590
    ##  4 CIA. IGUAÇU DE CAFÉ SOLÚVEL 2005-03-31 2.3459918
    ##  5 CIA. IGUAÇU DE CAFÉ SOLÚVEL 2005-06-30 2.7291873
    ##  6 CIA. IGUAÇU DE CAFÉ SOLÚVEL 2005-09-30 2.1644541
    ##  7                     MUNDIAL 2005-03-31 0.7903105
    ##  8                     MUNDIAL 2005-06-30 0.8555824
    ##  9                     MUNDIAL 2005-09-30 0.9052814
    ## 10   TELEMAR PARTICIPAÇÔES S.A 2005-03-31 1.5650393
    ## 11   TELEMAR PARTICIPAÇÔES S.A 2005-06-30 0.9419386
    ## 12   TELEMAR PARTICIPAÇÔES S.A 2005-09-30 1.3275712
    

    Now we can visualize the information using ggplot2:

    library(ggplot2)
    
    p 

    As we can see, CIA IGUACU is the company with highest liquidity, being
    able to pay its short term debt with the current assets in all quarters.
    We can certainly do a lot more interesting studies based on this
    dataset.

    Exporting financial data

    The package includes function gitrd.export.ITR.data for exporting the
    financial data to an Excel file. Users can choose between the long and
    wide format. See next:

    my.basename 

    The resulting Excel file contains all data available in df.reports.

    To leave a comment for the author, please follow the link and comment on their blog: R and Finance.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News