Electric Power System simulations using R

By Tal Galili


The field of electric power systems engineering relies heavily on computer simulations for analysis because of its nature. These computer simulations aid the planning, operation and management of the system.
Computer simulations have been implemented using several scientific computing tools. However, I have not yet seen any implementations using R. This inspired my thesis at a German institution. I am now privileged to have implemented several power systems analysis simulations using R. In the process, I started by creating standalone scripts for several topics of interests in this domain. Examples are:

      • Power flow simulation (Gauss-Siedel & Newton-Raphson techniques)
      • Time domain simulation of the Single-Machine Infinite Bus (SMIB) system using the Modified Euler numerical integration method.
      • Multi-machine transient stability simulation using the ‘deSolve‘ package for solving n-ODEs that describe the dynamics of interconnected generating machines.

      See screenshots below.

      Figure 1: Power flow simulation of the IEEE 14-bus system (Newton-Raphson technique)

      The power flow simulation consists of evaluating the flow of power and voltages of a network for a specified terminal or bus condition. The results of a power flow solution are used to evaluate loadability of lines or transformers and the acceptability of voltage profiles.

      Figure 2: Time domain simulation of an SMIB system

      The infinite bus has a great advantage in explaining many of the problems associated with power systems. When analytical solutions are sought, the infinite busbar provides a reference for all angular changes in power system parameters. This simulation concerns the transient stability performance of a single generator that is connected to the infinite bus (or grid).


      Figure 3: Transient stability simulation of the IEEE 39-bus 10 generator system

      Transient stability analysis, deals with the effects of large, sudden disturbances such as the sudden outage of a line, the occurrence of a fault or the sudden loss of load. Studying the transient stability of the power system is required to understand whether the system can withstand the transient situation following a major perturbation. It is very insightful in the planning and operation of power systems.

      I also went ahead to try out web-based implementations for these simulations and the results were great. The ‘shiny‘ package developed by RStudio was used to implement this (interactivity through reactive programming was achieved). The use of tabpanels provided an integration of many simulation views (e.g. one can view power flow and stability simulations on one web page). The reactive programming concept solved the problem of always going back to the input data file to modify some system parameters before re-simulation. For example, in the web-based multi-machine transient stability simulation program, the circuit breaker clearing time, time of simulation, faulted bus and line to be cleared are reactive. See screenshot (of Figure 4) below:


      Figure 4: Web-based Power flow and Multi-machine transient stability simulation

      These few simulations that were used as a case study for my thesis, have since then grown into ‘The RPowerLABS Project‘ with diverse simulations and potentials. In a subsequent post, I shall highlight the offerings of this project and its impact for developing nations.

      Some reasons why R is suitable for electrical power system engineering analysis and simulation are:

          • R can handle complex number operations
          • R is great at basic mathematical operations
          • Vector/Matrix arithmetic operations are very possible with R
          • Numerical linear algebra are well taken care of
          • Sparse matrices are catered for by R (or addon packages)
          • R has very attractive visualization provisions

          If R is proven suitable for electrical power system engineering analysis, how about other engineering sub-domains?

          Note: The major idea of this post is to inform the R community of the application of R to electric power system engineering and there exists a lot of potential for this in the academia (especially e-learning). Hence, we have not shown the nitty gritty of the simulation examples in this post. Watch out for other related posts upcoming!

          This is a guest post by Ben Ubah.

          Source:: R News


          By aschinchon

          States In Two Words v1

          (This article was first published on Ripples, and kindly contributed to R-bloggers)

          Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

          Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

          I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

          Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

                       panel.background = element_blank(),
                       panel.grid = element_blank(),
                       axis.text =element_blank(),
                       plot.title = element_text(size = 28))
          vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
          jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
          pushViewport(viewport(layout = grid.layout(6, 8)))
          for (i in 1:nrow(table))
            wd=subset(words, State==as.character(table$"State name"[i]))
            p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
            print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
            txt=paste(as.character(table$"State name"[i]),"n is", wd$word1,"n and", wd$word2, sep=" ")
            grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))

          To leave a comment for the author, please follow the link and comment on his blog: Ripples.

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News

          R Markdown Tutorial by RStudio and DataCamp

          By DataCamp

          R Markdown Launch

          (This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

          In collaboration with Garrett Grolemund, RStudio’s teaching specialist, DataCamp has developed a new interactive course to facilitate reproducible reporting of your R analyses. R Markdown enables you to generate reports straight from your R code, documenting your works as an HTML, pdf or Microsoft document. This course is part of DataCamp’s R training path, but can also be taken as a separate course.

          Check out the R Markdown tutorial, and take the free preview.

          A typically frustrating feature of the R language is the ‘for my eyes only’ aspect: once you have finished development – be it an analysis of customer data, a predictive model to judge the effectivity of a new drug or some fancy visualizations of your company’s monthly revenue -, how do you share those results with both technical and non-technical people? R Markdown provides the solution. On top of the core markdown syntax, R users are able to embed R code chunks in their report. These R code chunks generate actual R output that is included in the final document. In addition, these R Markdown documents are fully reproducible: once your underlying R code or data changes, you can simply regenerate the output document.

          On the R Markdown tutorial

          DataCamp has developed a brand new learning interface specifically for R Markdown exercises. You can edit R Markdown documents and render various output formats in the comfort of your browser. Thus, there is no need for you to install additional software such as RStudio and LaTeX on your personal computer. Through a combination of instructional videos given by Garrett Grolemund, interactive markdown exercises, and multiple choice challenges, you will be familiar with this revolutionary way of reporting quickly and easily.

          The R Markdown tutorial features several chapters that cover different topics. First you will be introduced to R Markdown in general and details on narration using markdown (an easy-to-write plain text format). Next, the focus will be on ways to embed your R code directly in your report and the specification of chunk options to customize how your code appears in your rendered document. The third chapter introduces pandoc, a tool that enables you to generate numerous types of output documents, among them HTML documents, beamer presentations, etc. After an introduction to interactive reports based on Shiny, Garrett will conclude by explaining how you can set up R Markdown on your own system.

          With R Markdown you have the perfect tool to report on your insights in a reproducible way and share these results straightforwardly. This R Markdown tutorial helps you to easily master its technicalities in the comfort of your browser so you can get started right away!

          Check it out.


          The post R Markdown Tutorial by RStudio and DataCamp appeared first on The DataCamp Blog .

          To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News

          Using Tables for Statistics on Large Vectors

          By strictlystat

          (This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers)

          This is the first post I’ve written in a while. I have been somewhat radio silent on social media, but I’m jumping back in.

          Now, I work with brain images, which can have millions of elements (referred to as voxels). Many of these elements are zero (for background). We want to calculate basic statistics on the data usually and I wanted to describe how you can speed up operations or reduce memory requirements if you want to calculate many statistics on a large vector with integer values by using summary tables.

          Why to use Tables

          Tables are relatively computationally expensive to calculate. They must operate over the entire vector, find the unique values, and bin the data into these values. Let be the length of the vector. For integer vectors (i.e. whole number), the number of unique values is much less than . Therefore, the table is stored much more efficiently than the entire vector.

          Tables are sufficient statistics

          You can think of the frequencies and bins as summary statistics for the entire distribution of the data. I will not discuss a formal proof here, but you can easily re-create the entire vector using the table (see epitools::expand.table for a function to do this), and thus the table is a sufficient (but not likely a minimal) statistic.

          As a sufficient statistic, we can create any statistic that we’d like relatively easy. Now, R has very efficient functions for many statistics, such as the median and quantiles, so it may not make sense why we’d want to rewrite some of these functions using tables.

          I can think of 2 reasons: 1) you want to calculate many statistics on the data and don’t want to pass the vector in multiple times, and 2) you want to preprocess the data to summarize the data into tables to only use these in memory versus the entire vector.

          Here are some examples when this question has been asked on stackoverflow: 1, 2 and the R list-serv: 1. What we’re going to do is show some basic operations on tables to get summary statistics and show they agree.

          R Implementation

          Let’s make a large vector:

          vec = sample(-10:100, size= 1e7, replace = TRUE)

          Quantile function for tables

          I implemented a quantile function for tables (of only type 1). The code takes in a table, creates the cumulative sum, extracts the unique values of the table, then computes and returns the quantiles.

          quantile.table = function(tab, probs = c(0, 0.25, 0.5, 0.75, 1)){
            n = sum(tab)
            #### get CDF
            cs = cumsum(tab)
            ### get values (x)
            uvals = unique(as.numeric(names(tab)))
            #  can add different types of quantile, but using default
            m = 0
            qs = sapply(probs, function(prob){
              np = n * prob
              j = floor(np) + m
              g = np + m - j
              # type == 1
              gamma = as.numeric(g != 0)
              cs <= j
              quant = uvals[min(which(cs >= j))]
            dig <- max(2L, getOption("digits"))
            names(qs) <- paste0(if (length(probs) < 100) 
              formatC(100 * probs, format = "fg", width = 1, digits = dig)
              else format(100 * probs, trim = TRUE, digits = dig), 

          Quantile Benchmarks

          Let’s benchmark the quantile functions: 1) creating the table and then getting the quantiles, 2) creating an empircal CDF function then creating the quantiles, 3) creating the quantiles on the original data.

          qtab = function(vec){
            tab = table(vec)
          qcdf = function(vec){
            cdf = ecdf(vec)
            quantile(cdf, type=1)
          # quantile(vec, type = 1)
          microbenchmark(qtab(vec), qcdf(vec), quantile(vec, type = 1), times = 10L)
          Unit: relative
                              expr       min        lq     mean    median       uq
                         qtab(vec) 12.495569 12.052644 9.109178 11.589662 7.499691
                         qcdf(vec)  5.407606  5.802752 4.375459  5.553492 3.708795
           quantile(vec, type = 1)  1.000000  1.000000 1.000000  1.000000 1.000000
                max neval cld
           5.481202    10   c
           2.653728    10  b 
           1.000000    10 a  

          More realistic benchmarks

          Not surprisingly, simply running quantile on the vector beats the other 2 methods, by far. So computational speed may not be beneficial for using a table. But if tables or CDFs are already created in a previous processing step, we should compare that procedure:

          tab = table(vec)
          cdf = ecdf(vec)
          all.equal(quantile.table(tab), quantile(cdf, type=1))
          [1] TRUE
          all.equal(quantile.table(tab), quantile(vec, type=1))
          [1] TRUE
          microbenchmark(quantile.table(tab), quantile(cdf, type=1), quantile(vec, type = 1), times = 10L)
          Unit: relative
                              expr      min       lq     mean   median       uq
               quantile.table(tab)    1.000    1.000   1.0000    1.000   1.0000
           quantile(cdf, type = 1)  774.885 1016.172 596.3217 1144.063 868.8105
           quantile(vec, type = 1) 1029.696 1122.550 653.2146 1199.143 910.3743
                max neval cld
             1.0000    10  a 
           198.1590    10   b
           206.5936    10   b

          As we can see, if you had already computed tables, then you get the same quantiles as performing the operation on the vector, and also much faster results. Using quantile on a ecdf object is not much better, which mainly is due to the fact that the quantile function remakes the factor and then calculate quantiles:

          function (x, ...) 
          quantile(evalq(rep.int(x, diff(c(0, round(nobs * y)))), environment(x)), 
          <bytecode: 0x107493e28>
          <environment: namespace:stats>

          Median for tables

          Above we show the quantile.table function, so the median function is trivial where probs = 0.5:

          median.table = function(tab){
            quantile.table(tab, probs = 0.5)

          Mean of a table

          Other functions can be used to calculate statstics on the table, such as the mean:

          mean.table = function(tab){
            uvals = unique(as.numeric(names(tab)))
            sum(uvals * tab)/sum(tab)
          [1] 44.98991
          [1] 44.98991
          Warning in mean.default(cdf): argument is not numeric or logical:
          returning NA
          [1] NA

          As we see, we can simply use mean and do not need to define a new function for tables.

          [1] 44.98991
          all.equal(mean(tab), mean(vec))
          [1] TRUE

          Subsetting tables

          One problem with using mean vs. mean.table is when you subset the table or perform an operation that causes it to lose the attribute of the class of table. For example, let’s say I want to estimate the mean of the data for values 0″ title=”> 0″>:

          mean(vec[vec > 0])
          [1] 50.50371
          over0 = tab[as.numeric(names(tab)) > 0]
          [1] 90065.98
          [1] 50.50371
          [1] "array"

          We see that after subsetting, over0 is an array and not a table, so mean computes the mean using the array method, treating the frequences as data and the estimated mean is not correct. mean.table calculates the correct value, as it does not depend on the class of tab. Another way to circumvent this is to reassign a class of table to over0:

          class(over0) = "table"
          [1] 50.50371

          This process requires the user to know what the class is of the object passed to mean, and may not be correct if the user changes the class of the object.

          Aside on NA values

          Let’s see what happens when there are NAs in the vector. We’ll put in 20 NA values:

          navec = vec
          navec[sample(length(navec), 20)] = NA
          natab = table(navec, useNA="ifany")
          nacdf = ecdf(navec)
          [1] NA
          [1] NA
          # mean(nacdf)

          We see that if we table the data with NA being a category, then any operation that returns NA if NA are present will return NA. For example, if we do a table on the data with the table option useNA="always", then the mean will be NA even though no NA are present in the original vector. Also, ecdf objects do not keep track of NA values after they are computed.

          tab2 = table(vec, useNA="always")
          [1] NA
          nonatab = table(navec, useNA="no")
          [1] 44.98993
          mean(navec, na.rm=TRUE)
          [1] 44.98993

          If you are using tables for statistics, the equivalent of na.rm=FALSE is table(..., useNA="ifany") and na.rm=TRUE is table(..., useNA="no"). We also see that an object of ecdf do not ever show NAs. Although we said tables are sufficient statistics, that may not be entirely correct if depending on how you make the table when the data have missing data.

          Mean benchmark

          Let’s benchmark the mean function, assuming we have pre-computed the table:

          microbenchmark(mean(tab), mean(vec), times = 10L)
          Unit: relative
                expr      min       lq     mean   median       uq      max neval cld
           mean(tab)   1.0000   1.0000   1.0000   1.0000   1.0000  1.00000    10  a 
           mean(vec) 374.0648 132.3851 111.2533 104.7355 112.7517 75.21185    10   b

          Again, if we have the table pre-computed, then estimating means is much faster using the table.

          Getting standard deviation

          The mean example may be misleading when we try sd on the table:

          [1] 32.04476
          [1] 302.4951

          This are not even remotely close. This is because sd is operating on the table as if it were a vector and not a frequency table.

          Note, we cannot calculate sd from the ecdf object:

          Error in as.double(x): cannot coerce type 'closure' to vector of type 'double'

          SD and Variance for frequency table

          We will create a function to run sd on a table:

          var.table = function(tab){
            m = mean(tab)
            uvals = unique(as.numeric(names(tab)))
            n = sum(tab)
            sq = (uvals - m)^2
            ## sum of squared terms
            var = sum(sq * tab) / (n-1)
          sd.table = function(tab){
          [1] 32.04476

          We create the mean, get the squared differences, and sum these up (sum(sq * tab)) , divide by n-1 to get the variance and the sd is the square root of the variance.

          Benchmarking SD

          Let’s similarly benchmark the data for sd:

          microbenchmark(sd.table(tab), sd(vec), times = 10L)
          Unit: relative
                    expr      min       lq    mean   median       uq      max neval
           sd.table(tab)   1.0000   1.0000   1.000    1.000   1.0000   1.0000    10
                 sd(vec) 851.8676 952.7785 847.225 1142.225 732.3427 736.2757    10

          Mode of distribution

          Another statistic we may want for tabular data is the mode. We can simply find the maximum frequency in the table. The multiple option returns multiple values if there is a tie for the maximum frequency.

          mode.table = function(tab, multiple = TRUE){
            uvals = unique(as.numeric(names(tab)))
            ind = which.max(tab)
            if (multiple){
              ind = which(tab == max(tab))
          [1] 36

          Memory of each object

          We wish to simply show the memory profile for using a table verus the entire vector:

          format(object.size(vec), "Kb")
          [1] "39062.5 Kb"
          format(object.size(tab), "Kb")
          [1] "7.3 Kb"
          round(as.numeric(object.size(vec) / object.size(tab)))
          [1] 5348

          We see that the table much smaller than the vector. Therefore, computing and storing summary tables for integer data can be much more efficient.


          Tables are computationally expensive. If tables are pre-computed for integer data, however, then statistics can be calculated quickly and accurately, even if NAs are present. These tables are also much smaller in memory so that they can be stored with less space. This may be an important thing to think about computing and storage of large vectors in the future.

          To leave a comment for the author, please follow the link and comment on his blog: A HopStat and Jump Away » Rbloggers.

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News

          drat 0.0.2: Improved Support for Lightweight R Repositories

          By Thinking inside the box

          (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

          A few weeks ago we introduced the drat package. Its name stands for drat R Archive Template, and it helps with easy-to-create and easy-to-use repositories for R packages. Two early blog posts describe drat: First Steps Towards Lightweight Repositories, and Publishing a Package.

          A new version 0.0.2 is now on CRAN. It adds several new features:

          • beginnings of native git support via the excellent new git2r package,
          • a new helper function to prune a repo of older versions of packages (as R repositories only show the newest release of a package),
          • improved core functionality in inserting a package, and adding a repo.

          Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

          This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

          To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News

          Should I use premium Diesel? Setup

          By Wingfeet

          (This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

          Since I drive quite a lot, I have some interest in getting the most km out every Euro spent on fuel. One thing to change is the fuel. The oil companies have a premium fuel, which is supposed to be better for both engine and fuel consumption. On the other hand, it is easy to find counter claims which say it is not beneficial for fuel consumption. In that case the extra costs would be a waste. In this post I am creating a setup to check the claims.

          Current data

          The current information which I have are fuel consumption of last year. I have taken a set of those data from March and April of 2014.
          r1 <- read.table(text='l km
          28.25 710.6
          22.93 690.4
          28.51 760.5
          23.22 697.9
          31.52 871.2
          24.68 689.6
          30.85 826.9
          23.04 699
          29.96 845.3
          30.16 894.7
          25.71 696
          23.6 669.8
          28.57 739
          27.23 727.4
          18.31 499.9
          24.28 689.5′,header=TRUE)
          main=’Observed normal diesel usage’,
          xlab=’l/100 km’)

          The data are from a distribution with a mean around 3.6 l/100 km.
          mean sd
          3.59517971 0.19314598
          (0.04828649) (0.03414371)


          Analysis will be a hypothesis test and an estimate of premium diesel usage.
          The assumptions which I will make are similar driving patterns and weather as last year. I think that should be possible, given my driving style. A cross check may be made, especially regarding obtaining similar speed. Data with serious traffic jams may be discarded in the analysis.
          A check for outliers is not planned. However, obviously faulty data will be corrected or removed from the data. No intermediate analysis is planned, unless data seems to be pointing a marked increase of fuel usage.

          Power for hypothesis test

          The advice price levels of premium and standard diesel are 1.433 and 1.363 Euro/liter according to the internet. This is about 5% price increase. It should be noted that prices at the pump vary wildly from these values, especially non-brand non-manned fuel stations may be significantly cheaper. Last year’s data was from such non brand fuel. Competition can force the price of both standard and premium fuel down a bit. I will take the 5% price increase as target for finding value for premium diesel. Given significance level of 10% and power of 90%, I come at 17 samples for each group. This means I will have to take a bit more data from last year, which is not a problem. The choice of alpha and beta reflect that I find both kind of errors equally bad.
          Two-sample t test power calculation

          n = 16.66118
          delta = 0.18
          sd = 0.2
          sig.level = 0.1
          power = 0.9
          alternative = one.sided

          NOTE: n is number in *each* group

          Estimating usage of premium diesel

          Besides a significance test, I desire an estimate of usage. This manner I can extrapolate the data to other scenarios. I will use a Bayesian analysis to obtain these estimates. The prior to be used is a mixture of three believes. Either it does not make a difference, or there is indeed a 5% gain to be made or something else entirely. This latter is an uninformed prior between 3 and 4 l/km. The combined density is plotted below.
          usage <- seq(3.2,3.8,.01)
          dens <- (dnorm(usage,3.6,.05)+
          xlab=’l/100 km’,

          To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News

          DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis

          By ygc

          Screenshot 2015-03-01 12.44.07

          (This article was first published on YGC » R, and kindly contributed to R-bloggers)

          My R/Bioconductor package, DOSE, published in Bioinformatics.

          Summary: Disease ontology (DO) annotates human genes in the context of disease. DO is important annotation in translating molecular findings from high-throughput data to clinical relevance. DOSE is an R package providing semantic similarity computations among DO terms and genes which allows biologists to explore the similarities of diseases and of gene functions in disease perspective. Enrichment analyses including hypergeometric model and gene set enrichment analysis are also implemented to support discovering disease associations of high-throughput biological data. This allows biologists to verify disease relevance in a biological experiment and identify unexpected disease associations. Comparison among gene clusters is also supported.

          Availability and implementation: DOSE is released under Artistic-2.0 License. The source code and documents are freely available through Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/DOSE.html).

          More importantly, DOSE provides several visualization functions to produce highly customizable, publication-quality figures of similarity and enrichment analyses. With these visualization tools, the results obtained by DOSE are more interpretable.

          Related Posts

          To leave a comment for the author, please follow the link and comment on his blog: YGC » R.

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News

          Book Review: Mastering Scientific Computing with R

          By Arnold Salvacion

          (This article was first published on Data Analysis and Visualization in R, and kindly contributed to R-bloggers)

          PACKT marketing guys again contact me to review their new book Mastering Scientific Computing with R. The book 432 pages (including covers) book is consist of 10 chapters which starts from basic R and ends with advanced data management. However, between the basic R and advanced data management chapters were topics on linear and non linear models, nonparametric techniques, linear and matrix algebra, principal component analysis, and structural equation modeling and confirmatory factor analysis

          Aside from code samples (even for R beginners) which include additional comments, there is also an online tutorial on Structural Equation Modeling included in the book webpage which make the book content easier to learn. In addition, the book also extends discussion beyond the methodological aspect of using R in scientific computing by providing technical and practical approach in some of the methodologies (e.g. “choosing the number of components to retain in Principal Component Analysis”, “Rasch analysis using linear algebra and a
          paired comparisons matrix”, “Exploratory factor analysis and reflective

          Highly recommended book!

          To leave a comment for the author, please follow the link and comment on his blog: Data Analysis and Visualization in R.

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News

          One weird trick to compile multipartite dynamic documents with Rmarkdown

          By biochemistries

          (This article was first published on biochemistries, and kindly contributed to R-bloggers)

          This afternoon I stumbled across this one weird trick an undocumented part of the YAML headers that get processed when you click the ‘knit’ button in RStudio. Knitting turns an Rmarkdown document into a specified format, using the rmarkdown package’s render function to call pandoc (a universal document converter written in Haskell).

          If you specify a knit: field in an Rmarkdown YAML header you can replace the default function (rmarkdown::render) that the input file and encoding are passed to with any arbitrarily complex function.

          For example, the developer of slidify passed in a totally different function rather than renderslidify::knit2slides.

          I thought it’d be worthwhile to modify what was triggered upon clicking that button – as simply as using a specified output file name (see StackOverflow here), or essentially running a sort of make to compose a multi-partite document.

          Here’s an exemplar Rmarkdown YAML header which threads together a document from 3 component Rmarkdown subsection files:

          • The title: field becomes a top-level header (#) in the output markdown
          • The knit: field (a currently undocumented hook) replaces rmarkdown::render with a custom function specifying parameters for the rendering.

            Yes, unfortunately it does have to be an unintelligible one-liner unless you have your own R package [with the function exported to the namespace] to source it from (as package::function). Here’s the above more clearly laid out:

            • Firstly, every section’s Rmarkdown file is rendered into markdown [with the same name by default]
            • Each of these files are ‘included’ after the ‘body’ (cf. the header) of this README, if they’re in the includes: after_body:[...] list.
            • The quiet=TRUE parameter silences the standard “Output created: …” message following render() which would otherwise trigger the RStudio file preview on the last of the intermediate markdown files created.
            • After these component files are processed, the final README markdown is rendered (includes appends their processed markdown contents), and this full document is previewed.
          • All Rmd files here contain a YAML header, the constituent files having only the output:md_document:variant field:

            …before their sub-section contents:

          ## Comparison of cancer types surveyed
          Comparing cancer types in this paper to CRUK's most up to date prevalence   statistics...

          Alternative modular setup

          One of the problems custom knit functions can also solve is the time it takes for large manuscripts to compile – a huge obstacle to my own use of Rmarkdown which I’m delighted to overcome, and what’s stopped me from recommending it to others as a practical writing tool despite its clear potential.

          E.g., if using knitcitations, each reference is downloaded even if the bibliographic metadata has already been obtained. Along with generating individual figures etc., the time to ‘compile’ an Rmarkdown document can therefore scale exorbitantly when writing a moderately sized manuscript (rising from seconds to tens of minutes in the extreme as I saw on a recent essay), breaking the proper flow of writing and review, and imposing a penalty on extended Rmarkdown compositions.

          A modular structure is the only rational way of doing this, but isn’t described anywhere for Rmarkdown’s dynamic documents (to my knowledge?).

          In such a framework, the ‘main’ document’s knit function would be as above, but lacking the first step of compiling each .Rmd.md (these having been done separately upon each edit), so that pre-made .md files would just be included (instantly) in the final document:

          Much more sensibly, the edited Rmarkdown component files (subsections) wouldn’t need to be re-processed — e.g. have all references and figures generated — rather this would be done per file, each of which could in turn potentially have custom knit: hooks (though note that the example below only works to prevent the file preview, there’s scope to do much more with it)

          via Software Carpentry

          The idea would be to follow what this Software Carpentry video describes regarding makefiles for reproducible research papers. In theory, the initially described knit: function could generate a full paper including analyses from component section files, each of which could in turn have their own knit: hooks.

          The example above creates a README.md file suitable for display in a standard GitHub repository, though it’s not advisable to write sprawling READMEs: it could easily be tweaked to give a paper.pdf as for the Software Carpentry example, using a PDF YAML output header instead for the final .md.pdf step after including the component parts.

          For what it’s worth, my current YAML header for a manuscript in PDF is:

          … and in the top matter (after the YAML, before the markdown, for the LaTeX engine & R):

          A minor limitation I see here is that it’s not possible to provide subsection titles through metadata — at present the title is written to markdown with a hardcoded ‘# ‘ prefix. In a reproducible manuscript utopia the title: field could still be specified and markdown header prefix of the appropriate level generated accordingly perhaps (which might also allow for procedural sub-section numbering – 1.2, 1.2.1 etc.).

          The above can also be found on my GitHub development notes Wiki, but it’s not possible to leave comments there. Feedback and more tips and tricks for Rmarkdown workflows are welcome.

          ✎ Check out the rmarkdown package here, and general Rmd documentation here.

          To leave a comment for the author, please follow the link and comment on his blog: biochemistries.

          R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

          Source:: R News