## Electric Power System simulations using R

The field of electric power systems engineering relies heavily on computer simulations for analysis because of its nature. These computer simulations aid the planning, operation and management of the system.
Computer simulations have been implemented using several scientific computing tools. However, I have not yet seen any implementations using R. This inspired my thesis at a German institution. I am now privileged to have implemented several power systems analysis simulations using R. In the process, I started by creating standalone scripts for several topics of interests in this domain. Examples are:

• Power flow simulation (Gauss-Siedel & Newton-Raphson techniques)
• Time domain simulation of the Single-Machine Infinite Bus (SMIB) system using the Modified Euler numerical integration method.
• Multi-machine transient stability simulation using the ‘deSolve‘ package for solving n-ODEs that describe the dynamics of interconnected generating machines.

See screenshots below.

Figure 1: Power flow simulation of the IEEE 14-bus system (Newton-Raphson technique)

The power flow simulation consists of evaluating the flow of power and voltages of a network for a specified terminal or bus condition. The results of a power flow solution are used to evaluate loadability of lines or transformers and the acceptability of voltage profiles.

Figure 2: Time domain simulation of an SMIB system

The infinite bus has a great advantage in explaining many of the problems associated with power systems. When analytical solutions are sought, the infinite busbar provides a reference for all angular changes in power system parameters. This simulation concerns the transient stability performance of a single generator that is connected to the infinite bus (or grid).

Figure 3: Transient stability simulation of the IEEE 39-bus 10 generator system

Transient stability analysis, deals with the effects of large, sudden disturbances such as the sudden outage of a line, the occurrence of a fault or the sudden loss of load. Studying the transient stability of the power system is required to understand whether the system can withstand the transient situation following a major perturbation. It is very insightful in the planning and operation of power systems.

I also went ahead to try out web-based implementations for these simulations and the results were great. The ‘shiny‘ package developed by RStudio was used to implement this (interactivity through reactive programming was achieved). The use of tabpanels provided an integration of many simulation views (e.g. one can view power flow and stability simulations on one web page). The reactive programming concept solved the problem of always going back to the input data file to modify some system parameters before re-simulation. For example, in the web-based multi-machine transient stability simulation program, the circuit breaker clearing time, time of simulation, faulted bus and line to be cleared are reactive. See screenshot (of Figure 4) below:

Figure 4: Web-based Power flow and Multi-machine transient stability simulation

These few simulations that were used as a case study for my thesis, have since then grown into ‘The RPowerLABS Project‘ with diverse simulations and potentials. In a subsequent post, I shall highlight the offerings of this project and its impact for developing nations.

Some reasons why R is suitable for electrical power system engineering analysis and simulation are:

• R can handle complex number operations
• R is great at basic mathematical operations
• Vector/Matrix arithmetic operations are very possible with R
• Numerical linear algebra are well taken care of
• Sparse matrices are catered for by R (or addon packages)
• R has very attractive visualization provisions

If R is proven suitable for electrical power system engineering analysis, how about other engineering sub-domains?

Note: The major idea of this post is to inform the R community of the application of R to electric power system engineering and there exists a lot of potential for this in the academia (especially e-learning). Hence, we have not shown the nitty gritty of the simulation examples in this post. Watch out for other related posts upcoming!

This is a guest post by Ben Ubah.

Source:: R News

## Silhouettes

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

```library(ggplot2)
library(maps)
library(gridExtra)
library(extrafont)
opt=theme(legend.position="none",
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
axis.text =element_blank(),
plot.title = element_text(size = 28))
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
grid.newpage()
jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
pushViewport(viewport(layout = grid.layout(6, 8)))
for (i in 1:nrow(table))
{
wd=subset(words, State==as.character(table\$"State name"[i]))
p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table\$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
txt=paste(as.character(table\$"State name"[i]),"n is", wd\$word1,"n and", wd\$word2, sep=" ")
grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))
}
dev.off()
```

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## R Markdown Tutorial by RStudio and DataCamp

By DataCamp

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

In collaboration with Garrett Grolemund, RStudio’s teaching specialist, DataCamp has developed a new interactive course to facilitate reproducible reporting of your R analyses. R Markdown enables you to generate reports straight from your R code, documenting your works as an HTML, pdf or Microsoft document. This course is part of DataCamp’s R training path, but can also be taken as a separate course.

Check out the R Markdown tutorial, and take the free preview.

A typically frustrating feature of the R language is the ‘for my eyes only’ aspect: once you have finished development – be it an analysis of customer data, a predictive model to judge the effectivity of a new drug or some fancy visualizations of your company’s monthly revenue -, how do you share those results with both technical and non-technical people? R Markdown provides the solution. On top of the core markdown syntax, R users are able to embed R code chunks in their report. These R code chunks generate actual R output that is included in the final document. In addition, these R Markdown documents are fully reproducible: once your underlying R code or data changes, you can simply regenerate the output document.

On the R Markdown tutorial

DataCamp has developed a brand new learning interface specifically for R Markdown exercises. You can edit R Markdown documents and render various output formats in the comfort of your browser. Thus, there is no need for you to install additional software such as RStudio and LaTeX on your personal computer. Through a combination of instructional videos given by Garrett Grolemund, interactive markdown exercises, and multiple choice challenges, you will be familiar with this revolutionary way of reporting quickly and easily.

The R Markdown tutorial features several chapters that cover different topics. First you will be introduced to R Markdown in general and details on narration using markdown (an easy-to-write plain text format). Next, the focus will be on ways to embed your R code directly in your report and the specification of chunk options to customize how your code appears in your rendered document. The third chapter introduces pandoc, a tool that enables you to generate numerous types of output documents, among them HTML documents, beamer presentations, etc. After an introduction to interactive reports based on Shiny, Garrett will conclude by explaining how you can set up R Markdown on your own system.

With R Markdown you have the perfect tool to report on your insights in a reproducible way and share these results straightforwardly. This R Markdown tutorial helps you to easily master its technicalities in the comfort of your browser so you can get started right away!

Check it out.

The post R Markdown Tutorial by RStudio and DataCamp appeared first on The DataCamp Blog .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Using Tables for Statistics on Large Vectors

(This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers)

This is the first post I’ve written in a while. I have been somewhat radio silent on social media, but I’m jumping back in.

Now, I work with brain images, which can have millions of elements (referred to as voxels). Many of these elements are zero (for background). We want to calculate basic statistics on the data usually and I wanted to describe how you can speed up operations or reduce memory requirements if you want to calculate many statistics on a large vector with integer values by using summary tables.

## Why to use Tables

Tables are relatively computationally expensive to calculate. They must operate over the entire vector, find the unique values, and bin the data into these values. Let be the length of the vector. For integer vectors (i.e. whole number), the number of unique values is much less than . Therefore, the table is stored much more efficiently than the entire vector.

### Tables are sufficient statistics

You can think of the frequencies and bins as summary statistics for the entire distribution of the data. I will not discuss a formal proof here, but you can easily re-create the entire vector using the table (see `epitools::expand.table` for a function to do this), and thus the table is a sufficient (but not likely a minimal) statistic.

As a sufficient statistic, we can create any statistic that we’d like relatively easy. Now, `R` has very efficient functions for many statistics, such as the median and quantiles, so it may not make sense why we’d want to rewrite some of these functions using tables.

I can think of 2 reasons: 1) you want to calculate many statistics on the data and don’t want to pass the vector in multiple times, and 2) you want to preprocess the data to summarize the data into tables to only use these in memory versus the entire vector.

Here are some examples when this question has been asked on stackoverflow: 1, 2 and the R list-serv: 1. What we’re going to do is show some basic operations on tables to get summary statistics and show they agree.

## R Implementation

Let’s make a large vector:

```set.seed(20150301)
vec = sample(-10:100, size= 1e7, replace = TRUE)
```

### Quantile function for tables

I implemented a quantile function for tables (of only type 1). The code takes in a table, creates the cumulative sum, extracts the unique values of the table, then computes and returns the quantiles.

```quantile.table = function(tab, probs = c(0, 0.25, 0.5, 0.75, 1)){
n = sum(tab)
#### get CDF
cs = cumsum(tab)
### get values (x)
uvals = unique(as.numeric(names(tab)))

#  can add different types of quantile, but using default
m = 0
qs = sapply(probs, function(prob){
np = n * prob
j = floor(np) + m
g = np + m - j
# type == 1
gamma = as.numeric(g != 0)
cs &lt;= j
quant = uvals[min(which(cs &gt;= j))]
return(quant)
})
dig &lt;- max(2L, getOption(&quot;digits&quot;))
names(qs) &lt;- paste0(if (length(probs) &lt; 100)
formatC(100 * probs, format = &quot;fg&quot;, width = 1, digits = dig)
else format(100 * probs, trim = TRUE, digits = dig),
&quot;%&quot;)
return(qs)
}
```

### Quantile Benchmarks

Let’s benchmark the quantile functions: 1) creating the table and then getting the quantiles, 2) creating an empircal CDF function then creating the quantiles, 3) creating the quantiles on the original data.

```library(microbenchmark)
options(microbenchmark.unit='relative')
qtab = function(vec){
tab = table(vec)
quantile.table(tab)
}
qcdf = function(vec){
cdf = ecdf(vec)
quantile(cdf, type=1)
}
# quantile(vec, type = 1)
microbenchmark(qtab(vec), qcdf(vec), quantile(vec, type = 1), times = 10L)
```
```Unit: relative
expr       min        lq     mean    median       uq
qtab(vec) 12.495569 12.052644 9.109178 11.589662 7.499691
qcdf(vec)  5.407606  5.802752 4.375459  5.553492 3.708795
quantile(vec, type = 1)  1.000000  1.000000 1.000000  1.000000 1.000000
max neval cld
5.481202    10   c
2.653728    10  b
1.000000    10 a
```

### More realistic benchmarks

Not surprisingly, simply running `quantile` on the vector beats the other 2 methods, by far. So computational speed may not be beneficial for using a table. But if tables or CDFs are already created in a previous processing step, we should compare that procedure:

```options(microbenchmark.unit=&quot;relative&quot;)
tab = table(vec)
cdf = ecdf(vec)
all.equal(quantile.table(tab), quantile(cdf, type=1))
```
```[1] TRUE
```
```all.equal(quantile.table(tab), quantile(vec, type=1))
```
```[1] TRUE
```
```microbenchmark(quantile.table(tab), quantile(cdf, type=1), quantile(vec, type = 1), times = 10L)
```
```Unit: relative
expr      min       lq     mean   median       uq
quantile.table(tab)    1.000    1.000   1.0000    1.000   1.0000
quantile(cdf, type = 1)  774.885 1016.172 596.3217 1144.063 868.8105
quantile(vec, type = 1) 1029.696 1122.550 653.2146 1199.143 910.3743
max neval cld
1.0000    10  a
198.1590    10   b
206.5936    10   b
```

As we can see, if you had already computed tables, then you get the same quantiles as performing the operation on the vector, and also much faster results. Using `quantile` on a `ecdf` object is not much better, which mainly is due to the fact that the `quantile` function remakes the factor and then calculate quantiles:

```stats:::quantile.ecdf
```
```function (x, ...)
quantile(evalq(rep.int(x, diff(c(0, round(nobs * y)))), environment(x)),
...)
&lt;bytecode: 0x107493e28&gt;
&lt;environment: namespace:stats&gt;
```

### Median for tables

Above we show the `quantile.table` function, so the median function is trivial where `probs = 0.5`:

```median.table = function(tab){
quantile.table(tab, probs = 0.5)
}
```

## Mean of a table

Other functions can be used to calculate statstics on the table, such as the mean:

```mean.table = function(tab){
uvals = unique(as.numeric(names(tab)))
sum(uvals * tab)/sum(tab)
}
mean.table(tab)
```
```[1] 44.98991
```
```mean(tab)
```
```[1] 44.98991
```
```mean(cdf)
```
```Warning in mean.default(cdf): argument is not numeric or logical:
returning NA
```
```[1] NA
```

As we see, we can simply use `mean` and do not need to define a new function for tables.

```mean(vec)
```
```[1] 44.98991
```
```all.equal(mean(tab), mean(vec))
```
```[1] TRUE
```

### Subsetting tables

One problem with using `mean` vs. `mean.table` is when you subset the table or perform an operation that causes it to lose the attribute of the class of `table`. For example, let’s say I want to estimate the mean of the data for values 0″ title=”> 0″>:

```mean(vec[vec &gt; 0])
```
```[1] 50.50371
```
```over0 = tab[as.numeric(names(tab)) &gt; 0]
mean(over0)
```
```[1] 90065.98
```
```mean.table(over0)
```
```[1] 50.50371
```
```class(over0)
```
```[1] &quot;array&quot;
```

We see that after subsetting, `over0` is an `array` and not a `table`, so `mean` computes the mean using the `array` method, treating the frequences as data and the estimated mean is not correct. `mean.table` calculates the correct value, as it does not depend on the class of `tab`. Another way to circumvent this is to reassign a class of `table` to `over0`:

```class(over0) = &quot;table&quot;
mean(over0)
```
```[1] 50.50371
```

This process requires the user to know what the class is of the object passed to `mean`, and may not be correct if the user changes the class of the object.

### Aside on NA values

Let’s see what happens when there are `NA`s in the vector. We’ll put in 20 `NA` values:

```navec = vec
navec[sample(length(navec), 20)] = NA
natab = table(navec, useNA=&quot;ifany&quot;)
nacdf = ecdf(navec)
mean(navec)
```
```[1] NA
```
```mean(natab)
```
```[1] NA
```
```# mean(nacdf)
```

We see that if we `table` the data with `NA` being a category, then any operation that returns `NA` if `NA` are present will return `NA`. For example, if we do a table on the data with the `table` option `useNA="always"`, then the mean will be `NA` even though no `NA` are present in the original vector. Also, `ecdf` objects do not keep track of `NA` values after they are computed.

```tab2 = table(vec, useNA=&quot;always&quot;)
mean(tab2)
```
```[1] NA
```
```nonatab = table(navec, useNA=&quot;no&quot;)
mean(nonatab)
```
```[1] 44.98993
```
```mean(navec, na.rm=TRUE)
```
```[1] 44.98993
```

If you are using tables for statistics, the equivalent of `na.rm=FALSE` is `table(..., useNA="ifany")` and `na.rm=TRUE` is `table(..., useNA="no")`. We also see that an object of `ecdf` do not ever show NAs. Although we said tables are sufficient statistics, that may not be entirely correct if depending on how you make the table when the data have missing data.

### Mean benchmark

Let’s benchmark the mean function, assuming we have pre-computed the table:

```options(microbenchmark.unit=&quot;relative&quot;)
microbenchmark(mean(tab), mean(vec), times = 10L)
```
```Unit: relative
expr      min       lq     mean   median       uq      max neval cld
mean(tab)   1.0000   1.0000   1.0000   1.0000   1.0000  1.00000    10  a
mean(vec) 374.0648 132.3851 111.2533 104.7355 112.7517 75.21185    10   b
```

Again, if we have the table pre-computed, then estimating means is much faster using the table.

## Getting standard deviation

The `mean` example may be misleading when we try `sd` on the table:

```sd(vec)
```
```[1] 32.04476
```
```sd(tab)
```
```[1] 302.4951
```

This are not even remotely close. This is because `sd` is operating on the table as if it were a `vector` and not a frequency table.

Note, we cannot calculate `sd` from the `ecdf` object:

```sd(cdf)
```
```Error in as.double(x): cannot coerce type 'closure' to vector of type 'double'
```

## SD and Variance for frequency table

We will create a function to run `sd` on a table:

```var.table = function(tab){
m = mean(tab)
uvals = unique(as.numeric(names(tab)))
n = sum(tab)
sq = (uvals - m)^2
## sum of squared terms
var = sum(sq * tab) / (n-1)
return(var)
}
sd.table = function(tab){
sqrt(var.table(tab))
}
sd.table(tab)
```
```[1] 32.04476
```

We create the mean, get the squared differences, and sum these up (`sum(sq * tab)`) , divide by `n-1` to get the variance and the `sd` is the square root of the variance.

### Benchmarking SD

Let’s similarly benchmark the data for `sd`:

```options(microbenchmark.unit=&quot;relative&quot;)
microbenchmark(sd.table(tab), sd(vec), times = 10L)
```
```Unit: relative
expr      min       lq    mean   median       uq      max neval
sd.table(tab)   1.0000   1.0000   1.000    1.000   1.0000   1.0000    10
sd(vec) 851.8676 952.7785 847.225 1142.225 732.3427 736.2757    10
cld
a
b
```

## Mode of distribution

Another statistic we may want for tabular data is the mode. We can simply find the maximum frequency in the table. The `multiple` option returns multiple values if there is a tie for the maximum frequency.

```mode.table = function(tab, multiple = TRUE){
uvals = unique(as.numeric(names(tab)))
ind = which.max(tab)
if (multiple){
ind = which(tab == max(tab))
}
uvals[ind]
}
mode.table(tab)
```
```[1] 36
```

## Memory of each object

We wish to simply show the memory profile for using a `table` verus the entire vector:

```format(object.size(vec), &quot;Kb&quot;)
```
```[1] &quot;39062.5 Kb&quot;
```
```format(object.size(tab), &quot;Kb&quot;)
```
```[1] &quot;7.3 Kb&quot;
```
```round(as.numeric(object.size(vec) / object.size(tab)))
```
```[1] 5348
```

We see that the table much smaller than the vector. Therefore, computing and storing summary tables for integer data can be much more efficient.

# Conclusion

Tables are computationally expensive. If tables are pre-computed for integer data, however, then statistics can be calculated quickly and accurately, even if `NA`s are present. These tables are also much smaller in memory so that they can be stored with less space. This may be an important thing to think about computing and storage of large vectors in the future.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## drat 0.0.2: Improved Support for Lightweight R Repositories

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A few weeks ago we introduced the drat package. Its name stands for drat R Archive Template, and it helps with easy-to-create and easy-to-use repositories for R packages. Two early blog posts describe drat: First Steps Towards Lightweight Repositories, and Publishing a Package.

A new version 0.0.2 is now on CRAN. It adds several new features:

• beginnings of native git support via the excellent new git2r package,
• a new helper function to prune a repo of older versions of packages (as R repositories only show the newest release of a package),
• improved core functionality in inserting a package, and adding a repo.

Courtesy of CRANberries, there is a comparison to the previous release. More detailed information is on the drat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Should I use premium Diesel? Setup

By Wingfeet

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

Since I drive quite a lot, I have some interest in getting the most km out every Euro spent on fuel. One thing to change is the fuel. The oil companies have a premium fuel, which is supposed to be better for both engine and fuel consumption. On the other hand, it is easy to find counter claims which say it is not beneficial for fuel consumption. In that case the extra costs would be a waste. In this post I am creating a setup to check the claims.

### Current data

The current information which I have are fuel consumption of last year. I have taken a set of those data from March and April of 2014.
library(MASS)
28.25 710.6
22.93 690.4
28.51 760.5
23.22 697.9
31.52 871.2
24.68 689.6
30.85 826.9
23.04 699
29.96 845.3
30.16 894.7
25.71 696
23.6 669.8
28.57 739
27.23 727.4
18.31 499.9
r1\$usage=100*r1\$l/r1\$km
plot(density(r1\$usage),
main=’Observed normal diesel usage’,
xlab=’l/100 km’)

The data are from a distribution with a mean around 3.6 l/100 km.
fitdistr(r1\$usage,’normal’)
mean sd
3.59517971 0.19314598
(0.04828649) (0.03414371)

### Approach

Analysis will be a hypothesis test and an estimate of premium diesel usage.
The assumptions which I will make are similar driving patterns and weather as last year. I think that should be possible, given my driving style. A cross check may be made, especially regarding obtaining similar speed. Data with serious traffic jams may be discarded in the analysis.
A check for outliers is not planned. However, obviously faulty data will be corrected or removed from the data. No intermediate analysis is planned, unless data seems to be pointing a marked increase of fuel usage.

### Power for hypothesis test

The advice price levels of premium and standard diesel are 1.433 and 1.363 Euro/liter according to the internet. This is about 5% price increase. It should be noted that prices at the pump vary wildly from these values, especially non-brand non-manned fuel stations may be significantly cheaper. Last year’s data was from such non brand fuel. Competition can force the price of both standard and premium fuel down a bit. I will take the 5% price increase as target for finding value for premium diesel. Given significance level of 10% and power of 90%, I come at 17 samples for each group. This means I will have to take a bit more data from last year, which is not a problem. The choice of alpha and beta reflect that I find both kind of errors equally bad.
power.t.test(delta=3.6*.05,
sd=0.2,
sig.level=.1,
power=.9,
alternative=’one.sided’)
Two-sample t test power calculation

n = 16.66118
delta = 0.18
sd = 0.2
sig.level = 0.1
power = 0.9
alternative = one.sided

NOTE: n is number in *each* group

### Estimating usage of premium diesel

Besides a significance test, I desire an estimate of usage. This manner I can extrapolate the data to other scenarios. I will use a Bayesian analysis to obtain these estimates. The prior to be used is a mixture of three believes. Either it does not make a difference, or there is indeed a 5% gain to be made or something else entirely. This latter is an uninformed prior between 3 and 4 l/km. The combined density is plotted below.
usage <- seq(3.2,3.8,.01)
dens <- (dnorm(usage,3.6,.05)+
dnorm(usage,3.6/1.05,.05)+
dnorm(usage,(3.6+3.6/1.05)/2,.15))/3
plot(x=usage,y=dens,type=’l’,
ylim=c(0,4),
ylab=’density’,
xlab=’l/100 km’,
main=’prior’)

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis

By ygc

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

My R/Bioconductor package, DOSE, published in Bioinformatics.

Summary: Disease ontology (DO) annotates human genes in the context of disease. DO is important annotation in translating molecular findings from high-throughput data to clinical relevance. DOSE is an R package providing semantic similarity computations among DO terms and genes which allows biologists to explore the similarities of diseases and of gene functions in disease perspective. Enrichment analyses including hypergeometric model and gene set enrichment analysis are also implemented to support discovering disease associations of high-throughput biological data. This allows biologists to verify disease relevance in a biological experiment and identify unexpected disease associations. Comparison among gene clusters is also supported.

Availability and implementation: DOSE is released under Artistic-2.0 License. The source code and documents are freely available through Bioconductor (http://www.bioconductor.org/packages/release/bioc/html/DOSE.html).

More importantly, DOSE provides several visualization functions to produce highly customizable, publication-quality figures of similarity and enrichment analyses. With these visualization tools, the results obtained by DOSE are more interpretable.

#### Related Posts

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## Book Review: Mastering Scientific Computing with R

(This article was first published on Data Analysis and Visualization in R, and kindly contributed to R-bloggers)

PACKT marketing guys again contact me to review their new book Mastering Scientific Computing with R. The book 432 pages (including covers) book is consist of 10 chapters which starts from basic R and ends with advanced data management. However, between the basic R and advanced data management chapters were topics on linear and non linear models, nonparametric techniques, linear and matrix algebra, principal component analysis, and structural equation modeling and confirmatory factor analysis

Aside from code samples (even for R beginners) which include additional comments, there is also an online tutorial on Structural Equation Modeling included in the book webpage which make the book content easier to learn. In addition, the book also extends discussion beyond the methodological aspect of using R in scientific computing by providing technical and practical approach in some of the methodologies (e.g. “choosing the number of components to retain in Principal Component Analysis”, “Rasch analysis using linear algebra and a
paired comparisons matrix”, “Exploratory factor analysis and reflective
constructs”).

Highly recommended book!

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

## One weird trick to compile multipartite dynamic documents with Rmarkdown

(This article was first published on biochemistries, and kindly contributed to R-bloggers)

This afternoon I stumbled across this one weird trick an undocumented part of the YAML headers that get processed when you click the ‘knit’ button in RStudio. Knitting turns an Rmarkdown document into a specified format, using the rmarkdown package’s render function to call pandoc (a universal document converter written in Haskell).

If you specify a `knit:` field in an Rmarkdown YAML header you can replace the default function (`rmarkdown::render`) that the input file and encoding are passed to with any arbitrarily complex function.

For example, the developer of slidify passed in a totally different function rather than `render``slidify::knit2slides`.

I thought it’d be worthwhile to modify what was triggered upon clicking that button – as simply as using a specified output file name (see StackOverflow here), or essentially running a sort of `make` to compose a multi-partite document.

Here’s an exemplar Rmarkdown YAML header which threads together a document from 3 component Rmarkdown subsection files:

• The `title:` field becomes a top-level header (`#`) in the output markdown
• The `knit:` field (a currently undocumented hook) replaces `rmarkdown::render` with a custom function specifying parameters for the rendering.

Yes, unfortunately it does have to be an unintelligible one-liner unless you have your own R package [with the function exported to the namespace] to source it from (as `package::function`). Here’s the above more clearly laid out:

• Firstly, every section’s Rmarkdown file is rendered into markdown [with the same name by default]
• Each of these files are ‘included’ after the ‘body’ (cf. the header) of this README, if they’re in the `includes: after_body:[...]` list.
• The `quiet=TRUE` parameter silences the standard “Output created: …” message following `render()` which would otherwise trigger the RStudio file preview on the last of the intermediate markdown files created.
• After these component files are processed, the final README markdown is rendered (`includes` appends their processed markdown contents), and this full document is previewed.
• All Rmd files here contain a YAML header, the constituent files having only the `output:md_document:variant` field:

…before their sub-section contents:

```## Comparison of cancer types surveyed

Comparing cancer types in this paper to CRUK's most up to date prevalence   statistics...```

## Alternative modular setup

One of the problems custom knit functions can also solve is the time it takes for large manuscripts to compile – a huge obstacle to my own use of Rmarkdown which I’m delighted to overcome, and what’s stopped me from recommending it to others as a practical writing tool despite its clear potential.

E.g., if using `knitcitations`, each reference is downloaded even if the bibliographic metadata has already been obtained. Along with generating individual figures etc., the time to ‘compile’ an Rmarkdown document can therefore scale exorbitantly when writing a moderately sized manuscript (rising from seconds to tens of minutes in the extreme as I saw on a recent essay), breaking the proper flow of writing and review, and imposing a penalty on extended Rmarkdown compositions.

A modular structure is the only rational way of doing this, but isn’t described anywhere for Rmarkdown’s dynamic documents (to my knowledge?).

In such a framework, the ‘main’ document’s knit function would be as above, but lacking the first step of compiling each `.Rmd``.md` (these having been done separately upon each edit), so that pre-made `.md` files would just be `included` (instantly) in the final document:

Much more sensibly, the edited Rmarkdown component files (subsections) wouldn’t need to be re-processed — e.g. have all references and figures generated — rather this would be done per file, each of which could in turn potentially have custom `knit:` hooks (though note that the example below only works to prevent the file preview, there’s scope to do much more with it)

`via Software Carpentry`

The idea would be to follow what this Software Carpentry video describes regarding makefiles for reproducible research papers. In theory, the initially described `knit:` function could generate a full paper including analyses from component section files, each of which could in turn have their own `knit:` hooks.

The example above creates a `README.md` file suitable for display in a standard GitHub repository, though it’s not advisable to write sprawling READMEs: it could easily be tweaked to give a `paper.pdf` as for the Software Carpentry example, using a PDF YAML output header instead for the final `.md``.pdf` step after including the component parts.

For what it’s worth, my current YAML header for a manuscript in PDF is:

… and in the top matter (after the YAML, before the markdown, for the LaTeX engine & R):

A minor limitation I see here is that it’s not possible to provide subsection titles through metadata — at present the title is written to markdown with a hardcoded ‘# ‘ prefix. In a reproducible manuscript utopia the `title:` field could still be specified and markdown header prefix of the appropriate level generated accordingly perhaps (which might also allow for procedural sub-section numbering – 1.2, 1.2.1 etc.).

The above can also be found on my GitHub development notes Wiki, but it’s not possible to leave comments there. Feedback and more tips and tricks for Rmarkdown workflows are welcome.

✎ Check out the `rmarkdown` package here, and general Rmd documentation here.