#6: Easiest package registration

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the sixth post in the really random R riffs series, or R4 for short.

Posts #1 and #2 discussed how to get the now de rigeur package registration information computed. In essence, we pointed to something which R 3.4.0 would have, and provided tricks for accessing it while R 3.3.3 was still R-released.

But now R 3.4.0 is out, and life is good! Or at least this is easier. For example, a few days ago I committed this short helper script pnrrs.r to littler:

#!/usr/bin/r

if (getRversion() < "3.4.0") stop("Not available for R (< 3.4.0). Please upgrade.", call.=FALSE)

tools::package_native_routine_registration_skeleton(".")

So with this example script pnrrs.r soft-linked to /usr/local/bin (or ~/bin) as I commonly do with littler helpers, all it takes is

cd some/R/package/source
pnrrs.r

and the desired file usable as src/init.c is on stdout. Editing NAMESPACE is quick too, and we’re all done. See the other two posts for additional context. If you don’t have littler, the above also works with Rscript.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Shiny Application Layouts exercises part 3

By Euthymios Kasvikis

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Shiny Application Layouts-Navbar Page

In the third part of our series we will build another small shiny app but use another UI.

Specifically we are going to create a Shiny application that includes more than one distinct sub-components each with its own characteristics.

For our case we are going to use the cars dataset to create a small app.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in “Building Shiny App” series as we will build basic shiny staff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Shiny Installation

In order to create the app we have to install and load the package “shiny”.

Exercise 1

Install and load shiny.

Navbar Pages

The navbarPage() function creates an application with a standard Navbar at the top. For example:
#ui.R
shinyUI(navbarPage("Application",
tabPanel("Panel 1"),
tabPanel("Panel 2"),
tabPanel("Panel 3")
))

#server.R
function(input, output, session) {}

Exercise 2

Create a Navbar page with three panels. Name the app “Cars” and the panels “Plot”, “Summary” and “Other” respectively.

Now we can work distinctly in each panel. First of all let’s put a sidebar in the first one, like the example below.
#ui.R
library(shiny)
library(markdown)

shinyUI(navbarPage(“App”,
tabPanel(“Panel 1”,
sidebarLayout(
sidebarPanel(
),
mainPanel(
)
)),
tabPanel(“Panel 2”),
tabPanel(“Panel 3”)
))

Exercise 3

Add a sidebar in the tabPanel “Plot”.

Now we will add in the sidebar a radiobutton from which the user can choose to see either the scatter plot or the line chart. Look at the example below.
#ui.R
radioButtons("plote", "Plot ",
c("Scatter"="s", "Line"="l","Bar"="b")
)

Exercise 4

Add radioButtons() with two buttons inside the sidebar. Name it “Plot”. The first will be a scatter chart and the second a line chart.

It is to time to create the two charts as the example:
#ui.R
mainPanel(
plotOutput("plot")
)

#server.R
output$plot <- renderPlot({
plot(cars, type=input$plote)
})

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 5

In the mainPanel() create the two plots. HINT: Use plotOutput().

In the tabPanel() “Summary” we will print the descriptive statistics of our dataset via verbatimTextOutput() like the example below.
#ui.R
tabPanel("Summary",
verbatimTextOutput("summary")
)

#server.R
output$summary <- renderPrint({
summary(cars)
})

Exercise 6

Print the summary statistics of cars’ dataset in the tabPanel() “Summary”. HINT:Use renderPrint().

Sub-Components

Sometimes upu may want to add “sub-components” in your tabPanel(). This can be done with navbarMenu() function. Look at the example.
#ui.R
shinyUI(navbarPage("Application",
tabPanel("Panel 1"),
tabPanel("Panel 2"),
navbarMenu("More",
tabPanel("Sub-Panel A"),
tabPanel("Sub-Panel B"))
))

Exercise 7

In the tabPanel() other create two additional tabPanel(). Name the first “Data Table” and the second “Description”.

Exercise 8

In the first additional tabPanel() “Data table” place the Data Table of your dataset. HINT: Use datatableOutput() and renderDataTable().

Exercise 9

In the “Description” Panel describe your dataset.”The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.”

Navlists

The navlistPanel() may be a good alternative to tabsetPanel() when you have many panels. A navlist presents the menu itams as a sidebar list. Here’s an example of a navlistPanel():
#ui.R
shinyUI(fluidPage(

titlePanel(“Application”),

navlistPanel(
“Header A”,
tabPanel(“Item 1”),
tabPanel(“Item 2”),
“Header B”,
tabPanel(“Item 3”),
tabPanel(“Item 4”),
“—–“,
tabPanel(“Item 5”)
)
))

Exercise 10

Transform you app by using navlistPanel() instead of navbarPage().

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Shiny server series part 4: adding user authentication using Auth0

By Jasper Ginn

(This article was first published on Jasper Ginn’s blog, and kindly contributed to R-bloggers)

This guide is part of a series on setting up your own private server running shiny apps. There are many guides with great advice on how to set up an R shiny server and related software. I try to make a comprehensive guide based in part on these resources as well as my own experiences. I always aim to properly attribute information to their respective sources. If you notice an issue, please contact me.

Part 1 of this series is available here

Part 2 of this series is available here

Part 3 of this series is available here

We’re nearly at the end of the shiny server series, and that means that we’ll be getting into the really useful parts. In this tutorial, we’ll use Auth0 to set up user authentication to regulate access to our private shiny server.

Resources used for this part

This guide draws on this excellent tutorial produced by Auth0.

Where were we?

In the previous part, we set up SSL encryption on our server. This makes user authentication safer by encrypting traffic to and from the server. Remember, however, that we didn’t yet set up nginx to use this SSL certificate. So although we set up all the prerequisites, SSL encryption is not yet operational.

As always, the first thing we’ll do is log into the server

# Log into the server
ssh root@<your-VPS-ip>

Before you do anything else, take a note of the location where Let’s encrypt stored your certificate. It’s easiest to just navigate to the following folder

cd /etc/letsencrypt/live

And execute

ls

Write down the name of this folder, you’ll need it to complete the next step.

Now, we can switch to the shiny user:

su shiny
# Navigate to shiny home
cd

Back up the nginx configuration in case something goes wrong. Then, delete the config and open a new default config file:

sudo cp /etc/nginx/sites-available/default /etc/nginx/sites-available/default2
# Delete default
sudo rm /etc/nginx/sites-available/default
# Make a new default config
sudo nano /etc/nginx/sites-available/default

The next part can be a bit tricky. Take a look at the config file below. You need to replace with the following:

  1. In lines 19-20, replace the text by the name of the folder you just wrote down.
  2. In lines 39 and 91, replace the text by your custom domain name, with and without ‘www’

I’ve added some helpful screenshots below for clarification. It’s easiest to just copy-paste this into a notepad file and replace the values before pasting it into the nano editor.

400: Invalid request

The image below shows what this looks like in the nano editor.

(here’s the top part of the config file)

(and here’s the bottom part)

After pasting the new configuration, save the file by hitting control+x and then Y and enter. Then, restart the nginx service:

sudo service nginx restart

In your browser, navigate to your domain name. If all went well, you should now see the lock symbol indicating that the setup is a success!

What does this new config file do?

Glad you asked. This config file makes sure that we use the Let’s Encrypt SSL certificate we created earlier. We also keep the locations we’ve been using so far, but we changed the config files for the private-apps shiny server. This configuration file re-routs all traffic through the Auth0 authentication service we’ll set up in a second. If the user successfully authenticates, they will be forwarded to the private shiny environment.

Setting up your Auth0 account

Navigate to the Auth0 website and sign up for an account. A free account works with up to 7.000 users, so this should be more than enough.

Navigate to connections and remove all social logins. We won’t be needing them for this application as we’ll be setting up an email whitelist:

Go to ‘clients’ and click on ‘create client’

Name your application and choose ‘regular web app’:

Now go to ‘settings’, note down your domain, client ID and secret ID, and add a callback url in the following format:

https://www.<your-domain-name>.<extension>/private-apps/callback,
https://<your-domain-name>.<extension>/private-apps/callback

Scroll down and save your changes. Then, navigate to ‘rules’ and click on ‘create first rule’:

From this list, select ‘whitelist for a specific app’:

Change the name of the app (red box) to your application name, then add the email addresses that are allowed to make an account (gold box):

Setting up the Auth0 service

Go back to the terminal and clone my fork of the auth0 shiny-auth0 repository:

cd ~ && git clone https://github.com/JasperHG90/shiny-auth0.git

Install nodejs and npm

sudo apt-get install -y nodejs npm

Enter the shiny-auth0 repository and let npm install the dependencies

cd shiny-auth0 && npm install

Go to this website and generate a random string of 40 characters. Copy-paste this string in a temporary document. You will need it a little later.

Go back to the terminal and create a file called ‘.env’ in the shiny-auth0 directory:

 nano .env

Copy the lines below into a temporary document and add the required information

AUTH0_CLIENT_SECRET=yoursecretid
AUTH0_CLIENT_ID=yourclientid
AUTH0_DOMAIN=yourdomain
AUTH0_CALLBACK_URL=https://<your-domain-name>.<extension>/private-apps/callback
COOKIE_SECRET=random string generated in the previous step
SHINY_HOST=localhost
SHINY_PORT=4949
PORT=3000

Save the file by hitting control+x and then Y and enter. Make a symlink for nodejs so it can also be called with ‘node’:

sudo ln -s `which nodejs` /usr/bin/node

Run the app using:

npm start

Now, go to: www../private-apps/ to see if it works.

Now that we know that user authentication works, we need to make sure that the Auth0 service can quietly run in the background like a regular service. Hit control + C in the terminal to stop the nodejs server. Then, execute

sudo nano /etc/systemd/system/shiny-auth0.service

and copy-paste the following:

[Service]
ExecStart=/usr/bin/node /home/shiny/shiny-auth0/bin/www
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=shiny
User=shiny
Group=shiny
WorkingDirectory=/home/shiny/shiny-auth0/
Environment=NODE_ENV=production

[Install]
WantedBy=multi-user.target

Hit Control + X, Y and then enter. Then, execute the next two lines:

sudo systemctl enable shiny-auth0
sudo systemctl start shiny-auth0

This automatically starts up the auth0 server every time the server restarts.

Wrapping up

Congratulations, user authentication is now set up! This wraps up part 4 of the shiny server series. In the last instalment, we’ll be adding a simple static website created using Jekyll.

To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn’s blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R⁶ — Using pandoc from R + A Neat Package For Reading Subtitles

By hrbrmstr

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Once I realized that my planned, larger post would not come to fruition today I took the R⁶ post (i.e. “minimal expository, keen focus) route, prompted by a Twitter discussion with some R mates who needed to convert “lightly formatted” Microsoft Word (docx) documents to markdown. Something like this:

to:

Does pandoc work?
=================

Simple document with **bold** and *italics*.

This is definitely a job that pandoc can handle.

pandoc is a Haskell (yes, Haskell) program created by John MacFarlane and is an amazing tool for transcoding documents. And, if you’re a “modern” R/RStudio user, you likely use it every day because it’s ultimately what powers rmarkdown / knitr.

Yes, you read that correctly. You’re beautiful PDF, Word and HTML R reports are powered by — and, would not be possible without — Haskell.

Doing the aforementioned conversion from docx to markdown is super-simple from R:

rmarkdown::pandoc_convert("simple.docx", "markdown", output="simple.md")

Give the help on rmarkdown::pandoc_convert() a read as well as the very thorough and helpful documentation over at [pandoc.org])(http://pandoc.org) to see the power available at your command.

Just One More Thing

This section — technically — violates the R⁶ principle so you can stop reading if you’re a purist 🙂

There’s a neat, non-on-CRAN package by François Keck called subtoolshttps://github.com/fkeck/subtools which can slice, dice and reformat digital content subtitles. There are multiple formats for these subtitle files and it seems to be able to handle them all.

There was a post (earlier in April) about Ranking the Negativity of Black Mirror Episodes. That post is python and I’ve never had time to fully replicate it in R.

Here’s a snippet (sans expository) that can get you started pulling in subtitles into R and tidytext. I would have written scraper code but the various subtitle aggregation sites make that a task suited for something like my splashr package and I just had no cycles to write the code. So, I grabbed the first season of “The Flash” and use the Bing sentiment lexicon from tidytext to see how the season looked.

The overall scoring for a given episode is naive and can definitely be improved upon.

Definitely drop a link to anything you create in the comments!

# devtools::install_github("fkeck/subtools")

library(subtools)
library(tidytext)
library(hrbrthemes)
library(tidyverse)

data(stop_words)

bing <- get_sentiments("bing")
afinn <- get_sentiments("afinn")

fils <- list.files("flash/01", pattern = "srt$", full.names = TRUE)

pb <- progress_estimated(length(fils))

map_df(1:length(fils), ~{

  pb$tick()$print()

  read.subtitles(fils[.x]) %>%
    sentencify() %>%
    .$subtitles %>%
    unnest_tokens(word, Text) %>%
    anti_join(stop_words, by="word") %>%
    inner_join(bing, by="word") %>%
    inner_join(afinn, by="word") %>%
    mutate(season = 1, ep = .x)

}) %>% as_tibble() -> season_sentiments


count(season_sentiments, ep, sentiment) %>%
  mutate(pct = n/sum(n),
         pct = ifelse(sentiment == "negative", -pct, pct)) -> bing_sent

ggplot() +
  geom_ribbon(data = filter(bing_sent, sentiment=="positive"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  geom_ribbon(data = filter(bing_sent, sentiment=="negative"),
              aes(ep, ymin=0, ymax=pct, fill=sentiment), alpha=3/4) +
  scale_x_continuous(expand=c(0,0.5), breaks=seq(1, 23, 2)) +
  scale_y_continuous(expand=c(0,0), limits=c(-1,1),
                     labels=c("100%nnegative", "50%", "0", "50%", "positiven100%")) +
  labs(x="Season 1 Episode", y=NULL, title="The Flash — Season 1",
       subtitle="Sentiment balance per episode") +
  scale_fill_ipsum(name="Sentiment") +
  guides(fill = guide_legend(reverse=TRUE)) +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.y=element_text(vjust=c(0, 0.5, 0.5, 0.5, 1)))

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

«smooth» package for R. es() function. Part VI. Parameters optimisation

By Ivan Svetunkov

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

Now that we looked into the basics of

es()

function, we can discuss how the optimisation mechanism works, how the parameters are restricted and what are the initials values for the parameters in the optimisation of the function. This will be fairly technical post for the researchers who are interested in the inner (darker) parts of

es()

.

NOTE. In this post we will discuss initial values of parameters. Please, don’t confuse this with initial values of components. The former is a wider term, than the latter, because in general it includes the latter. Here I will explain how initialisation is done before the parameters are optimised.

Let’s get started.

Before the optimisation, we need to have some initial values of all the parameters we want to estimate. The number of parameters and initialisation principles depend on the selected model. Let’s go through all of them in details.

The smoothing parameters (alpha), (beta) and (gamma) (for level, trend and seasonal components) for the model with additive error term are set to be equal to 0.3, 0.2 and 0.1, while for the multiplicative one they are equal to 0.1, 0.05 and 0.01 respectively. The motivation here is that we want to have parameters closer to zero in order to smooth the series (although we don’t always get these values), and in case with multiplicative models the parameters need to be very low, otherwise the model may become too sensitive to the noise.

The next important values is the initial of the damping parameter (phi), which is set to be equal to 0.95. We don’t want to start from one, because the damped trend model in this case looses property of damping, but we want to be close to one in order not too enforce the excessive damping.

As for the vector of states, its initial values are set depending on the type of model. First, the following simple model is fit to the first 12 observations of data (if we don’t have 12 observations, than to the whole sample we have):
begin{equation} label{eq:simpleregressionAdditive}
y_t = a_0 + a_1 t + e_t .
end{equation}
In case with multiplicative trend we use a different model:
begin{equation} label{eq:simpleregressionMulti}
log(y_t) = a_0 + a_1 t + e_t .
end{equation}
In both cases (a_0) is the intercept, which is used as the initial value of level component and (a_1) is the slope of the trend, which is used as the initial of trend component. In case of multiplicative model, exponents of (a_0) and (a_1) are used. For the case of no trend, a simple average (of the same sample) is used as initial of level component.

In case of seasonal model, the classical seasonal decomposition (“additive” or “multiplicative” – depending on the type specified by user) is done using

decompose()

function, and the seasonal indices are used as initials for the seasonal component.

All the values are then packed in the vector called

C

(I know that this is not a good name for the vector of parameters, but that’s life) in the following order:

  1. Vector of smoothing parameters (mathbf{g}) (
    persistence

    );

  2. Damping parameter (phi) (
    phi

    );

  3. Initial values of non-seasonal part of vector of states (mathbf{v}_t) (
    initial

    );

  4. Initial values of seasonal part of vector of states (mathbf{v}_t) (
    initialSeason

    );

  5. After that parameters of exogenous variables are added to the vector. We will cover the topic of exogenous variable separately in one of the upcoming posts. The sequence is:

  6. Vector of parameters of exogenous variables (mathbf{a}_t) (
    initialX

    );

  7. Transition matrix for exogenous variables (
    transitionX

    );

  8. Persistence vector for exogenous variables (
    persistenceX

    ).

Obviously, if we use predefined values of some of those elements, then they are not optimised and skipped during the formation of the vector

C

. For example, if user specifies parameter initial, then the step (3) is skipped.

There are some restrictions on the estimated parameters values. They are defined in vectors

CLower

and

CUpper

, which have the same length as

C

and correspond to the same elements as in

C

(persistence vector, then damping parameter etc). They may vary depending on the value of the parameter bounds. These restrictions are needed in order to find faster the optimal value of the vector

C

. The majority of them are fairly technical, making sure that the resulting model has meaningful components (for example, multiplicative component should be greater than zero). The only parameter that is worth mentioning separately is the damping parameter (phi). It is allowed to take values between zero and one (including boundary values). In this case the forecasting trajectories do not exhibit explosive behaviour.

Now the vectors

CLower

and

CUpper

define general regions for all the parameters, but the bounds of smoothing parameters need finer regulations, because they are connected with each other. That is why they are regulated in the cost function itself. If user defines

"usual"

bounds, then they are restricted to make sure that:
begin{equation} label{eq:boundsUsual}
alpha in [0, 1]; beta in [0, alpha]; gamma in [0, 1-alpha]
end{equation}
This way the exponential smoothing has property of averaging model, meaning that the weights are distributed over time in an exponential fashion, they are all positive and add up to one, plus the weights of the newer observations are higher than the weights of the older ones.

If user defines

bounds="admissible"

then the eigenvalues of discount matrix are calculated on each iteration. The function makes sure that the selected smoothing parameters guarantee that the eigenvalues lie in the unit circle. This way the model has property of being stable, which means that the weights decrease over time and add up to one. However, on each separate observation they may become negative or greater than one, meaning that the model is no longer an “averaging” model.

In the extreme case of

bounds="none"

the bounds of smoothing parameters are not checked.

In case of violation of bounds for smoothing parameters, the cost function returns a very high number, so the optimiser “hits the wall” and goes to the next value.

In order to optimise the model we use function

nloptr()

from

nloptr

package. This function implements non-linear optimisation algorithms written in C. smooth functions use two algorithms: BOBYQA and Nelder-Mead. This is done in two stages: parameters are estimated using the former, after that the returned parameters are used as the initial values for the latter. In cases of mixed models we also check if the parameters returned from the first stage differ from the initial values. If they don’t, then it means that the optimisation failed and BOBYQA is repeated but with the different initial values of the vector of parameters

C

(smoothing parameters that failed during the optimisation are set equal to zero).

Overall this optimisation mechanism guarantees that the parameters are close to the optimal values, lie in the meaningful region and satisfy the predefined bounds.

As an example, we will apply

es()

to the time series N41 from M3.

First, here’s how ETS(A,A,N) with usual bounds looks like on that time series:

es(M3$N0041$x,"AAN",bounds="u",h=6)
Time elapsed: 0.1 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
    0     0 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 397.628
Cost function type: MSE; Cost function value: 101640.73

Information criteria:
     AIC     AICc      BIC 
211.1391 218.6391 214.3344

As we see, in this case the optimal smoothing parameters are equal to zero. This means that we do not take any information into account and just produce the straight line (deterministic trend). See for yourselves:

Series №41 and ETS(A,A,N) with traditional (aka “usual”) bounds

And here’s what happens with the admissible bounds:

es(M3$N0041$x,"AAN",bounds="a",h=6)
Time elapsed: 0.11 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
1.990 0.018 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 327.758
Cost function type: MSE; Cost function value: 69059.107

Information criteria:
     AIC     AICc      BIC 
205.7283 213.2283 208.9236

The smoothing parameter of the level, (alpha) is greater than one. It is almost two. This means that the exponential smoothing is no longer averaging model, but I can assure you that the model is still stable. Such a high value of smoothing parameter means that the level in time series changes drastically. This is not common and usually not a desired, but possible behaviour of the exponential smoothing. Here how it looks:

Series №41 and ETS(A,A,N) with admissible bounds

Here I would like to note that model can be stable even with negative smoothing parameters. So don’t be scared. If the model is not stable, the function will warn you.

Last but not least, user can regulate values of

C

,

CLower

and

CUpper

vectors for the first optimisation stage. Model selection does not work with the provided vectors of initial parameters, because the length of

C

,

CLower

and

CUpper

vectors is fixed, while in the case of model selection it will vary from model to model. User also needs to make sure that the length of each of the vectors is correct and corresponds to the selected model. The values are passed via the ellipses, the following way:

Cvalues 
Time elapsed: 0.1 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
    1     0 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 429.923
Cost function type: MSE; Cost function value: 118821.938

Information criteria:
     AIC     AICc      BIC 
213.3256 220.8256 216.5209

In this case we reached boundary values for both level and trend smoothing parameters. The resulting model has constantly changing level (random walk level) and deterministic trend. This is a weird, but possible combination. The fit and forecast looks similar to the model with the admissible bounds, but not as reactive:

Series №41 and ETS(A,A,N) with traditional bounds and non-standard initials

Using this functionality, you may end up with ridiculous and meaningless models, so be aware and be careful. For example, the following does not have any sense from forecasting perspective:

Cvalues 
Time elapsed: 0.12 seconds
Model estimated: ETS(AAN)
Persistence vector g:
alpha  beta 
2.483 1.093 
Initial values were optimised.
5 parameters were estimated in the process
Residuals standard deviation: 193.328
Cost function type: MSE; Cost function value: 24027.222

Information criteria:
     AIC     AICc      BIC 
190.9475 198.4475 194.1428 
Warning message:
Model ETS(AAN) is unstable! Use a different value of 'bounds' parameter to address this issue!

Although the fit is very good and the model approximates data better than all the others (MSE value is 24027 versus 70000 – 120000 of other models), the model is unstable (the function warns us about that), meaning that the weights are distributed in an unreasonable way: the older observations become more important than the newer ones. The forecast of such a model is meaningless and most probably is biased and not accurate. Here how it looks:

Series №41 and ETS(A,A,N) with crazy bounds

So be careful with manual tuning of the optimiser.

Have fun but be reasonable!

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Accessing and Manipulating Biological Databases Exercises (Part 1)

By Stephen James

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below we cover how we can Access and Manipulate Biological Data bases through rentrez & seqinr packages

Install Packages
rentrez
seqinr

Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Print all the available data bases which you can access through rentrez package

Exercise 2

Print all the searchable terms in a database

Exercise 3

Display the details of any database of your choice

Exercise 4

Retrieve and print 10 ids of nucleotide sequences from nuccore database about Human.

Exercise 5

Retrieve and print 20 ids of protein sequences from protein database about Human.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 6

Create a Fasta File for a particular human protein sequence from the listed ids.

Exercise 7

Create a Fasta File for a particular human nucleotide sequence from the listed ids.

Exercise 8

Open the Nucleotide Fasta file and print the details using seqinr package.

Exercise 9

Open the Protein Fasta file and print the details using seqinr package

Exercise 10

Open the Nucleotide Fasta file and print only sequence from the created Fasta file striping all other information.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Plotting Data Online via Plotly and Python

By Mike Driscoll

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

I don’t do a lot of plotting in my job, but I recently heard about a website called Plotly that provides a plotting service for anyone’s data. They even have a plotly package for Python (among others)! So in this article we will be learning how to plot with their package. Let’s have some fun making graphs!

Getting Started

You will need the plotly package to follow along with this article. You can use pip to get the package and install it:

pip install plotly

Now that you have it installed, you’ll need to go to the Plotly website and create a free account. Once that’s done, you will get an API key. To make things super simple, you can use your username and API key to create a credentials file. Here’s how to do that:

import plotly.tools as tls

tls.set_credentials_file(
        username="your_username", 
        api_key="your_api_key")

# to get your credentials
credentials = tls.get_credentials_file()

If you don’t want to save your credentials, then you can also sign in to their service by doing the following:

import plotly.plotly as py
py.sign_in('your_username','your_api_key')

For the purposes of this article, I’m assuming you have created the credentials file. I found that makes interacting with their service a bit easier to use.

Creating a Graph

Plotly seems to default to a Scatter Plot, so we’ll start with that. I decided to grab some data from a census website. You can download any US state’s population data, along with other pieces of data. In this case, I downloaded a CSV file that contained the population of each county in the state of Iowa. Let’s take a look:

import csv
import plotly.plotly as py

#----------------------------------------------------------------------
def plot_counties(csv_path):
    """
    http://census.ire.org/data/bulkdata.html
    """
    counties = {}
    county = []
    pop = []
    
    counter = 0
    with open(csv_path) as csv_handler:
        reader = csv.reader(csv_handler)
        for row in reader:
            if counter  == 0:
                counter += 1
                continue
            county.append(row[8])
            pop.append(row[9])
            
    trace = dict(x=county, y=pop)
    data = [trace]
    py.plot(data, filename='ia_county_populations')
    
if __name__ == '__main__':
    csv_path = 'ia_county_pop.csv'
    plot_counties(csv_path)

If you run this code, you should see a graph that looks like this:

You can also view the graph here. Anyway, as you can see in the code above, all I did was read the CSV file and extract out the county name and the population. Then I put that data into two different Python lists. Finally I created a dictionary of those lists and then wrapped that dictionary in a list. So you end up with a list that contains a dictionary that contains two lists! To make the Scatter Plot, I passed the data to plotly’s plot method.

Converting to a Bar Chart

Now let’s see if we can change the ScatterPlot to a Bar Chart. First off, we’ll play around with the plot data. The following was done via the Python interpreter:

>>> scatter = py.get_figure('driscollis', '0')
>>> print scatter.to_string()
Figure(
    data=Data([
        Scatter(
            x=[u'Adair County', u'Adams County', u'Allamakee County', u'..', ],
            y=[u'7682', u'4029', u'14330', u'12887', u'6119', u'26076', '..'  ]
        )
    ])
)

This shows how we can grab the figure using the username and the plot’s unique number. Then we printed out the data structure. You will note that it doesn’t print out the entire data structure. Now let’s do the actual conversion to a Bar Chart:

from plotly.graph_objs import Data, Figure, Layout

scatter_data = scatter.get_data()
trace_bar = Bar(scatter_data[0])
data = Data([trace_bar])
layout = Layout(title="IA County Populations")
fig = Figure(data=data, layout=layout)
py.plot(fig, filename='bar_ia_county_pop')

This will create a bar chart at the following URL: https://plot.ly/~driscollis/1.

Here’s the image of the graph:

This code is slightly different than the code we used originally. In this case, we explicitly created a Bar object and passed it the scatter plot’s data. Then we put that data into a Data object. Next we created a Layout object and gave our chart a title. Then we created a Figure object using the data and layout objects. Finally we plotted the bar chart.

Saving the Graph to Disk

Plotly also allows you to save your graph to your hard drive. You can save it in the following formats: png, svg, jpeg, and pdf. Assuming you still have the Figure object from the previous example handy, you can do the following:

py.image.save_as(fig, filename='graph.png')

If you want to save using one of the other formats, then just use that format’s extension in the filename.

Wrapping Up

At this point you should be able to use the plotly package pretty well. There are many other graph types available, so be sure to read Plotly’s documentation thoroughly. They also support streaming graphs. As I understand it, Plotly allows you to create 10 graphs for free. After that you would either have to delete some of your graphs or pay a monthly fee.

Additional Reading

If you have question you can leave a comment below.

    Related Post

    1. Using PostgreSQL and shiny with a dynamic leaflet map: monitoring trash cans
    2. Visualizing Streaming Data And Alert Notification with Shiny
    3. Metro Systems Over Time: Part 3
    4. Metro Systems Over Time: Part 2
    5. Metro Systems Over Time: Part 1
    To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News

    Luke-warm about micromaps

    By Peter's stats stuff – R

    Continuing my exploring methods for spatial visualisation of data in R, today I’m looking at linked micromaps. Micromaps are a way of showing statistical graphics for a small subset of regions at a time, with a small map indicating which regions are being looked at in each of the small multiples. Alberto Cairo has some nice discussion in this blog post.

    Poverty and education in the USA

    It’s easier to show than explain. Here’s one I’ve adapted from the helpfiles of the R micromaps package by Quinn Payton, Tony Olsen, Marc Weber, Michael McManus and Tom Kincaid of the US Environmental Protection Agency and generously open-sourced by that agency. Under the hood, micromaps uses Hadley Wickham’s ggplot2 to create the graphic assets and (I think) Paul Murrell’s grid to lay them out in aligned fashion.

    Some links for the micromaps package:

    I used this USA example as my test case for understanding the micromaps API. Some of the changes I’ve introduced here include:

    • Set my own font
    • Pale blue background with white gridlines to make them go into the background, ggplot2-style.
    • White borders for the grey-fill states that have been previously drawn, so the border lines don’t distract the eye from the five states being referred to in each cluster.
    • Tidied up the code for readability

    Here’s the code that draws this, and sets up the overall session. Note that the nzcensus (by me) and mbiemaps (by New Zealand’s Ministry of Business, Innovation and Employment, my previous employer) are only available on GitHub, not CRAN. Code to install them is available in a previous post. They’re needed for New Zealand data and maps later on.

    library(micromap)
    library(nzcensus)
    library(extrafont)
    library(mbiemaps) # for the territorial authority map
    library(tidyverse)
    library(testthat)
    library(scales)
    library(ggrepel)
    
    the_font <- "Calibri"
    theme_set(theme_minimal(base_family = the_font))
    
    #============polish up the example from the ?lmplot helpfile===============
    data("USstates")
    data("edPov")
    
    # convert from SpatialPolygonsDataFrame into a data frame (similar to fortify)
    statePolys <- create_map_table(USstates, 'ST')
    
    # draw graphic
    lmplot(stat.data = edPov,
           map.data = statePolys,
           panel.types = c('labels', 'dot', 'dot','map'),
           panel.data = list('state','pov','ed', NA),
           ord.by = 'pov',   
           grouping = 5,
           colors = brewer.pal(5, "Spectral"),
           median.row = TRUE,
           # how to merge the two data frames:
           map.link = c('StateAb', 'ID'),
           # stroke colour of the borders of previously-drawn areas:
           map.color2 = "white",
           # how to save the result:
           print.file = "0095-eg1.png",
           print.res = 600,
           plot.width = 7,
           plot.height = 9,
           # attributes of the panels:
           panel.att = list(
              list(1, header = "States", 
                   panel.width = 0.8, 
                   text.size = 0.8, 
                   align = 'right', 
                   panel.header.font = the_font,
                   panel.header.face = "bold",
                   text.font = the_font,
                   left.margin = 2,
                   right.margin = 1),
              list(2, header = "Percent living belownpoverty level", 
                   xaxis.title = 'Percent',
                   xaxis.ticks = list(10, 15, 20),
                   xaxis.labels = list(10, 15, 20),
                   graph.bgcolor = 'lightblue',
                   graph.grid.color = "grey99",
                   graph.border.color = "white", 
                   panel.header.font = the_font,
                   panel.header.face = "bold"),
              list(3, header = "Percent adults withn4+ years of college", 
                   xaxis.title = 'Percent',
                   xaxis.ticks = list(0, 20, 30, 40),
                   xaxis.labels = list(0, 20, 30, 40),
                   graph.bgcolor = 'lightblue',
                   graph.grid.color = "grey99",
                   graph.border.color = "white", 
                   panel.header.font = the_font,
                   panel.header.face = "bold"),
              list(4, header = 'Light gray meansnhighlighted above', 
                   panel.width = 0.8,
                   panel.header.font = the_font,
                   panel.header.face = "italic")
           ))

    Most of the polishing code is in the list of lists passed to the panel.att argument. Each list refers to the attributes of one of the four panels (state names, poverty dot charts, education dot charts, maps). I had to do a bit of digging to find out how to control things like grid colour; while doing this it was useful to run one of the three lines of code below to see what attributes are within your control for the different panel types:

    unlist(labels_att(TRUE) )
    unlist(dot_att(TRUE) )
    unlist(map_att(TRUE) )

    Note that the lmplot function specifies a file to print the graphic to. I don’t like this, as it’s breaks some commonly accepted R workflows. For example, for this blog I usually create in advance all the graphics for each post in SVG format, which scales up nicely if people zoom in on it and is generally the best format (my view) for web graphics. That can’t be done when lmplot restricts you to particular device types. The pattern also doesn’t work well with Yixuan Qiu’s showtext R package that I normally use for fonts (it lets me access Google fonts, including the Poppins font I use for most graphics).

    New Zealand census example

    To be sure I understood how to use the technique, I wanted to apply it to some New Zealand maps and data. I’m used to presenting data at the Territorial Authority level in New Zealand by means of a choropleth map like this one:

    … which was drawn with this code, using the TA2013 data frame of 2013 census data from my nzcensus package:

    # pre-fortified data frame version of TA level map, from mbiemaps package:
    data(ta_simpl_gg)
    
    # change name in census data to match the name used in ta_simpl_gg
    TA2013$short_name <- gsub(" District", "", TA2013$TA2013_NAM)
    
    # filter out some visually inconvenient data:
    ta_data <- TA2013 %>%
       filter(!short_name %in% c("Chatham Islands Territory", "Area Outside Territorial Authority")) %>%
       mutate(PercNoReligion = PropNoReligion2013 * 100)
    
    # draw map:   
    ta_simpl_gg %>%
       left_join(ta_data, by = c("NAME" = "short_name")) %>%
       arrange(order) %>% 
       ggplot(aes(x = long, y = lat, group = group, fill = MedianIncome2013)) +
       geom_polygon(colour = "grey50") +
       ggmap::theme_nothing(legend = TRUE) +
       theme(legend.position = c(0.2, 0.7)) +
       scale_fill_gradientn("Median individualnincome", 
                            colours = brewer.pal(11, "Spectral"), label = dollar) +
       coord_map(projection = "sinusoidal") +
       ggtitle("Example choropleth map") +
       labs(caption = "Source: Statistics New Zealand, 2013 Census")

    Doing this with a linked micromap instead of a choropleth map lets us look at more variables at once, but I can’t say I’m happy with the result:

    My reservations about this graphic:

    • It feels like there are just too many territorial authorities for this to be a really friendly graphic.
    • Also, my map of New Zealand is probably too crinkly and individual districts and cities too small to show up well.
    • There’s an awkwardness of New Zealand being tall rather than wide – a smaller aspect ratio than USA. This seems to make the New Zealand map less well suited to the technique than the USA map.
    • It’s hard for the reader to move their eyes back and forth from the district or city name to the dots and to the map.
    • I couldn’t work out how (if it is possible) to control the projection of the map, and hence New Zealand looks a little bit stretched and rotated.

    Here’s the code that drew this, anyway:

    # change names of the map data frame to meet lmplot's expectations
    ta_polys2 <- ta_simpl_gg %>%
       rename(coordsx = long, 
              coordsy = lat, 
              ID = NAME) %>%
       mutate(hole = as.numeric(hole),
              plug = 0,
              region = as.numeric(as.factor(ID)))
    
    # check merge will be ok
    a <- unique(ta_polys2$ID)       # names of map polygons
    b <- unique(ta_data$short_name) # names of TAs with census data
    expect_equal(a[!a %in% b], b[!b %in% a])
    
    # draw graphic:
    lmplot(stat.data = ta_data,
           map.data = ta_polys2,
           panel.types = c('labels', 'dot', 'dot','map'),
           panel.data = list('short_name','MedianIncome2013','PercNoReligion', NA),
           ord.by = 'MedianIncome2013',   
           grouping = 6,
           median.row = FALSE,
           # how to merge the two data frames:
           map.link = c('short_name', 'ID'),
           # stroke colour of the borders of previously-drawn areas:
           map.color2 = "white",
           # how to save the result:
           print.file = "0095-eg2.png",
           print.res = 600,
           plot.width = 14,
           plot.height = 18,
           # attributes of the panels:
           panel.att = list(
              list(1, header = "Territorial Authorities", 
                   panel.width = 1.2, 
                   panel.header.size = 2,
                   text.size = 1.6, 
                   align = 'right', 
                   panel.header.font = the_font,
                   panel.header.face = "bold",
                   text.font = the_font,
                   left.margin = 2,
                   right.margin = 1,
                   xaxis.title.size = 2),
              list(2, header = "Mediannincome", 
                   panel.header.size = 2,
                   xaxis.title = 'Dollars',
                   xaxis.title.size = 2,
                   xaxis.labels.size = 2,
                   graph.bgcolor = 'grey80',
                   graph.grid.color = "grey99",
                   graph.border.color = "white", 
                   panel.header.font = the_font,
                   panel.header.face = "bold",
                   point.border = FALSE,
                   point.size = 2),
              list(3, header = "Percent individualsnwith no religion", 
                   panel.header.size = 2,
                   xaxis.title = 'Percent',
                   xaxis.title.size = 2,
                   xaxis.labels.size = 2,
                   graph.bgcolor = 'grey80',
                   graph.grid.color = "grey99",
                   graph.border.color = "white", 
                   panel.header.font = the_font,
                   panel.header.face = "bold",
                   point.border = FALSE,
                   point.size = 2),
              list(4, header = 'n', 
                   panel.header.size = 2,
                   panel.width = 0.5,
                   panel.header.font = the_font,
                   panel.header.face = "italic",
                   active.border.size = 0.2,
                   withdata.border.size = 0.2,
                   nodata.border.size = 0.2)
           ))

    Now, there’s a standard way of showing two variables against each other. We lose the spatial element, but for most purposes I think the good old scatter plot is better for this data:

    …drawn with:

    ta_data %>%
       ggplot(aes(x = MedianIncome2013, y = PropNoReligion2013, label = short_name)) +
       scale_x_continuous("Median income", label = dollar) +
       scale_y_continuous("No religion", label = percent) +
       geom_smooth(method = "lm", colour = "white") +
       geom_point(aes(size = CensusNightPop2013), shape = 19, colour = "grey60") +
       geom_point(aes(size = CensusNightPop2013), shape = 1, colour = "black") +
       geom_text_repel(colour = "steelblue", size = 2.5) +
       scale_size_area("Census night population", max_size = 10, label = comma) +
       theme(legend.position = "bottom") +
       labs(caption = "Source: Statistics New Zealand, census 2013")

    Maybe better with a smaller number of areas?

    New Zealand has 66 districts and cities (not counting Chatham Islands), but only 16 Regional Councils. Perhaps the method works better with a smaller number of areas to show:

    … and I think that probably is ok. But really, all we are showing here is 32 numbers. It’s an expensive graphic for something that could almost be as meaningfully shown in a table. All up, my reaction to linked micromaps is one of caution. Like any visualisation tool, I think they’ll be good in some circumstances, but in others they just won’t seem to easily communicate.

    Code for the regional council micromap:

    #================regions example============
    # convert from SpatialPolygonsDataFrame into a data frame (similar to fortify)
    data(region_simpl) # map from mbiemaps
    reg_polys <- create_map_table(region_simpl, 'NAME')
    
    reg_data <- REGC2013 %>%
       mutate(PercNoReligion = PropNoReligion2013 * 100) %>%
       filter(REGC2013_N != "Area Outside Region")
    
    lmplot(stat.data = reg_data,
           map.data = reg_polys,
           panel.types = c('labels', 'dot', 'dot','map'),
           panel.data = list('REGC2013_N','MedianIncome2013','PercNoReligion', NA),
           ord.by = 'MedianIncome2013',   
           grouping = 4,
           median.row = FALSE,
           # how to merge the two data frames:
           map.link = c('REGC2013_N', 'ID'),
           # stroke colour of the borders of previously-drawn areas:
           map.color2 = "white",
           # how to save the result:
           print.file = "0095-eg3.png",
           print.res = 600,
           plot.width = 8,
           plot.height = 9,
           # attributes of the panels:
           panel.att = list(
              list(1, header = "Territorial Authorities", 
                   panel.width = 1.2, 
                   text.size = 0.8, 
                   align = 'right', 
                   panel.header.font = the_font,
                   panel.header.face = "bold",
                   text.font = the_font,
                   left.margin = 2,
                   right.margin = 1),
              list(2, header = "Mediannincome", 
                   xaxis.title = 'Dollars',
                   panel.width = 0.8,
                   left.margin = .5, 
                   right.margin = .5, 
                   graph.bgcolor = 'grey90',
                   graph.grid.color = "grey99",
                   graph.border.color = "white", 
                   panel.header.font = the_font,
                   panel.header.face = "bold",
                   point.border = FALSE,
                   point.size = 2.5),
              list(3, header = "Percent individualsnwith no religion", 
                   panel.width = 0.8,
                   left.margin = .5, 
                   right.margin = .5, 
                   xaxis.title = 'Percent',
                   graph.bgcolor = 'grey90',
                   graph.grid.color = "grey99",
                   graph.border.color = "white", 
                   panel.header.font = the_font,
                   panel.header.face = "bold",
                   point.border = FALSE,
                   point.size = 2.5),
              list(4, header = 'n', 
                   panel.width = 1.2,
                   panel.header.font = the_font,
                   panel.header.face = "italic",
                   active.border.size = 0.5,
                   withdata.border.size = 0.5,
                   nodata.border.size = 0.5)
           ))

    Bonus – area unit cartograms

    Finally, an aside. My last two blog posts have been about what you might call population-weighted carto-choro-pletho-grams… I’ve been gradually enhancing this Shiny web app which lets people play around with visualisations of 2013 census data. Latest addition is data at the detailed area unit level. Here’s a screenshot showing which area units have higher proportions of people with a background from the Pacific islands:

    Check out the full app for more.

    Source:: R News

    R Weekly Bulletin Vol – VI

    By R programming

    This week’s R bulletin will cover topics like how to parse milliseconds in R, how to format dates and method to extract specific parts of date and time.

    We will also cover functions like %within%, %m+%, to.period function, period.max and period.min functions. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

    Shortcut Keys

    1. Indent – hit Tab button at beginning of the line
    2. Outdent – Shift+Tab
    3. Go to a specific line – Shift+Alt+G

    Problem Solving Ideas

    How to parse milliseconds in R

    When R reads dates from a text or spreadsheet file, it will typically store them as character vector or a factor. To convert them to dates, we need to parse these strings. Parsing can be done using the strptime function, which returns POSIXlt dates.

    To parse the dates, you must tell strptime which bits of the string correspond to which bits of the date. The date format is specified using a string, with components specified with a percent symbol followed by a letter. See the example below. These components are combined with other fixed characters, such as colons in times, or dashes and slashes in dates to form a full specification.

    Example:

    strptime("25/06/2016 09:50:24", "%d/%m/%Y %H:%M:%S")

    [1] “2016-06-25 09:50:24 IST”

    If a string does not match the format in the format string, it takes the value NA. For example, specifying dashes instead of slashes makes the parsing fail:

    Example:

    strptime("25-06-2016 09:50:24", "%d/%m/%Y %H:%M:%S")

    [1] NA

    To parse milliseconds in R we use the following format string:

    Example:

    strptime("25-06-2016 09:50:24.975", "%d-%m-%Y %H:%M:%OS")

    [1] “2016-06-25 09:50:24.975 IST”
    op = options(digits.secs=3)
    options(op)

    The options function allows the user to set and examine a variety of global options which affect the way in which R computes and displays its results. The argument “digits.secs” to the options function controls the maximum number of digits to print when formatting time values in seconds. Valid values are 0 to 6 with default 0.

    How to format dates

    We can format a date as per our requirement using the strftime (“string format time”) function. This function works on both POSIXct and POSIXlt date classes, and it turns a date variable into a string.

    Example:

    now_time = Sys.time()
    
    # Let us first check the class of the the now_time variable by calling the
    # class function.
    
    class(now_time)

    [1] “POSIXct” “POSIXt”

    As can be seen, it is a “POSIXct” variable. To correctly format a date, we need to use the right naming conventions. For example, %B signifies the full name of the month, %p signifies the AM/PM time indicator, and so on. For the full list of the conventions, check the help page for the function. The required formatting conventions are placed in quotes and form the second argument of the function. An example of the strftime function usage is shown below.

    strftime(now_time, "%I:%M%p %A, %d %B %Y")

    [1] “09:16AM Saturday, 29 April 2017”

    Extract specific parts of a date and time

    R has two standard date-time classes, POSIXct and POSIXlt. The POSIXct class stores dates as the number of seconds since the start of 1970, while the POSIXlt stores dates as a list, with components for seconds, minutes, hours, day of month, etc. One can use list indexing to access individual components of a POSIXlt date. These components can be viewed using the unclass function.

    Example: In this example, we use the Sys.time function, which gives the current date and time. We convert this into POSIXlt class using the as.POSIXlt function. Now we can extract the required components.

    time_now = as.POSIXlt(Sys.time())
    print(time_now)

    [1] “2017-04-29 09:16:06 IST”

    unclass(time_now)

    $sec
    [1] 6.642435

    $min
    [1] 16

    $hour
    [1] 9

    $mday
    [1] 29

    $mon
    [1] 3

    $year
    [1] 117

    $wday
    [1] 6

    $yday
    [1] 118

    $isdst
    [1] 0

    $zone
    [1] “IST”

    $gmtoff
    [1] 19800

    attr(,”tzone”)
    [1] “” “IST” “IST”

    # To extract minutes
    time_now$min

    [1] 16

    # To extract day of the month
    time_now$mday

    [1] 29

    Functions Demystified

    %within% and %m+% functions

    %within% function: This function from the lubridate package checks whether a date-time object falls in a given date-time interval. It returns a logical TRUE or FALSE as the output. The function requires a date-time interval, which is created using the interval function. We use the ymd function to convert a non-date-time object into a date-time object.

    Example:

    library(lubridate)
    dates = interval("2016-03-03", "2016-06-03")
    d = ymd("2016-04-21")
    d %within% dates

    [1] TRUE

    %m+% function: This function is used to add or subtract months to/from a given date-time object.

    Examples:

    # To add a month to a date-time objects
    library(lubridate)
    d = ymd("2016-04-21")
    d %m+% months(1)

    [1] “2016-05-21”

    # To create a sequence of months from a given date-time object
    library(lubridate)
    d = ymd("2016-04-21")
    d %m+% months(1:3)

    [1] “2016-05-21” “2016-06-21” “2016-07-21”

    # To subtract a year from a given date-time object
    d = ymd("2016-04-21")
    d %m+% years(-1)

    [1] “2015-04-21”

    to.period function

    The to.period function is part of the xts package. It converts an OHLC or univariate object to a specified periodicity lower than the given data object.

    For example, the function can convert a daily series to a monthly series, or a monthly series to a yearly one, or a one minute series to an hourly series.

    Example:

    library(quantmod)
    data = getSymbols("AAPL", src = "yahoo", from = "2016-01-01", to = "2016-01-15",
           auto.assign = FALSE)
    nrow(data)

    [1] 10

    # Convert the above daily data series to weekly data.
    to.period(data, period = "weeks")

    Valid period character strings include: “seconds”, “minutes”, “hours”, “days”, “weeks”, “months”, “quarters”, and “years”.

    To convert the daily data to monthly data, the syntax will be:

    df = to.period(data,period = ‘months’)

    The result will contain the open and close for the given period, as well as the maximum and minimum over the new period, reflected in the new high and low, respectively. If volume for a period was available, the new volume will also be calculated.

    period.max and period.min functions

    The period.max and period.min functions are part of the xts package. period.max is used to calculate a maximum value per period given an arbitrary index of sections to be calculated over. The period.min function works in a similar manner to compute the minimum values.

    The syntax is given as:

    period.max(x, index)
    period.min(x, index)

    where,
    x – represents a univariate data object
    index – represents a numeric vector of endpoints

    Example:

    library(xts)
    # compute the period maximum
    period.max(c(1, 1, 4, 2, 2, 6, 7, 8, -1, 20), c(0, 3, 5, 8, 10))

    3 5 8 10
    4 2 8 20

    # compute the period minimum
    period.min(c(1, 1, 4, 2, 2, 6, 7, 8, -1, 20), c(0, 3, 5, 8, 10))

    3 5 8 10
    1 2 6 -1

    Next Step

    We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

    The post R Weekly Bulletin Vol – VI appeared first on .

    Source:: R News

    Performance differences between RevoScaleR, ColumnStore Table and In-Memory OLTP Table

    By tomaztsql

    (This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

    Running *.XDF files using RevoScaleR computational functions versus have dataset available in Columnstore table or in In-Memory OLTP table will be focus of comparison for this blog post.

    For this test, I will use the AirLines dataset, available here. Deliberately, I have picked a sample 200 MB (of 13GB dataset) in order to properly test the differences and what should be the best way.

    After unzipping the file, I will use following T-SQL query to import the file into SQL Server.

    With this example, you can import xdf file directly to SQL Server table (note, that I have transformed a CSV file into XDF and import xdf file into SQL table):

    -- must have a write permissions on folder: C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData
    DECLARE @RScript nvarchar(max)
    SET @RScript = N'library(RevoScaleR)
                    rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                    inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                    of <-  rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
                                 transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
                                ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
                                ,overwrite = TRUE
                                ,maxRowsByCols = 10000000
                                ,rowsPerRead = 200000)
                    OutputDataSet <- rxXdfToDataFrame(of)'
    
    DECLARE @SQLScript nvarchar(max)
    SET @SQLScript = N'SELECT 1 AS N'
    
    EXECUTE sp_execute_external_script
         @language = N'R'
        ,@script = @RScript
        ,@input_data_1 = @SQLScript
    WITH RESULT SETS ((ArrDelay INT
                        ,CRSDepTime DECIMAL(6,4)
                        ,DofWeek NVARCHAR(20)))
    GO

    So the whole process is to be done by creating a table, converting the above sp_execute_external_script into procedure and import results from external procedure to the table.

    --Complete process
    CREATE TABLE AirFlights_small 
    (id INT IDENTITY(1,1)
    ,ArrDelay INT
    ,CRSDepTime DECIMAL(6,4)
    ,DofWeek NVARCHAR(20) 
    );
    GO
    
    CREATE Procedure ImportXDFtoSQLTable
    AS
    DECLARE @RScript nvarchar(max)
    SET @RScript = N'library(RevoScaleR)
                    rxOptions(sampleDataDir = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData")
                    inFile <- file.path(rxGetOption("sampleDataDir"), "airsample.csv")
                    of <-  rxDataStep(inData = inFile, outFile = "airline20170428_2.xdf", 
                    transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
                ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
                ,overwrite = TRUE
                ,maxRowsByCols = 10000000)
                 OutputDataSet <- data.frame(rxReadXdf(file=of, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))'
    DECLARE @SQLScript nvarchar(max)
    SET @SQLScript = N'SELECT 1 AS N'
    EXECUTE sp_execute_external_script
         @language = N'R'
        ,@script = @RScript
        ,@input_data_1 = @SQLScript
    WITH RESULT SETS ((ArrDelay INT,CRSDepTime DECIMAL(6,4),DofWeek NVARCHAR(20)));
    GO
    
    INSERT INTO AirFlights_small
    EXECUTE ImportXDFtoSQLTable;
    GO

    There you go. Data are in T-SQL Table. Now we can start with comparisons. I will be measuring the time to get average air delay time per day of the week.

    RevoScaleR

    With using the RevoScaleR package, I will be using rxCrossTabs function with the help of transform argument to convert day of the week into factors:

    #importing data
    outFile2 <- rxDataStep(inData = inFile, outFile = "C:/Program Files/Microsoft SQL Server/130/R_SERVER/library/RevoScaleR/SampleData/airline20170428_2.xdf", 
                transformVars = c("ArrDelay", "CRSDepTime","DayOfWeek")
               ,transforms = list(ArrDelay = as.integer(ArrDelay), CRSDepTime = as.numeric(CRSDepTime), DayOfWeek = as.character(DayOfWeek))
               ,overwrite = TRUE
               ,maxRowsByCols = 10000000)
    
    of2 <- data.frame(rxReadXdf(file=outFile2, varsToKeep=c("ArrDelay", "CRSDepTime","DayOfWeek")))
    
    summary(rxCrossTabs(ArrDelay~DayOfWeek
                        ,data = of2  #outFile2
                        ,transforms = transforms
                        ,blocksPerRead=300000), output="means")

    Now get those times:

    # Getting times
    system.time({ 
      summary(rxCrossTabs(ArrDelay~DayOfWeek
                          ,data = of2
                          ,transforms = transforms
                          ,blocksPerRead=300000), output="means")
      })

    With results of 7.8 on elapsed time and computation time of 3.8 second.

    Rows Read: 8400013, Total Rows Processed: 8400013, Total Chunk Time: 3.825 seconds 
    Computation time: 3.839 seconds.
       user  system elapsed 
       2.89    0.37    7.89 

    T-SQL query without any specifics

    To have a baseline, let’s run the following query:

    SET STATISTICS TIME ON;
    SELECT 
    [DofWeek]
    ,AVG(ArrDelay) AS [means]
    FROM
        AirFlights_small
    GROUP BY 
        [DofWeek]
    SET STATISTICS TIME OFF;

    And check these time statistics

    SQL Server Execution Times:
    CPU time = 6124 ms, elapsed time = 2019 ms.
    Warning: Null value is eliminated by an aggregate or other SET operation.

    Obiously the CPU / computation time is higher, although the elapsed time is faster.

    ColumnStore Table

    Let’s create a nonclustered column store index.

    CREATE TABLE AirFlights_CS
    (id INT IDENTITY(1,1)
    ,ArrDelay INT
    ,CRSDepTime DECIMAL(6,4)
    ,DofWeek NVARCHAR(20) 
    );
    GO
    INSERT INTO AirFlights_CS(ArrDelay, CRSDepTime, DofWeek)
    SELECT ArrDelay, CRSDepTime, DofWeek FROM AirFlights_small 
    
    CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_AirFlight
    ON AirFlights_CS
    (id, ArrDelay, CRSDepTime, DofWeek);
    GO

    With the execution of the same query

    SET STATISTICS TIME ON;
    SELECT 
    [DofWeek]
    ,AVG(ArrDelay) AS [means]
    FROM
      AirFlights_CS
    GROUP BY     [DofWeek] SET STATISTICS TIME OFF;

    The following time statistics are in

    SQL Server Execution Times:
    CPU time = 202 ms, elapsed time = 109 ms.
    Warning: Null value is eliminated by an aggregate or other SET operation.

    In-Memory OLTP

    To get Memory optimized table, we need to add a filegroup and create a table with memory optimized turned on:

    CREATE TABLE dbo.AirFlight_M   
    (  
      id INT NOT NULL PRIMARY KEY NONCLUSTERED
     ,ArrDelay INT
     ,CRSDepTime DECIMAL(6,4) 
     ,DofWeek NVARCHAR(20)
    ) WITH (MEMORY_OPTIMIZED=ON, DURABILITY = SCHEMA_AND_DATA);
    GO

    And insert the data

    INSERT INTO AirFlight_M
    SELECT * FROM AirFlights_small

    Running the simple query

    SET STATISTICS TIME ON;
    SELECT 
    [DofWeek]
    ,AVG(ArrDelay) AS [means]
    FROM
        AirFlight_M
    GROUP BY 
        [DofWeek]
    SET STATISTICS TIME OFF;

    results are:

    SQL Server Execution Times:
    CPU time = 6186 ms, elapsed time = 1627 ms.
    Warning: Null value is eliminated by an aggregate or other SET operation.

    These results were somehow expected, mostly because the ColumnStore table is the only one having index and reading (also by looking in execution plans) optimized with comparison to others. Also degree of parallelism, clustered and non-clustered index can be pushed, but the idea was to have tests similar to the one in RevoScaleR and R environemnt. With R, we can not push any index on the XDF file.

    In R we run:

    system.time({ 
    LMResults <- rxLinMod(ArrDelay ~ DayOfWeek, data = outFile2, transforms = transforms)
    LMResults$coefficients
    })
    
    

    And in SSMS we run:

    SET STATISTICS TIME ON;
    -- 1. T-SQL
    DECLARE @RScript nvarchar(max)
    SET @RScript = N'library(RevoScaleR)
                    LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                    OutputDataSet <- data.frame(LMResults$coefficients)'
    DECLARE @SQLScript nvarchar(max)
    SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_small]'
    EXECUTE sp_execute_external_script
         @language = N'R'
        ,@script = @RScript
        ,@input_data_1 = @SQLScript
    WITH RESULT SETS ((
                --DofWeek NVARCHAR(20)
            --    ,
                Coefficient DECIMAL(10,5)
                ));
    GO
    SET STATISTICS TIME OFF;
    
    
    SET STATISTICS TIME ON;
    -- 2. ColumnStore
    DECLARE @RScript nvarchar(max)
    SET @RScript = N'library(RevoScaleR)
                    LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                    OutputDataSet <- data.frame(LMResults$coefficients)'
    DECLARE @SQLScript nvarchar(max)
    SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlights_CS]'
    EXECUTE sp_execute_external_script
         @language = N'R'
        ,@script = @RScript
        ,@input_data_1 = @SQLScript
    WITH RESULT SETS ((
                --DofWeek NVARCHAR(20)
            --    ,
                Coefficient DECIMAL(10,5)
                ));
    GO
    SET STATISTICS TIME OFF;
    
    
    SET STATISTICS TIME ON;
    -- 3. Memory optimized
    DECLARE @RScript nvarchar(max)
    SET @RScript = N'library(RevoScaleR)
                    LMResults <- rxLinMod(ArrDelay ~ DofWeek, data = InputDataSet)
                    OutputDataSet <- data.frame(LMResults$coefficients)'
    DECLARE @SQLScript nvarchar(max)
    SET @SQLScript = N'SELECT ArrDelay, DofWeek FROM [dbo].[AirFlight_M]'
    EXECUTE sp_execute_external_script
         @language = N'R'
        ,@script = @RScript
        ,@input_data_1 = @SQLScript
    WITH RESULT SETS ((
                --DofWeek NVARCHAR(20)
            --    ,
                Coefficient DECIMAL(10,5)
                ));
    GO
    SET STATISTICS TIME OFF;

    Conclusion

    Gathering statistics on CPU time and elapsed time when running simple Linear regression, this is comparison:

    df_LR_comparison <- data.frame (
      method = c("T-SQL", "ColumnStore", "Memory Optimized", "RevoScaleR")
      ,CPUtime = c(3000,1625,2156,7689)
      ,ElapsedTime = c(14323,10851,10600,7760)
      )
    library(ggplot2)
    
    ggplot(df_LR_comparison, aes(method, fill=method)) + 
      geom_bar(aes(y=ElapsedTime), stat="identity") +
      geom_line(aes(y=CPUtime, group=1), color="white", size=3) +
      scale_colour_manual(" ", values=c("d1" = "blue", "d2" = "red"))+
      #scale_fill_manual("",values="red")+
      theme(legend.position="none")
    
    

    Showing that elapsed time for R environment with RevoScaleR is fastest (and getting data from XDF), where as simple T-SQL run with sp_execute_external_script and using RevoScaleR gives the slowest response.

    2017-04-29 00_43_10-Plot Zoom

    In terms of CPU time (white line), Columnstore with RevoScaleR call through external procedure outperforms all others.

    Final conclusion: When running statistical analysis (using RevoScaleR or any other R library), use columnstore and index optimized tables/views to receive best CPU and elapsed times. Important to remember is also the fact, that any aggregations and calculations that can be done within SQL Server, are better to be perfomered there.

    As always, code is available at GitHub.

    Happy coding! 🙂

    To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

    Source:: R News