The Sierpinski Triangle: Visualising infinity in R

By Peter Prevos

Sierpinski triangle

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Wacław Sierpiński was a mathematical genius who developed several of the earliest fractals. The Sierpiński triangle is an easy to conceptualise geometrical figure but it hides a fascinating mathematical complexity. Start by drawing an equilateral triangle and draw another one in its centre. Then draw equilateral triangles in the four resulting triangles, and so on, ad infinitum.

The original Sierpinski triangle will eventually disappear into Cantor dust, a cloud of ever shrinking triangles of infinitesimal size. The triangle is self-similar, no matter how far you zoom in, the basic geometry remains the same.

The Chaos Game

A fascinating method to create a Sierpinski Triangle is a chaos game. This method uses random numbers and some simple arithmetic rules. Sierpinski Triangles can be created using the following six steps:

  1. Define three points in a plane to form a triangle.
  2. Randomly select any point on the plane.
  3. Randomly select any one of the three triangle points.
  4. Move half the distance from your current position to the selected vertex.
  5. Plot the current position.
  6. Repeat from step 3.

This fractal is an implementation of chaos theory as this random process attracts to a complex ordered geometry. The game only works with random numbers and when selecting random vertices of the triangle.

Sierpinski Triangle Code

This code implements the six rules in R. The code first initializes the triangle, defines a random starting point and then runs a loop to place random dots. The R plot engine does not draw pixels but uses characters, which implies that the diagram is not as accurate as it could be but the general principle is clear. The x(11) and Sys.sleep() commands are used to plot during the for-loop.

# Sierpinsky Triangle

# Initialise triangle
p 

This algorithm demonstrates how a seemingly chaotic process can result in order. Many other versions of chaos games exist, which I leave to the reader to play with. If you create your own versions then please share the code in the comment box below.

The post The Sierpinski Triangle: Visualising infinity in R appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

Prediction intervals for NNETAR models

By R on Rob J Hyndman

The nnetar function in the forecast package for R fits a neural network model to a time series with lagged values of the time series as inputs (and possibly some other exogenous inputs). So it is a nonlinear autogressive model, and it is not possible to analytically derive prediction intervals. Therefore we use simulation.
Suppose we fit a NNETAR model to the famous Canadian lynx data:
library(forecast) (fit

Source:: R News

Subplots in maps with ggplot2

By Ilya Kashnitsky

fig1

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

Following the surprising success of my latest post, I decided to show yet another use case of the handy ggplot2::annotation_custom(). Here I will show how to add small graphical information to maps – just like putting a stamp on an envelope.

The example comes from my current work on a paper, in which I study the effect of urban/rural differences on the relative differences in population ageing (I plan to tell a bit more in one of the next posts). Let’s have a look at the map we are going to reproduce in this post:

So, with this map I want to show the location of more and less urbanized NUTS-2 regions of Europe. But I also want to show – with subplots – how I defined the three subregions of Europe (Eastern, Southern, and Western) and what is the relative frequency of the three categories of regions (Predominantly Rural, Intermediate, and Predominantly Rural) within each of the subregions. The logic of actions is simple: first prepare all the components, then assemble them in a composite plot. Let’s go!

The code to prepare R session and load the data.

# additional packages
library(tidyverse)
library(ggthemes)
library(rgdal)
library(viridis)
library(extrafont)
myfont 

Now, I prepare the spatial objects to be plotted with ggplot2 and create a blank map of Europe – our canvas.

# fortify spatial objects
bord 

Okay, now the envelope is ready. It’s time to prepare the stamps. Let’s create a nice mosaic plot showing the distribution of NUTS-2 regions by subregions and the urb/rur categories. I found the simplest way to create a nice mosaic plot on Stack Overflow.

# create a nice mosaic plot; solution from SO:
# http://stackoverflow.com/a/19252389/4638884
makeplot_mosaic % mutate(type = as.numeric(type)), 
                              x = subregion, y = type)+
        theme_void()+
        scale_fill_viridis(option = "B", discrete = T, end = .8)+
        scale_y_continuous(limits = c(0, 1.4))+
        annotate("text",x = c(27, 82.5, 186), y = 1.05, 
                 label=c("EAST", "SOUTH", "WEST"), 
                 size = 4, fontface = 2, 
                 vjust = 0.5, hjust = 0,
                 family = myfont) + 
        coord_flip()+
        theme(legend.position = "none")

fig3

Just what we needed. The next step is to build a small map showing the three subregions of Europe. But before we proceed to the maps, one thing has to be fixed. ggplot2 fails rendering nested polygons. With our regional dataset, London, for example, will not be shown if we do not account for this unpleasant feature. Luckily, there is quite a simple solution to fix that problem.

# a nice small function to overcome some mapping problems with nested polygons
# see more at SO
# https://stackoverflow.com/questions/21748852
gghole 

Now I build the small map of subregions.

# annotate a small map of the subregions of Europe
an_sub 

fig4

Finally, everything is ready to build the main map and stick the two subplots on top of it.

# finally the map of Urb/Rur typology

caption 

Done!

Of course, it takes several iterations to position each element in its proper place. Then, one also needs to play with export parameters to finally get the desired output.

The full R script for this post is here

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Euler Problem 22 : Names Scores

By Peter Prevos

Euler Problem 22

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

R logo in ASCII art by picascii.com

Euler problem 22 is another trivial one that takes us to the realm of ASCII codes. ASCII is a method to convert symbols into numbers, originally invented for telegraphs.

Back in the 8-bit days, ASCII art was a method to create images without using lots of memory. Each image consists of a collection of text characters that give the illusion of an image. Euler problem 22 is, unfortunately, a bit less poetic.

Euler Problem 22 Definition

Using names.txt, a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.

For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 × 53 = 49,714.

What is the total of all the name scores in the file?

Solution

This code reads and cleans the file and sorts the names alphabetically. The charToRaw function determines the numerical value of each character in each name. This output of this function is the hex ASCII code for each character. The letter A is number 65, so we subtract 64 from each value to get the sum total.

# ETL: reads the file and converts it to an ordered vector.
names 

We can have a bit more fun with this problem by comparing this list with the most popular baby names in 2016. The first section of the code extracts the list of popular names from the website. The rest of the code counts the number of matches between the lists.

# Most popular baby names
library(rvest)
url %
    read_html() %>%
    html_nodes(xpath = '//*[@id="babyNameList"]/table') %>%
    html_table()
babynames 

The post Euler Problem 22 : Names Scores appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Source:: R News

A recipe for dynamic programming in R: Solving a “quadruel” case from PuzzlOR

By serhatcevikel

(This article was first published on R – Serhat’s Blog, and kindly contributed to R-bloggers)

As also described in Cormen, et al (2009) p. 65, in algorithm design, divide-and-conquer paradigm incorporates a recursive approach in which the main problem is:

  • Divided into smaller sub-problems (divide),
  • The sub-problems are solved (conquer),
  • And the solutions to sub-problems are combined to solve the original and “bigger” problem (combine).

Instead of constructing indefinite number of nested loops destroying the readability of the code and the performance of execution, the “recursive” way utilizes just one block of code which calls itself (hence the term “recursive”) for the smaller problem. The main point is to define a “stop” rule, so that the function does not sink into an infinite recursion depth. While nested loops modify the same object (or address space in the low level sense), recursion moves the “stack pointer”, so each recursion depth uses a different part of the stack (a copy of the objects will be created for each recursion). This illustrates a well-known trade-off in algorithm design: Memory versus performance; recursion enhances performance at the expense of using more memory.

Given the benefits of the recursive divide-and-conquer paradigm, however, the performance of the naive recursion may not still be as good as we may want, since there may be too many smaller problems to navigate through recursion. Sometimes, the same or a similar sub-problem may be visited multiple times and each sub-problem paves way to its sub-problems down to the stop rule – so a naive recursive approach solves the same sub-problems many and many times. The recipe to enhance performance is to “memoize” solved sub-problems: Simply to keep solutions in a look-up table, and first check the table if the value was memoized before commencing with the “divide-and-conquer”. Divide-and-conquer paradigm on steroids with memoization is known as “dynamic programming” or “dp” for short (Cormen, et al. 2009, p. 359).

Now we will illustrate an application of the “dynamic programming” approach using a question from the PuzzlOR site – a site which publishes bi-monthly decision support puzzles for applied mathematicians. Since the answers to expired questions are disclosed by the website itself, there is no problem in describing a solution in detail. The question that appeared in October 2014 is called “Fighters”. The question involves a quadruel: A kind of duel fight that is among four participants instead of the classical two participant case.

The question goes as follows:

Four different fighters are having an all-out battle to determine who among them is the strongest. The image above shows those four fighters: Allan, Barry, Charles, and Dan.
Each fighter has varying attack and health abilities. At the start of the battle, they have differing health points: Allan has 10, Barry has 12, Charles has 16, and Dan has 18. Also, each fighter has differing attack points: Allan has 4, Barry has 3, Charles has 2, and Dan has 1.
The battle takes place over multiple rounds, each round consisting of a single attack. In each round, one random attacker and one random defender are chosen. When the attacker attacks a defender, the defender loses health points in the amount equivalent to the attacker’s attack points. For example, if Allan is the attacker and Barry is the defender, Barry would lose 4 health points.
The fighters continue to randomly attack and defend in subsequent rounds until there is only one fighter left, who is then declared the winner. A fighter is removed from the battle when his life points become zero (or less).
Question: Which fighter is most likely to win the battle?

Starting with the initial health levels of 10, 12, 16, and 18, we can divide the main problem of finding the most likely winner into smaller ones, each of which is the ex-post health level configuration after a possible and equally likely attack between any “alive” fighters – “alive” meaning having a non-zero health level. We can divide the problem until we have only one alive fighter – which is the winner. We calculate the actual probability of winning here in this deepest recursion step at the stop rule (only one winner and the probability is zero for other fighters), so this is the “conquer” phase in which the smaller problem is solved.

Each recursion paves way into smaller problems – each one involving a different pair of fighters – number of which is the count of two-member permutations of alive fighters without replacement . We calculate permutations instead of combinations since Fighter X attacking Fighter Y is different than the other way around – the attacker injures the attacked one, so the attack is one-way and not symmetric. We calculate the permutations without replacement, since an attacker cannot attack herself/himself! So in the case of four alive fighters, there are 12 possible attacks. And each attack has a probability of 1/12. The winning probabilities of each fighter calculated from each of these 12 attacks are summed together in a bottom-up manner (from the deepest recursion level involving the last alive fighter, up to the initial question) and this step is the “combine” phase: In each step involving n number of ordered pairs of fighters, the 1/n probability is multiplied with the bottom-up probabilities to allow for carrying the joint probabilities up to the topmost level (the initial question).

We first start with the divide-and-conquer paradigm without memoization and see the performance:


## for non-repeated permutations
library(gtools) 

### non memoized version

# get the winning probabilities of each player
# win means only one player remains with non-zero health

puzzlor.fighters.nm  0) # index of alive players

    # if only one player survives, so is the winner
    if (length(players) == 1) { 
        result 

The key points here in the above code are:

  • In each recursion, the sub-problem is defined by the attacker, attacked, current health points, and the carried probability of that sub-problem (1 / the number of sub-problems generated one level up). The attack points of fighters do not change (line 35).
  • In each sub-problem the health points are updated by subtracting the attack point of the attacker from the health point of the attacked. Note that, the health level can be as low as 0 (line 48).
  • New sub-problems are defined for the non-repeated ordered-pairs of remaining alive fighters – provided that more than one fighter is alive. The permutations are generated by the use of “gtools” package, a handy package to generate combinations and permutations with different options. Note that the base function “expand.grid” generates only the repeated permutations which is not useful for our purpose, since a fighter is not supposed to attack herself/himself. The carried probability is 1 / the number of new sub-problems. The same updated health levels are carried to the new recursion level in each new sub-problem (lines 62-68). If only one fighter with non-zero health level remains, the recursion stops and the carried probability is recorded for the alive fighter (0 probability for the others) (lines 53-56).
  • Each sub-problem returns a 4-tuple vector of winning probabilities for each fighter. The probabilities account for the carried probability (1 / number of ordered pairs). All results from sub-problems at a certain level return a matrix and the columns are summed up and multiplied by the carried probability to yield a vector of combined probabilities to be carried one level up until the original problem (lines 69-71).

We construct two functions: The first one (“puzzlor.fighters.nm”) (line 9) is a wrapper to initiate necessary objects and to start the initial problem calling the recursive function. The second function (“fight.rec.nm”) (line 35) is the recursive function which does all the steps described in detail above.

The performance of this non-memoized version is … well it is still running as at the time I am writing these sentences.

The reason for the performance bottleneck is the fact that too many sub-problems are generated until only one winner remains and the number of “tree branches” grows geometrically with each recursion (multiplied by the number of possible ordered attack pairs in each step). And those sub-problems are revisited many times since the fights can yield the same health level configuration at different instances . We can enhance the performance without redefining the problem but by augmenting it with an additional data structure:

  • Each sub-problem, after the attack is realized, is defined by the 4-tuple of health levels (line 48).
  • And each of these sub-problems yields a 4-tuple of the winning probabilities of each fighter (line 56, line 71).
  • So for each sub-problem defined by the 4-tuple configuration of health levels, we can memoize the 4-tuple winning probabilities of each fighter.
  • We must create an object to record the winning probabilities for each health level configuration. But what should be the R object type for this task? We can subset a matrix with at most two indices: row and column, but we need four indices for each health level. And in order to subset a vector (the 4-tuple probabilities), we can at most use one index in a matrix. The solution is to use a higher dimension object called “array” in R. Arrays can have a dimension higher than two. In our example we need to have a 5D array to index for each health configuration and still yield a 1D vector object of probabilities
The new code which combines recursion and memoization and hence is an application of “dynamic programming” (dp) is as follows:

## for non-repeated permutations
library(gtools) 

# get the winning probabilities of each player
# win means only one player remains with non-zero health

puzzlor.fighters.5D  0) # index of alive players

        # if only one player survives, so is the winner
        if (length(players) == 1) { # 

            result 

The main points of the modified code are as follows:

  • We create a 5D array, to hold the memoized values. The first four dimensions are for the health values and the last dimension is for the memoized probabilities. Note that, each dimension is 1 + the initial health in order to account for 0 health level. When subsetting the array, the health values will be incremented (line 13).
  • And also note that, the array is superassigned (“
  • We can subset a 5D object with 4 indices (leaving the last index so that a vector is returned) just like we subset a matrix for rows and/or columns: array.object[x,y,z,a,]. However, we should first subset each index from the health level configuration vector – which does not look very pretty considering higher dimension objects. A more elegant and generalized way to subset the 5D array is to use the subsetting operator “[” as a function: Note that every operator in R is a function following the convention of LISP (one of the languages that inspired S and R) (line 48). We should first create a matrix containing the index values and then use the “[” operator as a function (line 46). Note that we can follow the same subsetting approach for modifying the array for memoization purposes (line 81).
  • In each sub-problem, the array will be indexed first to get the memoized probabilities – if any. If the probabilities for the given health level configuration is already memoized, those values will be returned (after multiplying with the probability carried from one higher level) (lines 51-55).
  • For the last level of recursion (where there is only one fighter left), there is no need to memoize the value, since just returning a vector of three 0's and one value for carried probability without further recursion should be as efficient as indexing the 5D array (lines 62-66).
  • For the cases in which the configuration is not yet memoized and there are more than one alive fighters, the recursion will be followed as in the first code (lines 72-85), however, this time the results will be memoized into the 5D array before multiplying with the carried probability (line 81).
Now let's execute this new function and compare its results with the first one. Please note that the first code is still running after several hours on an octacore machine at 2.7 GHz current and 3.5 GHz max clock speed!
> microbenchmark(puzzlor.fighters.5D(), times = 1)
Unit: seconds
                  expr      min       lq     mean   median       uq      max
 puzzlor.fighters.5D() 7.983507 7.983507 7.983507 7.983507 7.983507 7.983507
 neval
     1

The code executes in less than 8 seconds. You can check the correct solutions from the PuzzlOR web page.

A downside to the dp approach is that, it cannot be utilized with the “mcmapply” and similar R functions of the “parallel” package for multicore parallelism, since these functions create copies of the global objects as well as the local objects for each thread. So all values cannot be memoized into the same object. However the gains from the memoization outweigh the disadvantage of using a single core.

You can view the whole code (with an additional option with a 4D array and a matrix, which slightly underperforms the 5D case) from my Github page.

Notes:

  • In cases where R returns an error for reaching the maximum recursion depth, the limit for recursion depth up to 5e5 levels can be defined with:
    > options(expressions= 500000)
    
  • Sometimes, despite an enough level of maximum recursion depth, the usage of stack may reach the maximum available size. To tackle with this issue answers to a stackoverflow question summarize the solution. From the unix shell, to check the current stack size:
    $ ulimit -s # print default
    8192
    

    Now you can increase the stack size by:

    $ ulimit -s 16384 # enlarge stack limit to 16 megs
    

    And now, you can check the new stack size available to R by calling a new R session from the shell that you increased the stack size with and calling the following:

    
    > Cstack_info()["size"]
        size 
    15938355
    
To leave a comment for the author, please follow the link and comment on their blog: R – Serhat’s Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How brute is your force?: Three ways to get radicals in batch using R

By serhatcevikel

(This article was first published on R – Serhat’s Blog, and kindly contributed to R-bloggers)

Brute force approach to solving a problem is usually the way of running through all possible combinations and repeating a certain calculation. Until a more elegant solution is devised – for example finding a numerical formula – brute forcing is the only tool at hand. Provided that the nature of the calculation is not unfeasibly inefficient and the universe of combinations / possibilities to search through is not unfeasibly large, brute forcing may be sufficient. However, in many instances, the brute force algorithm can be enhanced so that it bears more “elegance” and “intelligence” to some extent.

The methods to enhance a brute force algorithm may be to restrict the set of values to be searched, to rewrite the code in order to choose more efficient ways that run with fewer instructions at the hardware level or to incorporate multicore or GPU parallelism.

In this post, I will illustrate three “brute force” methods to solve a problem, enhancing in each successive method so that we experience a performance gain of more than 100x times in the third method as compared to the first one. A knowledge of the low level mechanics of the R interpreter is important in selecting the optimal method. The problem I choose is the 124th problem in Project Euler. Since Project Euler requests participants not to disclose the solutions publicly, I will just share portions of the algorithm in order to emphasize the efficiency gains from choosing the right tools and methods:

Ordered radicals

Problem 124

The radical of n, rad(n), is the product of the distinct prime factors of n. For example, 504 = 23 × 32 × 7, so rad(504) = 2 × 3 × 7 = 42.

If we calculate rad(n) for 1n ≤ 10, then sort them on rad(n), and sorting on n if the radical values are equal, we get:

Unsorted
Sorted
n
rad(n)
n
rad(n)
k
1
1
1
1
1
2
2
2
2
2
3
3
4
2
3
4
2
8
2
4
5
5
3
3
5
6
6
9
3
6
7
7
5
5
7
8
2
6
6
8
9
3
7
7
9
10
10
10
10
10

Let E(k) be the kth element in the sorted n column; for example, E(4) = 8 and E(6) = 9.

If rad(n) is sorted for 1 ≤ n ≤ 100000, find E(10000).

So the main point is to calculate the product of the distinct or unique prime factors of each number up to 1e5. Prime sieving is the trivial part, since it is a common task in many Project Euler problems. Although I have my own sieve algorithms, I personally prefer to use the “Primes” function in “numbers” package. Another method is making a system call to open-source “primesieve” tool. The last part (ordering and getting the number at certain rank) is also trivial.

The common method in all three versions is to check the divisibility of all numbers

In the naivest first version, we check the divisibility of each number

    # whether numbers are divisible by the prime factor, by modulo
    bool.new 

In this code snippet, “nums” is the global vector holding the sequence of numbers 1 to 1e5. Although this option is the “brutest” of the three, it still enjoys the efficiency benefit of indexing a vector with another boolean vector of the same length – which takes linear time – instead of searching for the numbers. As long as the vector is not too large and the numbers to be indexed are not too sparse, we may not need to have an efficient hash function without collision.

The performance of the first version that runs the above snippet for radical updating is 12.8 seconds:

> microbenchmark(ordered.radicals.3(), times = 10)
Unit: seconds
                 expr      min       lq     mean   median       uq      max
 ordered.radicals.3() 12.61865 12.64935 12.80029 12.75633 12.85513 13.39739
 neval
    10

However, we do not need to probe the divisibility of the numbers in the sequence with the prime using the modulo operator “%%”. We know that the numbers divisible by a prime are the multiples of that prime! Using the vector indexing approach, we can get a sequence of the multiples of the prime upto 1e5, and toggle those indices in a clean boolean vector of FALSE values to TRUE, yielding the same result in a much faster manner:

    # empty F vector
    bool.new 

In the above snippet, first of all an empty boolean vector of 1e5 FALSE values (bool.new) is copied from a template vector (bool.new1) in line 2. In the most critical part, a sequence of the multiples of the prime below 1e5 is formed (line 5). And then the boolean vector of FALSE values is indexed with the multiples and those values are toggled to TRUE (line 8). The last line for updating the radicals is the same line in the first option (line 11).

The performance of this second option is 3 seconds, recording more than 4x gain on top of the first option:

> microbenchmark(ordered.radicals.2(), times = 10)                                                                                                                                            
Unit: seconds
                 expr      min       lq     mean   median       uq      max
 ordered.radicals.2() 2.933848 2.982204 3.022639 3.016654 3.066471 3.118391
 neval
    10

However, this performance is not good enough. There are 9592 primes below 1e5. And both codes check for all those primes. To enhance the performance we should first review sieving algorithms and reflect on the problem at hand:

A number which is not prime is a composite number: It has at least two prime factors. A composite number must have at least one prime factor less than or equal to its square root. We can arrive at this conclusion by induction: If the smallest prime factor is larger than the square root, the other factor(s) must be larger further. However, this is a contradiction since the product of those factors would be larger than the number itself then. In prime sieving up to some limit, the prime factors = sqrt(1e5) for all numbers (and for the case of prime numbers, we will have the primes themselves). Square root of 1e5 is 316 and there are only 65 primes less than 316. So instead of 9592 primes, we will probe for only 65 primes. The maximum power of a prime that is below 1e5 is 16 for the case of prime number 2. So the gains from probing 147 times fewer primes justify the extra work for probing powers of those prime factors.

# empty F vector
bool.new 

In this last version, we have an additional data structure: an interim vector of 1's (radicals.new) to collect the max powers of a prime that numbers are divisible with (line 5). We get the maximum power of the the prime that is below 1e5 using logarithm (line 8). After updating the radicals as before, we repeat the procedure for each power of the prime to populate the interim vector of prime powers (radicals.new) using a ply formula (line 20). After the vector is populated, numbers vector is divided by this prime powers vector (line 23).

After all 65 primes are probed and numbers vector is updated for all primes, the remaining numbers vector is the last prime factors to be multiplied with the vector for collecting distinct prime factors (radicals) to get the resulting radicals.

The performance is 109 milliseconds for this last version:

> microbenchmark(ordered.radicals.1(), times = 10)
Unit: milliseconds
                 expr      min      lq     mean   median       uq      max
 ordered.radicals.1() 106.1128 107.258 110.9804 109.0543 110.4271 129.5838
 neval
    10

We sped up the code nearly 30 times as compared with the second version and nearly 120 times as compared with the first “raw” version. Though we are still in the realm of “brute forcing”, by incorporating a little number theory and a better way to check divisibility, our force is now much more elegant and intelligent and manipulates the low level mechanics of the higher level R language better.

Note: The code snippets given are just critical excerpts from the original code given in order to illustrate the progression in algorithm enhancement.

To leave a comment for the author, please follow the link and comment on their blog: R – Serhat’s Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Data wrangling : I/O (Part-1)

By Vasileios Tsakalos

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the first part of this series and it aims to cover the importing and exporting of data locally.
Please download the data from here

Before proceeding, it might be helpful to look over the help pages for the read.csv, read.table, read_excel, fromJSON, as.data.frame, xmlParse, xmlToList, write.csv, write.table, write.xlsx, toJSON, write, write.xml.

Moreover please load the following libraries.
install.packages("readxl")
library(readxl)
install.packages("xlsx")
library(xlsx)
install.packages("rjson")
library(rjson)
install.packages("plyr")
library(plyr)
install.packages("XML")
library(XML)
install.packages("kulife")
library(kulife)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Import the data.csv file to the csv_file object.

Exercise 2

Import the data.txt file to the txt_file object.

Exercise 3

Import the data.xls file to the xls_file object.

Exercise 4

Import the data.xlsx file to the xlsx_file object.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 5

Import the data.json file to the json_file object.

Exercise 6

Import the data.xml file to the xml_file object.

Exercise 7

Export the csv_file as “data.csv” and txt_file as “data.txt”.

Exercise 8

Export the xls_file as “data.xls” and xlsx_file as “data.xlsx”.

Exercise 9

Export the json_file as “data.json”.

Exercise 10

Export the xml_file as “data.xml”.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Microsoft R Open 3.4.0 now available

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.0 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.4.0, reduces the size of the installer image, and updates the bundled packages.

R 3.4.0 (upon which MRO 3.4.0 is based) is a major update to the R language, with many fixes and improvements. Most notably, R 3.4.0 introduces a just-in-time (JIT) compiler to improve performance of the scripts and functions that you write. There have been a few minor tweaks to the language itself, but in general functions and packages written for R 3.3.x should work the same in R 3.4.0. As usual, MRO points to a fixed CRAN snapshot from May 1 2017, but you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

MRO is supported on Windows, Mac and Linux (Ubuntu, RedHat/CentOS, and SUSE). MRO 3.4.0 is 100% compatible with R 3.4.0, and you can use it with any of the 10,000+ packages available on CRAN. Here are some highlights of new packages released since the last MRO update.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

MRAN: Download Microsoft R Open

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Counterintuitive probability problem of the day

By dan

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

TWO SHOT RUSSIAN ROULETTE WITH 2 BULLETS OR AN EQUAL CHANCE OF 1 or 3 BULLETS

With p as the probability of dying on one shot, this figure shows how to get the probability of living through the game.

Peter Ayton is giving a talk today at the London Judgement and Decision Making Seminar

Imagine being obliged to play Russian roulette – twice (if you are lucky enough to survive the first game). Each time you must spin the chambers of a six-chambered revolver before pulling the trigger. However you do have one choice: You can choose to either (a) use a revolver which contains only 2 bullets or (b) blindly pick one of two other revolvers: one revolver contains 3 bullets; the other just 1 bullet. Whichever particular gun you pick you must use every time you play. Surprisingly, option (b) offers a better chance of survival. We discuss a general theorem implying, with some specified caveats, that a system’s probability of surviving repeated ‘demands’ improves as uncertainty concerning the probability of surviving one demand increases. Nonetheless our behavioural experiments confirm the counterintuitive nature of the Russian roulette and other kindred problems: most subjects prefer option (a). We discuss how uncertain probabilities reduce risks for repeated exposure, why people intuitively eschew them and some policy implications for safety regulation.

We can see how many people would think the choice between (a) and (b) doesn’t matter. Naively one might think that 2 bullets leads to the same probability as choosing blindly between 1 and 3 bullets. But this is not true. See the graphic above and plug in different ps.

Or be lazy and let us do it for you. First, approve that this R function is correct:


prob_live_game = function(bullets) {
prob_die = bullets / 6
prob_live = 1 - prob_die
prob_die * 0 + prob_live * (prob_die * 0 + prob_live * 1)
}

Now see below that the probability of living with 2 bullets is .44 while the probability of living with an equal chance of 1 or 3 bullets is .47


> #probability living with 2 bullets
> prob_live_game(2)
[1] 0.4444444
>
> #probability living with 1 bullet
> prob_live_game(1)
[1] 0.6944444
>
> #probability living with 3 bullets
> prob_live_game(3)
[1] 0.25
>
> #probability living with an equal chance of 1 or 3 bullets
> .5 * (prob_live_game(1) + prob_live_game(3))
[1] 0.4722222

The post Counterintuitive probability problem of the day appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Win an ticket to San Francisco EARL

By Karis Bouher

WIN

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

This is your chance to win one of two 2-day passes to the first San Francisco EARL Conference on June 6 and 7.

To win: All you need to do is tell us why you want to come!

How to enter: Tweet us using #EARLConf2017 and #rstats OR send an email to earl-team@mango-solutions.com

The most interesting answers will win!

Competition closes 11:59pm GMT-7 on Sunday May 28. We will let the winners know on Monday May 29.

Prize is not redeemable for cash.

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News