Discount R courses at Simplilearn

By Tal Galili

Guest post by Simplilearn

Simplilearn is offering access to its R Language courses at reduced prices. The offer is good till 31st Jan, 2016 with the coupon: GetAhead

Check out the R-courses they offer:

Certified Data Scientist with R Language

At the end of the training, you will be technically competent in key R programming language concepts such as data visualization and exploration, as well as in statistical concepts like linear and logistic regression, cluster analysis, and forecasting.

Certified Data Scientist with R, SAS and Excel

This is a comprehensive package for budding data analysts wanting to learn about SAS software and the statistical techniques essential to decode extensive data. Once you’re done with this course, you will be technically competent in data analytics methods like reporting, clustering, predictive modeling, and optimization – so you can manage huge volumes of data.

With both these courses, Simplilearn offers:

  • A 100% money-back guarantee
  • A flexible learning format, which includes self-paced learning as well as online classroom training
  • 5 industry projects
  • Course completion exam included
  • Over 20 hours of healthcare, retail, ecommerce, insurance hands-on project work

You can also check out their other courses in Big Data and Analytics

(for other paid/free R courses check out the extensive learn R article)

Source:: R News

Obama 2008 recieved 3x more media coverage than Sanders 2016

By Francis Smart

(This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers)
Many supporters of presidential hopeful Bernie Sanders have claimed that there is a media blackout in which Bernie Sanders has been for whatever reason blocked from communicating his campaign message. Combined with a dramatically cut democratic debate scheme (from 18 in 2008 with Obama to 4 in 2016 with Sanders) scheduled on days of the week least likely to be viewed by a wide audience this is seen as a significant attempt to rig the primary to ensure Clinton gets the nomination.

Despite a strongly supported petitions with nearly 120 thousand signatories and 30 thousand signatories demanding more debates, Debbie Wassermann Schultz, chair of the Democratic National Committee (DNC) and former campaign co-manager for Hillary Clinton’s 2008 campaign has repeatedly denied the possibility of considering more debates.

Combined with a complex fiasco earlier in the year dubbed “DataGate” in which the DNC temporarily shut down the Sanders campaign from accessing critical voter information two days before the third debate based information presented by Schultz and refuted by the vendor. Access to the data was quickly restored after a petition demanding action gathered 285 thousand signatures in less than 48 hours.

With these two scandals in mind, Sanders supporters have become increasingly paranoid of what they view as the “establishment” acting to protect its candidate, Hillary. In this light, they have been very frustrated by the lack of media coverage of Sanders. Supporters claim that he and his views are almost entirely unrepresented by the news media.

I have been wary of jumping on this bandwagon. It seems natural that the democratic front-runner would get more coverage than that of a less known rival. Clinton naturally attracts media attention as she seems to have a new scandal every day while Sanders seems to be a boy scout who apart from being jailed for protesting segregation in the 60s, not enriching himself from private speaking fees and book deals, adamantly defending the rights of the downtrodden, and standing up to the most powerful people in the world really has little “newsworthy” about him.

Setting aside the difficult question of what the media considers “newsworthy”, I would like to ask the question, “Is Sanders getting more or less media coverage than Obama got in 2007/2008?”

In order to answer this question, I look back at the front pages of online newspapers from 2015 and 2007. Starting on January 1st and going up till yesterday, I scraped the headlines of Google News, Yahoo News, Huffington Post, Fox News, NPR, and the New York Times.

Table 1: This tables shows the frequency the name “Sanders”, “Obama”, or “Clinton” (or “Bernie”, “Barack”, or “Hillary”) have come up in each of the news sources for which headlines were recorded in the current race compared with that of the 2008 race. The columns Sander/Clinton and Obama/Clinton show the relative frequency. The highlighted rows show the relevant headline ratios with numbers less than 1 indicating the ratio of headlines featuring a challenger to that of Clinton.

Race Web N Sanders Obama Clinton Sanders/Clinton Obama/Clinton
2008 NYT 25902 1 100 138 0.01 0.72
2008 Fox 39132 10 167 357 0.03 0.47
2008 Google 8452 0 103 131 0.00 0.79
2008 HuffPost 1281 0 40 60 0.00 0.67
2008 NPR 20878 0 90 94 0.00 0.96
2008 Yahoo 27308 3 266 334 0.01 0.80
2016 NYT 36703 142 592 531 0.27 1.11
2016 Fox 32971 78 1284 898 0.09 1.43
2016 Google 21036 67 378 253 0.26 1.49
2016 HuffPost 45131 236 925 549 0.43 1.68
2016 NPR 9216 52 259 106 0.49 2.44
2016 Yahoo 19844 44 346 206 0.21 1.68

From Table 1, we can see that NPR is the news network which has the most balanced coverage of Obama in 2008 and Sanders in 2016. Fox is the least balanced of the networks with almost no coverage of Sanders. It is worth noting the the coverage of Sanders is abysmal in general, with no agency reporting on Sanders even half as much as Clinton. This is a significant deviation from Obama’s race against Clinton in which only Fox reported on him with slightly less than 50% coverage.

Table 2: This table shows the total number of news reports across all agencies for each candidate in each race.

Race N Sanders Obama Clinton Sanders/Clinton Obama/Clinton
2016 164901 619 3784 2543 0.24 1.49
2008 122953 14 766 1114 0.01 0.69

From Table 2 we can see that both candidates Sanders and Obama have not received nearly as much coverage by the media as their rival Hillary Clinton. Sanders however seems to be at significant disadvantage compared with Obama at the same time in the previous race as Obama on average had about two articles written about him for every three written about Clinton. Sanders has significantly less coverage with only one article written about him for every four written about Clinton.

By this time in the 2008 primary race, Senator Obama had received 2.8 times as much coverage relative to his rival Hillary Clinton as Senator Sanders (.69/.24=2.8). This is despite Sanders doing better than Obama in many key metrics (Crowds, Donations, and Polling).

With Sanders taking the lead in New Hampshire and neck and neck with Clinton in Iowa, we might wonder if coverage is improving for the Sander’s campaign.

Figure 1: The top curve is the relative frequency of Obama coverage relative to that of Clinton while the bottom curve is that of senator Sanders to that of Clinton. A 1 on the y axis represents equal coverage of the challenger with that of Clinton.

From Figure 1 we can see that despite a remarkable performance in energizing large crowds, doing well on polls, and collecting an immense quantity of donations, media coverage appears to be dreadful for Sanders with even in the current peak, for every two stories about Clinton, there is only one story about Sanders.

This is probably in part due to how the DNC and the Clinton camp (doubtful there exists any difference) appear to have white washed the primary, restricting the debate structure and constantly adjusting Clinton’s positions so that they appear indistinguishable from that of Sanders’.

#WeAreBernie look at how MSM tried to have a Media blackout on Sanders – We have to win Iowa for the Revolution pic.twitter.com/Vv7EXVesmo

— VoteForBernie-org (@0ggles) January 23, 2016

Figure 2: Shows a popular twitter meme which conveys the frustration many have with the media.

In number of written articles Bernie has suffered due to an apparent media blackout. He has also suffered in lack of airtime. We can see this from Figure 2, in the number of minutes of coverage of him aired as of the 20th of December.

The criticisms of the DNC rigging the debate process and the bias in which candidate the media chooses to follow are significant concerns for any democracy. This all fits well within a “systemic” corruption framework of thinking. However, this framework might not accurately fit what is actually happening with the media and within the DNC. Additional investigation is required before further conclusions can be made.

But even in the presence of uncertainty as to the true nature of the presidential campaign. Accusations such as these and others levied against the Hillary Clinton and the DNC should be investigated with due diligence as they represent a fundamental threat to the existence of the democracy far more pernicious and dangerous than anything Middle Eastern terrorists can muster.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics by Simulation.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R User Groups on GitHub

By Joseph Rickert

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Quite a few times over the past few years I have highlighted presentations posted by R user groups on their websites and recommended these sites as a source for interesting material, but I have never thought to see what the user groups were doing on GitHub. As you might expect, many people who make presentations at R user group meetings make their code available on GitHub. However as best as I can tell, only a few R user groups are maintaining GitHub sites under the user group name.

The Indy UseR Group is one that seems to be making very good use of their GitHub Site. Here is the link to a very nice tutorial from Shankar Vaidyaraman on using the rvest package to do some web scraping with R. The following code which scrapes the first page from Springer’s Use R! series to produce a short list of books comes form Shankar’s simple example.

# load libraries
library(rvest)
library(dplyr)
library(stringr)
 
# link to Use R! titles at Springer site
useRlink = "http://www.springer.com/?SGWID=0-102-24-0-0&series=Use+R&sortOrder=relevance&searchType=ADVANCED_CDA&searchScope=editions&queryText=Use+R"
 
# Read the page
userPg = useRlink %>% read_html()
 
## Get info of books displayed on the page
booktitles = userPg %>% html_nodes(".productGraphic img") %>% html_attr("alt")
bookyr = userPg %>% html_nodes(xpath = "//span[contains(@class,'renditionDescription')]") %>% html_text()
bookauth = userPg %>% html_nodes("span[class = 'displayBlock']") %>% html_text()
bookprice = userPg %>% html_nodes(xpath = "//div[@class = 'bookListPriceContainer']//span[2]") %>% html_text()
pgdf = data.frame(title = booktitles, pubyr = bookyr, auth = bookauth, price = bookprice)
pgdf

This plot,which shows a list of books ranked by number of downloads, comes from Shankar’s extended recommender example.

The Ann Arbor R User Group meetup site has done an exceptional job of creating an aesthetically pleasing and informative web property on their GitHub site.

AnnArbor_github

I am particularly impressed with the way they have integrated news, content and commentary into their “News” section. Scroll down the page and have look at the care taken to describe and document the presentations made to the group. I found the introduction and slides for Bob Carpenter’s RStan presentation very well done.

StanvsAlternatives

Other RUGs active on GitHub include:

If your R user group is on GitHub and I have not included you in my short list please let me know about it. I think RUG GitHub sites have the potential for creating a rich code sharing experience among user groups. If you would like some help getting started with GItHub have a look at tutorials on the Murdoch University R User Group webpage.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

A Million Text Files And A Single Laptop

By Randy Zwitch

GNU Parallel Cat Unix

(This article was first published on R – randyzwitch.com, and kindly contributed to R-bloggers)

Wait…What? Why?

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Data Generation

For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the arules package for sampls ing transactions (with replacement), and the Python Faker (fake-factory) package to generate fake customer profiles and for creating the 1MM+ text files.

The contents of the data itself isn’t important for this blog post, but the data generation code is posted as a GitHub gist should you want to run these commands yourself.

Problem 1: Concatenating (cat * >> out.txt ?!)

The cat utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…

$ cat * >> out.txt
-bash: /bin/cat: Argument list too long

That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the `cat` command gets expanded before running, so the above statement passes 1,234,567 arguments to `cat` and you get an error message.

One (naive) solution would be to loop over every file (a completely serial operation):

for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

Roughly 10,093 seconds later, you’ll have your concatenated file. Three hours is quite a coffee break…

Solution 1: GNU Parallel & Concatenation

Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:

ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The `$f` argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (graph code, Julia):

Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say exactly where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the `-m` argument represents; by specifying `m`, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This alone leads to an 8x speedup over the naive loop solution.

Problem 2: Data > RAM

Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the “chunksize” keyword in pandas).

But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:

  1. How many unique products were sold?
  2. How many transactions were there per day?
  3. How many total items were sold per store, per month?

If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:

Q1: Unique Products

Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the `tr` (transliterate) utility, we can map our data to one product per row as we stream over the file:

The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using `sort -u` to de-dup the list and `wc -l` to count the number of unique lines (i.e. products).

In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!

Q2. Transactions By Day

If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL `Group By` on the date and sum the rows:

Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this blog post).

Q3. Total items Per store, Per month

For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.

It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.

But, I’ve got MULTIPLE files…

The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.

The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.

To leave a comment for the author, please follow the link and comment on their blog: R – randyzwitch.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

love-hate Metropolis algorithm

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

Hyungsuk Tak, Xiao-Li Meng and David van Dyk just arXived a paper on a multiple choice proposal in Metropolis-Hastings algorithms towards dealing with multimodal targets. Called “A repulsive-attractive Metropolis algorithm for multimodality” [although I wonder why XXL did not jump at the opportunity to use the “love-hate” denomination!]. The proposal distribution includes a [forced] downward Metropolis-Hastings move that uses the inverse of the target density π as its own target, namely 1/{π(x)+ε}. Followed by a [forced] Metropolis-Hastings upward move which target is {π(x)+ε}. The +ε is just there to avoid handling ratios of zeroes (although I wonder why using the convention 0/0=1 would not work). And chosen as 10⁻³²³ by default in connection with R smallest positive number. Whether or not the “downward” move is truly downwards and the “upward” move is truly upwards obviously depends on the generating distribution: I find it rather surprising that the authors consider the same random walk density in both cases as I would have imagined relying on a more dispersed distribution for the downward move in order to reach more easily other modes. For instance, the downward move could have been based on an anti-Langevin proposal, relying on the gradient to proceed further down…

This special choice of a single proposal however simplifies the acceptance ratio (and keeps the overall proposal symmetric). The final acceptance ratio still requires a ratio of intractable normalising constants that the authors bypass by Møller et al. (2006) auxiliary variable trick. While the authors mention the alternative pseudo-marginal approach of Andrieu and Roberts (2009), they do not try to implement it, although this would be straightforward here: since the normalising constants are the probabilities of accepting a downward and an upward move, respectively. Those can easily be evaluated at a cost similar to the use of the auxiliary variables. That is,

– generate a few moves from the current value and record the proportion p of accepted downward moves;
– generate a few moves from the final proposed value and record the proportion q of accepted downward moves;

and replace the ratio of intractable normalising constants with p/q. It is not even clear that one needs those extra moves since the algorithm requires an acceptance in the downward and upward moves, hence generate Geometric variates associated with those probabilities p and q, variates that can be used for estimating them. From a theoretical perspective, I also wonder if forcing the downward and upward moves truly leads to an improved convergence speed. Considering the case when the random walk is poorly calibrated for either the downward or upward move, the number of failed attempts before an acceptance may get beyond the reasonable.

As XXL and David pointed out to me, the unusual aspect of the approach is that here the proposal density is intractable, rather than the target density itself. This makes using Andrieu and Roberts (2009) seemingly less straightforward. However, as I was reminded this afternoon at the statistics and probability seminar in Bristol, the argument for the pseudo-marginal based on an unbiased estimator is that w Q(w|x) has a marginal in x equal to π(x) when the expectation of w is π(x). In thecurrent problem, the proposal in x can extended into a proposal in (x,w), w P(w|x) whose marginal is the proposal on x.

If we complement the target π(x) with the conditional P(w|x), the acceptance probability would then involve

{π(x’) P(w’|x’) / π(x) P(w|x)} / {w’ P(w’|x’) / w P(w|x)} = {π(x’) / π(x)} {w/w’}

so it seems the pseudo-marginal (or auxiliary variable) argument also extends to the proposal. Here is a short experiment that shows no discrepancy between target and histogram:

nozero=1e-300
#love-hate move
move<-function(x){ 
  bacwa=1;prop1=prop2=rnorm(1,x,2) 
  while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){ 
    prop1=rnorm(1,x,2);bacwa=bacwa+1}
  while (runif(1)>{pi(prop2)+nozero}/{pi(prop1)+nozero}) 
    prop2=rnorm(1,prop1,2)
  y=x
  if (runif(1)<pi(prop2)*bacwa/pi(x)/fowa){ 
    y=prop2;assign("fowa",bacwa)}
  return(y)}
#arbitrary bimodal target
pi<-function(x){.25*dnorm(x)+.75*dnorm(x,mean=5)}
#running the chain
T=1e5
x=5*rnorm(1);luv8=rep(x,T)
fowa=1;prop1=rnorm(1,x,2) #initial estimate
  while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){
    fowa=fowa+1;prop1=rnorm(1,x,2)}
for (t in 2:T)
  luv8[t]=move(luv8[t-1])

Filed under: Books, pictures, R, Statistics, Travel Tagged: auxiliary variable, doubly intractable problems, Metropolis-Hastings algorithm, Monte Carlo Statistical Methods, multimodality, normalising constant, parallel tempering, pseudo-marginal MCMC, The night of the hunter, unbiased estimation

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

In-depth analysis of Twitter activity and sentiment, with R

By David Smith

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Astronomer and budding data scientist Julia Silge has been using R for less than a year, but based on the posts using R on her blog has already become very proficient at using R to analyze some interesting data sets. She has posted detailed analyses of water consumption data and health care indicators from the Utah Open Data Catalog, religious affiliation data from the Association of Statisticians of American Religious Bodies, and demographic data from the American Community Survey (that’s the same dataset we mentioned on Monday).

In a two-part series, Julia analyzed another interesting dataset: her own archive of 10,000 tweets. (Julia provides all the R code for her analyses, so you can download your own Twitter archive and follow along.) In part one, Julia uses just a few lines of R to import her Twitter archive into R — in fact, that takes just one line of R code:

tweets <- read.csv("./tweets.csv", stringsAsFactors = FALSE)

She then uses the lubridate package to clean up the timestamps, and the ggplot2 package to create some simple charts of her Twitter activity. This chart takes just a few lines of R code and shows her Twitter activity over time categorized by type of tweet (direct tweets, replies, and retweets).

The really interesting part of the analysis comes in part two, where Julia uses the tm package (which provides a number of text mining functions to R) and syuzhet package (which includes the NRC Word-Emotion Association Lexicon algorithm) to analyze the sentiment of her tweets. Categorizing all 10,000 tweets as representing “anger”, “fear”, “surprise” and other sentiments, and generating a positive and negative sentiment score for each, is as simple as this one line of R code:

mySentiment <- get_nrc_sentiment(tweets$text)

Using those sentiment scores, Julia was easily able to summarize the sentiments expressed in her tweet history:

and create this time series chart showing her negative and positive sentiment scores over time:

Sentiment time series

If you’ve been thinking about applying sentiment analysis to some text data, you might find that with R it’s easier than you think! Try it using your own Twitter archive by following along with Julia’s posts linked below.

data science ish: Ten Thousand Tweets ; Joy to the World, and also Anticipation, Disgust, Surprise…

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Materials for NYU Shortcourse “Data Science and Social Science”

By Alex

(This article was first published on R – Bad Hessian, and kindly contributed to R-bloggers)

Pablo Barberá, Dan Cervone, and I prepared a short course at New York University on Data Science and Social Science, sponsored by several institutes at NYU. The course was intended as an introduction to R and basic data science tasks, including data visualization, social network analysis, textual analysis, web scraping, and APIs. The workshop is geared towards social scientists with little experience in R, but experience with other statistical packages.

You can download and tinker around with the materials on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – Bad Hessian.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Intro to Sound Analysis with R

By Tal Galili

snap

Guest post by Christopher Johnson from www.codeitmagazine.com

Some of my articles cover getting started with a particular software, and some cover tips and tricks for seasoned users. This article, however, is different. It does demonstrate the usage of an R package, but the main purpose is for fun.

In an article in Time, Matt Peckham described how French researchers were able to use four microphones and a single snap to model a complex room to within 1mm accuracy (Peckham). I decided that I wanted to attempt this (on a smaller scale) with one microphone and an R package. I was amazed at the results. Since the purpose of this article is not to teach anyone to write code to work with sound clips, rather than work with the code line by line, I will give a general overview, and I will present the code in full at the end for anyone that would like to recreate it for themselves.

The basic idea comes from the fact that sound travels at a constant speed in air. When it bounces off of an object, it returns in a predictable time. A microphone takes recordings at a consistent sampling rate, as well, which can be determined form the specs on the mic.

I placed a mic on my desk in a small office, pressed record, and snapped my fingers one time. I had an idea what to expect from my surroundings. The mic was placed on a desk, with a monitor about a foot away. There were two walls about three feet away and two walls and a ceiling about 6 feet away.

Next, I imported the sound clip into R. In R, there is a library called tuneR that enables us to work with sound clips. The following shows the initial image of the sound. Right away, we can see that there are several peaks that are larger than the others, which we would assume are the features we are interested in. The other peaks, we would assume, are smaller, less important, features of the room.

I wrote two functions to process the sound further. The first simply takes an observation, the sampling rate of the mic, and the speed of sound, and determines the distance traveled. The second function uses the first function to process a dataset of observations.

The output of this second function is a dataset of time and distances. Graphing this, we can more clearly see the results of our snap.

I have indicated the major features of the room, and they do indeed correspond to the expected distances from the room’s dimensions.

install.packages("tuneR", repos = "http://cran.r-project.org")
library(tuneR)

#Functions
sound_dist <- function(duration, samplingrate) {
  #Speed of sound is 1125 ft/sec
  return((duration/samplingrate)*1125/2)
}

sound_data <- function(dataset, threshold, samplingrate) {
  dataset <- snap@left
  threshold = 4000
  samplingrate = 44100
  data <- data.frame()
  max = 0
  maxindex = 0
  for (i in 1:length(dataset)) {
    if (dataset[i] > max) {
      max = dataset[i]
      maxindex = i
      data <- data.frame()
    }
    if (abs(dataset[i]) > threshold) {
      data <- rbind(data, c(i,dataset[i], sound_dist(i - maxindex, samplingrate)))
    }
  }
  colnames(data) <- c("x", "y", "dist")
  return(data)
}

#Analysis
snap <- readWave("Data/snap.wav")
print(snap)
play(snap)
plot(snap@left[30700:31500], type = "l", main = "Snap",
     xlab = "Time", ylab = "Frequency")

data <- sound_data(snap@left, 4000, 44100)
plot(data[,3], data[,2], type = "l", main = "Snap",
     xlab = "Dist", ylab = "Frequency")

References

Peckham, Matt. “We Can Now Map Rooms Down to the Millimeter with a Finger Snap.” 19 06 2013. Time. Web. 11 12 2015.

Source:: R News

How To Import Data Into R – New Course

By DataCamp Blog

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Importing your data into R to start your analyses: it should be the easiest step. Unfortunately, this is almost never the case. Data is stored in all sorts of formats, ranging from from flat files to other statistical software files to databases and web data. A skilled data scientist knows which techniques to use to in order to proceed with the analysis of data.

In our latest course, Importing Data Into R, you will learn the basics on how to get up and running in no time! Start the new Importing & Cleaning Data course for free, today.

What you’ll learn

This 4 hour course includes 5 chapters and covers the given topics below:

  • Chapter 1: Learn how to import data from flat files without hesitation using the readr and data.table packages, in addition to harnessing the power of the fread function.

  • Chapter 2: You will excel at loading .xls and .xlsx files with help from packages such as: readxl, gdata, and XLConnect.

  • Chapter 3: Help out your friends that are still paying for their statistical software and import their datasets from SAS, STATA, and SPSS using the haven and foreign packages.

  • Chapter 4: Pull data in style from popular relational databases, including SQL.

  • Chapter 5: Learn the valuable skill of importing data from the web.

Start your data importing journey here!

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

R typos

By xi’an

Amster14

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

At MCMskv, Alexander Ly (from Amsterdam) pointed out to me some R programming mistakes I made in the introduction to Metropolis-Hastings algorithms I wrote a few months ago for the Wiley on-line encyclopedia! While the outcome (Monte Carlo posterior) of the corrected version is moderately changed this is nonetheless embarrassing! The example (if not the R code) was a mixture of a Poisson and a Geometric distributions borrowed from our testing as mixture paper. Among other things, I used a flat prior on the mixture weights instead of a Beta(1/2,1/2) prior and a simple log-normal random walk on the mean parameter instead of a more elaborate second order expansion discussed in the text. And I also inverted the probabilities of success and failure for the Geometric density. The new version is now available on arXiv, and hopefully soon on the Wiley site, but one (the?) fact worth mentioning here is that the (right) corrections in the R code first led to overflows, because I was using the Beta random walk Be(εp,ε(1-p)) which major drawback I discussed here a few months ago. With the drag that nearly zero or one values of the weight parameter produced infinite values of the density… Adding 1 (or 1/2) to each parameter of the Beta proposal solved the problem. And led to a posterior on the weight still concentrating on the correct corner of the unit interval. In any case, a big thank you to Alexander for testing the R code and spotting out the several mistakes…

Filed under: Books, Kids, R, Statistics, Travel, University life Tagged: Amsterdam, Bayesian Analysis, MCMskv, Metropolis-Hastings algorithm, mixtures, Monte Carlo Statistical Methods, R, random walk, testing as mixture estimation

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News