What is the best online dating site and the best way to use it?

By Daniel.Pocock

(This article was first published on DanielPocock.com – r-project, and kindly contributed to R-bloggers)

Somebody recently shared this with me, this is what happens when you attempt to access Parship, an online dating site, from the anonymous Tor Browser.

Experian is basically a private spy agency. Their website boasts about how they can:

  • Know who your customers are regardless of channel or device
  • Know where and how to reach your customers with optimal messages
  • Create and deliver exceptional experiences every time

Is that third objective, an “exceptional experience”, what you were hoping for with their dating site honey trap? You are out of luck: you are not the customer, you are the product.

When the Berlin wall came down, people were horrified at what they found in the archives of the Stasi. Don’t companies like Experian and Facebook gather far more data than this?

So can you succeed with online dating?

There are only three strategies that are worth mentioning:

  • Access sites you can’t trust (which includes all dating sites, whether free or paid for) using anonymous services like Tor Browser and anonymous email addresses. Use fake photos and fake all other data. Don’t send your real phone number through the messaging or chat facility in any of these sites because they can use that to match your anonymous account to a real identity: instead, get an extra SIM card that you pay for and top-up with cash. One person told me they tried this for a month as an experiment, expediently cutting and pasting a message to each contact to arrange a meeting for coffee. At each date they would give the other person a card that apologized for their completely fake profile photos and offering to start over now they could communicate beyond the prying eyes of the corporation.
  • Join online communities that are not primarily about dating and if a relationship comes naturally, it is a bonus.
  • If you really care about your future partner and don’t want your photo to be a piece of bait used to exploit and oppress them, why not expand your real-world activities?
To leave a comment for the author, please follow the link and comment on their blog: DanielPocock.com – r-project.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Edwina Dunn to keynote at enterprise focused R language conference

By Mango Solutions

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

MEDIA RELEASE

14 February 2018

Mango Solutions are delighted to announce that loyalty programme pioneer and data science innovator, Edwina Dunn, will keynote at the 2018 Enterprise Applications of the R Language (EARL) Conference in London on 11-13 September.

Mango Solutions’ Chief Data Scientist, Richard Pugh, has said that it is a privilege to have Ms Dunn address Conference delegates.

“Edwina helped to change the data landscape on a global scale while at dunnhumby; Tesco’s Clubcard, My Kroger Plus and other loyalty programmes have paved the way for data-driven decision making in retail,” Mr Pugh said.

“Having Edwina at EARL this year is a win for delegates, who attend the Conference to find inspiration in their use of analytics and data science using the R Language.

“In this centenary year of the 1918 Suffrage act, Edwina’s participation is especially appropriate, as she is the founder of The Female Lead, a non-profit organization dedicated to giving women a platform to share their inspirational stories,” he said.

Ms Dunn is currently CEO at Starcount, a consumer insights company that combines the science of purchase and intent and brings the voice of the customer into the boardroom.

The EARL Conference is a cross-sector conference focusing on the commercial use of the R programming language with presentations from some of the world’s leading practitioners.

More information and tickets are available on the EARL Conference website: earlconf.com

END

For more information, please contact:
Karis Bouher, Marketing Manager: marketing@mango-solutions.com or +44 (0)1249 705 450

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Mandalas

By @aschinchon

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Mathematics is a place where you can do things which you can’t do in the real world (Marcus Du Sautoy, mathematician)

From time to time I have a look to some of my previous posts: it’s like seeing them through another’s eyes. One of my first posts was this one, where I draw fractals using the Multiple Reduction Copy Machine (MRCM) algorithm. That time I was not clever enough to write an efficient code able generate deep fractals. Now I am pretty sure I could do it using ggplot and I started to do it when I come across with the idea of mixing this kind of fractal patterns with Voronoi tessellations, that I have explored in some of my previous posts, like this one. Mixing both techniques, the mandalas appeared.

I will not explain in depth the mathematics behind this patterns. I will just give a brief explanation:

  • I start obtaining n equidistant points in a unit circle centered in (0,0)
  • I repeat the process with all these points, obtaining again n points around each of them; the radius is scaled by a factor
  • I discard the previous (parent) n points

I repeat these steps iteratively. If I start with n points and iterate k times, at the end I obtain nk points. After that, I calculate the Voronoi tesselation of them, which I represent with ggplot.

This is an example:

Some others:

You can find the code here. Enjoy it.

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

BH 1.66.0-1

By Thinking inside the box

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of the BH package arrived on CRAN a little earlier: now at release 1.66.0-1. BH provides a sizeable portion of the Boost C++ libraries as a set of template headers for use by R, possibly with Rcpp as well as other packages.

This release upgrades the version of Boost to the Boost 1.66.0 version released recently, and also adds one exciting new library: Boost compute which provides a C++ interface to multi-core CPU and GPGPU computing platforms based on OpenCL.

Besides the usual small patches we need to make (i.e., cannot call abort() etc pp to satisfy CRAN Policy) we made one significant new change in response to a relatively recent CRAN Policy change: compiler diagnostics are not suppressed for clang and g++. This may make builds somewhat noisy so we all may want to keep our ~/.R/Makevars finely tuned suppressing a bunch of warnings…

Changes in version 1.66.0-1 (2018-02-12)

  • Upgraded to Boost 1.66.0 (plus the few local tweaks)

  • Added Boost compute (as requested in #16)

Via CRANberries, there is a diffstat report relative to the previous release.

Comments and suggestions are welcome via the mailing list or the issue tracker at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

jamovi for R: Easy but Controversial

By Bob Muenchen

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

jamovi is software that aims to simplify two aspects of using R. It offers a point-and-click graphical user interface (GUI). It also provides functions that combines the capabilities of many others, bringing a more SPSS- or SAS-like method of programming to R.

The ideal researcher would be an expert at their chosen field of study, data analysis, and computer programming. However, staying good at programming requires regular practice, and data collection on each project can take months or years. GUIs are ideal for people who only analyze data occasionally, since they only require you to recognize what you need in menus and dialog boxes, rather than having to recall programming statements from memory. This is likely why GUI-based research tools have been widely used in academic research for many years.

Several attempts have been made to make the powerful R language accessible to occasional users, including R Commander, Deducer, Rattle, and Bluesky Statistics. R Commander has been particularly successful, with over 40 plug-ins available for it. As helpful as those tools are, they lack the key element of reproducibility (more on that later).

jamovi’s developers designed its GUI to be familiar to SPSS users. Their goal is to have the most widely used parts of SPSS implemented by August of 2018, and they are well on their way. To use it, you simply click on Data>Open and select a comma separate values file (other formats will be supported soon). It will guess at the type of data in each column, which you can check and/or change by choosing Data>Setup and picking from: Continuous, Ordinal, Nominal, or Nominal Text.

Alternately, you could enter data manually in jamovi’s data editor. It accepts numeric, scientific notation, and character data, but not dates. Its default format is numeric, but when given text strings, it converts automatically to Nominal Text. If that was a typo, deleting it converts it immediately back to numeric. I missed some features such as finding data values or variable names, or pinning an ID column in place while scrolling across columns.

To analyze data, you click on jamovi’s Analysis tab. There, each menu item contains a drop-down list of various popular methods of statistical analysis. In the image below, I clicked on the ANOVA menu, and chose ANOVA to do a factorial analysis. I dragged the variables into the various model roles, and then chose the options I wanted. As I clicked on each option, its output appeared immediately in the window on the right. It’s well established that immediate feedback accelerates learning, so this is much better than having to click “Run” each time, and then go searching around the output to see what changed.

The tabular output is done in academic journal style by default, and when pasted into Microsoft Word, it’s a table object ready to edit or publish:

You have the choice of copying a single table or graph, or a particular analysis with all its tables and graphs at once. Here’s an example of its graphical output:

Interaction plot from jamovi using the “Hadley” style. Note how it offsets the confidence intervals to for each workshop automatically to make them easier to read when they overlap.

jamovi offers four styles for graphics: default a simple one with plain background, minimal which – oddly enough – adds a grid at the major tick-points; I♥SPSS, which copies the look of that software; and Hadley, which follows the style of Hadley Wickham’s popular ggplot2 package.

At the moment, nearly all graphs are produced through analyses. A set of graphics menus is in the works. I hope the developers will be able to offer full control over custom graphics similar to Ian Fellows’ powerful Plot Builder used in his Deducer GUI.

The graphical output looks fine on a computer screen, but when using copy-paste into Word, it is a fairly low-resolution bitmap. To get higher resolution images, you must right click on it and choose Save As from the menu to write the image to SVG, EPS, or PDF files. Windows users will see those options on the usual drop-down menu, but a bug in the Mac version blocks that. However, manually adding the appropriate extension will cause it to write the chosen format.

jamovi offers full reproducibility, and it is one of the few menu-based GUIs to do so. Menu-based tools such as SPSS or R Commander offer reproducibility via the programming code the GUI creates as people make menu selections. However, the settings in the dialog boxes are not currently saved from session to session. Since point-and-click users are often unable to understand that code, it’s not reproducible to them. A jamovi file contains: the data, the dialog-box settings, the syntax used, and the output. When you re-open one, it is as if you just performed all the analyses and never left. So if your data collection process came up with a few more observations, or if you found a data entry error, making the changes will automatically recalculate the analyses that would be affected (and no others).

While jamovi offers reproducibility, it does not offer reusability. Variable transformations and analysis steps are saved, and can be changed, but the data input data set cannot be changed. This is tantalizingly close to full reusability; if the developers allowed you to choose another data set (e.g. apply last week’s analysis to this week’s data) it would be a powerful and fairly unique feature. The new data would have to contain variables with the same names, of course. At the moment, only workflow-based GUIs such as KNIME offer re-usability in a graphical form.

As nice as the output is, it’s missing some very important features. In a complex analysis, it’s all too easy to lose track of what’s what. It needs a way to change the title of each set of output, and all pieces of output need to be clearly labeled (e.g. which sums of squares approach was used). The output needs the ability to collapse into an outline form to assist in finding a particular analysis, and also allow for dragging the collapsed analyses into a different order.

Another output feature that would be helpful would be to export the entire set of analyses to Microsoft Word. Currently you can find Export>Results under the main “hamburger” menu (upper left of screen). However, that saves only PDF and HTML formats. While you can force Word to open the HTML document, the less computer-savvy users that jamovi targets may not know how to do that. In addition, Word will not display the graphs when the output is exported to HTML. However, opening the HTML file in a browser shows that the images have indeed been saved.

Behind the scenes, jamovi’s menus convert its dialog box settings into a set of function calls from its own jmv package. The calculations in these functions are borrowed from the functions in other established packages. Therefore the accuracy of the calculations should already be well tested. Citations are not yet included in the package, but adding them is on the developers’ to-do list.

If functions already existed to perform these calculations, why did jamovi’s developers decide to develop their own set of functions? The answer is sure to be controversial: to develop a version of the R language that works more like the SPSS or SAS languages. Those languages provide output that is optimized for legibility rather than for further analysis. It is attractive, easy to read, and concise. For example, to compare the t-test and non-parametric analyses on two variables using base R function would look like this:

> t.test(pretest ~ gender, data = mydata100)

Welch Two Sample t-test

data: pretest by gender
t = -0.66251, df = 97.725, p-value = 0.5092
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.810931 1.403879
sample estimates:
mean in group Female mean in group Male 
 74.60417 75.30769

> wilcox.test(pretest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: pretest by gender
W = 1133, p-value = 0.4283
alternative hypothesis: true location shift is not equal to 0

> t.test(posttest ~ gender, data = mydata100)

Welch Two Sample t-test

data: posttest by gender
t = -0.57528, df = 97.312, p-value = 0.5664
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.365939 1.853119
sample estimates:
mean in group Female mean in group Male 
 81.66667 82.42308

> wilcox.test(posttest ~ gender, data = mydata100)

Wilcoxon rank sum test with continuity correction

data: posttest by gender
W = 1151, p-value = 0.5049
alternative hypothesis: true location shift is not equal to 0

While the same comparison using the jamovi GUI, or its jmv package, would look like this:

Output from jamovi or its jmv package.

Behind the scenes, the jamovi GUI was executing the following function call from the jmv package. You could type this into RStudio to get the same result:

library("jmv")
ttestIS(
 data = mydata100,
 vars = c("pretest", "posttest"),
 group = "gender",
 mann = TRUE,
 meanDiff = TRUE)

In jamovi (and in SAS/SPSS), there is one command that does an entire analysis. For example, you can use a single function to get: the equation parameters, t-tests on the parameters, an anova table, predicted values, and diagnostic plots. In R, those are usually done with five functions: lm, summary, anova, predict, and plot. In jamovi’s jmv package, a single linReg function does all those steps and more.

The impact of this design is very significant. By comparison, R Commander’s menus match R’s piecemeal programming style. So for linear modeling there are over 25 relevant menu choices spread across the Graphics, Statistics, and Models menus. Which of those apply to regression? You have to recall. In jamovi, choosing Linear Regression from the Regression menu leads you to a single dialog box, where all the choices are relevant. There are still over 20 items from which to choose (jamovi doesn’t do as much as R Commander yet), but you know they’re all useful.

jamovi has a syntax mode that shows you the functions that it used to create the output (under the triple-dot menu in the upper right of the screen). These functions come with the jmv package, which is available on the CRAN repository like any other. You can use jamovi’s syntax mode to learn how to program R from memory, but of course it uses jmv’s all-in-one style of commands instead of R’s piecemeal commands. It will be very interesting to see if the jmv functions become popular with programmers, rather than just GUI users. While it’s a radical change, R has seen other radical programming shifts such as the use of the tidyverse functions.

jamovi’s developers recognize the value of R’s piecemeal approach, but they want to provide an alternative that would be easier to learn for people who don’t need the additional flexibility.

As we have seen, jamovi’s approach has simplified its menus, and R functions, but it offers a third level of simplification: by combining the functions from 20 different packages (displayed when you install jmv), you can install them all in a single step and control them through jmv function calls. This is a controversial design decision, but one that makes sense to their overall goal.

Extending jamovi’s menus is done through add-on modules that are stored in an online repository called the jamovi Library. To see what’s available, you simply click on the large “+ Modules” icon at the upper right of the jamovi window. There are only nine available as I write this (2/12/2018) but the developers have made it fairly easy to bring any R package into the jamovi Library. Creating a menu front-end for a function is easy, but creating publication quality output takes more work.

A limitation in the current release is that data transformations are done one variable at a time. As a result, setting measurement level, taking logarithms, recoding, etc. cannot yet be done on a whole set of variables. This is on the developers to-do list.

Other features I miss include group-by (split-file) analyses and output management. For a discussion of this topic, see my post, Group-By Modeling in R Made Easy.

Another feature that would be helpful is the ability to correct p-values wherever dialog boxes encourage multiple testing by allowing you to select multiple variables (e.g. t-test, contingency tables). R Commander offers this feature for correlation matrices (one I contributed to it) and it helps people understand that the problem with multiple testing is not limited to post-hoc comparisons (for which jamovi does offer to correct p-values).

Though only at version 0.8.1.2.0, I only found only two minor bugs in quite a lot of testing. After asking for post-hoc comparisons, I later found that un-checking the selection box would not make them go away. The other bug I described above when discussing the export of graphics. The developers consider jamovi to be “production ready” and a number of universities are already using it in their undergraduate statistics programs.

In summary, jamovi offers both an easy to use graphical user interface plus a set of functions that combines the capabilities of many others. If its developers, Jonathan Love, Damian Dropmann, and Ravi Selker, complete their goal of matching SPSS’ basic capabilities, I expect it to become very popular. The only skill you need to use it is the ability to use a spreadsheet like Excel. That’s a far larger population of users than those who are good programmers. I look forward to trying jamovi 1.0 this August!

Acknowledgements

Thanks to Jonathon Love, Josh Price, and Christina Peterson for suggestions that significantly improved this post.

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Supervised vs. Unsupervised Learning: Exploring Brexit with PLS and PCA

By Computational Social Science

(This article was first published on Computational Social Science, and kindly contributed to R-bloggers)

Outcome Supervision

Yesterday I was part of an introductory session on machine learning and unsurprisingly, the issue of supervised vs. unsupervised learning came up. In social sciences, there is a definite tendency for the former; there is more or less always a target outcome or measure that we want to optimise the performance of our models for. This reminded me of a draft that I had written the code a couple of months ago but for some reason never converted into a blog post until now. This will also allow me to take a break from conflict forecasting for a bit and go back to my usual topic of UK. My frequent usage of all things UK is at such a level that my Google Search Console insights lists r/CasualUK as the top referrer. Cheers, mates!

UK General Elections and Brexit

I think I have exploited rvest enough recently, so I will rely on a couple of good old csv’s this time. The Electoral Commission website provides the EU referendum voting totals by region. I want to combine electoral data with other socio-economic indicators so I will subset the data to London only (this is how privilege perpetuates itself). Within that subset, I will only keep the area code (for matching purposes later), and the raw number of votes for each option in the referendum. I will also create a variable indicating ‘the outcome’ based on which side got more votes:

##EU Referendum data for London
eu  eu$Remain, "Leave", "Remain")

I supplement the referendum outcomes using data supplied by the London Datastore. The data structure is quite hostile to R (i.e. death by government-issue free-form Excel), so I cleaned it up manually a bit first. Let’s read in that version:

#London Borough data
london 

According to the website,

“…The London Borough Profiles help paint a general picture of an area by presenting a range of headline indicator data in both spreadsheet and map form to help show statistics covering demographic, economic, social and environmental datasets for each borough, alongside relevant comparator areas.”

Excellent. Now, one of the downsides of resurrecting months-old code is that sometimes, you don’t remember why you did something the way you did…

#Checking which columns have missing values
have.na  0, 1, 0)
  have.na 
## [1] "GAP"      "Male_GAP" "Fem_GAP"
#Remove columns with NAs
london 

At least I do add comments even for trivial things. A bit convoluted for the task at hand, perhaps. Anyway, I will leave it in for posterity. Moving on, we can now combine the two datasets and name it accordingly:

#Merge the datasets by region code
colnames(eu)[1] 

Rumour has it that the above code chunk is my current Facebook cover image. Where’s my biscuit?

Most of the variables are actually percentages rather than counts. We can write up a basic helper function to identify which ones have values bounded by 0-100. Naturally, we will misidentify some along the way so we’ll just remove them manually afterwards:

#Function to check whether a column is in the 0-100 range
in_range = 0 & holder[2] 
##  [1] "Mean_Age"                "Population_0_15"        
##  [3] "Working_Age_Population"  "Population_65"          
##  [5] "Born_Abroad"             "Largest_Mig_Pop"        
##  [7] "Second_Largest_Mig_Pop"  "Third_Largest_Mig_Pop"  
##  [9] "BAME_Population"         "English_Not_First_Lang" 
## [11] "Employment_Rate"         "Male_Employment"        
## [13] "Female_Employment"       "Unemployment"           
## [15] "Youth_Unemployment"      "Youth_NEET"             
## [17] "Working_Age_Benefits"    "Working_Age_Disability" 
## [19] "No_Qualifications"       "Degree"                 
## [21] "Volunteer"               "Employed_Public_Sector" 
## [23] "Job_Density"             "Business_Survival"      
## [25] "Fires_Per_Thousand"      "Ambulance_Per_Hundred"  
## [27] "House_Owned"             "House_Bought"           
## [29] "House_Council"           "House_Landlord"         
## [31] "Greenspace"              "Recycling"              
## [33] "Cars_Per_Household"      "Cycle"                  
## [35] "Public_Trans_Access"     "High_Grades"            
## [37] "Child_Care"              "Pupil_English_Not_First"
## [39] "Male_Life_Expectancy"    "Female_Life_Expectancy" 
## [41] "Teenage_Conception"      "Life_Satisfaction"      
## [43] "Worthwhileness"          "Happiness"              
## [45] "Anxiety"                 "Childhood_Obesity"      
## [47] "Diabetes"                "Tories"                 
## [49] "Labour"                  "Lib_Dems"               
## [51] "Turnout"
#Remove age and other non-% variables from the list
percentage 
## [1] "Population_0_15_%"        "Working_Age_Population_%"
## [3] "Population_65_%"          "Born_Abroad_%"           
## [5] "Largest_Mig_Pop_%"        "Second_Largest_Mig_Pop_%"

Before moving on, let’s also check which variables are correlated with each other. Also, everyone loves correlation plots. To me, they are akin to Apple products—they always look cool, but their utility is situational. They also get increasingly difficult to interpret when n > 15ish. However, they have another useful function: you can specify the number ‘blocks’ that you want to divide the correlation plot if you opt for hierarchical clustering. Say, we want to identify three such blocks in the data:

#Plot correlation matrix
M 

Larger version here. The first rectangle (top left), let’s call it the Tory block as the first row/column is Tory voting percentage. We see that it is positively correlated with indicators such as voting Leave, high income/employment, old age, happiness/worthwhileness (?) etc. In other words, it passes the eye test for potential conservative party leanings. Conversely, the middle block is the ‘Labour’ counterpart. The correlated indicators revolve around issues such as voting Remain, young immigrant population, English not being the first language, benefits, unemployment. Again, passes the sanity check (hold your horses Tories, reverse causality and all). Finally, we have a smaller block (bottom right), which I’ll call the ‘Non-Aligned’. This cluster is curious: Basically, it represents the people living in the City—as indicated by high population/job density, median income, and police/ambulance/fire emergencies per thousand. Note that not everything in a block is necessarily (positively) correlated with each other; only the blue shades are.

We can also just rely on numbers to identify correlations above a specified threshold (albeit without clustering so less useful):

#Numerical version
correlated 
##  [1] "Household_Estimate"       "Inland_Area"             
##  [3] "Population_Density"       "Mean_Age"                
##  [5] "Working_Age_Population_%" "Population_65_%"         
##  [7] "International_Migration"  "English_Not_First_Lang_%"
##  [9] "Overseas_National"        "New_Migrant_Rate"        
## [11] "Employment_Rate_%"        "Job_Density_%"           
## [13] "Active_Businesses"        "Crime_Per_Thousand"      
## [15] "Ambulance_Per_Hundred"    "House_Owned_%"           
## [17] "House_Bought_%"           "House_Council_%"         
## [19] "Carbon_Emissions"         "Cars"                    
## [21] "Cars_Per_Household"       "Public_Trans_Access_%"   
## [23] "Male_Life_Expectancy_%"   "Preventable_Deaths"      
## [25] "Labour_%"

Principal Component Analysis (PCA)

Probably the most commonly-used unsupervised learning technique alongside k-means clustering is the Principal Component Analysis. It’s so common that it’s probably your best bet for de-mystifying unsupervised learning; many have utilised the technique using commercial software such as SPSS/STATA. PCA is well-liked because it is pretty efficient at reducing dimension and creating uncorrelated variables (components), which helps with model stability. On the other hand, it is susceptible (or drawn) to high-variance variables; if you measure the same phenomenon in days and in months, the former will be picked up in the earlier components. The downside is that the model might focus on sorting variance rather than identifying the underlying data structure. Second, as the technique is unsupervised, the components maximise capturing variance without regard to an outcome. Meaning, your uncorrelated components only work in supervised settings if the captured variance is correlated with the outcome. Let’s unpack that a bit. First, we will allocate the columns from our aforementioned blocks and apply some diagnostics:

#Get colnames for splitting PCs
tories 
##              Lib_Dems_%           Job_Density_%                    Jobs 
##                3.358732                3.250213                3.043243 
##       Active_Businesses      Crime_Per_Thousand             House_Price 
##                2.862152                2.140319                2.057096 
##   Ambulance_Per_Hundred        Carbon_Emissions Third_Largest_Mig_Pop_% 
##                2.005706                1.606982                1.592895 
##             Council_Tax 
##                1.465253
#Look for non-zero variance predictors
nearZeroVar(irony)
## integer(0)

Looking good. Now we can actually create the PCs. For this exercise I will not split the data into train/test as we won’t be forecasting. I will rely on caret to do the transformation; however there are many other packages that you can use. We will center and scale the variables and perform PCA capturing at least 95% of the variance:

#Create PCs based on correlated blocks
tory.trans 
##    Population_65_%      House_Owned_% Cars_Per_Household 
##          0.2764259          0.2763916          0.2714639 
##           Mean_Age       Greenspace_%               Cars 
##          0.2581242          0.2363438          0.2349750

Let’s quickly do the same for Labour and the non-aligned block:

#Repeat for other parties
labour.trans 
##   Overseas_National   BAME_Population_%     Net_Immigration 
##          -0.2721566          -0.2675754          -0.2600342 
## Childhood_Obesity_%    New_Migrant_Rate            Labour_% 
##          -0.2552952          -0.2504425          -0.2461091
non.trans 

Finally, we can now subset the PCs themselves and merge it with the main dataset:

#Get rid of unnecessary columns and rename PCs
pca.data 

Partial Least Squares (PLS)

In contrast, Partial Least Squares is a PCA that is told the outcome so it has something to calibrate to. If you also happen to have a printed copy of Applied Predictive Modeling, Kuhn and Johnson touch upon this on page 37. Paragraph three, towards the end. Done reading? Good. On that note, Max is on Twitter now:

Talk about an impossible task.

PLS is a supervised technique as we are supplying an outcome found in the data; in our case, whether the borough voted Leave/Remain. Let’s set up a standard caret operation using a repeated 5-k fold, and specify we want upsampling (to offset the class imbalance as most London boroughs voted Remain) and follow the one-standard-error rule for model selection:

#Partial Least Squares
folds 

Let’s set the tune length to 20 and use Kappa as the accuracy metric. Don’t forget to remove the outcome from the right-hand side and apply the standard pre-processing procedures (remove zero-variance indicators, center and scale):

#Train Tory model
set.seed(1895)
tory.fit 

caret also makes it easy to extract and visualise the top n variables (in terms of their predictive accuracy):

plot(varImp(tory.fit), 10)

Lib Dems incursion into Tory territory! Let’s repeat the same for the Labour side:

#Train Labour model
set.seed(1895) #Let's not take chances
labour.fit 

plot(varImp(labour.fit), 10)

So in our case, the issue of supervision does not seem to radically change the findings. Both the PCA and PLS seem to agree: if we simplify greatly, older, happier, richer people vote Tory and Leave; whereas young, diverse, less financially-able people vote Labour/Remain. Both of which are admittedly not ground-breaking insights. But for our purposes, we show that in this case, maximising variance in the data actually lends itself well to predicting our outcome of interest.

Mapping the Loadings

Keeping up with the data visualisation focus of this blog, let’s create a map of London boroughs highlighting the above findings. Similar to the setup of mapping Westeros in R, you need to have all the shape files in your current directory for the "." argument to work. We can then match the borough by their area code:

#London Boroughs Shapefile
boroughs 

We have several options for visualisation here. We can have the loadings for each PC by party colouring the boroughs, which would help us see the variation between them. We can also add a basic histogram showing the distribution of the top variables (in each component) and overlay coloured indicators within each borough to see whether we can find something interesting there:

#Visualisation
tm_shape(geo.data) +
#PC loadings data
tm_fill(c("Tory_PC1", "Tory_PC2", "Labour_PC1", "Labour_PC2"),
        palette = list("-Blues", "-Blues", "-Reds", "-Reds"),
        style = "order", auto.palette.mapping = TRUE, contrast = .8,
        legend.is.portrait = FALSE, legend.show = FALSE) +
#Titles
tm_layout(main.title = "Principal Components & Their Top Variables  |  London Boroughs", legend.position = c(0, 0), 
          title = list("% Population over 65",
                       "% House Owned Outright",
                       "Overseas Nationals (NINo)",
                       "Children Looked After per 10K"),
          panel.show = TRUE, panel.label.size = 1.4,
          panel.labels = c("Tory PC1 values in blue, darker shades are negative",
                           "Tory PC2 values in blue, darker shades are negative",
                           "Labour PC1 values in red, darker shades are negative",
                           "Labour PC2 values in red, darker shades are negative"), 
          fontfamily = "Roboto Condensed", between.margin = 0, asp = 1.78, scale = .6) +
#Overlay data options
tm_symbols(col = c("Population_65_%", "House_Bought_%", "Overseas_National", "Child_Care"),
           title.col = "", n = 4,
           scale = 1, shape = 24, border.col = "black", border.lwd = .6, border.alpha = .5,
           legend.hist = TRUE, legend.hist.title = "Distribution") +
#Border transperancy
tm_borders(alpha = .4)

Click here for a larger pdf version. Let’s analyse what we see in the top left panel, which visualises the first Tory PC loadings and the highest ranked variable within that component; the percentage of population over the age of 65. The blues represent the loading values; darker shades are negative whereas lighter shades are positive. One way to look at this is that ‘inner’ London is different than ‘outer’ London for this PC (remember, PCA maximises variance for its own sake). In this case, we find that this overlaps well with increasing share of senior citizens, who mostly live on the periphery (darker red triangles) and going by the histogram, constitute the smallest segment in the population. As usual, the code is available on GitHub, so good luck replicating comrade!

Let’s block ads! (Why?)

To leave a comment for the author, please follow the link and comment on their blog: Computational Social Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Sentiment Analysis of 5 popular romantic comedies

By Appsilon Data Science Blog

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

Background

With Valentine’s Day coming up, I was thinking about a fun analysis that I could convert into a blog post.

Inspired by a beautiful visualization “Based on a True True story?” I decided to do something similar and analyze the sentiment in the most popular romantic comedies.

After searching for romantic comedies Google suggests a list of movies, where the top 5 are: “When Harry Met Sally”, “Love Actually”, “Pretty Woman”, “Notting Hill”, and “Sleepless in Seattle”.

How to do text analysis in R?

We can use the subtools package to analyze the movies’ sentiment in R by loading the movie subtitles into R and then use tidytext to work with the text data.

library(subtools)
library(tidytext)
library(dplyr)
library(plotly)
library(purrr)
library(lubridate)
library(methods)

Working with movie data

I downloaded the srt subtitles for 5 comedies from Open Subtitles before the analysis.

Now let’s load them into R and have a sneak peak of what the data looks like.

romantic_comedies_titles  c("Love Actually", "Notting Hill",
                              "Pretty Woman", "Sleepless in Seattle", "When Harry Met Sally")
subtitles_path  "../assets/data/valentines/"

romantic_comedies  romantic_comedies_titles %>% map(function(title){
  title_no_space  gsub(" ", "_", tolower(title))
  title_file_name  paste0(subtitles_path, title_no_space, ".srt")
  
  read.subtitles(title_file_name)$subtitles %>%
    mutate(movie_title = title)
})
head(romantic_comedies[[1]])
##   ID  Timecode.in Timecode.out
## 1  1 00:01:12.292 00:01:14.668
## 2  2 00:01:14.753 00:01:18.130
## 3  3 00:01:18.339 00:01:19.715
## 4  4 00:01:19.799 00:01:22.134
## 5  5 00:01:22.218 00:01:23.802
## 6  6 00:01:23.887 00:01:26.430
##                                                   Text   movie_title
## 1   Whenever I get gloomy with the state of the world, Love Actually
## 2 I think about the arrivals gate at Heathrow airport. Love Actually
## 3                  General opinion started to make out Love Actually
## 4         that we live in a world of hatred and greed, Love Actually
## 5                                but I don't see that. Love Actually
## 6                 Seems to me that love is everywhere. Love Actually

Subtitles preprocessing

The next step is tokenization, chopping up the subtitles into single words. At this stage I also perform a minor cleaning task, which is removing stop words and adding information about the line and its duration.

tokenize_clean_subtitles  function(subtitles, stop_words) {
  subtitles %>%
    unnest_tokens(word, Text) %>%
    anti_join(stop_words, by = "word") %>%
    left_join(subtitles %>% select(ID, Text), by = "ID") %>%
    mutate(
      line = paste(Timecode.in, Timecode.out),
      duration = as.numeric(hms(Timecode.out) - hms(Timecode.in)))
}

data("stop_words")

head(stop_words)
## # A tibble: 6 x 2
##   word      lexicon
##     
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART
tokenize_romantic_comedies  romantic_comedies %>%
  map(~tokenize_clean_subtitles(., stop_words))

After tokenizing the data I need to classify the word sentiment. In this analysis I simply want to know if the word has positive or negative sentiment. Tidytext package comes with 3 lexicons. The bing lexicon categorizes words as positive or negative. I use bing lexicon to assign the extracted words into desired classes.

bing  sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)

assign_sentiment  function(tokenize_subtitles, bing) {
  tokenize_subtitles %>%
    left_join(bing, by = "word") %>% 
    mutate(sentiment = ifelse(is.na(sentiment), "neutral", sentiment)) %>%
    mutate(score = ifelse(sentiment == "positive", 1,
                          ifelse(sentiment == "negative", -1, 0)))
}

tokenize_romantic_comedies_with_sentiment  tokenize_romantic_comedies %>%
  map(~assign_sentiment(., bing))

Since I am interested in deciding the sentiment of the movie line, I need to aggregate the scores on the line level. I create a simple rule: if the overall sentiment score is >= 1 we classify the line as positive, negative when and neutral in the other cases.

summarized_movie_sentiment  function(tokenize_subtitles_with_sentiment) {
  tokenize_subtitles_with_sentiment %>%
    group_by(line) %>%
    summarise(sentiment_per_minute = sum(score),
              sentiment_per_minute = ifelse(sentiment_per_minute >= 1, 1,
                                            ifelse(sentiment_per_minute  -1,-1, 0)),
              line_duration = max(duration),
              line_text = dplyr::first(Text),
              movie_title = dplyr::first(movie_title)) %>% 
    ungroup() %>%
    mutate(perc_line_duration = line_duration/sum(line_duration))
}
summarized_sentiment_romantic_comedies  tokenize_romantic_comedies_with_sentiment %>%
  map(~summarized_movie_sentiment(.))

Crème de la crème – data viz

After I am done with data preparation and munging, the fun begins and I get to visualize the data. In order to achieve a similar look as the authors of “Based on a True True Story?” I use stack horizontal bar charts in plotly. The bar length represents the movie duration in minutes.

Hint Hover on the chart to see the actual line and time. This only works for the orginal post.


plot_sentiment  function(summarized_sentiment) {
  sentiment_freq  round(
    summarized_sentiment %>%
      group_by(factor(sentiment_per_minute)) %>%
      summarize(duration = sum(perc_line_duration)) %>% .$duration * 100, 0)
  
  plot_title  paste('', summarized_sentiment$movie_title[1], '',
                      'Positive', paste0(sentiment_freq[3], '%'),
                      'Negative', paste0(sentiment_freq[1], '%'))
  
  plot_ly(summarized_sentiment, y = ~movie_title, x = ~perc_line_duration,
          type = "bar", orientation = 'h', color = ~sentiment_per_minute,
          text = ~paste("Time:", line, "
"
, "Line:", line_text), hoverinfo = 'text', colors = c("#01A8F1", "#f7f7f7", "#FA0771"), width = 800, height = 200) %>% layout(xaxis = list(title = "", showgrid = FALSE, showline = FALSE, showticklabels = FALSE, zeroline = FALSE, domain = c(0, 1)), yaxis = list(title = "", showticklabels = FALSE), barmode = "stack", title = ~plot_title) %>% hide_colorbar()} htmltools::tagList(summarized_sentiment_romantic_comedies %>% map(~plot_sentiment(.)))

Next steps

Recently, I learned about sentimentR package that let’s you analyze the sentiment on the senetence level. This would be interesting to conduct the analysis that way and see what sentiment scores would be received.

If you enjoyed this post spread the ♥ and share this post with someone who loves R as much as you!

Read the original post at
Appsilon Data Science Blog.

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Teaching Luxembourgish to my computer

By rdata.lu Blog | Data science with R

(This article was first published on rdata.lu Blog | Data science with R, and kindly contributed to R-bloggers)

<!–
words 941

pre code, pre, code {
white-space: pre !important;
overflow-x: scroll !important;
overflow-y: scroll !important;
word-break: keep-all !important;
word-wrap: initial !important;
max-height:30vh !important;
}
p img{
width:100%; !important;
}

–>

How we taught a computer to understand Luxembourguish

Today we reveal a project that Kevin and myself have been working on for the past 2 months, Liss. Liss is a sentiment analysis artificial intelligence; you can let Liss read single words or whole sentences, and Liss will tell you if the overall sentiment is either positive or negative. Such tools are used in marketing, to determine how people perceive a certain brand or new products for instance. The originality of Liss, is that it works on Luxembourguish texts.

How to develop a basic sentiment analysis AI

Machine learning algorithms need data in order to get trained. Training a machine means showing it hundreds, thousands, or even more examples in order for the machine to learn patterns, and then, once the machine learned the patterns, we can use to machine for predictions. For example, we can train a machine learning algorithm to determine if a given picture is a picture of a cat or of a dog. For this, we show the machine thousands of pictures of cats and dogs, until the machine has learned some patterns. For example, the machine will learn that cats are on average smaller than dogs, and have eyes that look more like almonds unlike dogs. It is not always (albeit possible) to know which patterns, or features, the machine is using to learn the difference between cats and dogs. Then, once we show the machine a new picture, it will be able to predict, with a certain confidence, that the picture is of a dog, or a cat.

For sentiment analysis, you proceed in a similar manner; you show the algorithm thousands of text that are either labeled as being positive or negative, and then, once you show the machine new texts, it will predict the sentiment. However, where can you find such example data, also called the training set to train your AI with? One solution is to scrape movie reviews. Movie reviews are texts written by humans, with a final score attached to it. A reviewer might write the following review about Interstellar:

“One of the best movies I have ever seen. The actors were great and the soundtrack amazing! 9/10”

Here, the AI will learn to associate the score of 9, very positive, to words such as best and amazing.

Another possible review, for, say, The Room could be:

“Wiseau put a lot of time and effort into this movie and it was utter crap. 2/10”

Here, the AI will learn to associate words such as crap with a low score, 2.

This is the gist of it, but of course feeding all this training examples to the AI requires a lot of thought and data pre-processing. This blog post will not deal with the technicalities, but more with how we tackled a serious problem: where do we find movies reviews written in Luxembourguish?

Where we found Luxembourguish comments

Luxembourg is a small country with a small population. As such, the size of the internet in Luxembourguish is quite small. Add to that the fact that most people only speak Luxembourguish, and don’t know how to write it, and you got a big problem: as far as we are aware, it is not possible to find, say, movie reviews to train a machine on. So how did we tackle this problem? Because there were no comments in Luxembourguish laying around for us to work with, we scraped German comments. Linguistically, Luxembourg is very close to West German dialects, but with some French influences too. However, putting German and Luxembourguish sentences side by side clearly shows the similarities:

Luxembourguish:

  • “Hallo, wéi geet et dir?” (Hello, how are you?)

  • “Ganz gudd, merci!” (Very well, thank you!)

German:

  • “Wie geht es dir?”

  • “Ganz gut, danke!”

The only word in the Luxembourguish sentences that comes from French is merci, meaning thank you. Of course, this is a simple example, but if we look at more complicated sentences, for example from the Bible, we still see a lot of similarities between Luxembourguish and German:

Wéi d’Elisabeth am sechste Mount war, ass den Engel Gabriel vum Herrgott an eng Stad a Galiläa geschéckt ginn, déi Nazareth heescht,

bei eng Jongfra, déi engem Mann versprach war, dee Jouseph geheescht huet an aus dem Haus vum David war. Dës Jongfra huet Maria geheescht.

Den Engel ass bei si eragaang a sot: “Free dech, [Maria], ganz an der Gnod! Den Här ass mat dir.”

Und im sechsten Monat ward der Engel Gabriel gesandt von Gott in eine Stadt in Galiläa, die heißt Nazareth,

zu einer Jungfrau, die vertraut war einem Manne mit Namen Joseph, vom Hause David; und die Jungfrau hieß Maria.

Und der Engel kam zu ihr hinein und sprach: Gegrüßet seist du, Hochbegnadete! Der Herr ist mit dir!

-Lk 1,26-38

In these sentences, most differences come from the use of different tenses or different choices in what to include in the translation. For instance the first sentence in Luxembourguish starts with Wéi d’Elisabeth am sechste Mount war (As Elisabeth six months pregnant was) while in the German translation starts with Und im sechsten Monat (And in the sixth month). Same meaning, but the reference to Elisabeth is implicit.

So German and Luxembourguish are very close, but what good does that do us? Training a model on German movie reviews and trying to predict sentiments of texts in Luxembourguish will not work. So the solution was to translate the comments we scraped from German to Luxembourguish. We scraped 50000 comments; obviously we could not translate them ourselves, so we use Google’s translate api to do it. There’s a nice R package that makes it easy to work with this api, called {translate}.

The translation quality is not bad, but the longer and more complicated the comments, the more spotty is the translation; but overall the quality seems to be good enough. Let’s translate the German sentences from above back to Luxembourguish using Google Translate:

An am sechsten Mount huet de Engel Gabriel vu Gott geschéckt an eng Stad an Galiläa geschéckt, déi Nazareth genannt gëtt

zu engem Kiischte, dee vum Josephsjäreger bekannt gouf, aus dem Haus vum David; an den Numm vun der Jungfra Maria war .

*De Engel ass si komm an huet gesot: “A Blann, geeschteg bass! Den Här ass mat iech!

The translation here is really not that great, but Bible verses are written in a pretty unusual way. What about a more standard text? Let’s try with the first paragraph on Luxembourg from the German version of Wikipedia:

German:

*Das Großherzogtum Luxemburg ist ein Staat und eine Demokratie in Form einer parlamentarischen Monarchie im Westen Mitteleuropas. Es ist das letzte Großherzog- bzw. Großfürstentum (von einst zwölf) in Europa. Das Land gehört zum mitteldeutschen Sprachraum. Landessprache ist Luxemburgisch, Verwaltungs- und Amtssprachen sind Französisch, Deutsch und Luxemburgisch. Gemeinsam mit seinem Nachbarn Belgien und mit den Niederlanden bildet Luxemburg die Beneluxstaaten.**

Luxembourguish (from Google Translate):

D’Groussherzogtum Lëtzebuerg ass e Staat an eng Demokratie an der Form vun enger parlamentarescher Monarchie an der Westeuropa. Et ass de leschte Grand-Duché oder Grand-Duché (e puer Joeren) an Europa. Dëst Land gehéiert zu der zentrale germanescher Sprooch. D’Nationalsprooch ass Lëtzebuergesch, administrativ an offizielle Sproochen sinn franséisch, däitsch a lëtzebuergesch. Zesumme mat sengem Noper Belgien an Holland ass Lëtzebuerg de Benelux.

English (from Google Translate):

(The Grand Duchy of Luxembourg is a state and a democracy in the form of a parliamentary monarchy in western Central Europe. It is the last Grand Duchy or Grand Duchy (once twelve) in Europe. The country belongs to the central German language area. The national language is Luxembourgish, administrative and official languages are French, German and Luxembourgish. Together with its neighbor Belgium and the Netherlands, Luxembourg is the Benelux.)

Knowing both German and Luxembourguish, I can tell that the translation is pretty good, and would require minimal human editing to make it perfect.

So we were pretty confident that this was a strategy that would be worth to try, so that’s what we did. We translated the comments using Google Translate api with the following R code:

Once we had translated everything, we started training a model.

The sentiment analysis tool we built

To train the model, we use the R programming language and the Keras, a deep learning library. The comments had to be preprocessed, which is what took the most time. Then, building a model with Keras is quite simple, and we did not do anything special to it; actually, we did not spend much time tuning the model and to our astonishment, it worked quite well! To share the results with anyone, we also created a web app that you can access by clicking here.
Try to write, words, sentences, and most importantly give us feedback! See you for the next post.

To leave a comment for the author, please follow the link and comment on their blog: rdata.lu Blog | Data science with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Support for hOCR and Tesseract 4 in R

By rOpenSci – open tools for open science

cover image

(This article was first published on rOpenSci – open tools for open science, and kindly contributed to R-bloggers)

Earlier this month we released a new version of the tesseract package to CRAN. This package provides R bindings to Google’s open source optical character recognition (OCR) engine Tesseract.

Two major new features are support for HOCR and support for the upcoming Tesseract 4.

hOCR output

Support for HOCR output was requested by one of our users on Github. The ocr() function gains a parameter HOCR which allows for returning results in hOCR format:

library(tesseract)

# Text output
text 

hOCR is an open standard of data representation for formatted text obtained from OCR (wikipedia). The definition encodes text, style, layout information, recognition confidence metrics and other information using XML.

Every word in the hOCR output includes meta data such as bounding box, confidence metrics, etc. With a little xml2 and regular expression magic we can extract a beautiful data frame:

library(tesseract)
library(xml2)
library(stringr)
library(tibble)
xml 
# A tibble: 60 x 3
   confidence word  bbox          
                 
 1       89.0 This  36 92 96 116  
 2       89.0 is    109 92 129 116
 3       92.0 a     141 98 156 116
 4       93.0 lot   169 92 201 116
 5       91.0 of    212 92 240 116
 6       91.0 12    251 92 282 116
 7       92.0 point 296 92 364 122
 8       89.0 text  374 93 427 116
 9       93.0 to    437 93 463 116
10       90.0 test  474 93 526 116
# ... with 50 more rows

So this gives us a little more information about the OCR results than just the text.

Upcoming Tesseract 4

The Google folks and contributors are working very hard on the next generation the Tesseract OCR engine which uses new neural network system based on LSTMs, with major accuracy gains. The release of Tesseract 4 is scheduled for later this year but an alpha release is already available.

Our latest CRAN release of the tesseract now has the required changes to support Tesseract 4. On MacOS you can already give this try this by installing tesseract from the master branch:

brew remove tesseract
brew install tesseract --HEAD

After updating tesseract you need to reinstall the R package from source:

install.packages("tessract", type = "source")

This is still alpha, things may break. Report problems in our github repository.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci – open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

weakly informative reparameterisations

By xi’an

(This article was first published on R – Xi’an’s Og, and kindly contributed to R-bloggers)

Our paper, weakly informative reparameterisations of location-scale mixtures, with Kaniav Kamary and Kate Lee, got accepted by JCGS! Great news, which comes in perfect timing for Kaniav as she is currently applying for positions. The paper proposes a unidimensional mixture Bayesian modelling based on the first and second moment constraints, since these turn the remainder of the parameter space into a compact. While we had already developed an associated R package, Ultimixt, the current editorial policy of JCGS imposes the R code used to produce all results to be attached to the submission and it took us a few more weeks than it should have to produce a directly executable code, due to internal library incompatibilities. (For this entry, I was looking for a link to our special JCGS issue with my picture of Edinburgh but realised I did not have this picture.)

To leave a comment for the author, please follow the link and comment on their blog: R – Xi’an’s Og.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News