heatmaply: interactive heat maps (with R)

By Tal Galili

2016-05-31 23_21_46-Clipboard

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

I am pleased to announce heatmaply, my new R package for generating interactive heat maps, based on the plotly R package.

tl;dr

By running the following 3 lines of code:

install.packages("heatmaply")
library(heatmaply)
heatmaply(mtcars, k_col = 2, k_row = 3) %>% layout(margin = list(l = 130, b = 40))

You will get this output in your browser (or RStudio console):

You can see more example in the online vignette on CRAN. For issue reports or feature requests, please visit the GitHub repo.

Introduction

A heatmap is a popular graphical method for visualizing high-dimensional data, in which a table of numbers are encoded as a grid of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms. Heatmaps are used in many fields for visualizing observations, correlations, missing values patterns, and more.

Interactive heatmaps allow the inspection of specific value by hovering the mouse over a cell, as well as zooming into a region of the heatmap by draging a rectangle around the relevant area.

This work is based on the ggplot2 and plotly.js engine. It produces similar heatmaps as d3heatmap, with the advantage of speed (plotly.js is able to handle larger size matrix), the ability to zoom from the dendrogram (thanks to the dendextend R package), and the possibility of seeing new features in the future (such as sidebar bars).

Why heatmaply

The heatmaply package is designed to have a familiar features and user interface as heatmap, gplots::heatmap.2 and other functions for static heatmaps. You can specify dendrogram, clustering, and scaling options in the same way. heatmaply includes the following features:

  • Shows the row/column/value under the mouse cursor (and includes a legend on the side)
  • Drag a rectangle over the heatmap image, or the dendrograms, in order to zoom in (the dendrogram coloring relies on integration with the dendextend package)
  • Works from the R console, in RStudio, with R Markdown, and with Shiny

The package is similar to the d3heatmap package (developed by the brilliant Joe Cheng), but is based on the plotly R package. Performance-wise it can handle larger matrices. Furthermore, since it is based on ggplot2+plotly, it is expected to have more features in the future (as it is more easily extendable by also non-JavaScript experts). I choose to build heatmaply on top of plotly.js since it is a free, open source, JavaScript library that can translate ggplot2 figures into self-contained interactive JavaScript objects (which can be viewed in your browser or RStudio).

The default color palette for the heatmap is based on the beautiful viridis package. Also, by using the dendextend package (see the open-access two-page bioinformatics paper), you can customize dendrograms before sending them to heatmaply (via Rowv and Colv).

You can see some more eye-candy in the online Vignette on CRAN, for example:

For issue reports or feature requests, please visit the GitHub repo.

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Happy New Year, Mr. President. Data and Sentiment Analysis of Presidential New Year Speeches

By Salvino A. Salvaggio

lavoro_2_EN-1


5.3. Young people

giovani_EN-1giovani_2_EN-1


5.4. Culture

cultura_EN-1cultura_2_EN-1


5.5. Terrorism and terror

terrorismo_EN-1terrorismo_2_EN-1


5.6. Reforms

riforme_EN-1riforme_2_EN-1


5.7. Homeland

patria_EN-1patria_2_EN-1

It should be stressed that in each of these seven examples, the difference of relative frequency of use of the theme (lemma) between the Italian language in general (in those years) and the presidential year-end speeches is not fortuitous. The p-value of independence t-test [24] is always lower than 0.001.

From this type of analysis it is possible to verify, for example, that the theme of terrorism occupies a significant position of relevance for President Pertini, surely because his seven year term was, from this point of view, a difficult and tragic period for Italy, just as the issues of unemployment and young people were central to him. But it is also noticeable that, overall, for Italian Presidents, terrorism is written in the national historical memory as a phenomenon associated as much —if not more— to the late 1970s and early 80s as to the episodes of the last 10-15 years, from 2001 on. President Napolitano seems instead to be preoccupied not only by the theme of employment but also and above all by the question of the State reformation whereas he refers to the homeland only 4 times in 9 years; a theme that, instead, was special to Presidents Cossiga and Scalfaro. The theme of culture has mainly attracted the attention of Presidents Cossiga and Scalfaro. The statistical analysis should be further fine-tuned though to understand if they meant culture in the anthropological or artistic sense.

Once the algorithm is established, this type of analysis has the advantage of being extendable to any topic, offering a historical reading rooted in quantified data and not only in the inspiration of the researcher. The comparison between the “traditional” socio-political analysis and the statistical reading of the data on themes such as mafia, north-south inequalities, migratory flows, innovation, to name just a few, would surely be instructive.

6. Sentiment analysis

The statistical analysis of sentiment [23] was developed on the basis of a combination of quantitative methodologies aiming to the measurement, quantification and classification of opinions and sentiments expressed in documents (textual corpus) through words which have a positive, negative or neutral semantic connotation. For example, in function of the lexical composition of the text, the sentence “Last quarter the European economy has benefitted from the favorable price of oil” would be classified as positive, when the phrase “In the same period the labor market has continued to suffer, penalizing particularly young people.” would be considered negative. In the recent years, the development of the discipline has experienced a strong acceleration fuelled by the desire to better understand the evolution of various types of content published by hundreds of millions of social media users (regardless of the difficulties given by the automatic tracking of sentiments in brief texts; Thelwall, 2010) and by the uncountable websites that offer users the possibility to publish review of products and services (Galitsky, 2009; Cataldi and al., 2013).

Normally used for scientific purposes [26] but also to improve marketing efficiency and expand business opportunities [27], sentiment analysis applied to historical, administrative or institutional documents still remains embryonic, above all in the case of documents written in the Italian language. [28] Consequently, inquiring whether the “truth” of the Presidential year-end speeches in Italy lies as much in the analysis of the data behind them as in the more traditional political analysis can only stimulate an innovative line of research and trigger considerations rooted in the extension of data sciences to the study of governments and institutions.
The first step in carrying out the sentiment analysis of the Presidential messages consists of reducing the lexical complexity through the lemmatization of the text, that is the substitution of each term used with its lemma of reference. For example, the last sentence of the first year-end address by President Einaudi in 1949, from “such I am sure is the common vote and such is my personal wish which is directed heartfelt and with affection at this hour to each Italian in and out of the boundaries of the country” becomes “such to be sure to be the common vote and such to be my personal wish which to direct heartfelt and with affection at the hour to each Italian in and out of the boundary of the country”. This makes easier the management of the dictionaries (words lists) that, through various simplifications, do not need to comprise the plurals, conjugated or derived forms but can be limited to citing only the infinitives or simple forms. For a basic sentiment analysis, once the speeches are lemmatized, a single positive value (+1) is associated to each term (lemma) if the term expresses a positive sentiment, opinion, attitude, concept; a single negative value (-1) if the lemma translates a negative sentiment, opinion, attitude, concept and a neutral value (0) for neutral terms (Vryniotis, 2013).[29] After having summed up the values of the single words grouped by sentence, each sentence of each presidential speech can be characterized by 3 absolute values: the sum of the “positive sentiments” (expressed by means of a positive integer), the sum of the “negative sentiments” (expressed by means of a negative integer), the overall sum of the sentiments expressed in that sentence (translated into a positive or negative integer according to the dominant sentiments).

The distribution of the values of the overall sums of sentiment shows that the year-end messages prevalently transmit positive concepts, opinions and sentiments. Since 1949 to present day, the sentences of the speeches are in fact positioned on the positive segment of the axis, with an average sentiment value of +3.7 and a median value of +3.

distribuzione_sentimenti_EN-1

However, for greater accuracy, these whole values (positive sum, negative sum and overall sum) are weighed by comparing them to the total number of words in the sentence. In this way 3 percentage values are obtained (positive and negative sentiments, and overall percentage). In combining the values of the single sentences per message and, later, combining the messages per president, granularity is somewhat lost but a clearer vision of the whole picture is acquired (particularly if a polynomial function is used to smooth the coarseness of the single values).

For example, President Mattarella’s year-end speech appears to be mainly positive even though there are distinctly negative statements.[30]

sentiment_mattarella_EN-1

The sentiment analysis —which perhaps should be called opinion mining in this case— can also help in making visible underlying structures in the construction of the speeches.

The following two graphs illustrate President Napolitano’s year-end addresses in 2007 and 2011. A resemblance of the narrative structure is observed: after the beginning well wishes (which can be translated with a high positive sentiment index), the President evokes negative facts, thoughts, opinions, situations (this significantly decreases the overall index value) to then mark a high, expressing again clearly positive sentiments and opinions which slowly “dull” to the final wishes. In other words: after taking off at top speed, then suddenly the bad news arrives which is immediately counterbalanced, first by a few but strong very positive opinions, then by a long series of still positive expressions, up to the conclusion of the message which fosters a pronounced hope for the year to come.

sentiment_napolitano2007_EN-1
sentiment_napolitano2011_EN-1

The long term analysis of the aggregated data per speech can benefit from this approach as well. Overall, the level of positive sentiment expressed in the year-end well wishes by all the Presidents does not vary much. Although some significant oscillations are observed year over year (with a maximum range of variation over the years between +28.5% in 2003 and +16.8% in 1981), the total trend remains stable —relative stability likely due, at least in part, to the characteristic of this particular communication exercise which tends to highlight the positive wishes and greetings repeated over and over during the speech.
On the other hand, the proportion of negative sentiment grows over the course of the years, but with oscillations that seem [31] to follow the evolutions of the economic and social crisis. In fact, it is possible to observe a rapid decline in the sentiments expressed between 1959 and 1980-1981 (practically a doubling of the negative lemmas), followed by a decrease in pessimism from 1981 to 2000, and again a progressive darkening of the skies from 2001-2002 on.

pos_sentiments_over_time_EN-1
neg_sentiments_over_time_EN-1

The expression of the sentiments considered for each single president shows some variability, with President Pertini who, up to today, ranks at the extremes both for the highest percentage of negative sentiments (lemmas) and for the lowest percentage of those positive. Beyond the differences in Presidents’ personality, the historical period surely influenced the volume of the negative sentiments expressed by Presidents Leone and Pertini. For this same reason, it would be useful to closely look at the next year-end addresses to understand if the negative sentiments expressed by President Mattarella confirms a trend which seems to inaugurate with President Napolitano or if it is only an instance given the fact that up to now President Mattarella has only had one occasion to deliver his year-end wishes to the nation.

sentiments_by_president_1_EN-1

7. Conclusion

This work provides unique insights into the institution’s textual production and its variation over time in three manners:

  • Descriptive statistics were used to quantify the way(s) each Italian President speaks to the Nation. Amongst others, it allowed differentiating elocutionary styles: crisper (202 words/speech) or verbose (3,513 words/speech), direct (17 words/sentence) or convoluted (49 words/sentence), slow (95 words/minute) or fast (142 words/minute), as well as variations to means. When applied to the time series, the descriptive analysis shows the mutations of the elocutionary styles over time and the fact that they are not always in line with the zeitgeist.
  • Natural language processing methods highlighted the frequency and associations of single or groups of words. This was useful to extract the features of the New Year speeches overall but also the main interests of each President (with the oldest President in the history of the Italian Republic being the most worried about the future of the young generation). Quantified examples are given for 7 themes: unemployment, work/job, youth, culture, terrorism, reform, and homeland. Absolute and relative frequencies of these themes were computed and compared to the average frequency of the same themes in the language overall for the same period. Supported by meaningful independence t-tests and confidence intervals, this approach showed the comparative evolution of the recurrence of the 7 topics. But it also showed it can be generalized to any theme.
  • After having built a “sentiment dictionary”, quantitative sentiment analysis (opinion mining) has been applied to quantify the expression of ideas, opinions, and statements as positive or negative based on the wording. Relevant differences between Presidents emerge with, at the 2 extremes, President Pertini (18% positive sentiments against 9% negative) and President Gronchi (27% positive sentiments against 4.5% negative). Also, historical trends become more visible: towards more pessimism in the 1980s followed by a slightly stronger optimism in the 1990s and again more negative sentiments from 2000 onward. Sentiment analysis also made obvious that some Presidents built up their narratives following recurrent “sentiment/opinion patterns”. The most evident case is President Napolitano that alternates good and bad news in such a specific manner that it becomes a pattern signature structuring some of his speeches.

Although little has been done to date to incorporate data science in the field of textual analysis of political content, this work shows the early benefits of such an approach. Textual and verbal production of public administrations can be investigated with quantitative rigor opening new lines of research. Quantitative methods such as descriptive statistics, natural language processing, and sentiment analysis (opinion mining) prove to be highly valuable tools capable of bringing a strong contribution to enrich and enhance political sciences.


References

Baayen, R.H., (2008). Analyzing Linguistic Data, Cambridge University Press.

Baccianella, S., Esuli, A. and Sebastiani, F. (2010). “Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining”, in Calzolari, N. and al., editor, Proceedings of LREC, 2200–2204. http://is.gd/VLTKqB

Basile, V. and Nissim, M. (14 June 2013). “Sentiment analysis on Italian tweets”, Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 100–107. http://is.gd/io82rB

Casotto, P. (2012). Sentiment Analysis for the Italian Language, Tesi di Dottorato, Dipartimento di matematica e Informatica, Universita’ degli Studi di Udine.

Cataldi, M., Ballatore, A., Tiddi, I., Aufaure, M.-A. (22 June 2013). “Good location, terrible food: detecting feature sentiment in user-generated reviews”, Social Network Analysis and Mining, 3 (4): 1149–1163. http://is.gd/Tlp0Fc

Charpentier, A. (22 February 2016). “Clusters of Texts”, Freakonometrics. http://is.gd/32mB13

Davidov, D., Tsur, O, and Rappoport, A. (2010). “Enhanced sentiment learning using twitter hashtags and smileys”, Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, 241–249, Stroudsburg, PA, USA.

Galitsky, B. and McKenna, E.W. (12 November 2009). “Sentiment Extraction from Consumer Reviews for Providing Product Recommendations”, Patent US–20090282019-A1. http://is.gd/ioVBsb

Gonzales-Ibanez, R., Muresan, S. and Wacholder, N. (June 2011). “Identifying sarcasm in twitter: A closer look”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 581–586, Portland, OR, USA.

Gries, S.Th. (2013). Statistics for Linguistics with R, Second Edition, De Gruyter Mouton, Berlin.

Jockers, M.L., (2014). Text Analysis with R for Students of Literature, Springer.

Kennette, L.N., Wurm, L.H. and Van Havermaet, L.R. (2010). “Change detection: The effects of linguistic focus, hierarchical word level and proficiency”, The Mental Lexicon, 5(1), 47–86.

Kulkarni V., Rfou R., Perozzi B. and Skiena S. (2015). “Statistically Significant Detection of Linguistic Change”, Proceedings of the 24th International Conference on World Wide Web, 2015.

Lebeau, J. (20 January 2016). “State of the Union Speeches and Data”, More or Less Numbers. http://is.gd/NrJIRS

Liu, B. (2012). Sentiment Analysis and Opinion Mining, Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers.

Lapowsky, I. (13 January 2016). “The True Message of the State of the Union Is in the Data”, WIRED. http://is.gd/7ULEdC

Mani, I. (2010). The Imagined Moment. Time, Narrative, and Computation, University of Nebraska Press.

Mejova, Y. (16 November 2009). Sentiment Analysis: An Overview. http://is.gd/C1U9OJ

Taboada, M., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (June 2011). “Lexicon-based methods for sentiment analysis”, Comput. Linguist., 37(2):267–307.

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A, (2010). “Sentiment strength detection in short informal text”, Journal of the American Society for Information Science and Technology, 61 (12): 2544–2558. http://is.gd/yRnRCq

Vryniotis, V. (23 September 2013). “The importance of Neutral Class in Sentiment Analysis”, Machine Learning Blog & Software Development News. http://is.gd/FidS0k


Notes

[1] This document is the result of an analysis carried out by the author and reflects only and exclusively the opinions of the author. Therefore, this document does not involve in any way, neither directly nor indirectly, none of the employers, past or present, of the author. The author confirms that he has no conflict of interest, and has never worked, neither in a remunerated capacity nor gratuitously, for the President of the Italian Republic, and has never been appointed in any role or capacity by the Quirinale. Notwithstanding, the author informs the readers that he was honored with the title of Commander by the President of the Italian Republic (Decree August 1 2007).

[2] eMail: salvino [dot] salvaggio [at] gmail [dot] com

[3] The Author thanks Paolo Barbesino, Paolo Gasparini, Gaetano Palumbo for the comments to the previous versions.

[4] The first year-end message was delivered by President Luigi Einaudi on December 31, 1948. Enrico de Nicola – elected interim President on June 28, 1946, then again June 26, 1947 after his resignation, and again President of the Italian Republic from January 1, 1948 – did not deliver any year-end wishes to the nation during his two years at the Quirinale.

[5] Printed paper, radio and film distribution of the Istituto Luce in the early years, live television from 1954 on.

[6] In fact, since his first appearance on TV in 1954, the program often opens with a wide view which captures a room in the Quirinale palace to then progressively zoom in on the framing of the bust and face of the President speaking to the nation looking directly into the camera.

[7] Record held by President Giorgio Napolitano watched by approximately 13 million viewers on December 31, 2014.

[8] A much formalized ritual, with few surprises or formal variations but also significant, as shown by the clustering text which is unable to isolate clear or specific groups (Charpentier, 2016).

[9] Highlighted by change point detection to identify not only one or more potential changes but also the moment in which they occur (Kulkarni and al, 2015; Kennette and al, 2010).

[10] The variation to the average is statistically significant and cannot therefore be attributed to chance (t-test).

[11] It would be useful —but unfortunately the information is not available— to know the average speed of elocution of the Italian population in general, year by year, from 1949 on to compare to the data of the presidential speeches. Also, data on duration of President Einaudi’s speeches are not available.

[12] It is not therefore clear whether President Scalfaro was at the Quirinale for the message on December 31 1997 or not.

[13] p-value of t-test is approximately 0.00016.

[14] Precisely 481.8 words.

[15] With the significant exception of 1991, year in which President Cossiga went on TV to deliver his wishes saying that substantially he would not be saying anything else, all in just 419 words in less than 4 minutes. It must be noted that, from a statistical point of view, specifically the messages by Presidents Pertini and Scalfaro are the main contributors to general trend (regression).

[16] The terms “italy”, “italians” are pronounced almost 700 times in 67 messages.

[17] With words like italy, italians, young people, population, country, life, liberty, citizens, democracy, trust, responsibility.

[18] Natural Language Processing (NPL).

[19] Which in the statistical analysis stands for ‘president of the republic’.

[20] Obtained with AntCon 3.4.3 software.

[21] The themes where chosen without representativeness pretention.

[22] The following regex terms or expressions were used in the research of the occurrences: ‘unemployment|underemployment’, ‘work’, ‘culture’, ‘terror’, ‘reform’, ‘bhomeland|patriotism|patriot|patriots’, ‘byoung people|youth|youthful|b’

[23] This second analysis is made possible by Google Books which provides researchers with the frequency of all the words used in millions of books, year by year, language by language, from approximately 1500 to 2009. It is necessary to underline that the terminological quantification of the semantic linguistic corpus built by Google Books does not included texts published in mass media (newspapers, Internet) which might modify the frequency data adding a more pronounced component of modernity.

[24] Welch Method.

[25] Usually called sentiment analysis or opinion mining. See Mejova, 2009 for an overall description.

[26] For example to add quantitative depth to the literal analysis, or in a political setting to better understand the evolutions of the orientation of the electorate.

[27] To evaluate the level of adhesion to advertisements or specific brands.

[28] The easy availability of numerous English dictionaries of words associated to values of various sentiments and opinions has dynamized the research on English language contents (Baccianella et al., 2010; Gonzales-Ibanes and al., 2011; Liu, 2012). The same is taking place for Spanish and, in lesser degree, for other languages. In Italian, unfortunately, the availability of such instruments still suffers significant shortcomings and the laudable efforts of isolated researchers are not enough to bridge the double gap of, on one side, lists of words with the corresponding quantification of the positive or negative connotation and, on the other, of classifications of the same words per type of sentiment or opinion. Having a base dictionary available which signals the positive (+1) or negative (-1) value of an adjective is the first essential step for this approach. But also knowing if the same adjective belongs to a specific category of sentiment -for example anger, joy, sadness, fear, surprise, trust, etc.- makes it possible to enrich the analysis. In Italian, see for example the software sentiment-Italian-lang by Giuseppe Ragusa – https://github.com/gragusa?tab=repositories, Casotto, P. (2012), Basile and Nissim (2013).

[29] For greater precision, the lemmas can be quantified on a real scale that includes the intermediate values between 0 and 1, 0 and -1.

[30] In this type of graph, the order of sequence of the phrases in the speech act as a temporal axis (x-axis)- this approach is usually defined as novelistic time (Mani, 2010).

[31] This should be verified in detail.

(This article was first published on RSS Feed – SaS in #R#, and kindly contributed to R-bloggers)

Salvino A. Salvaggio [1] [2] [3]

At a moment where many are preparing for the December 31st evening cocktail, the End of Year speech of the President of the Italian Republic is broadcast right on time at 8:30pm. A tradition which came to be with the constitutional establishment of the Italian Republic itself (or almost [4]), the year-end message has endured all trends and has survived the technological refoundation of the mediascape [5]. Since the beginning it also imposed itself as a television format [6] able to bring together millions of TV viewers [7] regardless of the fact that they are often engrossed in final preparations for the evening’s celebrations.

It is not a concern to question at this time the convenience or lack thereof of such an annual occurrence nor, much less, the reasons which favor or exhaust the lengthening of this republican ritual [8]. Advantage must simply be drawn from the fact that 67 years of presidential year-end well wishes form today a sufficiently rich documental corpus to become the object of data analysis.

For over a decade, the statistical analysis of written documents has become a common practice and a consolidated scientific instrument (Jockers, 2014; Gries, 2013; Baayen, 2008). Although data analysis has been fully incorporated in the field of literary studies, little has been undertaken to investigate the textual or verbal production of public administration with equal quantitative rigor. This work contributes to filling the gap by exploring all the Italian Presidents’ New Year speeches from 1949 to 2015 bringing the tools and methodologies of data science into the field of political studies.

1. Style and elocution

First of all, we must note that not all the Presidents of the Italian Republic have addressed the Nation seven times for the year-end wishes. Luigi Einaudi remained in office for 6 years, Antonio Segni for 2 years, Giorgio Napolitano for 9 (a first mandate of 7 years and a second he resigned from after 2 years) and Sergio Mattarella has taken on the New Year’s Eve practice only once to date. Three types of data are of immediate help:

  • If the total number of words spoken is affected by the number of years in office, the average number of words/speech provides a better indication on the more modest and the more loquacious Presidents.
  • The average number of words/phrase helps to understand the linguistic style of each President.
  • From the average number of words spoken per minute, it is possible to derive a valid idea of the rhythm and speed of elocution.
President New Year Speeches Total words Words by speech Words by sentence Words by minute
Luigi Einaudi 6 1210 202 42 116
Giovanni Gronchi 7 5821 832 46 139
Antonio Segni 2 1797 898 49 126
Giuseppe Saragat 7 8471 1210 32 132
Giovanni Leone 7 7333 1048 29 142
Sandro Pertini 7 14705 2101 19 95
Francesco Cossiga 7 14131 2019 38 115
Oscar Luigi Scalfaro 7 24588 3513 18 110
Carlo Azeglio Ciampi 7 12605 1801 18 99
Giorgio Napolitano 9 20377 2264 30 122
Sergio Mattarella 1 2117 2117 17 106

stat_descr_2_EN-1

Two significant elements can then be drawn from the previous basic data:

  • The mutations of the stylistic forms unique to each President
  • The changes in elocution rhythms of each of them

1.1. Stylistic mutations
The stylistic mutations over the course of the history of the year-end speeches appear with great clarity. [9] First of all, a progressive shortening of the sentences, almost split in half, is evident: from the approximately 42 words/sentence between 1949-1968 to approximately 22 words in the seven year period of Presidents Ciampi and Scalfaro (with the exception of President Cossiga who used much longer sentences). Lately, President Napolitano reversed the trend, proposing a more sophisticated language founded on longer sentences than his predecessors (approximately 28 words/sentence on average). Finally, in his first year-end address, President Mattarella constructed his speech on very brief sentences (17-18 words on average), which are sharp and of great impact, with a more journalistic style rather than institutional (similar to Presidents Ciampi, Scalfaro and Pertini).

mutamenti_stilistici_EN-1

1.2. Elocutionary evolutions
The impact that the profound mutation of the media ecosystem has had in these last decades on the forms of oral and written expression in general could lead one to believe that a similar effect was impressed on the rhythm and speed of elocution of the Presidents. The progressive slipping of the dominating media (from newspapers to radio to TV to Internet to social media) as well as the changes endured by the various formats which tend to favor, more and more, forms of brief communication have spread in such a pervasive manner in every fold of society that it comes natural to speculate a similar transformation of the presidential elocution which reflects the passage of a sacred conception of the Presidency to a conception more in line with the zeitgeist. Actually, none of all this seems to apply to the presidential year-end addresses. In fact, instead of noticing an acceleration of the elocution, the contrary is observed: Presidents Gronchi, Segni, Saragat and Leone spoke in a much more rapid manner than Presidents Pertini, Cossiga, Scalfaro, Ciampi, Napolitano and Mattarella. [10] [11]

evoluzioni_elocutorie_1_EN-1

evoluzioni_elocutorie_2_EN-1

Perhaps it is worthwhile to also note a detail: although all of the Presidents were seated to deliver their year-end well wishes, not all of them were so in the same manner. For the most part of the addresses, the Presidents were seated at their desk; Presidents Pertini, Scalfaro and Mattarella at times used an armchair in a Quirinale sitting room, without a desk.[12] President Pertini in 1984 even sat next to a fireplace in a simple setting far from the formality and the golden palace adornments. In particular, in 2011 and 2012, President Napolitano sat in an informal way at the desk, sideways (and with his jacket unbuttoned in 2011). On these two occasions (2011 and 2012), the average of words spoken by minute increased from 110 to 120 [13], to indicate —but it is only a general indication whose orientation should be followed over the years— that mutations in the setting of the year-end speech could have an impact on the locution.

Tricolore

2. Length

Although the seven year term of President Scalfaro is characterized by the most rambling year-end speeches of the whole history of the Italian Republic (with an all time record of 5,013 words spoken on December 31, 1997), a trend to the lengthening of the well-wishing message is undeniable, from an average of less than 500 words/speech before 1960 [14] to an average of more than 1,500 words after 1980 [15]. This increasing trend, however, shows a setback in 1999-2000 and marks a slight about turn in the last 15 years, with speeches that count, for the most part, between approximately 1,800 and 2,250 words.

stat_descr_3_EN-1

3. Frequency

With this type of public speech that TV broadcast makes available to all in a sort of formal and recurrent celebration (defined by President Saragat a “spiritual communion for the entire nation” in 1964), the choice and use of terms spoken by the Presidents are certainly not by chance. After all, since the early years, the year-end presidential addresses were the object of interpretation and analysis in fine detail on behalf of journalists, analysts and politicians, besides being followed by a very wide audience. It therefore is useful to highlight the words and, behind them, the themes used with the greatest frequency by the Presidents in attempting to outline a map of historically relevant topics or of personal interest to each speaker.

For the totality of year-end well wishes delivered by all the Presidents, it is not surprising to see that Italy represents the main and dominant theme [16], together with the final balance of the year and with suggestions, proposals, wishes for the new year. Peace and young people also occupy a prominent position in the presidential well wishes, but also politics, Europe, freedom and employment are topics that many Presidents have taken on, on December 31.

Freqterm_all_EN-1

The study of the absolute frequency of words also provides clear indications on the overall landscape of the presidential well wishes. These appear as an expression of unity for the country, actual or hoped for, of gathering around shared patriotic symbols, and of celebration if not of construction of the national narration [17]. But not all Presidents insist in the same way on the same themes. Beyond the points of convergence, the lexical ecosystem of each President varies greatly in respect to the overall ‘average’ illustrating specific and multifaceted thematic geographies. The wordclouds highlight well the personal worries and the current trends which hold in a differentiated way the attention of this or that President.

Presidente L. Einaudi Presidente G. Gronchi Presidente A. Segni
Presidente G. Saragat Presidente G. Leone Presidente S. Pertini
Presidente F. Cossiga Presidente O. L. Scalfaro Presidente C. A. Ciampi
Presidente G. Napolitano Presidente S. Mattarella

4. Words association

The identification of structures of association (words which are often used together), in pairs (bi-grams) or in longer structures (tri-grams or n-grams), is a pillar in the processing and analysis of contents in natural language.[18]

token_graph_EN-1

Besides the obvious expressions and clichés typical of year-end speeches (such as, for example, ‘happy year’, ‘new year’, ‘this year’) and the references to the nation (‘Italian people’, ‘national unity’, ‘common good’, ‘constitutional charter’), spotting some unexpected expressions among the most cited may arouse surprise. Without a quantitative analysis of the language, it is difficult, for example, to realize how much these messages are self-referential with a strong recurrence of expressions such as ‘president republic’ [19], ‘head of state’, ‘I would like to say’ (vorrei dire), or how much ‘armed forces’ and ‘law enforcement’ are found in the well wishes. Additionally, the analysis of the n-grams removes every doubt on the themes of central importance for Italy in these almost seventy years of contemporary history. ‘European union’, ‘united nations’, ‘international community’ and ‘middle east’ recall the inclusion of Italy in the network of international relationships but also of the geo-political problems on a large scale. In the same way, expressions like ‘social justice’, ‘social economic’, ‘public debt’ or ‘organized crime’ allude to internal problems which have marked the recent history of the country. Finally, we can reasonably ask if the presence of ‘john paul’ and ‘paul ii’ among the most used bigrams do not suggest proximity —perhaps even more spiritual than geographical— between the Quirinale and the Vatican. After all, in the period of the 67 years analysed here, the trigram ‘help from god’ appears more times than ‘jobs’…

Purely for example, here is a partial and selective list of trigrams [20] stated at least 10 times.

Frequency TriGram
45 forze dell ordine
44 capo dello stato
43 il popolo italiano
36 del nostro paese
36 per la pace
25 presidente della repubblica
24 tutti gli italiani
24 tutti i cittadini
21 senso di responsabilità
19 la libertà e
16 il capo dello
14 il mio pensiero
14 le forze politiche
13 contro il terrorismo
13 economico e sociale
13 giovanni paolo ii
13 presidente del consiglio
12 della nostra società
12 il bene comune
11 aiuto di dio
11 pace nel mondo
11 per la giustizia
10 dell unione europea
10 della persona umana
10 delle nazioni unite
10 posti di lavoro

Quirinale

5. Themes

As already seen, the historical period, the major events, national or international, the set of most acute problems, the plights faced by the government institutions or by the population, the particular interests of each President have all a considerable impact on the choice of terminology adopted in the year-end well wishes speeches. In order to illustrate this selection let’s focus on examples relative to seven specific themes [21]:

  • unemployment
  • work and workers
  • young people
  • culture
  • terrorism and terror
  • reforms
  • homeland [22]

Presented as graphs (titled Frequency of theme in each speech), the how and how much Presidents evoke a specific theme in the chain of the 67 year-end addresses is better perceived. Each graph photographs a topic and indicates the number of times that such theme is cited, overall, in each particular message. For example, the first graph pertaining to unemployment shows that the term was used six times in the year-end message of 1984 and 5 times in 1979, 1981, 1983 (President Sandro Pertini) or 4 times in 1992 (President Scalfaro).

This, however, provides only half of the analysis. The significance of a theme is not measured only by the overall absolute frequency of the relevant term in the single speeches but also by the relative frequency of the term in the message compared to the average use of the same term in the Italian language in general in the same historical period or year.[23] And that comparison is given in the second graph (entitled Relative frequencies of theme).

5.1. Unemployment

disoccupazione_EN-1
disoccupazione_2_EN-1


5.2. Work and workers

lavoro_EN-1lavoro_2_EN-1


5.3. Young people

giovani_EN-1giovani_2_EN-1


5.4. Culture

cultura_EN-1cultura_2_EN-1


5.5. Terrorism and terror

terrorismo_EN-1terrorismo_2_EN-1


5.6. Reforms

riforme_EN-1riforme_2_EN-1


5.7. Homeland

patria_EN-1patria_2_EN-1

It should be stressed that in each of these seven examples, the difference of relative frequency of use of the theme (lemma) between the Italian language in general (in those years) and the presidential year-end speeches is not fortuitous. The p-value of independence t-test [24] is always lower than 0.001.

From this type of analysis it is possible to verify, for example, that the theme of terrorism occupies a significant position of relevance for President Pertini, surely because his seven year term was, from this point of view, a difficult and tragic period for Italy, just as the issues of unemployment and young people were central to him. But it is also noticeable that, overall, for Italian Presidents, terrorism is written in the national historical memory as a phenomenon associated as much —if not more— to the late 1970s and early 80s as to the episodes of the last 10-15 years, from 2001 on. President Napolitano seems instead to be preoccupied not only by the theme of employment but also and above all by the question of the State reformation whereas he refers to the homeland only 4 times in 9 years; a theme that, instead, was special to Presidents Cossiga and Scalfaro. The theme of culture has mainly attracted the attention of Presidents Cossiga and Scalfaro. The statistical analysis should be further fine-tuned though to understand if they meant culture in the anthropological or artistic sense.

Once the algorithm is established, this type of analysis has the advantage of being extendable to any topic, offering a historical reading rooted in quantified data and not only in the inspiration of the researcher. The comparison between the “traditional” socio-political analysis and the statistical reading of the data on themes such as mafia, north-south inequalities, migratory flows, innovation, to name just a few, would surely be instructive.

6. Sentiment analysis

The statistical analysis of sentiment [23] was developed on the basis of a combination of quantitative methodologies aiming to the measurement, quantification and classification of opinions and sentiments expressed in documents (textual corpus) through words which have a positive, negative or neutral semantic connotation. For example, in function of the lexical composition of the text, the sentence “Last quarter the European economy has benefitted from the favorable price of oil” would be classified as positive, when the phrase “In the same period the labor market has continued to suffer, penalizing particularly young people.” would be considered negative. In the recent years, the development of the discipline has experienced a strong acceleration fuelled by the desire to better understand the evolution of various types of content published by hundreds of millions of social media users (regardless of the difficulties given by the automatic tracking of sentiments in brief texts; Thelwall, 2010) and by the uncountable websites that offer users the possibility to publish review of products and services (Galitsky, 2009; Cataldi and al., 2013).

Normally used for scientific purposes [26] but also to improve marketing efficiency and expand business opportunities [27], sentiment analysis applied to historical, administrative or institutional documents still remains embryonic, above all in the case of documents written in the Italian language. [28] Consequently, inquiring whether the “truth” of the Presidential year-end speeches in Italy lies as much in the analysis of the data behind them as in the more traditional political analysis can only stimulate an innovative line of research and trigger considerations rooted in the extension of data sciences to the study of governments and institutions.
The first step in carrying out the sentiment analysis of the Presidential messages consists of reducing the lexical complexity through the lemmatization of the text, that is the substitution of each term used with its lemma of reference. For example, the last sentence of the first year-end address by President Einaudi in 1949, from “such I am sure is the common vote and such is my personal wish which is directed heartfelt and with affection at this hour to each Italian in and out of the boundaries of the country” becomes “such to be sure to be the common vote and such to be my personal wish which to direct heartfelt and with affection at the hour to each Italian in and out of the boundary of the country”. This makes easier the management of the dictionaries (words lists) that, through various simplifications, do not need to comprise the plurals, conjugated or derived forms but can be limited to citing only the infinitives or simple forms. For a basic sentiment analysis, once the speeches are lemmatized, a single positive value (+1) is associated to each term (lemma) if the term expresses a positive sentiment, opinion, attitude, concept; a single negative value (-1) if the lemma translates a negative sentiment, opinion, attitude, concept and a neutral value (0) for neutral terms (Vryniotis, 2013).[29] After having summed up the values of the single words grouped by sentence, each sentence of each presidential speech can be characterized by 3 absolute values: the sum of the “positive sentiments” (expressed by means of a positive integer), the sum of the “negative sentiments” (expressed by means of a negative integer), the overall sum of the sentiments expressed in that sentence (translated into a positive or negative integer according to the dominant sentiments).

The distribution of the values of the overall sums of sentiment shows that the year-end messages prevalently transmit positive concepts, opinions and sentiments. Since 1949 to present day, the sentences of the speeches are in fact positioned on the positive segment of the axis, with an average sentiment value of +3.7 and a median value of +3.

distribuzione_sentimenti_EN-1

However, for greater accuracy, these whole values (positive sum, negative sum and overall sum) are weighed by comparing them to the total number of words in the sentence. In this way 3 percentage values are obtained (positive and negative sentiments, and overall percentage). In combining the values of the single sentences per message and, later, combining the messages per president, granularity is somewhat lost but a clearer vision of the whole picture is acquired (particularly if a polynomial function is used to smooth the coarseness of the single values).

For example, President Mattarella’s year-end speech appears to be mainly positive even though there are distinctly negative statements.[30]

sentiment_mattarella_EN-1

The sentiment analysis —which perhaps should be called opinion mining in this case— can also help in making visible underlying structures in the construction of the speeches.

The following two graphs illustrate President Napolitano’s year-end addresses in 2007 and 2011. A resemblance of the narrative structure is observed: after the beginning well wishes (which can be translated with a high positive sentiment index), the President evokes negative facts, thoughts, opinions, situations (this significantly decreases the overall index value) to then mark a high, expressing again clearly positive sentiments and opinions which slowly “dull” to the final wishes. In other words: after taking off at top speed, then suddenly the bad news arrives which is immediately counterbalanced, first by a few but strong very positive opinions, then by a long series of still positive expressions, up to the conclusion of the message which fosters a pronounced hope for the year to come.

sentiment_napolitano2007_EN-1
sentiment_napolitano2011_EN-1

The long term analysis of the aggregated data per speech can benefit from this approach as well. Overall, the level of positive sentiment expressed in the year-end well wishes by all the Presidents does not vary much. Although some significant oscillations are observed year over year (with a maximum range of variation over the years between +28.5% in 2003 and +16.8% in 1981), the total trend remains stable —relative stability likely due, at least in part, to the characteristic of this particular communication exercise which tends to highlight the positive wishes and greetings repeated over and over during the speech.
On the other hand, the proportion of negative sentiment grows over the course of the years, but with oscillations that seem [31] to follow the evolutions of the economic and social crisis. In fact, it is possible to observe a rapid decline in the sentiments expressed between 1959 and 1980-1981 (practically a doubling of the negative lemmas), followed by a decrease in pessimism from 1981 to 2000, and again a progressive darkening of the skies from 2001-2002 on.

pos_sentiments_over_time_EN-1
neg_sentiments_over_time_EN-1

The expression of the sentiments considered for each single president shows some variability, with President Pertini who, up to today, ranks at the extremes both for the highest percentage of negative sentiments (lemmas) and for the lowest percentage of those positive. Beyond the differences in Presidents’ personality, the historical period surely influenced the volume of the negative sentiments expressed by Presidents Leone and Pertini. For this same reason, it would be useful to closely look at the next year-end addresses to understand if the negative sentiments expressed by President Mattarella confirms a trend which seems to inaugurate with President Napolitano or if it is only an instance given the fact that up to now President Mattarella has only had one occasion to deliver his year-end wishes to the nation.

sentiments_by_president_1_EN-1

7. Conclusion

This work provides unique insights into the institution’s textual production and its variation over time in three manners:

  • Descriptive statistics were used to quantify the way(s) each Italian President speaks to the Nation. Amongst others, it allowed differentiating elocutionary styles: crisper (202 words/speech) or verbose (3,513 words/speech), direct (17 words/sentence) or convoluted (49 words/sentence), slow (95 words/minute) or fast (142 words/minute), as well as variations to means. When applied to the time series, the descriptive analysis shows the mutations of the elocutionary styles over time and the fact that they are not always in line with the zeitgeist.
  • Natural language processing methods highlighted the frequency and associations of single or groups of words. This was useful to extract the features of the New Year speeches overall but also the main interests of each President (with the oldest President in the history of the Italian Republic being the most worried about the future of the young generation). Quantified examples are given for 7 themes: unemployment, work/job, youth, culture, terrorism, reform, and homeland. Absolute and relative frequencies of these themes were computed and compared to the average frequency of the same themes in the language overall for the same period. Supported by meaningful independence t-tests and confidence intervals, this approach showed the comparative evolution of the recurrence of the 7 topics. But it also showed it can be generalized to any theme.
  • After having built a “sentiment dictionary”, quantitative sentiment analysis (opinion mining) has been applied to quantify the expression of ideas, opinions, and statements as positive or negative based on the wording. Relevant differences between Presidents emerge with, at the 2 extremes, President Pertini (18% positive sentiments against 9% negative) and President Gronchi (27% positive sentiments against 4.5% negative). Also, historical trends become more visible: towards more pessimism in the 1980s followed by a slightly stronger optimism in the 1990s and again more negative sentiments from 2000 onward. Sentiment analysis also made obvious that some Presidents built up their narratives following recurrent “sentiment/opinion patterns”. The most evident case is President Napolitano that alternates good and bad news in such a specific manner that it becomes a pattern signature structuring some of his speeches.

Although little has been done to date to incorporate data science in the field of textual analysis of political content, this work shows the early benefits of such an approach. Textual and verbal production of public administrations can be investigated with quantitative rigor opening new lines of research. Quantitative methods such as descriptive statistics, natural language processing, and sentiment analysis (opinion mining) prove to be highly valuable tools capable of bringing a strong contribution to enrich and enhance political sciences.


References

Baayen, R.H., (2008). Analyzing Linguistic Data, Cambridge University Press.

Baccianella, S., Esuli, A. and Sebastiani, F. (2010). “Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining”, in Calzolari, N. and al., editor, Proceedings of LREC, 2200–2204. http://is.gd/VLTKqB

Basile, V. and Nissim, M. (14 June 2013). “Sentiment analysis on Italian tweets”, Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 100–107. http://is.gd/io82rB

Casotto, P. (2012). Sentiment Analysis for the Italian Language, Tesi di Dottorato, Dipartimento di matematica e Informatica, Universita’ degli Studi di Udine.

Cataldi, M., Ballatore, A., Tiddi, I., Aufaure, M.-A. (22 June 2013). “Good location, terrible food: detecting feature sentiment in user-generated reviews”, Social Network Analysis and Mining, 3 (4): 1149–1163. http://is.gd/Tlp0Fc

Charpentier, A. (22 February 2016). “Clusters of Texts”, Freakonometrics. http://is.gd/32mB13

Davidov, D., Tsur, O, and Rappoport, A. (2010). “Enhanced sentiment learning using twitter hashtags and smileys”, Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, 241–249, Stroudsburg, PA, USA.

Galitsky, B. and McKenna, E.W. (12 November 2009). “Sentiment Extraction from Consumer Reviews for Providing Product Recommendations”, Patent US–20090282019-A1. http://is.gd/ioVBsb

Gonzales-Ibanez, R., Muresan, S. and Wacholder, N. (June 2011). “Identifying sarcasm in twitter: A closer look”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 581–586, Portland, OR, USA.

Gries, S.Th. (2013). Statistics for Linguistics with R, Second Edition, De Gruyter Mouton, Berlin.

Jockers, M.L., (2014). Text Analysis with R for Students of Literature, Springer.

Kennette, L.N., Wurm, L.H. and Van Havermaet, L.R. (2010). “Change detection: The effects of linguistic focus, hierarchical word level and proficiency”, The Mental Lexicon, 5(1), 47–86.

Kulkarni V., Rfou R., Perozzi B. and Skiena S. (2015). “Statistically Significant Detection of Linguistic Change”, Proceedings of the 24th International Conference on World Wide Web, 2015.

Lebeau, J. (20 January 2016). “State of the Union Speeches and Data”, More or Less Numbers. http://is.gd/NrJIRS

Liu, B. (2012). Sentiment Analysis and Opinion Mining, Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers.

Lapowsky, I. (13 January 2016). “The True Message of the State of the Union Is in the Data”, WIRED. http://is.gd/7ULEdC

Mani, I. (2010). The Imagined Moment. Time, Narrative, and Computation, University of Nebraska Press.

Mejova, Y. (16 November 2009). Sentiment Analysis: An Overview. http://is.gd/C1U9OJ

Taboada, M., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (June 2011). “Lexicon-based methods for sentiment analysis”, Comput. Linguist., 37(2):267–307.

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A, (2010). “Sentiment strength detection in short informal text”, Journal of the American Society for Information Science and Technology, 61 (12): 2544–2558. http://is.gd/yRnRCq

Vryniotis, V. (23 September 2013). “The importance of Neutral Class in Sentiment Analysis”, Machine Learning Blog & Software Development News. http://is.gd/FidS0k


Notes

[1] This document is the result of an analysis carried out by the author and reflects only and exclusively the opinions of the author. Therefore, this document does not involve in any way, neither directly nor indirectly, none of the employers, past or present, of the author. The author confirms that he has no conflict of interest, and has never worked, neither in a remunerated capacity nor gratuitously, for the President of the Italian Republic, and has never been appointed in any role or capacity by the Quirinale. Notwithstanding, the author informs the readers that he was honored with the title of Commander by the President of the Italian Republic (Decree August 1 2007).

[2] eMail: salvino [dot] salvaggio [at] gmail [dot] com

[3] The Author thanks Paolo Barbesino, Paolo Gasparini, Gaetano Palumbo for the comments to the previous versions.

[4] The first year-end message was delivered by President Luigi Einaudi on December 31, 1948. Enrico de Nicola – elected interim President on June 28, 1946, then again June 26, 1947 after his resignation, and again President of the Italian Republic from January 1, 1948 – did not deliver any year-end wishes to the nation during his two years at the Quirinale.

[5] Printed paper, radio and film distribution of the Istituto Luce in the early years, live television from 1954 on.

[6] In fact, since his first appearance on TV in 1954, the program often opens with a wide view which captures a room in the Quirinale palace to then progressively zoom in on the framing of the bust and face of the President speaking to the nation looking directly into the camera.

[7] Record held by President Giorgio Napolitano watched by approximately 13 million viewers on December 31, 2014.

[8] A much formalized ritual, with few surprises or formal variations but also significant, as shown by the clustering text which is unable to isolate clear or specific groups (Charpentier, 2016).

[9] Highlighted by change point detection to identify not only one or more potential changes but also the moment in which they occur (Kulkarni and al, 2015; Kennette and al, 2010).

[10] The variation to the average is statistically significant and cannot therefore be attributed to chance (t-test).

[11] It would be useful —but unfortunately the information is not available— to know the average speed of elocution of the Italian population in general, year by year, from 1949 on to compare to the data of the presidential speeches. Also, data on duration of President Einaudi’s speeches are not available.

[12] It is not therefore clear whether President Scalfaro was at the Quirinale for the message on December 31 1997 or not.

[13] p-value of t-test is approximately 0.00016.

[14] Precisely 481.8 words.

[15] With the significant exception of 1991, year in which President Cossiga went on TV to deliver his wishes saying that substantially he would not be saying anything else, all in just 419 words in less than 4 minutes. It must be noted that, from a statistical point of view, specifically the messages by Presidents Pertini and Scalfaro are the main contributors to general trend (regression).

[16] The terms “italy”, “italians” are pronounced almost 700 times in 67 messages.

[17] With words like italy, italians, young people, population, country, life, liberty, citizens, democracy, trust, responsibility.

[18] Natural Language Processing (NPL).

[19] Which in the statistical analysis stands for ‘president of the republic’.

[20] Obtained with AntCon 3.4.3 software.

[21] The themes where chosen without representativeness pretention.

[22] The following regex terms or expressions were used in the research of the occurrences: ‘unemployment|underemployment’, ‘work’, ‘culture’, ‘terror’, ‘reform’, ‘bhomeland|patriotism|patriot|patriots’, ‘byoung people|youth|youthful|b’

[23] This second analysis is made possible by Google Books which provides researchers with the frequency of all the words used in millions of books, year by year, language by language, from approximately 1500 to 2009. It is necessary to underline that the terminological quantification of the semantic linguistic corpus built by Google Books does not included texts published in mass media (newspapers, Internet) which might modify the frequency data adding a more pronounced component of modernity.

[24] Welch Method.

[25] Usually called sentiment analysis or opinion mining. See Mejova, 2009 for an overall description.

[26] For example to add quantitative depth to the literal analysis, or in a political setting to better understand the evolutions of the orientation of the electorate.

[27] To evaluate the level of adhesion to advertisements or specific brands.

[28] The easy availability of numerous English dictionaries of words associated to values of various sentiments and opinions has dynamized the research on English language contents (Baccianella et al., 2010; Gonzales-Ibanes and al., 2011; Liu, 2012). The same is taking place for Spanish and, in lesser degree, for other languages. In Italian, unfortunately, the availability of such instruments still suffers significant shortcomings and the laudable efforts of isolated researchers are not enough to bridge the double gap of, on one side, lists of words with the corresponding quantification of the positive or negative connotation and, on the other, of classifications of the same words per type of sentiment or opinion. Having a base dictionary available which signals the positive (+1) or negative (-1) value of an adjective is the first essential step for this approach. But also knowing if the same adjective belongs to a specific category of sentiment -for example anger, joy, sadness, fear, surprise, trust, etc.- makes it possible to enrich the analysis. In Italian, see for example the software sentiment-Italian-lang by Giuseppe Ragusa – https://github.com/gragusa?tab=repositories, Casotto, P. (2012), Basile and Nissim (2013).

[29] For greater precision, the lemmas can be quantified on a real scale that includes the intermediate values between 0 and 1, 0 and -1.

[30] In this type of graph, the order of sequence of the phrases in the speech act as a temporal axis (x-axis)- this approach is usually defined as novelistic time (Mani, 2010).

[31] This should be verified in detail.

To leave a comment for the author, please follow the link and comment on their blog: RSS Feed – SaS in #R#.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Principal Components Regression in R: Part 3

By Joseph Rickert

Replot 1

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by John Mount Ph. D.
Data Scientist at Win-Vector LLC

In her series on principal components analysis for regression in R, Win-Vector LLC‘s Dr. Nina Zumel broke the demonstration down into the following pieces:

  • Part 1: the proper preparation of data and use of principal components analysis (particularly for supervised learning or regression).
  • Part 2: the introduction of y-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict.
  • And now Part 3: how to pick the number of components to retain for analysis.

In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3, she moves from the usual “pick the number of components by eyeballing it” non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:

  • Pick 2, as that is all you can legibly graph.
  • Pick enough to cover some fixed fraction of the variation (say 95%).
  • (for variance scaled data only) Retain components with singular values at least 1.0.
  • Look for a “knee in the curve” (the curve being the plot of the singular value magnitudes).
  • Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the y-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following “broken stick” type test yields the following graph which identifies five principal components as being significant:

However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:

Plotsig 1 In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general, we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the “y” or outcome) and do all of the following:

  • Fit model and significance prune incoming variables.
  • Convert incoming variables into consistent response units by y-aware scaling.
  • Fit model and significance prune resulting latent components.

The above will become much clearer and much more specific if you click here to read part 3.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Predictive Bookmaker Consensus Model for the UEFA Euro 2016

By Tal Galili

fig_01

(By Achim Zeileis)

From 10 June to 10 July 2016 the best European football teams will meet in France to determine the European Champion in the UEFA European Championship 2016 tournament. For the first time 24 teams compete, expanding the format from 16 teams as in the previous five Euro tournaments. For forecasting the winning probability of each team a predictive model based on bookmaker odds from 19 online bookmakers is employed. The favorite is the host France with a forecasted winning probability of 21.5%, followed by the current World Champion Germany with a winning probability of 20.1%. The defending European Champion Spain follows after some gap with 13.7% and all remaining teams are predicted to have lower chances with England (9.2%) and Belgium (7.7%) being the “best of the rest”. Furthermore, by complementing the bookmaker consensus results with simulations of the whole tournament, predicted pairwise probabilities for each possible game at the Euro 2016 are obtained along with “survival” probabilities for each team proceeding to the different stages of the tournament. For example, it can be determined that it is much more likely that top favorites France and Germany meet in the semifinal (7.8%) rather than in the final at the Stade de France (4.2%) – which would be a re-match of the friendly game that was played on 13 November 2015 during the terrorist attacks in Paris and that France won 2-0. Hence it is maybe better that the tournament draw favors a match in the semifinal at Marseille (with an almost even winning probability of 50.1% for France). The most likely final is then that either of the two teams plays against the defending champion Spain with a probability of 5.7% for France vs. Spain and 5.4% for Germany vs. Spain, respectively.

All of these forecasts are the result of a bookmaker consensus rating proposed in Leitner, Hornik, and Zeileis (International Journal of Forecasting, 26(3), 471-481, 2010). This technique correctly predicted the winner of the FIFA World Cup 2010 and Euro 2012 tournaments while missing the winner but correctly predicting the final for the Euro 2008and three out of four semifinalists at the FIFA World Cup 2014. A new working paper about the UEFA Euro 2016, upon which this blog post is based, applies the same technique and is introduced here.

The core idea is to use the expert knowledge of international bookmakers. These have to judge all possible outcomes in a sports tournament such as the UEFA Euro and assign odds to them. Doing a poor job (i.e., assigning too high or too low odds) will cost them money. Hence, in our forecasts we solely rely on the expertise of 19 such bookmakers. Specifically, we (1) adjust the quoted odds by removing the bookmakers’ profit margins (or overround, typically around 15%), (2) aggregate and average these to a consensus rating, and (3) infer the corresponding tournament-draw-adjusted team abilities using the Bradley-Terry model for pairwise comparisons.

For step (1), it is assumed that the quoted odds are derived from the underlying “true” odds as: quoted odds = odds · α + 1, where + 1 is the stake (which is to be paid back to the bookmakers’ customers in case they win) and α is the proportion of the bets that is actually paid out by the bookmakers. The so-called overround is the remaining proportion1 – α and the main basis of the bookmakers’ profits (see also Wikipedia and the links therein). For the 19 bookmakers employed in this analysis, the median overround is a sizeable 15.1%. Subsequently, in step (2), the overround-adjusted odds are transformed to the log-odds (or logit scale), averaged for each team, and transformed back to winning probabilities (displayed in the barchart above).

Finally, step (3) of the analysis uses the following idea:

  1. If team abilities are available, pairwise winning probabilities can be derived for each possible match using a Bradley-Terry approach.
  2. Given pairwise winning probabilities, the whole tournament can be easily simulated to see which team proceeds to which stage in the tournament and which team finally wins.
  3. Such a tournament simulation can then be run sufficiently often (here 100,000 times) to obtain relative frequencies for each team winning the tournament.

Using an iterative approach we calibrate the team abilities so that the implied winning probabilities (when simulating the tournament repeatedly) match the bookmaker consensus probabilities (reported above) closely. Thus, we obtain abilities for each team that are adjusted for the tournement draw because the bookmakers’ odds factored this already in (i.e., account for the fact that some teams compete in relatively weak or strong groups respectively). Moreover, these abilities imply winning probabilities for each conceivable match between two teams, reported in the color-coded display below.

fig_02

Light gray signals that either team is almost equally likely to win a match between Teams A and B (probability between 40% and 60%). Light, medium, and dark blue/red corresponds to small, moderate, and high probbilities of winning/losing a match between Team A and Team B. All probabilities are obtained from the Bradley-Terry model using the following equation for the winning probability:

Pr(A beats B)= abilityA abilityA +abilityB .

Clearly, the bookmakers perceive France and Germany to be the strongest teams in the tour- nament that are almost on par (with a probability of only 50.5% that France beats Germany) while having moderate (70-0%) to high (> 80%) probabilities to beat almost any other team in the tournament. The only group of teams that get close to having even chances are Spain (with probability of 43.7% and 44.2% of beating France and Germany, respectively), England (with 38.7% and 39.1%), and Belgium (with 37.4% and 37.9%). Behind these two groups of the strongest teams there are several larger clusters of teams that have approximately the same strength (i.e., yielding approximately even chances in a pairwise comparison). Interestingly, two of the teams with very low strengths (Romania and Albania) compete in the same group A together with the favorite team France.

Additionally, the tournament simulation cannot only be used to infer an estimated probability for the outcome of each individual match but also for the whole course of the tournament. The plot below shows the relative frequencies from the simulation for each team to “survive” over the tournament, i.e., proceed from the group-phase to the round of 16, quarter- and semi-finals, and the final. France and Germany are the clear favorites within their respective groups A and C with almost 100% probability to make it to the round of 16 and also rather small drops in probability to proceed through the subsequent rounds. All remaining teams have much poorer chances to proceed to the later stages of the Euro 2016. Group B also has a rather clear favorite with England and all remaining teams following with a certain gap. In contrast, groups D and E each have a favorite (Spain and Belgium, respectively) but with a second strong contender (Croatia and Italy, respectively). Group F is a weaker group but a much more balanced compared with the previous groups. Due to the new tournament system where 16 out of 24 teams proceed from the group phase to the next stage, even the weakest teams have probabilities of about 40% to reach at least the round of 16. However, many of these weak teams then have rather poor chances to make it to the quarterfinals resulting in clear downward kinks in the survival curves.

fig_03

Needless to say that all predictions are in probabilities that are far from being certain. While France taking the home victory is the most likely event in the bookmakers’ expert opinions, it is still far more likely that one of the other teams wins. This is one of the two reasons why we would recommend to refrain from placing bets based on our analyses. The more important second reason, though, is that the bookmakers have a sizeable profit margin of about 15% which assures that the best chances of making money based on sports betting lie with them. Hence, this should be kept in mind when placing bets. We, ourselves, will not place bets but focus on enjoying the exciting football tournament that the UEFA Euro 2016 will be with 100% predicted probability!

Working paper: Zeileis A, Leitner C, Hornik K (2016). “Predictive Bookmaker Consensus Model for the UEFA Euro 2016″, Working Paper 2016-16, Working Papers in Economics and Statistics, Research Platform Empirical and Experimental Economics, Universität Innsbruck. URL http://EconPapers.RePEc.org/RePEc:inn:wpaper:2016-15.

Source:: R News

QGIS, Open Source GIS & R

By Kurt Menke

QGIS Desktop

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Today’s post is by Kurt Menke, the owner of Bird’s Eye View GIS, a GIS consultancy. Kurt also wrote the book Mastering QGIS. In my latest course (Shapefiles for R Programmers) I briefly introduce people to QGIS. Kurt’s post below gives you a roadmap for learning more.

I come to this blog from a slightly different, but related, perspective. I am a GIS specialist. I run my own GIS consulting business, Bird’s Eye View GIS, out of Albuquerque, NM. I’ve have been a user and advocate of open source GIS for a long time. I’ve long been aware of the power of R and have dabbled in it from time to time. I discovered Ari and his great Learn to Map Census Data in R course via Twitter back in February. When I saw his free course, I thought it would be a great opportunity to get more comfortable with R.

QGIS

What I want to do here, is briefly introduce all of you R users to QGIS, the leading free and open source (FOSS) desktop GIS. I’ll also cover some QGIS resources. R and QGIS can be a beautiful marriage. What began in 2002 as a simple data viewer has evolved into a fully functional GIS package, with a very dedicated and active user community. Currently a new version of QGIS is released every four months! Due to this rapid release schedule, a long-term release (LTR) version is created each spring. The LTR works better for productions environments. QGIS is multi-platform, running on Windows, Mac and Linux. There is also an Android version, and with crouton it can be installed on a Chromebook!

Under the hood, QGIS leverages several other FOSS packages. One of those (GDAL/OGR), allows it to read 42 vector file formats and 88 raster formats! It can also connect to several spatial databases including: PostgreSQL/PostGIS, SpatiaLite, Microsoft SQL Server and Oracle. If that weren’t enough it can connect to OGC web services such as: WMS, WCS and WFS. The image below shows the basic layout and a map of total population by county. The basemap is being streamed from CartoDB. This particular population dataset is a shapefile. QGIS has fantastic cartographic capabilities including some that are not available in any other GIS package: Blending Modes and Live Layer Effects.

QGIS Desktop

Spatial Analysis with QGIS

I know R users are a technical group, so I’m happy to say QGIS also has a robust set of geoprocessing functionality. The Processing Toolbox, (shown on the right above) has both native QGIS tools for processing spatial data, along with tools from numerous other FOSS providers, including R!

NOTE: R needs to be installed on your computer and the PATH has to be correctly set up in the Processing menu -> Processing Options settings.

The image below shows the Toolbox with the R section highlighted.

QGIS Processing Toolbox

You can write your own R scripts, and there is an online R scripts collection for QGIS. To access that, simply open the Get R scripts from on-line scripts collection tool. This opens the Get scripts and models dialog shown below. Here you can choose the scripts you’d like to install and click OK.

Get R Scripts dialog

Get R Scripts dialog

Once a script is loaded you can run it like any other tool, or open it in the QGIS Script Editor. Below the histogram script from the QGIS collection is open.

QGIS Script Editor

QGIS Script Editor

The Processing framework also includes a graphical modeler. This allows you to set up an entire workflow from simple to complex. A model can include any datasets and tools from any provider, including other models. The model below is a site selection model for a heliport in Corpus Christi.

QGIS Graphical Modeler

QGIS Graphical Modeler

Models can also be saved as Python scripts. QGIS has a Python console giving you another powerful scripting environment to work with. There’s obviously a lot to learn here.

Phew! There is way more to QGIS than I can do justice to in a blog post. So below I’m going to give you some information on resources.

QGIS Resources (#shamelessplug)

One way I’ve become involved in the QGIS community is writing curricula and authoring two books on QGIS. Two years ago, myself and several colleagues authored the GeoAcademy, which is the first ever GIS curriculum based on a national standard – the U.S. Department of Labor’s Geospatial Competency Model (GTCM). The GTCM consists of the knowledge, skills and abilities needed to be a working GIS professional. The GeoAcademy consists of 5 complete college courses.

  • Introduction to Geospatial Technology Using QGIS
  • Spatial Analysis Using QGIS
  • Data Acquisition and Management Using QGIS
  • Cartography Using QGIS
  • Remote Sensing Using QGIS

This winter I converted the curriculum to fit into a convenient workbook format with Locate Press. The book is called Discover QGIS. and is hot off the presses. I mentioned how fast QGIS is developing. Originally written for QGIS 2.4, the GeoAcademy material in this workbook has been updated for use with QGIS 2.14. It therefore represents the most up-to-date version of the GeoAcademy curriculum.

At the moment the digital version of the book is available as a Preview Edition. Purchasing this ebook, entitles you to the full version when it is released. We are just working on a few formatting issues.

Discover QGIS

Discover QGIS

Prior to this I authored Mastering QGIS with Packt Publishing. This is a more thorough treatment of QGIS. Beyond these two books there are many other resources including:

With this extremely short introduction I encourage you to download QGIS and begin experimenting. It is fantastic software. Meanwhile I have a lot of R to learn! See you in the twittersphere (@geomenke)

The post QGIS, Open Source GIS & R appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

How to use data analysis for machine learning (example, part 1)

By Sharp Sight Labs

1_data-analysis-for-ML_how-we-use-dataAnalysis_2016-05-16

(This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers)

In my last article, I stated that for practitioners (as opposed to theorists), the real prerequisite for machine learning is data analysis, not math.

One of the main reasons for making this statement, is that data scientists spend an inordinate amount of time on data analysis. The traditional statement is that data scientists “spend 80% of their time on data preparation.” While I think that this statement is essentially correct, a more precise statement is that you’ll spend 80% of your time on getting data, cleaning data, aggregating data, reshaping data, and exploring data using exploratory data analysis and data visualization. (From this point forward, I’ll use the term “data analysis” as a shorthand for getting data, reshaping it, exploring it, and visualizing it.)

And ultimately, the importance of data analysis applies not only to data science generally, but machine learning specifically.

The fact is, if you want to build a machine learning model, you’ll spend huge amounts of time just doing data analysis as a precursor to that process.

Moreover, you’ll use data analysis to explore the results of your model after you’ve applied an ML algorithm.

Additionally, in industry, you’ll also need to rely heavily on data visualization techniques to present the results after you’ve finalized them. This is one of the practical details of working as a data scientist that many courses and teachers never tell you about. Creating presentations to communicate your results will take large amounts of your time. And to create these presentations, you should rely heavily data visualization to communicate the model results visually.

Data analysis and data visualization are critical at almost every part of the machine learning workflow.

So, to get started with ML (and to eventually master it) you need to be able to apply visualization and analysis.

In this post, I’ll show you some of the basic data analysis and visualization techniques you’ll need to know to build a machine learning model.

We’ll work on a toy problem, for simplicity and clarity

One note before we get started: the problem that we’ll work through is just linear regression and we’ll be using an easy-to-use, off-the-shelf dataset. It’s a “toy problem,” which is intentional. Whenever you try to learn a new skill, it is extremely helpful to isolate different details of that skill.

The skill that I really want you to focus on here is data visualization (as it applies to machine learning).

We’ll also be performing a little bit of data manipulation, but it will be in service of analyzing and visualizing the data. We won’t be doing any data manipulation to “clean” the data.

So just keep that in mind. We’re working on a very simplified problem. I’m removing or limiting several other parts of the ML workflow so we can strictly focus on preliminary visualization and analysis for machine learning.

Step 1: get the data

The first step almost of any analysis or model building effort is getting the data.

For this particular analysis, we’ll use a relatively “off the shelf” dataset that’s available in R within the MASS package.

data(Boston, package = "MASS")

The Boston dataset contains data on median house price for houses in the Boston area. The variable that we’ll try to predict is the medv variable (median house price). The dataset has roughly a dozen other predictors that we’ll be investigating and using in our model.

This is a simplified “toy example” data doesn’t require much cleaning

As I already mentioned, the example we’ll be working through is a bit of a “toy” example, and as such, we’re working with a dataset that’s relatively “easy to use.” What I mean is that I’ve chosen this dataset because it’s easy to obtain and it doesn’t require much data cleaning.

However, keep in mind that in a typical business or industry setting, you’ll probably need to get your data from a database using SQL or possibly from a spreadsheet or other file.

Moreover, it’s very common for data to be “messy.” The data may have lots of missing values; variable names and class names that need to be changed; or other details that need to be altered.

Again, I’m intentionally leaving “data cleaning” out of this blog post for the sake of simplicity.

Just keep in mind that in many cases, you’ll have some data cleaning to do.

Step 2: basic data exploration

After getting the dataset, the next step in the model building workflow is almost always data visualization. Specifically, we’ll perform exploratory data analysis on the data to accomplish several tasks:

1. View data distributions
2. Identify skewed predictors
3. Identify outliers

Visualize data distributions

Let’s begin our data exploration by visualizing the data distributions of our variables.

We can start by visualizing the distribution of our target variable, medv.

To do this, we’ll first use a basic histogram.

I strongly believe that the histogram is one of the “core visualization techniques” that every data scientists should master. If you want to be a great data scientist, and if you ultimately want to build machine learning models, then mastering the histogram is one of your “first steps.”

By “master”, I mean that you should be able to write this code “with your eyes closed.” A good data scientist should be able to write the code to create a histogram (or scatterplot, or line chart ….) from scratch, without any reference material and without “copying and pasting.” You should be able to write it from memory almost as fast as you can type.

One of the reasons that I believe the histogram is so important is because we use it frequently in this sort of exploratory data analysis. When we’re performing an analysis or building a model, it is extremely common to examine the distribution of a variable. Because it’s so common to do this, you should know this technique cold.

Here’s the code to create a histogram of our target variable medv.

############################
# VISUALIZE TARGET VARIABLE
############################

require(ggplot2)

#~~~~~~~~~~~
# histogram
#~~~~~~~~~~~

ggplot(data = Boston, aes(x = medv)) +
  geom_histogram()

2_data-analysis-for-ML_histogram-medv_2016-05-16

If you don’t really understand how this code works, I’d highly recommend that you read my blog post about how to create a histogram with ggplot2. That post explains how the histogram code works, step by step.

Let’s also create a density plot medv.

Here’s the exact code to create a density plot medv.

#~~~~~~~~~~~~~~
# density plot
#~~~~~~~~~~~~~~

ggplot(data = Boston, aes(x = medv)) +
  stat_density()

The density plot is essentially a variation of the histogram. The code to create a density plot is essentially identical to the code for a histogram, except that the second line is changed from geom_histogram() to stat_density(). Speaking in terms of ggplot2 syntax, we’re replacing the histogram geom with a statistical transformation.

3_data-analysis-for-ML_density-medv_2016-05-16

If you’ve been working with data visualization for a while, you might want to learn a little bit about the differences between histograms and density plots, and how we use them. This is more of an intermediate data visualization topic, but it’s relevant to this blog post on ‘data visualization for machine learning’, so I’ll mention it.

Between histograms and density plots, some people strongly prefer histograms. The primary reason for this is that histograms tend to “provide better information on the exact location of data” (which is good for detecting outliers). This is true in particular when you use a relatively larger number of histogram bins; a histogram with a sufficiently large number of bins can show you peaks and unusual data details a little better, because it doesn’t smooth that information away. (Density plots and histograms with a small number of bins can smooth that information out too much.) So, when we’re visualizing a single variable, the histogram might be the better option.

However, when we’re visualizing multiple variables at a time, density plots are easier to work with. If you attempt to plot several histograms at the same time by using a small multiple chart, it can be very difficult to select a single binwidth that properly displays all of your variables. Because of this, density plots are easier to work with when you’re visualizing multiple variables in a small multiple chart. Density plots show the general shape of the data and we don’t have to worry about choosing the number of bins.

Ok, so having plotted the distribution of our target variable, let’s examine the plots. We can immediately see a few important details:

1. It’s not perfectly normal.

This is one of the things we’re looking for when we visualize our data (particularly for linear regression). I’ll save a complete explanation of why we test for normality in linear regression and machine learning, but in brief, we are examining this because many machine learning techniques require normally distributed variables.

2. It appears that there may be a few minor outliers in the far right tail of the distribution. For the sake of simplicity, we’re not going to deal with those outliers here; we’ll be able to build a model (imperfect though it might be) without worrying about those outliers right now.

Keep in mind, however, that you’ll be looking for them when you plot your data, and in some cases, they may be problematic enough to warrant some action.

Now that we’ve examined our target variable, let’s look at the distributions of all of the variables in our dataset.

We’re going to visualize all of our predictors in a single chart, like this:

4_data-analysis-for-ML_variable-small-multiple_2016-05-16

This is known as a small multiple chart (sometimes also called a “trellis chart”). Basically, the small multiple chart allows you to plot many charts in a grid format, side by side. It allows you to use the same basic graphic or chart to display different slices of a data set. In this case, we’ll use the small multiple to visualize different variables.

You might be tempted to visualize each predictor individually – one chart at a time – but that can get cumbersome and tedious very quickly. When you’re working with more than a couple of variables, the small multiple will save you lots of time. This is particularly true if you work with datasets with dozens, even hundreds of variables.

Although the small multiple is perfect for a task like this, we’ll have to do some minor data wrangling to use it.

Reshape the data for the small multiple chart

To use the small multiple design to visualize our variables, we’ll have to manipulate our data into shape.

Before we reshape the data, let’s take a look at the data as it currently exists:

head(Boston)

5_data-analysis-for-ML_head-boston-dataset_2016-05-16

Notice that the variables are currently located as columns of the data frame. For example, the crim variable is the first column. This is how data is commonly formatted in a data frame; typical data frames have variables as columns, and data observations as rows. This format is commonly called “wide-format” data.

However, to create a “small multiple” to plot all of our variables, we need to reshape our data so that the variables are along the rows; we need to reshape our data into “long-format.’

To do this, we’re going to use the melt() function from the reshape2 package. This function will change the shape of the data from wide-format to long-format.

require(ggplot2)
require(reshape2)
melt.boston <- melt(Boston)
head(melt.boston)

6_data-analysis-for-ML_melt-boston-dataset_2016-05-16

After using melt(), notice that the crim variable (which had been a column) is now dispersed along the rows of our reshaped dataset, melt.boston. In fact, if you examine the whole dataset, all of our dataset features are now along rows.

Now that we’ve melted our data into long-format, we’re going to use ggplot2 to create a small multiple chart. Specifically, we’ll use facet_wrap() to implement the small-multiple design.

ggplot(data = melt.boston, aes(x = value)) +
  stat_density() +
  facet_wrap(~variable, scales = "free")

4_data-analysis-for-ML_variable-small-multiple_2016-05-16

So now that we’ve visualized the distributions of all of our variables, what are we looking for?

We’re looking primarily for a few things:

1. Outliers
2. Skewness
3. Other deviations from normality

Let’s examine skewness first (simply because that seems to be one of the primary issues with these features).

Just to refresh your memory, skewness is a measure of asymmetry of a data distribution.

You can immediately see that several of the variables are highly skewed .

In particular, crim, zn, chaz, dis, and black are highly skewed. Several of the others appear to have moderate skewness.

We’ll quickly confirm this by calculating the skewness:

#~~~~~~~~~~~~~~~~~~~~
# calculate skewness
#~~~~~~~~~~~~~~~~~~~~

require(e1071)
sapply(Boston, skewness)

7_data-analysis-for-ML_sapply-skewness-boston_2016-05-16

To be clear, skewness can be a bit of a slippery concept, but note that a skewness of zero indicates a symmetrical distribution. Ideally, we’re looking for variables with a skewness of zero.

Also, a very rough rule of thumb is if the absolute value of skewness is above 1, then the variable has high skewness.

Looking at these numbers, a few of the variables have relatively low skewness, including rm, indus, and age (although rm is much more symmetrical).

Note: how we (normally) use this info

Part of the task of data exploration for ML is knowing what to do with the information uncovered in the data exploration process. That is, once you’ve identified potential problems and salient characteristics of your data, you need to be able to transform your data in ways that will make the machine learning algorithms work better. As I said previously, “data transformation” is a separate skill, and because we’re focusing on the pure “data exploration” process in this post, we won’t discussing data transformations. (Transformations are a big topic.)

Be aware, however, that in a common ML workflow, your learnings from EDA will serve as an input to the “data transformation” step of your workflow.

Recap

Let’s recap what we just did:

1. Plotted a histogram of our target variable using ggplot2
2. Reshaped our dataset using melt()
3. Plotted the variables using the small multiple design
4. Examined our variables for skewness and outliers

To be honest, this was actually an abbreviated list of things to do. We could also have looked at correlation, among other things.

This is more than enough to get you started though.

Tools that we leveraged.

By now, you should have some indication of what skills you need to know to get started with practical machine learning in R:

1. Learn ggplot2
– master basic techniques like the histogram and scatterplot
– learn how to facet your data in ggplot2 to perform multivariate data exploration

2. Learn basic data manipulation. I’ll suggest dplyr (which we didn’t really use here), and also reshape.

Want to see part 2?

In the next post, we’ll continue our use of data analysis in the ML workflow.

If you want to see part 2, sign up for the email list, and the next blog post will be delivered automatically to your inbox as soon as it’s published.

The post How to use data analysis for machine learning (example, part 1) appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Principal Components Regression, Pt. 3: Picking the Number of Components

By Nina Zumel

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.

Before starting the discussion, let’s quickly redo our y-aware PCA. Please refer to our previous post for a full discussion of this data set and this approach.

#
# make data
#
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)

#
# design treatment plan
#
treatmentsN <- designTreatmentsN(dTrain,
                                 setdiff(colnames(dTrain),'y'),'y',
                                 verbose=FALSE)

#
# prepare the treated frames, with y-aware scaling
#
examplePruneSig = 1.0 
dTrainNTreatedYScaled <- prepare(treatmentsN,dTrain,
                                 pruneSig=examplePruneSig,scale=TRUE)
dTestNTreatedYScaled <- prepare(treatmentsN,dTest,
                                pruneSig=examplePruneSig,scale=TRUE)

#
# do the principal components analysis
#
vars <- setdiff(colnames(dTrainNTreatedYScaled),'y')
# prcomp defaults to scale. = FALSE, but we already 
# scaled/centered in vtreat- which we don't want to lose.
dmTrain <- as.matrix(dTrainNTreatedYScaled[,vars])
dmTest <- as.matrix(dTestNTreatedYScaled[,vars])
princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)

If we examine the magnitudes of the resulting singular values, we see that we should use from two to five principal components for our analysis. In fact, as we showed in the previous post, the first two singular values accurately capture the two unobservable processes that contribute to y, and a linear model fit to these two components captures most of the explainable variance in the data, both on training and on hold-out data.

We picked the number of principal components to use by eye; but it’s tricky to implement code based on the strategy “look for a knee in the curve.” So how might we automate picking the appropriate number of components in a reliable way?

X-Only Approaches

Jackson (1993) and Peres-Neto, et.al. (2005) are two excellent surveys and evaluations of the different published approaches to picking the number of components in standard PCA. Those methods include:

  1. Look for a “knee in the curve” — the approach we have taken, visually.
  2. Only for data that has been scaled to unit variance: keep the components corresponding to singular values greater than 1.
  3. Select enough components to cover some fixed fraction (generally 95%) of the observed variance. This is the approach taken by caret::preProcess.
  4. Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

The papers also cover other approaches, as well as different variations of the above.

Kabakoff (R In Action, 2nd Edition, 2015) suggests comparing the magnitudes of the singular values to those extracted from random matrices of the same shape as the original data. Let’s assume that the original data has k variables, and that PCA on the original data extracts the k singular values si and the k principal components PCi.To pick the appropriate number of principal components:

  1. For a chosen number of iterations, N (choose N >> k):
  • Generate a random matrix of the correct size
  • Do PCA and extract the singular values
  1. Then for each of the k principal components:
  • Find the mean of the ith singular value, ri
  • If si > ri, then keep PCi

The idea is that if there is more variation in a given direction than you would expect at random, then that direction is probably meaningful. If you assume that higher variance directions are more useful than lower variance directions (the usual assumption), then one handy variation is to find the first i such that si < ri, and keep the first i-1 principal components.

This approach is similar to what the authors of the survey papers cited above refer to as the broken-stick method. In their research, the broken-stick method was among the best performing approaches for a variety of simulated and real-world examples.

With the proper adjustment, all of the above heuristics work as well in the y-adjusted case as they do with traditional x-only PCA.

A Y-Aware Approach: The Permutation Test

Since in our case we know y, we can — and should — take advantage of this information. We will use a variation of the broken-stick method, but rather than comparing our data to a random matrix, we will compare our data to alternative datasets where x has no relation to y. We can do this by randomly permuting the y values. This preserves the structure of x — that is, the correlations and relationships of the x variables to each other — but it changes the units of the problem, that is, the y-aware scaling. We are testing whether or not a given principal component appears more meaningful in a metric space induced by the true y than it does in a random metric space, one that preserves the distribution of y, but not the relationship of y to x.

You can read a more complete discussion of permutation tests and their application to variable selection (significance pruning) in this post.

In our example, we’ll use N=100, and rather than using the means of the singular values from our experiments as the thresholds, we’ll use the 98th percentiles. This represents a threshold value that is likely to be exceeded by a singular value induced in a random space only 1/(the number of variables) (1/50=0.02) fraction of the time.

#
# Resample y, do y-aware PCA, 
# and return the singular values
#
getResampledSV = function(data,yindices) {
  # resample y
  data$y = data$y[yindices]
  
  # treatment plan
  treatplan = vtreat::designTreatmentsN(data, 
                                setdiff(colnames(data), 'y'), 
                                'y', verbose=FALSE)
  # y-aware scaling
  dataTreat = vtreat::prepare(treatplan, data, pruneSig=1, scale=TRUE)
  
  # PCA
  vars = setdiff(colnames(dataTreat), 'y')
  dmat = as.matrix(dataTreat[,vars])
  princ = prcomp(dmat, center=FALSE, scale=FALSE)
  
  # return the magnitudes of the singular values
  princ$sdev
}

#
# Permute y, do y-aware PCA, 
# and return the singular values
#
getPermutedSV = function(data) {
  n = nrow(data)
  getResampledSV(data,sample(n,n,replace=FALSE))
}

#
# Run the permutation tests and collect the outcomes
#
niter = 100 # should be >> nvars
nvars = ncol(dTrain)-1
# matrix: 1 column for each iter, nvars rows
svmat = vapply(1:niter, FUN=function(i) {getPermutedSV(dTrain)}, numeric(nvars))
rownames(svmat) = colnames(princ$rotation) # rows are principal components
colnames(svmat) = paste0('rep',1:niter) # each col is an iteration

# plot the distribution of values for the first singular value
# compare it to the actual first singular value
ggplot(as.data.frame(t(svmat)), aes(x=PC1)) + 
  geom_density() + geom_vline(xintercept=princ$sdev[[1]], color="red") +
  ggtitle("Distribution of magnitudes of first singular value, permuted data")

Here we show the distribution of the magnitude of the first singular value on the permuted data, and compare it to the magnitude of the actual first singular value (the red vertical line). We see that the actual first singular value is far larger than the magnitude you would expect from data where x is not related to y. Let’s compare all the singular values to their permutation test thresholds. The dashed line is the mean value of each singular value from the permutation tests; the shaded area represents the 98th percentile.

# transpose svmat so we get one column for every principal component
# Get the mean and empirical confidence level of every singular value
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(mean)) %>% as.numeric() -> pmean
confF <- function(x) as.numeric(quantile(x,1-1/nvars))
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(confF)) %>% as.numeric() -> pupper

pdata = data.frame(pc=seq_len(length(pmean)), magnitude=pmean, upper=pupper)

# we will use the first place where the singular value falls 
# below its threshold as the cutoff.
# Obviously there are multiple comparison issues on such a stopping rule,
# but for this example the signal is so strong we can ignore them.
below = which(princ$sdev < pdata$upper)
lastSV = below[[1]] - 1

This test suggests that we should use 5 principal components, which is consistent with what our eye sees. This is perhaps not the “correct” knee in the graph, but it is undoubtably a knee.

Bootstrapping

Empirically estimating the quantiles from the permuted data so that we can threshold the non-informative singular values will have some undesirable bias and variance, especially if we do not perform enough experiment replications. This suggests that instead of estimating quantiles ad-hoc, we should use a systematic method: The Bootstrap. Bootstrap replication breaks the input to output association by re-sampling with replacement rather than using permutation, but comes with built-in methods to estimate bias-adjusted confidence intervals. The methods are fairly technical, and on this dataset the results are similar, so we don’t show them here, although the code is available in the R markdown document used to produce this note.

Significance Pruning

Alternatively, we can treat the principal components that we extracted via y-aware PCA simply as transformed variables — which is what they are — and significance prune them in the standard way. As our article on significance pruning discusses, we can estimate the significance of a variable by fitting a one variable model (in this case, a linear regression) and looking at that model’s significance value. You can pick the pruning threshold by considering the rate of false positives that you are willing to tolerate; as a rule of thumb, we suggest one over the number of variables.

In regular significance pruning, you would take any variable with estimated significance value lower than the threshold. Since in the PCR situation we presume that the variables are ordered from most to least useful, you can again look for the first position i where the variable appears insignificant, and use the first i-1 variables.

We’ll use vtreat to get the significance estimates for the principal components. We’ll use one over the number of variables (1/50 = 0.02) as the pruning threshold.

# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(princ,dTrainNTreatedYScaled),
                                 stringsAsFactors = FALSE)
vars = colnames(projectedTrain)
projectedTrain$y = dTrainNTreatedYScaled$y

# designing the treatment plan for the transformed data
# produces a data frame of estimated significances
tplan = designTreatmentsN(projectedTrain, vars, 'y', verbose=FALSE)

threshold = 1/length(vars)
scoreFrame = tplan$scoreFrame
scoreFrame$accept = scoreFrame$sig < threshold

# pick the number of variables in the standard way:
# the number of variables that pass the significance prune
nPC = sum(scoreFrame$accept)

Significance pruning picks 2 principal components, again consistent with our visual assessment. This time, we picked the correct knee: as we saw in the previous post, the first two principal components were sufficient to describe the explainable structure of the problem.

Conclusion

Since one of the purposes of PCR/PCA is to discover the underlying structure in the data, it’s generally useful to examine the singular values and the variable loadings on the principal components. However an analysis should also be repeatable, and hence, automatable, and it’s not straightforward to automate something as vague as “look for a knee in the curve” when selecting the number of principal components to use. We’ve covered two ways to programatically select the appropriate number of principal components in a predictive modeling context.

To conclude this entire series, here is our recommended best practice for principal components regression:

  1. Significance prune the candidate input variables.
  2. Perform a Y-Aware principal components analysis.
  3. Significance prune the resulting principal components.
  4. Regress.

Thanks to Cyril Pernet, who blogs at NeuroImaging and Statistics, for requesting this follow-up post and pointing us to the Jackson reference.

References

  • Jackson, Donald A. “Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches”, Ecology Vol 74, no. 8, 1993.

  • Kabacoff, Robert I. R In Action, 2nd edition, Manning, 2015.

  • Efron, Bradley and Robert J. Tibshirani. An Introduction to the Bootstrap, Chapman and Hall/CRC, 1998.

  • Peres-Neto, Pedro, Donald A. Jackson and Keith M. Somers. “How many principal components? Stopping rules for determining the number of non-trivial axes revisited”, Computational Statistics & Data Analysis, Vol 49, no. 4, 2005.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

On ranger respect.unordered.factors

By John Mount

NewImage

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

It is often said that “R it its packages.”

One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value respect.unordered.factors=FALSE in ranger::ranger() which we strongly advise overriding to respect.unordered.factors=TRUE in applications.

To illustrate the issue we build a simple data set (split into training and evaluation) where the dependent (or outcome) variable y is given as the sum of how many input level codes end in an odd digit minus how many input level codes end in an even digit.

Some example data is given below

print(head(dTrain))
##          x1      x2      x3      x4 y
## 77  lev_008 lev_004 lev_007 lev_011 0
## 41  lev_016 lev_015 lev_019 lev_012 0
## 158 lev_007 lev_019 lev_001 lev_015 4
## 69  lev_010 lev_017 lev_018 lev_009 0
## 6   lev_003 lev_014 lev_016 lev_017 0
## 18  lev_004 lev_015 lev_014 lev_007 0

Given enough data this relation is easily learnable. In our example we have only 100 training rows and 20 possible levels for each input variable- so we at best get a noisy impression of how each independent (or input) variable affects y.

What the default ranger default training setting respect.unordered.factors=FALSE does is decide that string-valued variables (such as we have here) are to be treated as “ordered”. This allows ranger to skip any of the expensive re-encoding of such variables as contrasts, dummies or indicators. This is achieved in ranger by only using ordered cuts in its underlying trees and is equivalent to re-encoding the categorical variable as the numeric order codes. These variables are thus essentially treated as numeric, and ranger appears to run faster over fairly complicated variables.

The above is good if all of your categorical variables were in fact known to have ordered relations with the outcome. We must emphasize that this is very rarely the case in practice as one of the main reasons for using categorical variables is that we may not a-priori know the relation between the variable levels and outcome and would like the downstream machine learning to estimate the relation. The default respect.unordered.factors=FALSE in fact weakens the expressiveness of the ranger model (which is why it is faster).

This is simpler to see with an example. Consider fitting a ranger model on our example data (all code/data shared here).

If we try to build a ranger model on the data using the default settings we get the following:

# default ranger model, treat categoricals as ordered (a very limiting treatment)
m1 <- ranger(y~x1+x2+x3+x4,data=dTrain, write.forest=TRUE)

Keep in mind the 0.24 R-squared on test.

If we set respect.unordered.factors=TRUE ranger takes a lot longer to run (as it is doing more work in actually respecting the individual levels of our categorical variables) but gets a much better result (test R-squared 0.54).

m2 <- ranger(y~x1+x2+x3+x4,data=dTrain,  write.forest=TRUE,
             respect.unordered.factors=TRUE)

NewImage

The loss of modeling power seen with the default respect.unordered.factors=FALSE is similar to the undesirable loss of modeling power seen if one hash-encodes categorical levels. Everyone claims they would never do such a thing, but we strongly suggest inspecting your team’s work for these bad but tempting shortcuts.

If even one of the variables had 64 or more levels ranger would throw an exception and not complete training (as the randomForest library also does).

The correct way to feed large categoricals to a random forest model remains to explicitly introduce the dummy/indicators yourself or re-encode them as impact/effect sub models. Both of these are services supplied by the vtreat package so we demonstrate the technique here.

# vtreat re-encoded model
ct <- vtreat::mkCrossFrameNExperiment(dTrain,c('x1','x2','x3','x4'),'y')
newvars <- ct$treatments$scoreFrame$varName[(ct$treatments$scoreFrame$code=='catN') &
                                            (ct$treatments$scoreFrame$sig<1)]
m3 <- ranger(paste('y',paste(newvars,collapse=' + '),sep=' ~ '),data=ct$crossFrame,
              write.forest=TRUE)
dTestTreated <- vtreat::prepare(ct$treatments,dTest,
                                pruneSig=c(),varRestriction=newvars)
dTest$rangerNestedPred <- predict(m3,data=dTestTreated)$predictions
WVPlots::ScatterHist(dTest,'rangerNestedPred','y',
                     'ranger vtreat nested prediction on test',
                     smoothmethod='identity',annot_size=3)

NewImage

The point is a test R-squared of 0.6 or 0.54 is a lot better than an R-squared of 0.24. You do not want to achieve 0.24 if 0.6 is within easy reach. So at the very least when using ranger set respect.unordered.factors=TRUE; for unordered factors (the most common kind) the default is making things easy for ranger at the expense of model quality.

Instructions explaining the use of vtreat can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

zoo time series exercises

By Siva Sunku

DallasZooEntryPlaza

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The zoo package consists of the methods for totally ordered indexed observations. It aims at performing calculations containing irregular time series of numeric vectors, matrices & factors. The zoo package interfaces to all other time series packages on CRAN. This makes it easy to pass the time series objects between zoo & other time series packages. The zoo package is an infrastructure that tries to do all basic things well, but it doesn’t provide modeling functionality. The below set of exercises makes you understand some of zoo concepts.

The initial environment setup required to work on these exercises is as follows.

Install zoo package – install.packages("zoo")
Attach the package – require("zoo")
Download the Dataset – ZooData
read.table(Filepath, header = TRUE, sep=",",stringsAsFactors = FALSE) -> inData

Answers to the exercises are available here.

Exercise 1
Coerce the inData object as zoo object into wZ.
Check the class of the object wZ
Observe the index of the object wZ .

Exercise 2
Create a zoo object Z from the inData with ‘Date' column as the index

Exercise 3
Get the ratio of Z$DeliveryVolume to Z$TotalVolume
Did you get the non-numeric operation error ? There is a small catch to remember, Zoo objects need to be a matrix. If there is a character string in at least one of the values, the complete Zoo object is considered as matrix of ‘character’ type. Now make a numeric zoo object, create Zoo object Z with only numeric columns from inData

Create a Zoo object (Z) with inData[3:10] with index as Date Column
Extract only the data (without index) of zoo object Z
Get the ratio of Z$Deliverable.Qty to Z$Total.Traded.Quantity

Exercise 4
Get the mean of Z$DeliveryVolume to Z$TotalVolume per quarter, by using Aggregate function.
as.yearqtr – function returns the Quarter of a given date.
Get the mean of Z$DeliveryVolume to Z$TotalVolume per month, by using Aggregate function.
as.yearmon – function returns the Year Month of a given date.

Exercise 5
Create Z1 object with only price related columns (Date as index). Cols – 3:6
Create Z2 with all other quantity related columns (Date as index). Cols – 3,8:10
Merge the objects Z1 & Z2 to Z3
check if merged output Z3 is same as Z

Exercise 6
Extract only the rows from 2015-Feb-01 to 2015-Feb-15 from Zoo object Z

Exercise 7

Volume Weighted Average Price (VWAP) = Sum of (Price * Volume)/Sum of Volume
myVwap <- function (x) {
sum(x[,1]*x[,2])/sum(x[,2])
}

Find VWAP with Close as Price, TotalVolume as Volume with data of 5 prior days. Store the result in object ZT
e.g. Find the VWAP of 2015-01-07 with 2015-Jan-01 to 2015-01-07.
Find VWAP with Close as Price, DeliveryVolume as Volume with data of 5 prior days. Store the result object ZD
Merge the objects Z with ZT, ZD and save result in R

Exercise 8
Replace the NAs in R[,"ZT"] with the monthly mean of ZT
e.g. NAs of Jan month in R[,"ZT"] should be replaced with Jan months mean of R[,"ZT"]

Exercise 9
Subtract the mean of each month’s VWAP ZT from close price of each R row

Exercise 10
Check the rownames of object Z .
Surprised, to see NULL ?, Zoo object stores the index separately and if you want to retrieve use index function.Now, you know how to get index and data, make a data frame DT from Zoo object Z , with all the columns of Z + new column dt, which contains the date of each row.

Image by Kevin1086 (Own work) [CC BY-SA 3.0], via Wikimedia Commons.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

satRday Event in Cape Town

By Daniel Emaasit

cape-town-university

(This article was first published on R – Data Science Africa, and kindly contributed to R-bloggers)

This blog post was first published on EXEGETIC ANALYTICS‘s blog and kindly re-posted on Data Science Africa.

We are planning to host one of the three inaugural satRday conferences in Cape Town during 2017. The R Consortium has committed to funding three of these events: one will be in Hungary, another will be somewhere in the USA and the third will be at an international destination. At present Cape Town is dicing it out with Monterrey (Mexico) for the third location. We just need your votes to make Cape Town’s plans a reality.

The satRday will probably happen in late February or early March 2017. This is the end of southern hemisphere Summer and the Cape is at its best, with glorious weather and the peak Summer tourist rush over. You could easily factor satRday into a vacation in sunny South Africa.

Why Cape Town?

Cape Town is literally the jewel of Southern Africa:

– Table Mountain (spectacular view, great hiking, cable car);
– Wine farms (too many to mention, but all within a short drive of the city and most offering free tastings);
Boulders Beach (pristine beach in Simonstown with large colony of Jackass Penguins);
Camps Bay, Muizenberg and many other idyllic beaches;
Robben Island (return boat trip across Table Bay, tour of the Maximum Security Prison, and a bus tour of the Island);
– the Victoria & Alfred Waterfront and the Two Oceans Aquarium;
– the Kirstenbosch National Botanical Garden; and
– lots, lots more.

Did I mention the wine and Table Mountain? Ah, yes, I did.

This is what the weather looked like in Cape Town at the end of February 2016: temperatures around 20 to 25 °C (70 to 80 Fahrenheit), light breezes and zero precipitation.

For International Tourists

Cape Town is well connected with the rest of the World. There are direct flights to Cape Town International Airport from Amsterdam, Buenos Aires, Doha, Dubai, Frankfurt, Istanbul, London and Munich.

The exchange rate is extremely favourable (see below), making South Africa rather affordable for the international traveller. A healthy meal will cost you around 100 ZAR and a great bottle of wine can be had for about the same price. Decent accommodation with sweeping views of the sea or mountain costs between 700 and 1000 ZAR per night, but you can find more affordable (or more lavish) options.

Public transport is a little sparse, but Uber will take you anywhere you need to go.

We know that there are some security concerns about South Africa, but Cape Town is a very safe city. The two venues that we are considering for the conference are secure and easy to access:

– the campus of the University of Cape Town;
– the trendy suburb of Green Point, close to the Waterfront.

For useRs from the USA, Cape Town is somewhat further away than Monterrey, but it’s a trip you won’t regret making. For Europeans useRs it’s closer than Mexico and there is no time zone change.

We look forward to hosting you next year at satRday in Cape Town, the Mother City. Please vote for Cape Town now.

cape-town-penguins

The post satRday Event in Cape Town appeared first on Data Science Africa.

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Africa.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News