CRAN package XML has something wrong at parsing html pages encoded in
shift-jis). In this report, I will show these issues and also their solutions which is workable at user side.
I found the issues are common at least on both
Windows 7 and
Japanese language. Though other versions and languages are not checked, the issues may common on world wide Windows with non-European
multibyte languages encoded in
national locales, not in
Versions on my machines:
Windows 7 + R 3.2.3 + XML 3.98-1.3 Mac OS X 10.9.5 + R 3.2.0 + XML 3.98-1.3
Windows > Sys.getlocale('LC_CTYPE')  "Japanese_Japan.932" Mac > Sys.getlocale('LC_CTYPE')  "ja_JP.UTF-8"
# Mac library(XML) src <- 'http://www.taiki.pref.ibaraki.jp/data.asp' t1 <- as.character( readHTMLTable(src, which=4, trim=T, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] ) > t1 # good  "最新の観測情報 （2016年1月17日 8時）"
Above small R script was written by me when I improved my PM 2.5 script in the previous article. This was working on my Mac, but not on Windows PC at my office.
Of course a small change was needed for Windows to handle the difference of locales.
# Windows t2 <- iconv(as.character( readHTMLTable(src, which=4, trim=T, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] ), from='utf-8', to='shift-jis') > t2 # bad  NA
It completely failed.
I found this problem occurs depending the text in the html. So we must know “when and how” to avoid the error. This report is to show the solutions. Technical details will be shown in the next report.
2-1. No-Break Space (U+00A0)
Unicode character No-Break Space (U+00A0): xc2xa0 in utf-8 , &＃160; or &＃xa0; in html
shift-jis encoded html has
html entity, such as
package XML brings a issue. More strictly, it’s not originated from the
package XML but from function
NA when it tries to convert
shift-jis. But we must be aware of this issue at using the
package XML because it always comes with famous html entity
A solution is to use an option
sub= in function
iconv, which can convert unknown characters into a specific one instead of
sub='' sub=' ' sub='byte'
# Windows t3 <- iconv(as.character( readHTMLTable(src, which=4, trim=T, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] ), from='utf-8', to='shift-jis', sub=' ') > t3 # bad The result is a broken string and not shown.
This can be a solution of the
u+00a0 issue in
shift-jis encoded page. But unfortunately, the above
t3 still fails because there is another issue on that html page.
trim= is commonly used in text functions of
package XML, in such as
trim=TRUE, a text removed space characters such as
r from both ends of the node text is returned. This option is very useful to treat html pages, because they usually have a plenty of spaces and line feeds.
trim=TRUE is not safe when a
shift-jis encoded html is read on a
Windows PC with
shift-jis (cp932) locale. This issue is serious and the text string is completely destroyed.
Additionally, we must be aware of the default value of this option;
A solution is to use
trim=FALSE and to remove spaces with function
gsub after we get a pure string.
# Windows t4 <- gsub('s', '', iconv( readHTMLTable(src, which=4, trim=F, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] , from='utf-8', to='shift-jis', sub=' ')) > t4 # good  "最新の観測情報（2016年1月17日8時）"
The regular expression of
gsub is safe to the platform locale.
More precisely, the
t4 above is not same as the result of
trim=TRUE. That regular expression remove all spaces in the sentence, although it doesn’t matter in Japanese language.
We may want to improve this as:
gsub('(^s+)|(s+$)', '', x)
# Windows t5 <- gsub('(^s+)|(s+$)', '', iconv( readHTMLTable(src, which=4, trim=F, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] , from='utf-8', to='shift-jis', sub=' ')) > t5 # very good  "最新の観測情報 （2016年1月17日 8時）"
Finally, two issues are solved. We get a script workable on Windows.
t1 and the
t5 are different. Spaces in
u+0020, while these in
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…
Source:: R News