Pitfall of XML package: issues specific to cp932 locale, Japanese Shift-JIS, on Windows

By tomizono

(This article was first published on R – ЯтомизоnoR, and kindly contributed to R-bloggers)

CRAN package XML has something wrong at parsing html pages encoded in cp932 (shift-jis). In this report, I will show these issues and also their solutions which is workable at user side.

I found the issues are common at least on both Windows 7 and 10 with Japanese language. Though other versions and languages are not checked, the issues may common on world wide Windows with non-European multibyte languages encoded in national locales, not in utf-8.

Versions on my machines:

Windows 7 + R 3.2.3 + XML 3.98-1.3
Mac OS X 10.9.5 + R 3.2.0 + XML 3.98-1.3

Locales:

Windows
> Sys.getlocale('LC_CTYPE')
[1] "Japanese_Japan.932"
Mac
> Sys.getlocale('LC_CTYPE')
[1] "ja_JP.UTF-8"

1. incident

# Mac
library(XML)
src <- 'http://www.taiki.pref.ibaraki.jp/data.asp'
t1 <- as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
        skip.rows=2:48, encoding='shift-jis')[1,1]
      )
> t1 # good
[1] "最新の観測情報  (2016年1月17日  8時)"

Above small R script was written by me when I improved my PM 2.5 script in the previous article. This was working on my Mac, but not on Windows PC at my office.

Of course a small change was needed for Windows to handle the difference of locales.

# Windows
t2 <- iconv(as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      ), from='utf-8', to='shift-jis')
> t2 # bad
[1] NA

It completely failed.

I found this problem occurs depending the text in the html. So we must know “when and how” to avoid the error. This report is to show the solutions. Technical details will be shown in the next report.

2. solutions

2-1. No-Break Space (U+00A0)

Unicode character No-Break Space (U+00A0): 
    xc2xa0 in utf-8
    &nbsp;, &#160; or &#xa0; in html

When a shift-jis encoded html has u+00a0 as html entity, such as  , the package XML brings a issue. More strictly, it’s not originated from the package XML but from function iconv. Function iconv returns NA when it tries to convert u+00a0 into shift-jis. But we must be aware of this issue at using the package XML because it always comes with famous html entity  .

A solution is to use an option sub= in function iconv, which can convert unknown characters into a specific one instead of NA.

sub=''
sub=' '
sub='byte'
# Windows
t3 <- iconv(as.character(
        readHTMLTable(src, which=4, trim=T, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      ), from='utf-8', to='shift-jis', sub=' ')
> t3 # bad
The result is a broken string and not shown.  

This can be a solution of the u+00a0 issue in shift-jis encoded page. But unfortunately, the above t3 still fails because there is another issue on that html page.

2-2. trim

An option trim= is commonly used in text functions of package XML, in such as readHTMLTable and xmlValue. With trim=TRUE, a text removed space characters such as t or r from both ends of the node text is returned. This option is very useful to treat html pages, because they usually have a plenty of spaces and line feeds.

But trim=TRUE is not safe when a shift-jis encoded html is read on a Windows PC with shift-jis (cp932) locale. This issue is serious and the text string is completely destroyed.

Additionally, we must be aware of the default value of this option; trim=FALSE for xmlValue, and trim=TRUE for readHTMLTable.

A solution is to use trim=FALSE and to remove spaces with function gsub after we get a pure string.

# Windows
t4 <- gsub('s', '', iconv(
        readHTMLTable(src, which=4, trim=F, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      , from='utf-8', to='shift-jis', sub=' '))
> t4 # good
[1] "最新の観測情報(2016年1月17日8時)"

The regular expression of gsub is safe to the platform locale.

More precisely, the t4 above is not same as the result of trim=TRUE. That regular expression remove all spaces in the sentence, although it doesn’t matter in Japanese language.

We may want to improve this as:

gsub('(^s+)|(s+$)', '', x)
# Windows
t5 <- gsub('(^s+)|(s+$)', '', iconv(
        readHTMLTable(src, which=4, trim=F, header=F, 
          skip.rows=2:48, encoding='shift-jis')[1,1]
      , from='utf-8', to='shift-jis', sub=' '))
> t5 # very good
[1] "最新の観測情報 (2016年1月17日 8時)"

Finally, two issues are solved. We get a script workable on Windows.

Strictly the t1 and the t5 are different. Spaces in t5 is u+0020, while these in t1 is u+00a0.

To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.