By Eric Cai – The Chemical Statistician

**R programming – The Chemical Statistician**, and kindly contributed to R-bloggers)

I often create character variables (i.e. variables with strings of text as their values) in SAS, and they sometimes don’t render as expected. Here is an example involving the built-in data set SASHELP.CLASS.

Here is the code:

data c1; set sashelp.class; * define a new character variable to classify someone as tall or short; if height > 60 then height_class = 'Tall'; else height_class = 'Short'; run; * print the results for the first 5 rows; proc print data = c1 (obs = 5); run;

Here is the result:

Obs | Name | Sex | Age | Height | Weight | height_class |
---|---|---|---|---|---|---|

1 | Alfred | M | 14 | 69.0 | 112.5 | Tall |

2 | Alice | F | 13 | 56.5 | 84.0 | Shor |

3 | Barbara | F | 13 | 65.3 | 98.0 | Tall |

4 | Carol | F | 14 | 62.8 | 102.5 | Tall |

5 | Henry | M | 14 | 63.5 | 102.5 | Tall |

What happened? Why does the word “Short” render as “Shor”?

This occurred because SAS sets the length of a new character variable as the length of the first value given in its definition. My code defined “height_class” by setting the value “Tall” first, which has a length of 4. Thus, “height_class” was defined as a character variable with a length of 4. Any subsequent values must follow this variable type and format.

How can we circumvent this? You can pre-set the length of any new variable with the LENGTH statement before the SET statement. In the revised code below, I correct the problem by setting the length of “height_class” to 5 before defining its possible values.

data c2; set sashelp.class; * define a new character variable to classify someone as tall or short;length height_class $ 5;if height > 60 then height_class = 'Tall'; else height_class = 'Short'; run; * print the results for the first 5 rows; proc print data = c2 (obs = 5); run;

Here is the result:

Obs | Name | Sex | Age | Height | Weight | height_class |
---|---|---|---|---|---|---|

1 | Alfred | M | 14 | 69.0 | 112.5 | Tall |

2 | Alice | F | 13 | 56.5 | 84.0 | Short |

3 | Barbara | F | 13 | 65.3 | 98.0 | Tall |

4 | Carol | F | 14 | 62.8 | 102.5 | Tall |

5 | Henry | M | 14 | 63.5 | 102.5 | Tall |

Notice that “height_class” for Alice is “Short”, as it should be.

An alternative solution is to re-write the code so that the first instance of “height_class” is the longest possible value. This does not require the use of the LENGTH statement.

data c3; set sashelp.class; * define a new character variable to classify someone as tall or short; if height

By the way, I don't notice this problem in R. Here is some code to illustrate this observation.

> set.seed(235) > > # randomly generate 4 values > x = rnorm(3, 60, 5) > > # add a value to the beginning of "x" so that the first value is above 60 > # add a value to the end of "x" so that the last vlaue is below 60 > x = c(63, x, 57) > x [1] 63.00000 70.68902 61.36082 56.62601 57.00000 > > # pre-allocate a vector for classifying "x" as "tall" or "short" > y = 0 * x > > > for (i in 1:length(x)) + { + if (x[i] > 60) + { + y[i] = 'Tall' + } + else + { + y[i] = 'Short' + } + } > > > # display "y" > y [1] "Tall" "Tall" "Tall" "Short" "Short"

Notice that the value “Short” renders fully with a length of 5. I did not need to pre-set the length of “y” first.

**leave a comment**for the author, please follow the link and comment on their blog:

**R programming – The Chemical Statistician**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

Source:: R News