Sergiy Radyakin wrote:
it has just occured to me that string variables do not have extended
missing codes. A colleague of mine argues that this is perfectly fine,
because:
1) one can use any text to stand for particular situations ("not
applicable", "not responded",...)
2) for numerical values there are operations defined, which require
that they yield missing values if any argument is missing.
In a situation when I classify, say firms, by first letter of their
name, I will have "Not applicable" and "No response" as instances in
section "N", which is not what I want. Hence every time I deal with
the strings like that is to specifically check for particular string
values (and hence a different data entry operator inevitably chooses a
different coding, the programs become highly oriented/dependent on a
particular dataset), it is also quite tedious and annoying. One
solution I see is to create a masking variable, which for each
observation will have a code with an agreed upon code, e.g. 0=not
applicable; 1= valid observation; 2=applicable, but refused to answer;
3=applicable, but respondent doesn't know; etc.
I don't see this as a good solution, and I wonder, whether there is
any technical possibility to instruct Stata that a particular string
value should be treated as a missing value in some operations. I see
it along the lines:
char define make[extmiss_a] Not applicable
char define make[extmiss_b] No response
And later
gen make_group=substr(make,1,1)
will create empty values for those observations that had "Not
applicable" or "No response"
(however I still want to be able to distinguish between the two in
some cases, like -tabulate-)
What do you think about it? Are there extended missing string codes in
other statistical packages?
--------------------------------------------------------------------------------
I'm not aware of other statistical packages' having extended missing-value
codes for string variables. The other two packages that have extended
missing-value codes for numerical variables, SAS and SPSS, also are able to
apply value labels to string variables. But to my knowledge neither of them
has a string-variable analogue of their extended missing-value codes for
numerical variables.
I would approach the task along the lines of setting up alternative sets of
value labels after -encode-, something like what is illustrated below with a
dataset and set of desired missing value codes that are modeled after what
you show. (One note about efficiency: usually you'll have a lot of firms
and only a few missing-value codes, and so it would make sense to evert the
nested loops from what I have hastily done below.) If I were doing this
sort of thing routinely, I would probably take advantage of Stata's class
programming to make life easier.
Joseph Coveney
clear *
set more off
/* Create demonstration dataset */
set obs 4
generate str make = "MyFirm"
replace make = "YourFirm" in 2
replace make = "Not Applicable" in 3
replace make = "No Response" in l
/* Create starting list of value labels */
encode make, generate(encoded_make) label(Makes)
/* Create list of list of desired missing-value labels
and corresponding extended missing values
to use (in alphabetical order) */
label define Missings .a "Not Applicable" .b "No Response"
/* Substitute extended missings for current
value labels' values, and do same for encoded
variable's integers */
// Create program for substitutions
program define MatchAndSubstitute
version 10.1
syntax varname, label_index(integer) ///
extended_missing(string) ///
missing_string(string)
local value_labels : value label `varlist'
if ("`: label `value_labels' `label_index''" == "`missing_string'") {
label define `value_labels' `label_index' "", modify
label define `value_labels' `extended_missing' ///
"`missing_string'", modify
quietly replace `varlist' = `extended_missing' ///
if `varlist' == `label_index'
}
else {
exit 0
}
end
// Traverse both value label lists, substituting
// where matches are found
local label_index 1
local label_string : label Makes `label_index'
while ( indexnot("`label_string'", "123456789") == 1 ) {
foreach letter in `c(alpha)' {
local missing_string : label Missings .`letter'
if ( "`missing_string'" != ".`letter'" ) {
MatchAndSubstitute encoded_make, ///
label_index(`label_index') ///
extended_missing(.`letter') ///
missing_string(`missing_string')
}
else {
continue, break
}
}
local ++label_index
local label_string : label Makes `label_index'
}
label drop Missings
// Results
tabulate encoded_make
tabulate encoded_make, missing
/* Create initial value label list,
preserving missing value labels */
label copy Makes InitialMakes
local --label_index
forvalues InitialMakes_index = `label_index'(-1)1 {
local label_string : label InitialMakes `InitialMakes_index'
if ( regexm("`label_string'", "^[1-9]+") == 0 ) {
local label_initial = substr("`label_string'", 1, 1)
label define InitialMakes `InitialMakes_index' ///
`label_initial', modify
}
}
// Results
label values encoded_make InitialMakes
tabulate encoded_make
tabulate encoded_make, missing
exit
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/