Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: -label define- and -replace- when a variable may be missing
From
Phil Schumm <[email protected]>
To
Statalist Statalist <[email protected]>
Subject
Re: st: -label define- and -replace- when a variable may be missing
Date
Sun, 9 Mar 2014 15:11:44 -0500
On Mar 8, 2014, at 8:30 PM, Michael McCulloch <[email protected]> wrote:
> I am cleaning data for a survey, in which for one question, respondents were asked where they got their training.
> There are 9 possible answers for this question, but as I monitor survey data coming in, some may have missing values.
> The online survey instrument creates the variables as they are filled in:
> what_types_of_training_did___1
> what_types_of_training_did___2, and so on up to
> what_types_of_training_did___9
>
> In order to create reports but not have my do-file stopped by a missing (not yet defined) variable, how can I modify the following code?
>
> gen training=.
> replace training=1 if what_types_of_training_did___1==1
> replace training=2 if what_types_of_training_did___2==1
> replace training=3 if what_types_of_training_did___3==1
You can do this two ways: (1) write the code to do the desired translation (i.e., from 9 vars into 1) in a way that can accommodate fewer than 9 input variables, or (2) fill in any missing variables first, and then perform the translation. I tend to prefer the latter, which results in a workflow like
standardized transformed
raw ---> dataset ---> dataset
A B
where A is a series of steps which yield a dataset in "standard" form, and B includes whatever transformations of the data are necessary prior to analysis, distribution, or whatever. Thus, in the example above, A might include something like
forv i = 1/9 {
cap gen byte what_types_of_training_did___`i' = 0
}
which you might then follow with a few tests, such as ensuring that the 9 items are truly mutually exclusive (as required if you want to collapse them into a single variable). Note that this would even handle the case where none of the variables exists (e.g., if none of the first batch of respondents provided an answer to the question).
Separation of A and B (typically in different set(s) of do-files) in the data management context has two important advantages:
1) It allows you to write simpler code in B, which makes it more readable, maintainable and cuts down on errors, and
2) It makes it easier to reuse the code in B in different contexts (as long as you pass it a dataset in standard form, which is where automated testing comes in handy).
-- Phil
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/