Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: -label define- and -replace- when a variable may be missing
From
"Joseph Coveney" <[email protected]>
To
<[email protected]>
Subject
Re: st: -label define- and -replace- when a variable may be missing
Date
Mon, 10 Mar 2014 10:44:29 +0900
Squaring the dataset to a standardized form can also be done by -append-ing to a
standardized template empty dataset (see below); this would avoid the use of
-capture- in production code if that's a concern. Regardless, if I were the OP,
I would want to understand why the datasets arrive in an inconsistent format.
And I would worry whether the dataset supplier could accidentally (or otherwise)
flag more than one type of training as positive, because then the collection of
training types into a single variable with nine value labels (one for each
possibility) would fail. Perhaps it's time for the OP to wade upstream closer
to the source in order to take a look at how the data are recorded and
processed.
Joseph Coveney
. clear *
. set more off
. set seed `=date("2014-03-10", "YMD")'
.
. *
. * Incoming dataset
. *
. quietly set obs 20
. foreach i in 1 3 5 9 {
2. generate byte what_types_of_training_did___`i' = 0
3. }
. replace what_types_of_training_did___1 = 1
(20 real changes made)
. generate byte pid = _n
. generate str1 sex = cond(runiform() < 0.5, "F", "M")
. generate int age = floor(20 + 20 * runiform())
. set linesize 79
. describe, fullnames
Contains data
obs: 20
vars: 7
size: 160
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
what_types_of_training_did___1
byte %8.0g
what_types_of_training_did___3
byte %8.0g
what_types_of_training_did___5
byte %8.0g
what_types_of_training_did___9
byte %8.0g
pid byte %8.0g
sex str1 %9s
age int %8.0g
-------------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
. tempfile incoming
. quietly save `incoming'
.
. *
. * Standardized template empty dataset
. *
. drop _all
. forvalues i = 1/9 {
2. quietly generate byte what_types_of_training_did___`i' = .
3. }
.
. *
. * Squaring incoming dataset(s) by appending to template
. *
. append using `incoming'
.
. describe, fullnames
Contains data
obs: 20
vars: 12
size: 260
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
what_types_of_training_did___1
byte %8.0g
what_types_of_training_did___2
byte %8.0g
what_types_of_training_did___3
byte %8.0g
what_types_of_training_did___4
byte %8.0g
what_types_of_training_did___5
byte %8.0g
what_types_of_training_did___6
byte %8.0g
what_types_of_training_did___7
byte %8.0g
what_types_of_training_did___8
byte %8.0g
what_types_of_training_did___9
byte %8.0g
pid byte %8.0g
sex str1 %9s
age int %8.0g
-------------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
. list pid-age *1 in -2/l, abbreviate(30) noobs
+--------------------------------------------------+
| pid sex age what_types_of_training_did___1 |
|--------------------------------------------------|
| 19 M 38 1 |
| 20 F 39 1 |
+--------------------------------------------------+
.
. exit
end of do-file
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Phil Schumm
Sent: Monday, March 10, 2014 05:12
To: Statalist Statalist
Subject: Re: st: -label define- and -replace- when a variable may be missing
[OP redacted for brevity]
You can do this two ways: (1) write the code to do the desired translation
(i.e., from 9 vars into 1) in a way that can accommodate fewer than 9 input
variables, or (2) fill in any missing variables first, and then perform the
translation. I tend to prefer the latter, which results in a workflow like
standardized transformed
raw ---> dataset ---> dataset
A B
where A is a series of steps which yield a dataset in "standard" form, and B
includes whatever transformations of the data are necessary prior to analysis,
distribution, or whatever. Thus, in the example above, A might include
something like
forv i = 1/9 {
cap gen byte what_types_of_training_did___`i' = 0
}
which you might then follow with a few tests, such as ensuring that the 9 items
are truly mutually exclusive (as required if you want to collapse them into a
single variable). Note that this would even handle the case where none of the
variables exists (e.g., if none of the first batch of respondents provided an
answer to the question).
Separation of A and B (typically in different set(s) of do-files) in the data
management context has two important advantages:
1) It allows you to write simpler code in B, which makes it more readable,
maintainable and cuts down on errors, and
2) It makes it easier to reuse the code in B in different contexts (as long as
you pass it a dataset in standard form, which is where automated testing comes
in handy).
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/