Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -label define- and -replace- when a variable may be missing

From	"Joseph Coveney" <[email protected]>
To	<[email protected]>
Subject	Re: st: -label define- and -replace- when a variable may be missing
Date	Mon, 10 Mar 2014 10:44:29 +0900

Squaring the dataset to a standardized form can also be done by -append-ing to a
standardized template empty dataset (see below); this would avoid the use of
-capture- in production code if that's a concern.  Regardless, if I were the OP,
I would want to understand why the datasets arrive in an inconsistent format.
And I would worry whether the dataset supplier could accidentally (or otherwise)
flag more than one type of training as positive, because then the collection of
training types into a single variable with nine value labels (one for each
possibility) would fail.  Perhaps it's time for the OP to wade upstream closer
to the source in order to take a look at how the data are recorded and
processed.

Joseph Coveney

. clear *

. set more off

. set seed `=date("2014-03-10", "YMD")'

. 
. *
. * Incoming dataset
. *
. quietly set obs 20

. foreach i in 1 3 5 9 {
  2.     generate byte what_types_of_training_did___`i' = 0
  3. }

. replace what_types_of_training_did___1 = 1
(20 real changes made)

. generate byte pid = _n

. generate str1 sex = cond(runiform() < 0.5, "F", "M")

. generate int age = floor(20 + 20 * runiform())

. set linesize 79

. describe, fullnames

Contains data
  obs:            20                          
 vars:             7                          
 size:           160                          
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
what_types_of_training_did___1
                byte   %8.0g                  
what_types_of_training_did___3
                byte   %8.0g                  
what_types_of_training_did___5
                byte   %8.0g                  
what_types_of_training_did___9
                byte   %8.0g                  
pid             byte   %8.0g                  
sex             str1   %9s                    
age             int    %8.0g                  
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

. tempfile incoming

. quietly save `incoming'

. 
. *
. * Standardized template empty dataset
. *
. drop _all

. forvalues i = 1/9 {
  2.     quietly generate byte what_types_of_training_did___`i' = .
  3. }

. 
. *
. * Squaring incoming dataset(s) by appending to template
. *
. append using `incoming'

. 
. describe, fullnames

Contains data
  obs:            20                          
 vars:            12                          
 size:           260                          
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
what_types_of_training_did___1
                byte   %8.0g                  
what_types_of_training_did___2
                byte   %8.0g                  
what_types_of_training_did___3
                byte   %8.0g                  
what_types_of_training_did___4
                byte   %8.0g                  
what_types_of_training_did___5
                byte   %8.0g                  
what_types_of_training_did___6
                byte   %8.0g                  
what_types_of_training_did___7
                byte   %8.0g                  
what_types_of_training_did___8
                byte   %8.0g                  
what_types_of_training_did___9
                byte   %8.0g                  
pid             byte   %8.0g                  
sex             str1   %9s                    
age             int    %8.0g                  
-------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

. list pid-age *1 in -2/l, abbreviate(30) noobs

  +--------------------------------------------------+
  | pid   sex   age   what_types_of_training_did___1 |
  |--------------------------------------------------|
  |  19     M    38                                1 |
  |  20     F    39                                1 |
  +--------------------------------------------------+

. 
. exit

end of do-file

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Phil Schumm
Sent: Monday, March 10, 2014 05:12
To: Statalist Statalist
Subject: Re: st: -label define- and -replace- when a variable may be missing

[OP redacted for brevity]


You can do this two ways: (1) write the code to do the desired translation
(i.e., from 9 vars into 1) in a way that can accommodate fewer than 9 input
variables, or (2) fill in any missing variables first, and then perform the
translation.  I tend to prefer the latter, which results in a workflow like

               standardized        transformed
    raw  --->    dataset     --->    dataset
          A                   B

where A is a series of steps which yield a dataset in "standard" form, and B
includes whatever transformations of the data are necessary prior to analysis,
distribution, or whatever.  Thus, in the example above, A might include
something like

    forv i = 1/9 {
        cap gen byte what_types_of_training_did___`i' = 0
    }

which you might then follow with a few tests, such as ensuring that the 9 items
are truly mutually exclusive (as required if you want to collapse them into a
single variable).  Note that this would even handle the case where none of the
variables exists (e.g., if none of the first batch of respondents provided an
answer to the question).

Separation of A and B (typically in different set(s) of do-files) in the data
management context has two important advantages:

1) It allows you to write simpler code in B, which makes it more readable,
maintainable and cuts down on errors, and

2) It makes it easier to reuse the code in B in different contexts (as long as
you pass it a dataset in standard form, which is where automated testing comes
in handy).


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: -label define- and -replace- when a variable may be missing
  - From: Michael McCulloch <[email protected]>

References:
- st: -label define- and -replace- when a variable may be missing
  - From: Michael McCulloch <[email protected]>
- Re: st: -label define- and -replace- when a variable may be missing
  - From: Phil Schumm <[email protected]>

Prev by Date: Re: st: Baseline hazard in discrete time hazards model
Next by Date: st: re-sorting display order after -encode-
Previous by thread: Re: st: -label define- and -replace- when a variable may be missing
Next by thread: Re: st: -label define- and -replace- when a variable may be missing
Index(es):
- Date
- Thread