Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: -label define- and -replace- when a variable may be missing
From
Michael McCulloch <[email protected]>
To
[email protected]
Subject
Re: st: -label define- and -replace- when a variable may be missing
Date
Sun, 9 Mar 2014 20:50:40 -0700
Thanks Joseph and Phil, for the elegant suggestions.
Best wishes,
Michael McCulloch
--
Pine Street Foundation, since 1989
124 Pine Street | San Anselmo | California | 94960-2674
P: (415) 407-1357 | F: (206) 338-2391 | http://www.PineStreetFoundation.org
On Mar 9, 2014, at 6:44 PM, Joseph Coveney wrote:
> Squaring the dataset to a standardized form can also be done by -append-ing to a
> standardized template empty dataset (see below); this would avoid the use of
> -capture- in production code if that's a concern. Regardless, if I were the OP,
> I would want to understand why the datasets arrive in an inconsistent format.
> And I would worry whether the dataset supplier could accidentally (or otherwise)
> flag more than one type of training as positive, because then the collection of
> training types into a single variable with nine value labels (one for each
> possibility) would fail. Perhaps it's time for the OP to wade upstream closer
> to the source in order to take a look at how the data are recorded and
> processed.
>
> Joseph Coveney
>
> . clear *
>
> . set more off
>
> . set seed `=date("2014-03-10", "YMD")'
>
> .
> . *
> . * Incoming dataset
> . *
> . quietly set obs 20
>
> . foreach i in 1 3 5 9 {
> 2. generate byte what_types_of_training_did___`i' = 0
> 3. }
>
> . replace what_types_of_training_did___1 = 1
> (20 real changes made)
>
> . generate byte pid = _n
>
> . generate str1 sex = cond(runiform() < 0.5, "F", "M")
>
> . generate int age = floor(20 + 20 * runiform())
>
> . set linesize 79
>
> . describe, fullnames
>
> Contains data
> obs: 20
> vars: 7
> size: 160
> -------------------------------------------------------------------------------
> storage display value
> variable name type format label variable label
> -------------------------------------------------------------------------------
> what_types_of_training_did___1
> byte %8.0g
> what_types_of_training_did___3
> byte %8.0g
> what_types_of_training_did___5
> byte %8.0g
> what_types_of_training_did___9
> byte %8.0g
> pid byte %8.0g
> sex str1 %9s
> age int %8.0g
> -------------------------------------------------------------------------------
> Sorted by:
> Note: dataset has changed since last saved
>
> . tempfile incoming
>
> . quietly save `incoming'
>
> .
> . *
> . * Standardized template empty dataset
> . *
> . drop _all
>
> . forvalues i = 1/9 {
> 2. quietly generate byte what_types_of_training_did___`i' = .
> 3. }
>
> .
> . *
> . * Squaring incoming dataset(s) by appending to template
> . *
> . append using `incoming'
>
> .
> . describe, fullnames
>
> Contains data
> obs: 20
> vars: 12
> size: 260
> -------------------------------------------------------------------------------
> storage display value
> variable name type format label variable label
> -------------------------------------------------------------------------------
> what_types_of_training_did___1
> byte %8.0g
> what_types_of_training_did___2
> byte %8.0g
> what_types_of_training_did___3
> byte %8.0g
> what_types_of_training_did___4
> byte %8.0g
> what_types_of_training_did___5
> byte %8.0g
> what_types_of_training_did___6
> byte %8.0g
> what_types_of_training_did___7
> byte %8.0g
> what_types_of_training_did___8
> byte %8.0g
> what_types_of_training_did___9
> byte %8.0g
> pid byte %8.0g
> sex str1 %9s
> age int %8.0g
> -------------------------------------------------------------------------------
> Sorted by:
> Note: dataset has changed since last saved
>
> . list pid-age *1 in -2/l, abbreviate(30) noobs
>
> +--------------------------------------------------+
> | pid sex age what_types_of_training_did___1 |
> |--------------------------------------------------|
> | 19 M 38 1 |
> | 20 F 39 1 |
> +--------------------------------------------------+
>
> .
> . exit
>
> end of do-file
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Phil Schumm
> Sent: Monday, March 10, 2014 05:12
> To: Statalist Statalist
> Subject: Re: st: -label define- and -replace- when a variable may be missing
>
> [OP redacted for brevity]
>
>
> You can do this two ways: (1) write the code to do the desired translation
> (i.e., from 9 vars into 1) in a way that can accommodate fewer than 9 input
> variables, or (2) fill in any missing variables first, and then perform the
> translation. I tend to prefer the latter, which results in a workflow like
>
> standardized transformed
> raw ---> dataset ---> dataset
> A B
>
> where A is a series of steps which yield a dataset in "standard" form, and B
> includes whatever transformations of the data are necessary prior to analysis,
> distribution, or whatever. Thus, in the example above, A might include
> something like
>
> forv i = 1/9 {
> cap gen byte what_types_of_training_did___`i' = 0
> }
>
> which you might then follow with a few tests, such as ensuring that the 9 items
> are truly mutually exclusive (as required if you want to collapse them into a
> single variable). Note that this would even handle the case where none of the
> variables exists (e.g., if none of the first batch of respondents provided an
> answer to the question).
>
> Separation of A and B (typically in different set(s) of do-files) in the data
> management context has two important advantages:
>
> 1) It allows you to write simpler code in B, which makes it more readable,
> maintainable and cuts down on errors, and
>
> 2) It makes it easier to reuse the code in B in different contexts (as long as
> you pass it a dataset in standard form, which is where automated testing comes
> in handy).
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/