I think you need to clean up at source.
Some of the problems look fairly clear
and can be fixed with a -subinstr()-
function in a -replace-. Some look more
difficult to diagnose.
For example, "998" as an element looks
a miscoding for "9 98" and the action would
then be
replace myvar = subinstr(myvar, "998", "998", .)
Once you have cleaned up, some of your
questions can be answered using -tabsplit-
from -tab_chi- on SSC.
Others will requiring a different data structure
based on a -split- and then a -reshape-.
Nick
[email protected]
Honey, Wayne, DOH
> We have a data set with a poorly designed string variable of
> the form str%22s.� This variable allowed for multiple
> responses to be coded in the following manner:
>
> 01.� Cards�(21, Black Jack, Poker, etc.)
> 02.� Animals (Roosters, dogs, horses, frogs, ducks)
> 03.� Sports (football, baseball, pool, golf)(incl. pools,
> w/friends or bookie)
> 04.� Dice games of any type (Craps, etc.)
> 05. Lottery or numbers (Quick Pick, Road Runner, scratch cards, etc.)
> 06. Bingo
> 07.� Raffles or sweepstakes
> 08.� Slot machines, video machines or other gambling machines
> 09.� Pull Tabs, punch cards
> 10.� Internet Gambling
> 11.� Other, please specify: ______________________________�
> SAM (575-594)
>
> 88.� Never Gamble� GO TO NEXT MODULE
> 98.� No other
> 77.� Don't Know/Not Sure
> 99.� Refused� GO TO NEXT MODULE
>
> The respondent was free to respond in any way they chose and
> the interviewers were trained to select from among 15
> possible response codes.� Codes 01 through 10 were assigned
> to particular forms of gambling.� Code 11 was used to
> identify types of gambling that couldn't be coded according
> to the 10 identified responses.�
> Codes 77, 88, and 99 are self-explanatory.� If the respondent
> reported one or more types of gambling, the interviewer coded
> as many forms as were relevant, then entered 98 to indicate
> that no additional types of gambling were reported.�
>
> Consequently, we have a variable with a wide variety of
> responses (see frequency table, below, showing the first and
> last few rows).
>
> 1 2 3 4 5 7 8 998 | 1 0.03 7.19
> 1 2 3 4 5 898 | 1 0.03 7.22
> 1 2 3 51098 | 1 0.03 7.25
> 1 2 4 5 7 898 | 1 0.03 7.28
> 1 2 498 | 1 0.03 7.31
> 1 2 81098 | 1 0.03 7.34
> 1 2 898 | 1 0.03 7.37
> 1 298 | 7 0.21 7.58
> 1 3 898 | 1 0.03 7.61
> 1 398 | 3 0.09 7.70
> 1 4 5 898 | 1 0.03 7.73
> 1 4 598 | 2 0.06 7.79
> 1 4 8 9 5 798 | 1 0.03 7.82
> 1 4 898 | 1 0.03 7.85
> 1 498 | 3 0.09 7.94
> 1 5 2 798 | 1 0.03 7.97
> 50 85998 | 1 0.03 40.16
> 5898 | 1 0.03 40.19
> 77 | 1 0.03 40.22
> 88 | 1 0.03 40.25
> 88 | 1,974 59.39 99.64
> 89 898 | 1 0.03 99.67
> 99 | 11 0.33 100.00
>
>
> Ultimately, we would like to summarize the results in a few
> simple ways:
> 1. Proportion of adults participating in gambling of any form
> 2. Proportion of adults participating in Internet gambling
> (as a new form that should be monitored)
> 3. Most common form of gambling
> 4. 3 most common forms of gambling
>
> Clearly, the structure of the variable does not lend itself
> to efficient use.� Note that, in addition to the problem of
> multiple responses stored in a single variable, spacing does
> not appear to be consistent and some records even have a
> right justification while most appear to be left justified
> within the 22 columns.� I don't know if this justification is
> real or only apparent.
>
> Any advice on how to work with this variable using Stata 9.2
> (generate other variables summarizing responses, etc.) would
> be greatly appreciated.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/