Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: creating a new variable
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: creating a new variable
Date
Wed, 18 Jul 2012 13:17:41 +0100
That is not surprising. You are not asking exactly the same question.
The -egen- command will ignore missings on -bw- and assign the group
mean to observations that include them, so long as -gestwk- is not
missing. -tabstat- will ignore the missings on -bw- out of hand.
Evidently you have 2991456 - 2972666 missing values on -bw-.
This is the sort of discrepancy that you can investigate yourself, if
only with a smaller dataset.
To ensure identical results, always exclude the missings, e.g. by
-drop-ping them first.
. sysuse auto
. tabstat rep78, by(foreign) s(n mean)
Summary for variables: rep78
by categories of: foreign (Car type)
foreign | N mean
---------+--------------------
Domestic | 48 3.020833
Foreign | 21 4.285714
---------+--------------------
Total | 69 3.405797
------------------------------
. egen mean_rep78 = mean(rep78), by(foreign)
. tab mean_rep78
mean_rep78 | Freq. Percent Cum.
------------+-----------------------------------
3.020833 | 52 70.27 70.27
4.285714 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
On Wed, Jul 18, 2012 at 1:02 PM, Amal Khanolkar <[email protected]> wrote:
> Thank you Nick, Maarten & steve for your suggestions.
>
> The tabstat command is the perfect way to get a descriptive take on what I wanted.
>
> I tried the following and find a discrepency in the number of subjects:
>
> . egen mean_bw = mean(bw), by(gestwk)
>
> . tab mean_bw
>
> mean_bw | Freq. Percent Cum.
> ------------+-----------------------------------
> 559.5574 | 134 0.00 0.00
> 616.5096 | 387 0.01 0.02
> 699.3734 | 738 0.02 0.04
> 790.9377 | 1,235 0.04 0.08
> 902.7249 | 1,688 0.06 0.14
> 1014.961 | 2,125 0.07 0.21
> 1138.658 | 2,723 0.09 0.30
> 1295.815 | 3,415 0.11 0.42
> 1461.302 | 4,481 0.15 0.57
> 1655.637 | 5,876 0.20 0.76
> 1858.227 | 8,533 0.29 1.05
> 2092.705 | 12,958 0.43 1.48
> 2325.826 | 21,420 0.72 2.20
> 2592.584 | 36,710 1.23 3.42
> 2837.138 | 70,297 2.35 5.77
> 3081.272 | 151,310 5.06 10.83
> 3309.638 | 9,763 0.33 11.16
> 3313.268 | 373,660 12.49 23.65
> 3488.345 | 660,536 22.08 45.73
> 3627.659 | 1,648 0.06 45.78
> 3637.902 | 822,376 27.49 73.28
> 3698.833 | 5,470 0.18 73.46
> 3755.764 | 542,442 18.13 91.59
> 3791.726 | 31,928 1.07 92.66
> 3826.705 | 219,603 7.34 100.00
> ------------+-----------------------------------
> Total | 2,991,456 100.00
>
> . tabstat bw, by(gestwk) stat (mean n sd)
>
> Summary for variables: bw
> by categories of: gestwk
>
> gestwk | mean N sd
> ---------+------------------------------
> 22 | 559.5574 122 209.6139
> 23 | 616.5096 365 134.5845
> 24 | 699.3734 691 135.2207
> 25 | 790.9377 1171 147.066
> 26 | 902.7248 1610 189.5523
> 27 | 1014.961 2024 201.809
> 28 | 1138.658 2613 238.724
> 29 | 1295.815 3316 278.1803
> 30 | 1461.302 4367 299.6202
> 31 | 1655.637 5732 345.8412
> 32 | 1858.227 8369 359.1699
> 33 | 2092.704 12771 402.861
> 34 | 2325.826 21149 416.8742
> 35 | 2592.584 36451 458.3818
> 36 | 2837.138 69940 464.2042
> 37 | 3081.272 150767 465.5551
> 38 | 3313.268 372601 453.221
> 39 | 3488.345 658969 445.2462
> 40 | 3637.902 820460 453.1178
> 41 | 3755.764 541160 467.3571
> 42 | 3826.705 219074 485.0738
> 43 | 3791.726 31859 507.7569
> 44 | 3698.833 5454 512.7899
> 45 | 3627.659 1631 531.2405
> ---------+------------------------------
> Total | 3502.912 2972666 575.2709
> ----------------------------------------
>
>
> As one can see from above the N for each gestational week isn't the same for the two tabs. I get the same problem when using:
>
> bys gestwk : egen mean1 = mean(bw)
>
> The N's are almost the same for most gestwk thus giving the same mean BW. But in some cases the N's differ quite a bit giving larger differences in mean BW.
>
>
> Thanks,
> /Amal
>
> ________________________________________
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: 18 July 2012 13:40
> To: [email protected]
> Subject: Re: st: creating a new variable
>
> Here are five solutions for a similar problem.
>
> . sysuse auto
>
> . tab rep78, su(mpg)
>
> Repair | Summary of Mileage (mpg)
> Record 1978 | Mean Std. Dev. Freq.
> ------------+------------------------------------
> 1 | 21 4.2426407 2
> 2 | 19.125 3.7583241 8
> 3 | 19.433333 4.1413252 30
> 4 | 21.666667 4.9348699 18
> 5 | 27.363636 8.7323849 11
> ------------+------------------------------------
> Total | 21.289855 5.8664085 69
>
> . tabstat mpg , by(rep78)
>
> Summary for variables: mpg
> by categories of: rep78 (Repair Record 1978)
>
> rep78 | mean
> ---------+----------
> 1 | 21
> 2 | 19.125
> 3 | 19.43333
> 4 | 21.66667
> 5 | 27.36364
> ---------+----------
> Total | 21.28986
> --------------------
>
> . graph dot (mean) mpg, over(rep78) vertical
>
> . egen mean_mpg = mean(mpg), by(rep78)
>
> . scatter mean_mpg rep78
>
> . dotplot mpg, over(rep78) bar
>
>
> On Wed, Jul 18, 2012 at 11:34 AM, Amal Khanolkar <[email protected]> wrote:
>
>> I have a very simple problem that I'm unable to find a simple solution for:
>>
>> Below is the data concerned:
>>
>> Gestational age in weeks:
>>
>> tab gestwk
>>
>> gestwk | Freq. Percent Cum.
>> ------------+-----------------------------------
>> 22 | 134 0.00 0.00
>> 23 | 387 0.01 0.02
>> 24 | 738 0.02 0.04
>> 25 | 1,235 0.04 0.08
>> 26 | 1,688 0.06 0.14
>> 27 | 2,125 0.07 0.21
>> 28 | 2,723 0.09 0.30
>> 29 | 3,415 0.11 0.42
>> 30 | 4,481 0.15 0.57
>> 31 | 5,876 0.20 0.76
>> 32 | 8,533 0.29 1.05
>> 33 | 12,958 0.43 1.49
>> 34 | 21,420 0.72 2.20
>> 35 | 36,710 1.23 3.44
>> 36 | 70,297 2.36 5.79
>> 37 | 151,310 5.07 10.87
>> 38 | 373,660 12.53 23.40
>> 39 | 660,536 22.15 45.55
>> 40 | 822,376 27.58 73.13
>> 41 | 542,442 18.19 91.33
>> 42 | 219,603 7.37 98.69
>> 43 | 31,928 1.07 99.76
>> 44 | 5,470 0.18 99.94
>> 45 | 1,648 0.06 100.00
>> ------------+-----------------------------------
>> Total | 2,981,693 100.00
>>
>>
>> Mean birth weight of my study sample:
>>
>> . sum bw
>>
>> Variable | Obs Mean Std. Dev. Min Max
>> -------------+--------------------------------------------------------
>> bw | 2980093 3502.431 575.7603 300 6780
>>
>> sum bw if gestwk==26
>>
>> Variable | Obs Mean Std. Dev. Min Max
>> -------------+--------------------------------------------------------
>> bw | 1610 902.7248 189.5523 350 1970
>>
>> . sum bw if gestwk==26
>>
>> Variable | Obs Mean Std. Dev. Min Max
>> -------------+--------------------------------------------------------
>> bw | 1610 902.7248 189.5523 350 1970
>>
>>
>> Below, if I would like to look at the mean birth weight for a particular gestational week:
>>
>> . sum bw if gestwk==27
>>
>> Variable | Obs Mean Std. Dev. Min Max
>> -------------+--------------------------------------------------------
>> bw | 2024 1014.961 201.809 380 1920
>>
>> . sum bw if gestwk==28
>>
>> Variable | Obs Mean Std. Dev. Min Max
>> -------------+--------------------------------------------------------
>> bw | 2613 1138.658 238.724 370 2000
>>
>> . sum bw if gestwk==29
>>
>> Variable | Obs Mean Std. Dev. Min Max
>> -------------+--------------------------------------------------------
>> bw | 3316 1295.815 278.1803 370 2480
>>
>>
>> What I would like to do is to create a single continuous variable that would give me the mean birth weight for each gestational week so that I don't have to look at it individually as above. I would like to ideally be able to use this variable in scatter plots.
>>
>> If I plot as follows:
>>
>> scatter twoway bw gestwk
>>
>> I of course don't get a single estimate for each gestational week, but instaed the entire range of birth weight for a particular week is plotted.
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/