Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Using sampling/probability weights for mixed design ANOVA in STATA


From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: Using sampling/probability weights for mixed design ANOVA in STATA
Date   Fri, 17 Jun 2011 09:57:40 -0400

Meg-

You do interpret the coefficients in the same way.  Most illustrations of -anova- have very few subjects, so it is necessary to pool interactions to get reasonable tests.  With as many observations as you have, pooling is unnecessary.  


Steve

On Jun 17, 2011, at 1:36 AM, Meg Dennison wrote:

Thanks Steve,

I didn't see your comments as criticism at all! We expected a low
uptake, and the numbers we got were what we needed.


This might be a silly question, but the p values for the regression
aren't be the same as the anova, and this would be due to differences
in degrees of freedom? (but the test of the co-efficient is
essentially the same test as the anova? - i.e, you would interpret
them the same way?)

Thanks for your help,

Meg




On Thu, Jun 16, 2011 at 8:51 PM, Steven Samuels <[email protected]> wrote:
> Meg-
> 
> 
> I have to apologize for criticizing your response rate.  I have seen studies that had much lower rates despite strong recruitment efforts.  I really know nothing of what you tried and 25% might be as good as it could get.
> 
> If response-weighted estimates are similar to the probability weighted estimates, you are  probably better off using the latter.  The reason is that response-weighting often leads to greater variation in the weights (larger CVs) and this can increase standard errors enough to offset the possible minimal reduction in bias.
> 
> 
> The second -regress- statement might be more understandable as:
> 
> regress brainvol  i.time i.sex i.hemisphere ///
>                  i.time#i.sex i.time#i.hemisphere  i.sex#i.hemisphere ///
>                  [pweight=yourweight], vce(cluster subject)
> 
> You could also add the three-way interaction i.sex#i.time#i.interaction to the above regression.
> 
> But the three-way model is most parsimoniously written as:
> regress brainvol i.time##i.sex##i.interaction [pweight= yourweight], vce(cluster subject)
> 
> Steve
> [email protected]
> 
> Meg-
> 
> To answer your question, Example 17 in the Stata Manual entry for -anova- shows a repeated measures design with two within-subject factors.
> 
> But on further consideration, I don't think you will be able to emulate the -anova- calculations with -regress-, as you would have to calculated weighted mean squares, which are not available in the regression results.  Moreover, -anova- relies on assumptions, such as the absence of certain interactions, which might not be reasonable.
> 
> Since you have 110 subjects, you can still use -regress- directly and analyze hemisphere and time as within-subject effects. I should have realized this sooner, and I apologize..  Create a hemisphere variable coded 1 & 2.  (Strings for won't work for -regress-.)
> 
> Here is the code:
> 
> ****************************
> /* Main Effects */
> regress brainvol  i.time i.sex i.hemisphere ///
>  [pweight = yourweight], vce(cluster subject)
> /* Interactions */
> regress brainvol i.time##i.sex i.time##i.hemisphere i.sex##i.hemisphere ///
>  [pweight = yourweight], vce(cluster subject)  // interactions
> ***************************
> 
> With the vce(cluster subject) option, -regress- will correctly identify time and hemisphere as within-subject factors
> 
> To compare to -anova- run the above without the [pweight=] option.
> 
> This approach ignores variability imposed by the sampling design and so from a sampling standpoint the standard errors will be incorrect. You can get an idea of how incorrect, by substituting "postcode" for "subject" in the vce() option.
> 
> Now the warning:  If you do not correct your probability weights for non-participation, you risk serious bias, except under conditions which cannot be tested.   You would be better off using no weights at all.  If you do not make such a correction, then whether you use probability weights or not, you will have to state that the children are not representative.
> 
> From a statistical point of view, a protocol which attempted to recruit 100 children intensively  and got 75 would have been superior. This might have happened if you had been able to do a pilot study on, say, 20 children, to test recruitment strategies.
> 
> You can also use the contributed -gllamm- package ("findit") which would allow you to fit a nested repeated measures model with postcode and school.  This would allow you to assess the variation due to post-code and schools-within-postcode.
> 
> Steve
> 
> Steven J. Samuels
> Consulting Statistician
> 18 Cantine's Island
> Saugerties, NY 12477 USA
> Voice: 845-246-0774
> Fax:   206-202-4783
> [email protected]
> 
> 
> 
> 
> On Jun 14, 2011, at 11:25 PM, Meg Dennison wrote:
> 
> Hi Steven,
> 
> I am still familiarizing myself with STATA and have looked at this
> syntax below using anova, regress and pweights commands - can this
> handle repeated measures (within-subjects) variables?
> 
> Thanks again,
> 
> Meg
> 
> On Wed, Jun 1, 2011 at 9:58 AM, Steven Samuels <[email protected]> wrote:
>> --
>> 
>> Meg, here's an example of how to do an -anova- analysis with a probability weight.  It exploits the fact that -anova- is equivalent to an ordinary multiple regression command (Stata -regress-, for example) and Stata's -regress- _will_ take a probability weight. The trick is to run -regress- without options after -anova-  to show the implied predictors that you need to put on the right hand side of the equation. You will need to prefix the categorical factors with "i.". The example is from the -anova- help illustration of a split-plot design:
>> 
>> *******START******************
>> log close
>> log using aovtest, replace text
>> webuse reading
>> 
>> anova score prog / class|prog skill prog#skill / class#skill|prog / group|class#skill|prog /, dropemptycells
>> regress   //
>> 
>> //regression equivalent from -regress- statement after -anova-: add the "i." in front of the categorical variables
>> 
>> regress score i.prog i.class#i.prog i.skill i.program#i.skill ///
>> i.class#i.skill#i.program  i.group#i.class#i.skill#i.program
>> 
>> // regression with probability weight
>> set seed 4816033
>> gen finalwt =20*(uniform() +.05)  // weights>1
>> 
>> regress score i.prog i.class#i.prog i.skill i.program#i.skill ///
>> i.class#i.skill#i.program  i.group#i.class#i.skill#i.program, ///
>> pweight(finalwt)   // <--pweight option
>> *********END**************
>> 
>> Meg,
>> I'm not sure that you should or can do a full-fledged survey analysis,
>> From your description, you selected post-codes at random, then schools within post-costs at random. (The term "randomly selected" actually is not informative, as this can describe many different procedures.) In survey parlance, the post-codes would be primary sampling units (PSUs) and the schools would be secondary sampling units, with no sampling thereafter.  You made a further selection on the basis of test questions.
>> 
>> An idle question: Did you round some numbers or leave out some details? It would be odd to get exactly 2500 and 400 as the result of a sampling/testing process.
>> 
>> What you are calling "sampling bias" is, in fact, selection bias. You say that the factors on which the selection took place have nothing to do with gender. But the issue is: would it have any relation to the development of brain volume? If so, the 400 children would not be representative of the larger population.
>> 
>> Your more critical concern should be the large non-reponse bias at the final stage.  The only way to (partly) alleviate that is to do reponse weighting: predict the probability of participation from variables you know for all 400 children who were invited; use the inverse as a response weight and multiply this by the selection weight..  See: Sharon Lohr. 2009. Sampling: Design and Analysis. 2nd ed. Boston, MA: Cengage Brooks/Cole.
>> 
>> Here's what I suggest
>> 
>> 1. Ignoring the  weights, run the -anova- command. I'm not that familiar with the -anova- command, but I would guess that the standard error for hemisphere should be the subject x hemisphere interaction term. You can add terms for postal code, nesting of schools within postal code, and of subjects within schools.
>> 
>> 2. Run the -regress- post-anova command to see how to set up the problem using Stata's -regress-.
>> 
>> 3. Now compute the final weight with the non-response correction.
>> 
>> 4. Run the -regress- equivalent of  the anova  but incorporating your final weight with a [pweight = ] option.
>> 
>> 5. Do the same, but omit the within-subject terms and add as a final option (after the comma) "vce(cluster subject)"
>> 
>> 
>> 
>> These analyses are defensible because it allows for the possibility of postal code and school differences but does not use the design to determine standard errors.
>> 
>> 5. Try a full survey analysis:  -svyset- your data:
>> 
>> ************************************************
>> svyset post_code  [pweight=final_wt] || school
>> ************************************************
>> 
>> Try variations of the following model
>> 
>> ************************************************************************************
>> svy: regress brainvol  i.time i.sex i.hemisphere  //main effects. Add interactions as desired
>> ************************************************************************************
>> (See the -help- for "factor variables", assuming you have Stata 11)
>> 
>> Steve
>> [email protected]
>> 
>> 
>> 
>> On May 26, 2011, at 8:36 PM, Meg Dennison wrote:
>> 
>> Hi All,
>> 
>> Steven thanks for your reply. I have inserted my answers below.
>> 
>> 
>> But The description of your data  unclear.  You refer to one
>> between-subject and two within-subject "variables", but to "the"
>> (single?) repeated measures variable with two levels. Isn't this a
>> within-subject variable?.  By two levels do you mean two occasions (if
>> longitudinal)?  Which, if any variables (besides subject), do you
>> consider to be "random effects"?
>> 
>> - I am looking at brain development over time. I have collected data
>> on brain measures at two time points for each subject (the repeated
>> measure - baseline and follow up). Additionally, these brain measures
>> involve collecting from both the left and the right hemisphere within
>> a single person - and are not independent, so they are being treated
>> as another within subjects variable (hemisphere - left and right).
>> 
>> The between subjects variable is sex (obviously, male and female).
>> 
>> So please clarify what the variables are and  list the data for some
>> subjects, so that we can see where you are starting from,.
>> 
>> So, the data would look like this:
>> 
>> Subject BrainVol Time Hemisphere
>> 1               1345      1        left
>> 1               2345      2        left
>> 1               3546      1        right
>> 1               3457      2        right
>> etc
>> 
>> 
>> In any case, for complex survey data, the standard errors for
>> estimates are governed by variation of primary sampling units (PSUs,
>> first-stage clusters) within strata, so the usual ANOVA formulas would
>> not ordinarily apply. Stata can analyze some mixed model designs with
>> survey data.
>> 
>> Some other questions that will help us suggest analyses:
>> 1. What is the sampling design? If there were strata, do they
>> correspond to the "between-subject" variable?
>> The sampling design involved postcodes being randomly selected across
>> a metropolitan city. Within these postcodes (strata?), schools were
>> randomly selected to participate (clusters?). All Grade 5 classes
>> within these schools were asked to complete a survey (obviously not
>> all consented or were present at school that day etc). The survey they
>> completed consisted of four factors. Two of these factors were used to
>> select subjects for further participation - the probability of being
>> selected is the probability weights that I have based on this sampling
>> bias. From this initial sample of about 2500, 400 were invited to
>> participate in the research, and from those who were invited, I have
>> 101 who participated in my study. The variable on which they were
>> initially sampled does not correspond to sex - the BS variable in my
>> study. I am not interested in the variable on which the sampling bias
>> was introduced - my data is derived from a larger research project for
>> which this initial sampling bias was desirable.
>> 
>> 2. Are replicate (bootstrap, jackknife, BRR) weights available? Did
>> the survey distributor provide SAS or SPSS macros to compute them?
>> No, the selection was not done using these programs.
>> 
>> 3. What questions are you trying to answer.  What parameters do you
>> hope to estimate or test in your analysis?
>> I am interested in describing typical brain development - how it
>> changes over time by sex and hemisphere, and their interaction. I
>> believe that the initial sample of 2500 was reasonably representative
>> of normally developing children (obviously with the caveats of being
>> living in a certain country, being at school, living in city etc etc).
>> I would like to correct for the sampling bias that was introduced.
>> 
>> Thanks in advance
>> 
>> Meg
>> 
>> 
>> 4. What version of Stata do you have> Version 11.
>> 
>> 
>> On Tue, May 24, 2011 at 11:54 PM, Steven Samuels <[email protected]> wrote:
>>> 
>>> Hi, Meg.
>>> 
>>> Welcome to Stata!  You will find that Stata's regression and survey capabilities are both far superior to those of SPSS.
>>> 
>>> But The description of your data  unclear.  You refer to one between-subject and two within-subject "variables", but to "the" (single?) repeated measures variable with two levels. Isn't this a within-subject variable?.  By two levels do you mean two occasions (if longitudinal)?  Which, if any variables (besides subject), do you consider to be "random effects"?
>>> 
>>> So please clarify what the variables are and  list the data for some  subjects, so that we can see where you are starting from,.
>>> 
>>> In any case, for complex survey data, the standard errors for estimates are governed by variation of primary sampling units (PSUs, first-stage clusters) within strata, so the usual ANOVA formulas would not ordinarily apply. Stata can analyze some mixed model designs with survey data.
>>> 
>>> Some other questions that will help us suggest analyses:
>>> 1. What is the sampling design? If there were strata, do they correspond to the "between-subject" variable?
>>> 2. Are replicate (bootstrap, jackknife, BRR) weights available? Did the survey distributor provide SAS or SPSS macros to compute them?
>>> 3. What questions are you trying to answer.  What parameters do you hope to estimate or test in your analysis?
>>> 4. What version of Stata do you have>
>>> 
>>> Steve
>>> [email protected]
>>> 
>>> 
>>> On May 23, 2011, at 9:20 AM, Meg Dennison wrote:
>>> 
>>> Hi,
>>> 
>>> I have a complex sample, for which I need to use sampling weights
>>> (probability weights). I already have these values derived from the
>>> initial sampling selection. I wanted to then perform a mixed design
>>> ANOVA (with 2 within subjects variables and one between subjects
>>> variable).The repeated measures variable only has 2 levels.
>>> 
>>> I have only used SPSS before and the Complex Sampling Add-on module
>>> only allows for univariate ANOVA. Can STATA perform this type of
>>> analysis? From what I could see from looking at the GUI and reading
>>> the manual, probability weights (pweights) could not be used for mixed
>>> ANOVA?
>>> 
>>> Is there another way I should be thinking about this?
>>> 
>>> Thanks in advance for your help,
>>> 
>>> 
>>> Kind regards,
>>> 
>>> Meg
>>> 
>>> --
>>> 
>>> Meg Dennison BA(Hons) MPsych(Clin)/PhD Candidate
>>> School of Psychological Sciences, University of Melbourne
>>> [email protected]
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>> 
>>> 
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>> 
>> 
>> 
>> 
>> --
>> 
>> Meg Dennison BA(Hons) MPsych(Clin)/PhD Candidate
>> School of Psychological Sciences, University of Melbourne
>> [email protected]
>> 
>> 
>> 
>> 
>> 
>> --
>> 
>> Meg Dennison BA(Hons) MPsych(Clin)/PhD Candidate
>> School of Psychological Sciences, University of Melbourne
>> [email protected]
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>> 
>> 
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>> 
> 
> 
> 
> --
> 
> Meg Dennison BA(Hons) MPsych(Clin)/PhD Candidate
> School of Psychological Sciences, University of Melbourne
> [email protected]
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 



-- 

Meg Dennison BA(Hons) MPsych(Clin)/PhD Candidate
School of Psychological Sciences, University of Melbourne
[email protected]

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index