Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Rebecca Pope <rebecca.a.pope@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: reliability with -icc- and -estat icc- |
Date | Wed, 27 Feb 2013 10:57:53 -0600 |
Sorry guys, was offline most of yesterday while this discussion was occurring. Jay, I don't know which Rabe-Hesketh & Skrondal text you are referring to, but if it is MLM Using Stata (2012) you'll want Ch 9. I just pulled it down off my shelf and flipped through. They have a brief discussion of Lenny's problem at the very end (exclusive of the "crazy" rater who as Nick points out may just be the only sane one in the bunch). In the interest of full disclosure, like Nick, psychometrics is not my specialty. The following is as much for my edification as to add to the group discussion. Joseph has used two random effects rather than one (leaving aside the whole target/rater issue for now). This corresponds to crossed effects (advised by R-H & S, so I should have read the book yesterday instead of adapting UCLA's code to match -icc-) and will reduce the ICC. This differs by design from what is implemented with -icc-, which treats the target as fixed, as does the code I posted originally. In short Jay, while _all: R isn't wrong, my use of fixed effects for part of the model was. Does that sum it up appropriately? On Wed, Feb 27, 2013 at 10:06 AM, <jcoveney@bigplanet.com> wrote: "Just by inspection, raters are not reliable--if your sample is representative, then a quarter of the population of raters disagrees dramatically from the rest..." If dramatic disagreement is a mark of unreliable raters, what does that say for elections, user reviews, or for that matter faculty search committees? Don't take this to mean I don't understand the general concept that you want raters to concur. However, we're talking about smartphone apps, not e.g. radiology where there is a "true" answer (yes/no lung cancer). Hypothetically at least, you could legitimately have strong and divergent opinions. I would argue here that the issue of rater reliability is not an issue of disagreement but rather Rater 2's utter inability to distinguish between applications. Now, perhaps this means that the applications really aren't substantively different from each other and you found 3 people who just wanted to accomodate you by completing the ranking task and Rater 2 happened to be honest. Who knows. I'd say it's unlikely, but I've seen some pretty unlikely things happen... On Wed, Feb 27, 2013 at 9:33 AM, JVerkuilen (Gmail) <jvverkuilen@gmail.com> wrote: > I think my interpretation is that Rater 4 is an outlier. The question > is what does that outlier status mean? It's very clear that she's > driving most of the ICC estimate so the notion of a good ICC estimate > from these data themselves is suspect. This is a useful study outcome. Useful, if not pleasant. I share Joseph's objections to droping Rater 4 though. >> My take on all that would be that your volunteers need better >> training on evaluating smartphone software in the manner that you >> want it done. Perhaps you and your colleagues could provide more >> explicit instructions on what you’re are looking for in measuring the >> characteristic(s) of the software that you’re trying to measure. > > Absolutely true, and is probably the best takeaway of this exercise. FWIW, I concur. Rebecca * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/