Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reliability with -icc- and -estat icc-


From   Rebecca Pope <[email protected]>
To   [email protected]
Subject   Re: st: reliability with -icc- and -estat icc-
Date   Wed, 27 Feb 2013 10:57:53 -0600

Sorry guys, was offline most of yesterday while this discussion was occurring.

Jay, I don't know which Rabe-Hesketh & Skrondal text you are referring
to, but if it is MLM Using Stata (2012) you'll want Ch 9. I just
pulled it down off my shelf and flipped through. They have a brief
discussion of Lenny's problem at the very end (exclusive of the
"crazy" rater who as Nick points out may just be the only sane one in
the bunch).

In the interest of full disclosure, like Nick, psychometrics is not my
specialty. The following is as much for my edification as to add to
the group discussion. Joseph has used two random effects rather than
one (leaving aside the whole target/rater issue for now). This
corresponds to crossed effects (advised by R-H & S, so I should have
read the book yesterday instead of adapting UCLA's code to match
-icc-) and will reduce the ICC. This differs by design from what is
implemented with -icc-, which treats the target as fixed, as does the
code I posted originally. In short Jay, while _all: R isn't wrong, my
use of fixed effects for part of the model was. Does that sum it up
appropriately?

On Wed, Feb 27, 2013 at 10:06 AM, <[email protected]> wrote:
"Just by inspection, raters are not reliable--if your sample is
representative, then a quarter of the population of raters disagrees
dramatically from the rest..."

If dramatic disagreement is a mark of unreliable raters, what does
that say for elections, user reviews, or for that matter faculty
search committees? Don't take this to mean I don't understand the
general concept that you want raters to concur. However, we're talking
about smartphone apps, not e.g. radiology where there is a "true"
answer (yes/no lung cancer). Hypothetically at least, you could
legitimately have strong and divergent opinions.

I would argue here that the issue of rater reliability is not an issue
of disagreement but rather Rater 2's utter inability to distinguish
between applications. Now, perhaps this means that the applications
really aren't substantively different from each other and you found 3
people who just wanted to accomodate you by completing the ranking
task and Rater 2 happened to be honest. Who knows. I'd say it's
unlikely, but I've seen some pretty unlikely things happen...

On Wed, Feb 27, 2013 at 9:33 AM, JVerkuilen (Gmail)
<[email protected]> wrote:
> I think my interpretation is that Rater 4 is an outlier. The question
> is what does that outlier status mean? It's very clear that she's
> driving most of the ICC estimate so the notion of a good ICC estimate
> from these data themselves is suspect. This is a useful study outcome.

Useful, if not pleasant. I share Joseph's objections to droping Rater 4 though.

>> My take on all that would be that your volunteers need better
>> training on evaluating smartphone software in the manner that you
>> want it done.  Perhaps you and your colleagues could provide more
>> explicit instructions on what you’re are looking for in measuring the
>> characteristic(s) of the software that you’re trying to measure.
>
> Absolutely true, and is probably the best takeaway of this exercise.

FWIW, I concur.

Rebecca

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index