Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: using Stata to detect interviewer fraud

From	Tim Wade <[email protected]>
To	[email protected]
Subject	Re: st: using Stata to detect interviewer fraud
Date	Sat, 1 May 2010 12:00:59 -0400

I have no idea if this would be any faster, but you could save each
observation as a temporary file, then loop through the number of
questionnaires to read in each file and using -cf- to compare
responses with every other file, and saving the percentage differences
using -postfile-, something like below, which is actually really
ineffecient because the loop compares questionnaire 2 with 4 and then
again 4 with 2 which is unnecessary but I am sure some more elegant
programming could take care of it. Also I am not sure how the approach
below addresses only comparing non-missing data, but if these are all
stored in a similar manner this might not be a problem.

clear

*made up data

input quesid a b c d
1 1 2 3 0
2 1 2 3 0
3 4 4 4 0
4 1 4 5 0
5 2 4 3 0
end
qui desc
local nvars=r(k)-1
qui count
local totques=r(N)

levelsof quesid, local(levels)

*save temporary files one per observation

foreach i in `levels' {
tempfile file`i'
preserve
keep if quesid==`i'
save "`file`i''"
restore
}

*set up postfile

tempname testx
tempfile diffs

postfile `testx' quesid1 quesid2 double(pctdiff) numdiffs nvars using `diffs'

*compare each questionnaire with the other

forvalues a=1/`totques' {
forvalues b=1/`totques' {
if `a'~=`b' {
use "`file`a''", clear
capture cf _all using "`file`b''"
*subtract one because quesid will always differ
local diffcount=r(Nsum)
scalar pctdiffs=((`diffcount'-1)/`nvars')*100
post `testx' (`a') (`b') (pctdiffs) (`diffcount'-1) (`nvars')
di "differences in ques `a' and ques `b'=" r(Nsum)
}
}
}
postclose `testx'

clear
use `diffs'
list


     +------------------------------------------------+
     | quesid1   quesid2   pctdiff   numdiffs   nvars |
     |------------------------------------------------|
  1. |       1         2         0          0       4 |
  2. |       1         3        75          3       4 |
  3. |       1         4        50          2       4 |
  4. |       1         5        50          2       4 |
  5. |       2         1         0          0       4 |
     |------------------------------------------------|
  6. |       2         3        75          3       4 |
  7. |       2         4        50          2       4 |
  8. |       2         5        50          2       4 |
  9. |       3         1        75          3       4 |
 10. |       3         2        75          3       4 |
     |------------------------------------------------|
 11. |       3         4        50          2       4 |
 12. |       3         5        50          2       4 |
 13. |       4         1        50          2       4 |
 14. |       4         2        50          2       4 |
 15. |       4         3        50          2       4 |
     |------------------------------------------------|
 16. |       4         5        50          2       4 |
 17. |       5         1        50          2       4 |
 18. |       5         2        50          2       4 |
 19. |       5         3        50          2       4 |
 20. |       5         4        50          2       4 |
     +------------------------------------------------+



On Fri, Apr 30, 2010 at 11:16 PM, Michelson, Ethan <[email protected]> wrote:
> I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers conspired across different survey sites).....
>
> In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer.
>
> ********************
> sort id
> gen order=0
> gen add=-1
> replace order=1 if _n==1
> levels id, local(levels)
> foreach l of local levels {
>    gen same_`l'=0
>    gen all_`l'=0
> }
> forv n = 1(1)504 {
>    foreach l of local levels {
>       foreach var of varlist a1* a2* a3* b* d* c1 c12 c23 c34 c44 c55 c67 c77 c88 c100 c107 c116 c126 c136 c144 c155 c165 c176 c185 c195 {
>          quietly replace same_`l'=same_`l'+1 if `var'==`var'[_n+`n']&`var'~=.&id[_n+`n']==`l'
>          quietly replace all_`l'=all_`l'+1 if `var'~=.&`var'[_n+`n']~=.&id[_n+`n']==`l'
>          display "`l' `n'"
>      }
>    }
>    quietly replace order=add if order==1
>    quietly replace add=add-1
>    gsort -order id
>    quietly replace order=1 if _n==1
> }
> foreach l of local levels {
>    gen prop_`l'=same_`l'/all_`l'*100
> }
> foreach l of local levels {
>    list id prop_`l' same_`l' all_`l' if prop_`l'>80&prop_`l'<.
> }
>
> ******************
>
> Ethan Michelson
> Departments of Sociology and East Asian Languages & Cultures, Associate Professor
> Maurer School of Law, Associate Professor of Sociology and Law
> mail address:
> Department of Sociology
> Indiana University
> 744 Ballantine Hall
> 1020 E. Kirkwood Ave.
> Bloomington, IN 47405
> Phone: (812) 856-1521
> Fax: (812) 855-0781
> Email: [email protected]
> URL: http://www.indiana.edu/~emsoc/
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: random allocation three-way cross-over design
Next by Date: Re: Re-re-post: Stata 11 - Factor variables in a regression command
Previous by thread: Re: st: using Stata to detect interviewer fraud
Next by thread: Re: st: using Stata to detect interviewer fraud
Index(es):
- Date
- Thread