Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: using Stata to detect interviewer fraud
From
Tim Wade <[email protected]>
To
[email protected]
Subject
Re: st: using Stata to detect interviewer fraud
Date
Sat, 1 May 2010 12:00:59 -0400
I have no idea if this would be any faster, but you could save each
observation as a temporary file, then loop through the number of
questionnaires to read in each file and using -cf- to compare
responses with every other file, and saving the percentage differences
using -postfile-, something like below, which is actually really
ineffecient because the loop compares questionnaire 2 with 4 and then
again 4 with 2 which is unnecessary but I am sure some more elegant
programming could take care of it. Also I am not sure how the approach
below addresses only comparing non-missing data, but if these are all
stored in a similar manner this might not be a problem.
clear
*made up data
input quesid a b c d
1 1 2 3 0
2 1 2 3 0
3 4 4 4 0
4 1 4 5 0
5 2 4 3 0
end
qui desc
local nvars=r(k)-1
qui count
local totques=r(N)
levelsof quesid, local(levels)
*save temporary files one per observation
foreach i in `levels' {
tempfile file`i'
preserve
keep if quesid==`i'
save "`file`i''"
restore
}
*set up postfile
tempname testx
tempfile diffs
postfile `testx' quesid1 quesid2 double(pctdiff) numdiffs nvars using `diffs'
*compare each questionnaire with the other
forvalues a=1/`totques' {
forvalues b=1/`totques' {
if `a'~=`b' {
use "`file`a''", clear
capture cf _all using "`file`b''"
*subtract one because quesid will always differ
local diffcount=r(Nsum)
scalar pctdiffs=((`diffcount'-1)/`nvars')*100
post `testx' (`a') (`b') (pctdiffs) (`diffcount'-1) (`nvars')
di "differences in ques `a' and ques `b'=" r(Nsum)
}
}
}
postclose `testx'
clear
use `diffs'
list
+------------------------------------------------+
| quesid1 quesid2 pctdiff numdiffs nvars |
|------------------------------------------------|
1. | 1 2 0 0 4 |
2. | 1 3 75 3 4 |
3. | 1 4 50 2 4 |
4. | 1 5 50 2 4 |
5. | 2 1 0 0 4 |
|------------------------------------------------|
6. | 2 3 75 3 4 |
7. | 2 4 50 2 4 |
8. | 2 5 50 2 4 |
9. | 3 1 75 3 4 |
10. | 3 2 75 3 4 |
|------------------------------------------------|
11. | 3 4 50 2 4 |
12. | 3 5 50 2 4 |
13. | 4 1 50 2 4 |
14. | 4 2 50 2 4 |
15. | 4 3 50 2 4 |
|------------------------------------------------|
16. | 4 5 50 2 4 |
17. | 5 1 50 2 4 |
18. | 5 2 50 2 4 |
19. | 5 3 50 2 4 |
20. | 5 4 50 2 4 |
+------------------------------------------------+
On Fri, Apr 30, 2010 at 11:16 PM, Michelson, Ethan <[email protected]> wrote:
> I'd be deeply grateful for help writing a more efficient, more parsimonious .do file to help detect interviewer fraud. After completing a survey of 2,500 households, I discovered that a few interviewers copied each others' questionnaires. I decided to write some code that calculates the proportion of all nonmissing questionnaire items that are identical across every other questionnaire. Although my .do file accomplishes this task, I strongly suspect I'm making Stata do tons of unnecessary work. It takes Stata about 12 hours to process 505 questionnaires (from a single survey site, since I can rule out the possibility that interviewers conspired across different survey sites).....
>
> In the following code, "id" is the unique questionnaire id. There are 505 questionnaires in this batch. The final command at the bottom asks Stata to list combinations of questionnaires with >80% identical content. I have no doubt there's a far more efficient way to do this. I'd really appreciate any advice anyone can offer.
>
> ********************
> sort id
> gen order=0
> gen add=-1
> replace order=1 if _n==1
> levels id, local(levels)
> foreach l of local levels {
> gen same_`l'=0
> gen all_`l'=0
> }
> forv n = 1(1)504 {
> foreach l of local levels {
> foreach var of varlist a1* a2* a3* b* d* c1 c12 c23 c34 c44 c55 c67 c77 c88 c100 c107 c116 c126 c136 c144 c155 c165 c176 c185 c195 {
> quietly replace same_`l'=same_`l'+1 if `var'==`var'[_n+`n']&`var'~=.&id[_n+`n']==`l'
> quietly replace all_`l'=all_`l'+1 if `var'~=.&`var'[_n+`n']~=.&id[_n+`n']==`l'
> display "`l' `n'"
> }
> }
> quietly replace order=add if order==1
> quietly replace add=add-1
> gsort -order id
> quietly replace order=1 if _n==1
> }
> foreach l of local levels {
> gen prop_`l'=same_`l'/all_`l'*100
> }
> foreach l of local levels {
> list id prop_`l' same_`l' all_`l' if prop_`l'>80&prop_`l'<.
> }
>
> ******************
>
> Ethan Michelson
> Departments of Sociology and East Asian Languages & Cultures, Associate Professor
> Maurer School of Law, Associate Professor of Sociology and Law
> mail address:
> Department of Sociology
> Indiana University
> 744 Ballantine Hall
> 1020 E. Kirkwood Ave.
> Bloomington, IN 47405
> Phone: (812) 856-1521
> Fax: (812) 855-0781
> Email: [email protected]
> URL: http://www.indiana.edu/~emsoc/
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/