Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Re: using Stata to detect interviewer fraud
From
Mike Lacy <[email protected]>
To
[email protected]
Subject
st: Re: using Stata to detect interviewer fraud
Date
Sat, 01 May 2010 13:10:25 -0600
>Date: Fri, 30 Apr 2010 23:16:14 -0400
>From: "Michelson, Ethan" <[email protected]
>Subject: st: using Stata to detect interviewer fraud
>I'd be deeply grateful for help writing a more efficient, more
parsimonious .do file to help detect interviewer >fraud. After
completing a survey of 2,500 households, I discovered that a few
interviewers copied each >others' questionnaires. I decided to write
some code that calculates the proportion of all
nonmissing >questionnaire items that are identical across every other
questionnaire. Although my .do file accomplishes this >task, I
strongly suspect I'm making Stata do tons of unnecessary work. It
takes Stata about 12 hours to >process 505 questionnaires (from a
single survey site, since I can rule out the possibility that
interviewers >conspired across different survey sites).....
>In the following code, "id" is the unique questionnaire id. There
are 505 questionnaires in this batch. The >final command at the
bottom asks Stata to list combinations of questionnaires with >80%
identical content. I >have no doubt there's a far more efficient way
to do this. I'd really appreciate any advice anyone can offer.
... snip, snip
A generalized version of -matrix dissimilarity- would solve this,
since it will return a matrix of matching coefficients between all
pairs of respondents, but unfortunately it only will do this for
binary variables. I recently needed a replacement of this kind, and
wrote what is doubtless a clumsy bit of Mata code. It will do
Ethan's problem in a 30 sec. or so on my old Wintel laptop. I'd
welcome comments or improvements on the code below, because this is a
part of what I need to do in another context, and because I think a
good program to accomplish this end would serve a larger purpose.
clear all
// Create some simulated questionnaire data to work on.
set obs 505
local nvars = 100 // number of variables
local ncat = 2 // number of response categories for each variable
forval i = 1/`nvars' {
gen byte q`i' = 1 + trunc(runiform() * `ncat')
}
//
// Mata program that returns a Stata matrix (Respondent X Respondent)
of the proportion of
// matches across a list of variables. This is essentially a replacement
// for -matrix dissim-, which can only do matching coefficients for
// binary variables
//
mata mata clear
mata:
void mat_match ///
(string varlist, // list of variables across which to match
string scalar stmatname) // name of Stata matrix for results
//
{
st_view(X=., ., tokens(varlist)) // tokens splits the string into
a row vector
nsubj = rows(X)
nvar = cols(X)
M = J(nsubj, nsubj, 0)
for (j = 1; j <= nvar; j++ ) {
for (ego = 1; ego <=nsubj; ego++) {
for (alter = 1; alter <= nsubj; alter++) {
if (X[ego,j] == X[alter,j]) {
M[ego,alter] = M[ego,alter] + 1
}
}
}
}
M = M/nvar // proportion
st_matrix(stmatname,M)
}
end
//
//
// Illustrate use: Feed the list of variables created above to
mat_match, return matrix of matching
// proportions in Stata matrix "M"
quiet unab varlist: q*
mata: mat_match("`varlist'", "M")
//
// Inspect the matching matrix to find excessive matches. This could
// be included in the Mata program, but I only need the matrix. Cases
// here are ID'd by case number, not by a true id number.
clear
svmat M
gen str HighMatch = ""
local toomuch = 0.8
foreach M of varlist M* {
quiet replace HighMatch = HighMatch + "`M'" + " " if (`M' > `toomuch')
}
edit HighMatch
Regards,
Mike Lacy
Dept. of Sociology
Colorado State University
Fort Collins CO 80521
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/