Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
From
Rebecca Pope <[email protected]>
To
[email protected]
Subject
Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date
Sat, 12 Mar 2011 13:00:28 -0600
Many thanks Nick and Robert. I'm using a combination of your
approaches. Nick is absolutely right; some persons could be eligible
for the full 15 years, but Robert's code handles this situation fine.
It will continue to loop for the maximum number of sets in the data,
even if someone is eligible for all 15 years.
The true problem is when the person was never eligible. In that
situation, Robert's code (pasted in at the end of this reply without
his example data) always assigns a value of 1 in "where1". This has to
do, I think, with how Stata matches missing values when implementing
strpos(). If maxspan=="" then that gets treated as a match to pos 1of
"es" when using strpos(). You can test it with:
clear
input patid str12 estring
1 XXXXX-------
2 --XXX---XXXX
3 -XXXXXX-----
4 -XXX-XXX-XXX
5 XXXX-XX-XXXX
6 XXXXXXXXXXXX
7 ------------
end
After a couple of slight modifications so that Robert's code will only
produce 1s when the first match of a set of Xs occurs in position 1, I
took a 10% sample of the data and ran both sets of code (Nick's and
Robert's). Robert's code does run substantially faster. I changed the
-replace where`i'- line in Robert's code so that the results preserve
the 0s in where1 & are thus directly comparable to Nick's so everyone
can see how the results compare.
1 = Nick, 2 = Robert (order received, nothing else should be inferred)
<omitted output>
. tabstat where*, statistics( count min max ) columns(statistics)
variable | N min max
-------------+------------------------------
where1 | 502964 0 180
where2 | 1572 29 178
where3 | 57 74 178
where4 | 12 142 169
where5 | 3 152 175
where6 | 1 173 173
--------------------------------------------
<omitted output>
variable | N min max
-------------+------------------------------
where1 | 502964 0 180
where2 | 1572 29 178
where3 | 57 74 178
where4 | 12 142 169
where5 | 3 152 175
where6 | 1 173 173
--------------------------------------------
. timer list
1: 184.85 / 1 = 184.8470
2: 36.50 / 1 = 36.5020
I also merged the sets on pat_id and the where`i' variables to make
sure the values were the same & not just the counts & ranges. The
results are identical.
For those like Nick who haven't worked with eligibility data like this
& in case someone who has wonders why I'm counting Xs instead of
something "logical" like subtracting the start date from the end date:
This data only has the earliest start date and the last end date. I
would expect a full 15-year insurance coverage only _very_ rarely. It
doesn't happen at all in the sample I used for testing. Here in the
US, people tend to gain and lose insurance with their jobs.
Compounding the issue, the employer could change the company they
contract with several times over the years. If one company isn't
covered by my data, that will cause apparent "gaps" as well. Our
public insurance for the poor has the same problem of "gaps"--people
constantly go in and out of the program with marginal changes in
financial situation or moving across state lines.
Best,
Rebecca
On Sat, Mar 12, 2011 at 8:05 AM, Nick Cox <[email protected]> wrote:
> Interesting point. Clearly fast beats slow whenever nothing else is at issue.
>
> In this case, the data are patients' eligiblity for health insurance
> benefits over a period of 15 years. I've never worked with such data
> but it does seem possible that at least some people among 6 million
> might have been eligible for the entire time.
>
> Nick
<omitted text here>
> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <[email protected]> wrote:
>> * -------------------------- begin example ----------------
>> * Find the longest period of continuous eligibility
>> clonevar es = estring
>> gen maxspan = ""
>> local more 1
>> while `more' {
>> gen s = regexs(1) if regexm(es,"(X+)")
>> replace maxspan = s if length(s) > length(maxspan)
>> replace es = subinstr(es,s,"",1)
>> count if s != ""
>> local more = r(N)
>> drop s
>> }
>>
>>
>> * Identify the start of each span
>> gen smask = subinstr(maxspan,"X","_",.)
>> replace es = estring
>> local i 0
>> local more 1
>> while `more' {
>> local i = `i' + 1
>> gen where`i' = strpos(es,maxspan)
>> replace where`i' = . if where`i' == 0
>> replace es = subinstr(es,maxspan,smask,1)
>> count if where`i' != .
>> local more = r(N)
>> }
>> drop where`i'
>> egen nmaxspan = rownonmiss(where*)
>> drop es smask
>>
>> * -------------------------- end example ------------------
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/