Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From	Robert Picard <[email protected]>
To	[email protected]
Subject	Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date	Sun, 13 Mar 2011 13:25:07 +0100

Turns out that finding the longest span can be done faster without
string manipulations. Here's a new version:

* -------------------------- begin example ----------------

clear all
input patid str12 estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX
5          XXXX-XX-XXXX
6          X-XX-XX-XXXX
7          X-XXXXX-XXXX
8          X-XXX---XXX-
9          XXXXXXXXXXXX
10         ------------
end

local len = 12

* Find the longest period of continuous eligibility.
gen maxspan = 0
gen currspan = 0
gen isX = 0
qui forvalue i = 1/`len' {
	replace isX = substr(estring,`i',1) == "X"
	replace currspan = currspan + 1 if isX
	replace maxspan = currspan if !isX & ///
		currspan > maxspan
	replace currspan = 0 if !isX
}
replace maxspan = currspan if currspan > maxspan
drop currspan isX

* Identify the start of each span
gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
gen blanks = subinstr(spanX,"X"," ",.)
gen es = estring
local i 0
local more 1
qui while `more' {
	local i = `i' + 1
	gen where`i' = strpos(es,spanX)
	replace where`i' = . if where`i' == 0
	replace es = subinstr(es,spanX,blanks,1)
	count if where`i' != .
	local more = r(N)
}
drop where`i'
replace where1 = . if maxspan == 0
egen nmaxspan = rownonmiss(where*)
drop es blanks spanX

* -------------------------- end example ------------------



On Sun, Mar 13, 2011 at 10:33 AM, Nick Cox <[email protected]> wrote:
> Thanks for the very detailed report. A few small footnotes
>
> 1. My comment
>
> it does seem possible that at least some people among 6 million might
> have been eligible for the entire time
>
> was definitely not a suggestion that Robert's code assumed otherwise.
> I could see that it was elegantly general enough to cope fine. It was
> a comment that going all the way as my code did might be needed,
> although no doubt my code still tries patterns never found in the
> data, to no good purpose.
>
> 2. It is striking that, to put it directly, an empty string has
> position 1 within another string
>
> . di strpos("--------", "")
> 1
>
> That is certainly something that could bite as when by accident one
> tries to find this.
>
> 3. My original code could be speeded up a bit by not using a variable
> X but my guess would be that Robert's is still definitely faster.
>
> Nick
>
> On Sat, Mar 12, 2011 at 7:00 PM, Rebecca Pope <[email protected]> wrote:
>> Many thanks Nick and Robert. I'm using a combination of your
>> approaches. Nick is absolutely right; some persons could be eligible
>> for the full 15 years, but Robert's code handles this situation fine.
>> It will continue to loop for the maximum number of sets in the data,
>> even if someone is eligible for all 15 years.
>>
>> The true problem is when the person was never eligible. In that
>> situation, Robert's code (pasted in at the end of this reply without
>> his example data) always assigns a value of 1 in "where1". This has to
>> do, I think, with how Stata matches missing values when implementing
>> strpos(). If maxspan=="" then that gets treated as a match to pos 1of
>> "es" when using strpos(). You can test it with:
>>
>> clear
>> input patid str12 estring
>> 1 á á á á áXXXXX-------
>> 2 á á á á á--XXX---XXXX
>> 3 á á á á á-XXXXXX-----
>> 4 á á á á á-XXX-XXX-XXX
>> 5 á á á á áXXXX-XX-XXXX
>> 6 á á á á áXXXXXXXXXXXX
>> 7 á á á á á------------
>> end
>>
>> After a couple of slight modifications so that Robert's code will only
>> produce 1s when the first match of a set of Xs occurs in position 1, I
>> took a 10% sample of the data and ran both sets of code (Nick's and
>> Robert's). Robert's code does run substantially faster. I changed the
>> -replace where`i'- line in Robert's code so that the results preserve
>> the 0s in where1 & are thus directly comparable to Nick's so everyone
>> can see how the results compare.
>>
>> 1 = Nick, 2 = Robert (order received, nothing else should be inferred)
>>
>> <omitted output>
>> . tabstat where*, statistics( count min max ) columns(statistics)
>>
>> á ávariable | á á á á N á á á min á á á max
>> -------------+------------------------------
>> á á áwhere1 | á á502964 á á á á 0 á á á 180
>> á á áwhere2 | á á á1572 á á á á29 á á á 178
>> á á áwhere3 | á á á á57 á á á á74 á á á 178
>> á á áwhere4 | á á á á12 á á á 142 á á á 169
>> á á áwhere5 | á á á á 3 á á á 152 á á á 175
>> á á áwhere6 | á á á á 1 á á á 173 á á á 173
>> --------------------------------------------
>>
>> <omitted output>
>> á ávariable | á á á á N á á á min á á á max
>> -------------+------------------------------
>> á á áwhere1 | á á502964 á á á á 0 á á á 180
>> á á áwhere2 | á á á1572 á á á á29 á á á 178
>> á á áwhere3 | á á á á57 á á á á74 á á á 178
>> á á áwhere4 | á á á á12 á á á 142 á á á 169
>> á á áwhere5 | á á á á 3 á á á 152 á á á 175
>> á á áwhere6 | á á á á 1 á á á 173 á á á 173
>> --------------------------------------------
>>
>> . timer list
>> á 1: á á184.85 / á á á á1 = á á 184.8470
>> á 2: á á 36.50 / á á á á1 = á á á36.5020
>>
>> I also merged the sets on pat_id and the where`i' variables to make
>> sure the values were the same & not just the counts & ranges. The
>> results are identical.
>>
>> For those like Nick who haven't worked with eligibility data like this
>> & in case someone who has wonders why I'm counting Xs instead of
>> something "logical" like subtracting the start date from the end date:
>> This data only has the earliest start date and the last end date. I
>> would expect a full 15-year insurance coverage only _very_ árarely. It
>> doesn't happen at all in the sample I used for testing. Here in the
>> US, people tend to gain and lose insurance with their jobs.
>> Compounding the issue, the employer could change the company they
>> contract with several times over the years. If one company isn't
>> covered by my data, that will cause apparent "gaps" as well. Our
>> public insurance for the poor has the same problem of "gaps"--people
>> constantly go in and out of the program with marginal changes in
>> financial situation or moving across state lines.
>>
>> Best,
>> Rebecca
>>
>> On Sat, Mar 12, 2011 at 8:05 AM, Nick Cox <[email protected]> wrote:
>>> Interesting point. Clearly fast beats slow whenever nothing else is at issue.
>>>
>>> In this case, the data are patients' eligiblity for health insurance
>>> benefits over a period of 15 years. I've never worked with such data
>>> but it does seem possible that at least some people among 6 million
>>> might have been eligible for the entire time.
>>>
>>> Nick
>>
>> <omitted text here>
>>
>>> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <[email protected]> wrote:
>>>> * -------------------------- begin example ----------------
>>>> * Find the longest period of continuous eligibility
>>>> clonevar es = estring
>>>> gen maxspan = ""
>>>> local more 1
>>>> while `more' {
>>>> á á á ágen s = regexs(1) if regexm(es,"(X+)")
>>>> á á á áreplace maxspan = s if length(s) > length(maxspan)
>>>> á á á áreplace es = subinstr(es,s,"",1)
>>>> á á á ácount if s != ""
>>>> á á á álocal more = r(N)
>>>> á á á ádrop s
>>>> }
>>>>
>>>>
>>>> * Identify the start of each span
>>>> gen smask = subinstr(maxspan,"X","_",.)
>>>> replace es = estring
>>>> local i 0
>>>> local more 1
>>>> while `more' {
>>>> á á á álocal i = `i' + 1
>>>> á á á ágen where`i' = strpos(es,maxspan)
>>>> á á á áreplace where`i' = . if where`i' == 0
>>>> á á á áreplace es = subinstr(es,maxspan,smask,1)
>>>> á á á ácount if where`i' != .
>>>> á á á álocal more = r(N)
>>>> }
>>>> drop where`i'
>>>> egen nmaxspan = rownonmiss(where*)
>>>> drop es smask
>>>>
>>>> * -------------------------- end example ------------------
>> *
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>

References:
- st: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by Date: st: Stset-ing Multiple Failure/Multiple Spell Data : Moving in and out of risk set
Previous by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Index(es):
- Date
- Thread