Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Robert Picard <picard@netbox.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: RE: Stata analog to Mata's -strdup()- or better approach? |
Date | Sun, 13 Mar 2011 13:25:07 +0100 |
Turns out that finding the longest span can be done faster without string manipulations. Here's a new version: * -------------------------- begin example ---------------- clear all input patid str12 estring 1 XXXXX------- 2 --XXX---XXXX 3 -XXXXXX----- 4 -XXX-XXX-XXX 5 XXXX-XX-XXXX 6 X-XX-XX-XXXX 7 X-XXXXX-XXXX 8 X-XXX---XXX- 9 XXXXXXXXXXXX 10 ------------ end local len = 12 * Find the longest period of continuous eligibility. gen maxspan = 0 gen currspan = 0 gen isX = 0 qui forvalue i = 1/`len' { replace isX = substr(estring,`i',1) == "X" replace currspan = currspan + 1 if isX replace maxspan = currspan if !isX & /// currspan > maxspan replace currspan = 0 if !isX } replace maxspan = currspan if currspan > maxspan drop currspan isX * Identify the start of each span gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan) gen blanks = subinstr(spanX,"X"," ",.) gen es = estring local i 0 local more 1 qui while `more' { local i = `i' + 1 gen where`i' = strpos(es,spanX) replace where`i' = . if where`i' == 0 replace es = subinstr(es,spanX,blanks,1) count if where`i' != . local more = r(N) } drop where`i' replace where1 = . if maxspan == 0 egen nmaxspan = rownonmiss(where*) drop es blanks spanX * -------------------------- end example ------------------ On Sun, Mar 13, 2011 at 10:33 AM, Nick Cox <njcoxstata@gmail.com> wrote: > Thanks for the very detailed report. A few small footnotes > > 1. My comment > > it does seem possible that at least some people among 6 million might > have been eligible for the entire time > > was definitely not a suggestion that Robert's code assumed otherwise. > I could see that it was elegantly general enough to cope fine. It was > a comment that going all the way as my code did might be needed, > although no doubt my code still tries patterns never found in the > data, to no good purpose. > > 2. It is striking that, to put it directly, an empty string has > position 1 within another string > > . di strpos("--------", "") > 1 > > That is certainly something that could bite as when by accident one > tries to find this. > > 3. My original code could be speeded up a bit by not using a variable > X but my guess would be that Robert's is still definitely faster. > > Nick > > On Sat, Mar 12, 2011 at 7:00 PM, Rebecca Pope <rebecca.a.pope@gmail.com> wrote: >> Many thanks Nick and Robert. I'm using a combination of your >> approaches. Nick is absolutely right; some persons could be eligible >> for the full 15 years, but Robert's code handles this situation fine. >> It will continue to loop for the maximum number of sets in the data, >> even if someone is eligible for all 15 years. >> >> The true problem is when the person was never eligible. In that >> situation, Robert's code (pasted in at the end of this reply without >> his example data) always assigns a value of 1 in "where1". This has to >> do, I think, with how Stata matches missing values when implementing >> strpos(). If maxspan=="" then that gets treated as a match to pos 1of >> "es" when using strpos(). You can test it with: >> >> clear >> input patid str12 estring >> 1 á á á á áXXXXX------- >> 2 á á á á á--XXX---XXXX >> 3 á á á á á-XXXXXX----- >> 4 á á á á á-XXX-XXX-XXX >> 5 á á á á áXXXX-XX-XXXX >> 6 á á á á áXXXXXXXXXXXX >> 7 á á á á á------------ >> end >> >> After a couple of slight modifications so that Robert's code will only >> produce 1s when the first match of a set of Xs occurs in position 1, I >> took a 10% sample of the data and ran both sets of code (Nick's and >> Robert's). Robert's code does run substantially faster. I changed the >> -replace where`i'- line in Robert's code so that the results preserve >> the 0s in where1 & are thus directly comparable to Nick's so everyone >> can see how the results compare. >> >> 1 = Nick, 2 = Robert (order received, nothing else should be inferred) >> >> <omitted output> >> . tabstat where*, statistics( count min max ) columns(statistics) >> >> á ávariable | á á á á N á á á min á á á max >> -------------+------------------------------ >> á á áwhere1 | á á502964 á á á á 0 á á á 180 >> á á áwhere2 | á á á1572 á á á á29 á á á 178 >> á á áwhere3 | á á á á57 á á á á74 á á á 178 >> á á áwhere4 | á á á á12 á á á 142 á á á 169 >> á á áwhere5 | á á á á 3 á á á 152 á á á 175 >> á á áwhere6 | á á á á 1 á á á 173 á á á 173 >> -------------------------------------------- >> >> <omitted output> >> á ávariable | á á á á N á á á min á á á max >> -------------+------------------------------ >> á á áwhere1 | á á502964 á á á á 0 á á á 180 >> á á áwhere2 | á á á1572 á á á á29 á á á 178 >> á á áwhere3 | á á á á57 á á á á74 á á á 178 >> á á áwhere4 | á á á á12 á á á 142 á á á 169 >> á á áwhere5 | á á á á 3 á á á 152 á á á 175 >> á á áwhere6 | á á á á 1 á á á 173 á á á 173 >> -------------------------------------------- >> >> . timer list >> á 1: á á184.85 / á á á á1 = á á 184.8470 >> á 2: á á 36.50 / á á á á1 = á á á36.5020 >> >> I also merged the sets on pat_id and the where`i' variables to make >> sure the values were the same & not just the counts & ranges. The >> results are identical. >> >> For those like Nick who haven't worked with eligibility data like this >> & in case someone who has wonders why I'm counting Xs instead of >> something "logical" like subtracting the start date from the end date: >> This data only has the earliest start date and the last end date. I >> would expect a full 15-year insurance coverage only _very_ árarely. It >> doesn't happen at all in the sample I used for testing. Here in the >> US, people tend to gain and lose insurance with their jobs. >> Compounding the issue, the employer could change the company they >> contract with several times over the years. If one company isn't >> covered by my data, that will cause apparent "gaps" as well. Our >> public insurance for the poor has the same problem of "gaps"--people >> constantly go in and out of the program with marginal changes in >> financial situation or moving across state lines. >> >> Best, >> Rebecca >> >> On Sat, Mar 12, 2011 at 8:05 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>> Interesting point. Clearly fast beats slow whenever nothing else is at issue. >>> >>> In this case, the data are patients' eligiblity for health insurance >>> benefits over a period of 15 years. I've never worked with such data >>> but it does seem possible that at least some people among 6 million >>> might have been eligible for the entire time. >>> >>> Nick >> >> <omitted text here> >> >>> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <picard@netbox.com> wrote: >>>> * -------------------------- begin example ---------------- >>>> * Find the longest period of continuous eligibility >>>> clonevar es = estring >>>> gen maxspan = "" >>>> local more 1 >>>> while `more' { >>>> á á á ágen s = regexs(1) if regexm(es,"(X+)") >>>> á á á áreplace maxspan = s if length(s) > length(maxspan) >>>> á á á áreplace es = subinstr(es,s,"",1) >>>> á á á ácount if s != "" >>>> á á á álocal more = r(N) >>>> á á á ádrop s >>>> } >>>> >>>> >>>> * Identify the start of each span >>>> gen smask = subinstr(maxspan,"X","_",.) >>>> replace es = estring >>>> local i 0 >>>> local more 1 >>>> while `more' { >>>> á á á álocal i = `i' + 1 >>>> á á á ágen where`i' = strpos(es,maxspan) >>>> á á á áreplace where`i' = . if where`i' == 0 >>>> á á á áreplace es = subinstr(es,maxspan,smask,1) >>>> á á á ácount if where`i' != . >>>> á á á álocal more = r(N) >>>> } >>>> drop where`i' >>>> egen nmaxspan = rownonmiss(where*) >>>> drop es smask >>>> >>>> * -------------------------- end example ------------------ >> * > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/