Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Eric Booth <eric.a.booth@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: calculating cumulative values of other observations |
Date | Sat, 7 Apr 2012 16:42:21 -0500 |
<> Or you can do it in one line with the cond() function: ******************! //input example data// clear inp ID str18 (date_start date_end) str4 activity hours 1 20071230 20071230 A 10 1 20071122 20071129 A 11 2 20071120 20071120 A 5 3 20071210 20071210 B 6 3 20071111 20071111 B 24 3 20071130 20071130 F 20 3 20070511 20070511 B 49 4 20070611 20070611 A 50 end **nick's original code bysort ID activity (date_start): gen ph = sum(hours) - hours **ind bysort ID activity (date_start): gen ind = 1 if /// (date_start>=date_end[_n-1] & !mi(date_end[_n-1])) **priorhours bysort ID activity (date_start): gen priorhrs = /// sum(hours) - hours replace priorhrs = 0 if priorhrs>0 & /// (date_start<date_end[_n-1] & !mi(date_end[_n-1])) **with cond() bys ID activity (date_start) : g oneline = /// cond((date_start>=date_end[_n-1] & /// !mi(date_end[_n-1])), sum(hours) - hours, 0, .) assert ph == oneline //check ********************! - Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu +979.845.6754 On Apr 7, 2012, at 4:36 PM, Eric Booth wrote: > <> > > Your new [if] condition is preventing the adapted version of Nick's code from creating the running sum properly. To see how the [if] condition you added is "filtering" the observations that are eligible for the running sum, create an indicator/flag for your [if] conditions and inspect the results: > *** > bysort ID activity (date_start): gen ind = 1 if /// > (date_start>=date_end[_n-1] & !mi(date_end[_n-1])) > *** > > You should be able to see that this stops Stata from including the observations you wanted included in the sum using 'ind'. > > Without seeing your new variables - you could probably break your new code into two steps like: > *** > bysort ID activity (date_start): gen priorhrs = /// > sum(hours) - hours > replace priorhrs = 0 if priorhrs>0 & /// > (date_start<date_end[_n-1] & !mi(date_end[_n-1])) > *** > to get what you need. > > - Eric > __ > Eric A. Booth > Public Policy Research Institute > Texas A&M University > ebooth@ppri.tamu.edu > +979.845.6754 > > On Apr 7, 2012, at 4:12 PM, KOTa wrote: > >> thanks Eric, >> i did adapt your code for real dataset without problems (see my >> previous response) >> >> i am just interested how to do it using Nick's code, which works much >> faster (dataset is pretty large). >> i am probably missing some thing about how "sum" works or about >> boundary conditions >> >> my rewrite of Nick was: >> >> bysort ID activity (date_start): gen priorhrs = sum(hours) - hours if >> (date_start>=date_end[_n-1] & !mi(date_end[_n-1])) >> >> which does not work :/ >> >> regards >> K. >> >> El día 7 de abril de 2012 14:09, Eric Booth <eric.a.booth@gmail.com> escribió: >>> <> >>> >>> On Apr 7, 2012, at 9:42 AM, KOTa wrote: >>> >>>> thanks Erik, Nick >>>> >>>> both ways work on this sample, and i managed to adjust for a real database also. >>>> >>>> few followup question: >>>> >>>> 1. Eric, any specific reason why you convert time into date format? I >>>> used it without conversion and it looks fine. >>>> Also on real data your solution somehow goes into infinite loop after >>>> finishing all calculations. >>>> >>> I don't know how you edited the code, so I don't know how you produced an infinite loop? It certainly doesn't do that with the example I gave you. Also, I created a time/date format version of "date" just out of habit, I find that it makes data management easier. If you don't need/want it, then skip it. >>> >>> >>>> 2. Nick, the solution you proposed with "sum" was the one i tried >>>> initially. The problem is that in real data I have 2 columns for date >>>> (start_date end_date) and i want to compare start_date [_n] >= >>>> end_date[all previous] >>>> Eric's solution works longer, but i could adjust it to work with >>>> different date columns. >>>> your solution is works much faster in case there is one date column, >>>> but i could not figure out how to do the same with 2 different >>>> columns. >>> >>> El día 7 de abril de 2012 04:08, Nick Cox <njcoxstata@gmail.com> escribió: >>>> Why not >>>> bysort id activity (time): gen priorhrs = sum(time) - time >>>> Nick >>> >>> >>> Nick's -sum- solution does work (and is more straightforward than my -replace line) with a slight tweak - changing the first "time" to "date2" (if you create the time/date var from my example). Also, if you are using the same variable names from your original example you'd change "time" to "hours". Adding this to my example works: >>> >>> bysort ID activity (date2): gen njc = sum(hours) - hours >>> >>> >>> Adapting mine and Nick's code to your "real" dataset is another issue completely - if you cannot figure how to adapt our tips to your data/code, then you should post a data & code example/question that is closer to reality. >>> >>> >>> - Eric >>> __ >>> Eric A. Booth >>> Public Policy Research Institute >>> Texas A&M University >>> ebooth@ppri.tamu.edu >>> Office: +979.845.6754 >>> >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/