Hi statalist
a question on speeding up a program with a large dataset if someone can help.
I have household data that is set up as the following example:
HHID MEMID MONTH SALARY
481 1 1 2000
481 1 2 2500
481 1 3 2000
481 2 2 4000
482 1 2 7400
482 1 3 3600
482 2 1 5000
482 2 2 5500
482 2 3 2000
483 3 1 7000
483 3 2 7500
483 3 3 8000
In other words, I have monthly salary data on each individual (memid) in each
household(hhid). My data set has about 60,000 observations.
I have written the following command to sum salary across month for each
individual
by hhid memid,sort: egen quart_inc=sum(salary)
which works fine except then I have to get rid of all the duplicate totals
created for each individual. So then I wrote
program define dupinc, byable(recall)
syntax [varlist] [if] [in]
marksample touse
duplicates drop `varlist' if `touse',force
end
by hhid memid: dupinc quart_inc
which also works fine except it takes forever - it ran for several hours
yesterday! I'm running it on a Centrino, 1.7Ghz, 512mb, notebook.
Is there any way I can speed this up considerably? I also forgot to put -qui-
in front so this might have helped.
Hope someone can help, please.
Patricia
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/