Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: RE: joinby command and memory issues
From
Eric Booth <[email protected]>
To
"<[email protected]>" <[email protected]>
Subject
Re: st: RE: RE: joinby command and memory issues
Date
Fri, 8 Oct 2010 20:33:14 +0000
<>
On Oct 8, 2010, Weichle, Thomas wrote:
> Maybe this is beyond the capabilities of
> Stata.
No, Stata can easily handle this size join/merge. It is only limited by your system.
If you've dropped all the variables that you can drop, compressed all your variables, and set your mem to it's max (see this thread for some tips on maxing your mem: http://www.stata.com/statalist/archive/2009-07/msg00899.html ), then the most likely solution is to get a system with more memory.
> Any other suggestions would be welcome. would doing a
> many-to-many merge result in the same dataset (all pairwise
> combinations)?
No, here's an example of how -joinby- and -merge- would merge your datasets (as you've described them):
****************!
clear
**create master (hgb0209)**
inp study_id str11(ord_date) result
1 "01/02/2009" 1
2 "01/02/2009" 0
2 "01/04/2009" 0
3 "01/05/2009" 2
3 "01/06/2009" 1
3 "01/07/2009" 1
3 "01/08/2009" 0
4 "01/02/2009" 1
5 "01/01/2009" 1
6 "01/07/2009" 0
7 "01/07/2009" 0
end
g ord_date2 = date(ord_date, "MDY")
format ord_date2 %td
drop ord_date
unique study_id
sa hgb.dta, replace
**create using (epo0209)**
clear
inp study_id str11(rec_date)
1 "01/02/2009"
1 "01/04/2009"
2 "01/05/2009"
2 "01/06/2009"
3 "02/24/2009"
3 "01/25/2009"
4 "01/12/2009"
5 "01/05/2009"
5 "01/10/2009"
98 "01/20/2009"
99 "01/20/2009"
100 "01/20/2009"
end
unique study_id
g rec_date2 = date(rec_date, "MDY")
format rec_date2 %td
drop rec_date
compress
sa epo.dta, replace
**joinby v. merge m:m**
u hgb.dta, clear
joinby study_id using "epo.dta", unmatched(none) _merge(_merge)
ta _merge
g datediff = rec_date2 - ord_date2
sort stu
l
u hgb.dta, clear
merge m:m study_id using "epo.dta"
ta _m
g datediff = rec_date2 - ord_date2
sort stu
l
****************!
Notice how it treats study_id's 2 and 3 in -joinby- vs. -merge-. My guess is that you want to use -joinby-.
Finally, if just by looking at my system performance (e.g. Task Manager in Windows or Activity Monitor in OSX) during a -joinby-, it looks like Stata takes a lot of (nearly double) overhead during this process (I think it puts the "using" dataset into memory, in addition to the already open "master" dataset, and that's when you get the r(910) error -- other can correct me if I'm wrong about what Stata is doing during the -joinby-).
So, if you've got enough memory to open a joined hgb & epo dataset, but not enough memory to open hgb and then join epo to it, then a work-around *might* be to
(1) break hgb and epo up into sub-sections based on study_id
(2) join these smaller sections by study_id, and then
(3) append the joined datasets.
E.g.,
***************!
*** continued from example data above ***
**create new empty MASTER**
clear
sa "MASTER.dta", replace emptyok
**this is your rangelist of study_ids
**so your first range is 0 to 2
**next is 3 to 4, and then 5 to 100
local rangelist 0 2 3 4 5 100
token "`rangelist'"
while "`1'" != "" {
di "Range: `1' -- `2'"
foreach d in hgb epo {
**create subset data**
u "`d'.dta", clear
keep if inrange(study_id, `1', `2')
ta study_id
sa "`d'_`1'_`2'.dta", replace emptyok
}
**joinby**
u hgb_`1'_`2'.dta, clear
joinby study_id using "epo_`1'_`2'.dta", unmatched(none) _merge(_merge)
ta _merge
sort stu
drop _merge
sa "joined_`1'_`2'.dta", replace
**append to MASTER**
u "MASTER.dta", clear
append using "joined_`1'_`2'.dta"
sa "MASTER.dta", replace
mac shift 2
}
u MASTER.dta, clear
g datediff = rec_date2 - ord_date2
l
**this matches the -joinby- above**
***************!
You'll want to think about how to select subsets of study_id's from your datasets...if there are lot's of study_id's, then it's probably better to create your ranges using -numlist- or a -forvalues- loop. Again, this will only work if your system has enough memory to open a completely joined version of your dataset (plus overhead).
- Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754
Fax: +979.845.0249
http://ppri.tamu.edu
On Oct 8, 2010, at 12:02 PM, Weichle, Thomas wrote:
> Sorry for the confusion. I was successful in dropping the other
> variables in epo.dta resulting in 2 variables that remain (study_id,
> receipt date). Then, I tried to join epo.dta with hgb.dta (which is in
> memory) containing 3 variables (study_id, test date, test result).
> After trying the joinby command is where I still received the error
> saying that there is no room to add more observations.
>
> --- From Nick Cox ---
> This is unclear to me:
>
> 1. What does "try to drop" mean? Did -drop- work? If not, what error
> messages did you get? Or why say "try"?
>
> 2. This description doesn't seem to match your earlier one just a few
> minutes ago. Here epo.dta has 2 variables. A few minutes ago it included
> other variables you didn't want and wanted to know how to ignore.
>
> I don't think people can easily give good advice if the precise problem
> is not clear.
>
> As in an earlier thread today, the limits of Stata are given by -help
> limits-. Almost always, the memory available on your machine is what
> really bites.
>
> Nick
> [email protected]
>
>
> Tom Weichle
> Math Statistician
> Center for Management of Complex Chronic Care (CMC3)
> Hines VA Hospital, Bldg 1, C202
> 708-202-8387 ext. 24261
> [email protected]
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/