|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Re: Unix stata big dataset
...
I can't really comment on the cpu and memory usage report but I would guess that
you could save a large fraction of the time for this operation if you told us
more about the joinby you want to do:
1) How many observations are in each of the two files?
2) What type of merge do you need: one-to-one, one-to-many, many-to-one, or
many-to-many? Only the last type needs -joinby-.
3) What proportion of the observations in each file do you expect to match?
Does the large table contain lots of observations you don't need?
4) Are there any variables you don't need in either file that could be dropped
first?
I think the biggest question is -- Are you sure that you need -joinby- rather
than -merge-? Even if you need joinby, you may be able to do this much more
quickly by first subsetting unique identifiers of the smaller file, then -merge-
with the nokeep option to grab the useful observations in the large file and
then go back to the smaller file to do a joinby on this subset file.
Also, do you have enough physical memory and an operating system that can
allocate 2GB+ to Stata for loading the large dataset? If you are using virtual
memory things can be very slow.
If you describe more about the data, there may be other approaches that reduce
the memory requirements and speed the process.
Michael Blasnik
----- Original Message -----
From: <[email protected]>
To: <[email protected]>
Sent: Thursday, November 29, 2007 4:36 PM
Subject: st: Unix stata big dataset
Dear Statalist,
I have a problem to joinby 2 datasets in unix, I have a dataset about 1,8 gb
and other about 30 mg, I want to join this two dataset but in unix is very
slow the process, and in 4 days I did'nt have a final dataset ( two month ago
I join two dataset, more o less the same size, in only 1 day). I use a do
file where I write my command joinby.
I look with the command top at the processor in local machine and my process
is in state sleep. I use batch mode
11258 franz 1 20 0 0K 0K cpu/0 35.8H 24.41% stata
15566 ncd 1 20 0 0K 0K cpu/1 11:52 23.41% xstata
16084 ncd 1 60 0 0K 0K sleep 27:00 0.22% stata
so, if my processes is sleep means that it no functions? there is another
user connected, can he influence my process?
thanks in advance for your help
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/