Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Overriding a loop if 0 observations using tabstat
From
Richard Goldstein <[email protected]>
To
[email protected]
Subject
Re: st: Overriding a loop if 0 observations using tabstat
Date
Thu, 29 Apr 2010 08:32:39 -0400
Vince,
thank you very much for this detailed explanation; since I do a lot of
work on files of very different sizes where almost all the work is data
management, it appears that I might be able to save some time on some of
these multi-hour runs.
Rich
On 4/28/10 5:27 PM, Vince Wiggins, StataCorp wrote:
> Short question
> --------------
>
> In a thread that started out about looping then became about relative speed
> with differing amounts of memory, Richard Goldstein <[email protected]>,
> Robert Picard <[email protected]>, and Kit Baum <[email protected]> have written
> about substantial differences in runtimes when performing many iterations of
> some very simple computations on a very small dataset. They attribute the
> difference in runtimes to differing amounts of memory allocated to Stata.
>
>
> Short answer
> ------------
>
> First, I believe their timings. Second, I am not concerned.
>
> There is nothing to worry about here. These timings are an artifact of a
> toy-size example combined with the way modern computers handle cache memory.
> You will see timing differences of this sort only on small datasets and when
> when doing easy computations that are performed repeatedly -- Robert, Rich,
> and Kit (RRK) are computing means and counts over a 100,000 observation
> dataset comprised of a single variable. The only reason they observe much
> timing passing at all is that they are iterating the task 10,000 times.
>
>
> Long question
> -------------
>
> The task that RRK are timing came from Nick Cox <[email protected]>
> (who at the time was merely trying to show that -count- was faster than
> -summarize, meanonly- and that both were faster than -summarize-.) Here is
> the do file.
>
> -------------------------------- BEGIN --- timings.do --- CUT HERE -------
> clear
> set obs 100000
> set seed 2803
> gen y = runiform()
> set rmsg on
>
> qui forval i = 1/10000 {
> count if y > 0.5
> }
>
> qui forval i = 1/10000 {
> summarize y if y > 0.5, meanonly
> }
>
> qui forval i = 1/10000 {
> summarize y if y > 0.5
> }
> -------------------------------- END --- timings.do --- CUT HERE -------
>
> RRK are timing each of the loops -- the count loop, the summarize meanonly
> loop, and the full summarize loop. They are performing the timings with
> different amounts of memory allocated to Stata.
>
> Here are some of the timings reported by RRK (rounded to seconds).
>
> summarize
> memory count meanonly summarize
> ---------------+-------------------------------
> Rich 10 MB | 6 50 56
> Rich 500 MB | 6 99 104
> |
> Robert 10 MB | 11 43 47
> Robert 1000 MB | 17 64 71
> |
> Kit 500 MB | 14 61 72
>
> Kit did not report timings for other memory configurations, but
> surprisingly noted that reducing the memory allocated to Stata increased
> the runtimes -- the opposite of Rich and Robert
>
> Robert and Rich note the big differences in some of the timings when different
> amounts of memory are allocated.
>
> All of our intrepid timers are running on MacPro computers though the
> computers have different numbers of cores and different clock speeds. None of
> that matters, this could happen on any modern computer.
>
>
> Really, really, short answer
> ----------------------------
>
> Computers are weird.
>
> Or, more accurately, computers are weird when you operate at the boundary of
> what will fit in their cache memories.
>
>
> Long answer
> -----------
>
> First, we need to talk a bit about computer architectures. The processor (or
> core) performs very simple tasks, such as add 2 numbers from two of its
> internal registers and put them in another register (or more likely in one of
> the original registers).
>
> core
>
> register 1 +
> register 2 =
> register 3
>
> A simple instruction like this can occur in a single clock cycle and many
> modern processors can execute on the order of 3 billion of these per second.
> MacPro computers use Intel Xeon processors which can run at speeds up to 3.3
> GHz or 3.3 billion instructions per second. At this point Bill Gould might
> draw a hand-crank that could be turned 3 billion times per second, but that is
> well beyond my limited ASCII art abilities.
>
> The numbers, unfortunately, are often not in the registers and the computer
> must fetch them from memory. All modern CPUs have a large cache memory from
> which they can pull numbers very quickly. The cache is not the main memory on
> the computer -- the memory we allocate in Stata -- it is the processor's own
> memory for storing data and instructions that it is using repeatedly. On the
> MacPros that RRK are running each CPU has 8 MB of on-chip cache.
>
> core cache
>
> +------+
> register 1 + <---->| |
> register 2 = <---->| |
> register 3 <---->| |
> | |
> ...
>
> Getting numbers from the cache to the core registers is extremely fast. We
> will ignore the time required, if only because after quite a bit of searching
> I still cannot find the L3 cache latency for the Intel Xeon 3500 processors
> and because with modern pipelining the delays between the core and cache truly
> are minimal.
>
> If the core cannot find the numbers it wants in cache, then it must pull the
> numbers from standard memory into cache. Standard memory is where Stata
> stores your data.
>
> incredibly
> fast slower
> | |
> core | cache | memory
> | |
> V +------+ V +---------------+
> register 1 + <---->| |<---->| |
> register 2 = <---->| |<---->| |
> register 3 <---->| |<---->| |
> | | | |
> ... ...
>
> The problem with pulling something from standard memory is that it is
> relatively slow. The memory on RRK's MacPros runs at 1066 MHz, meaning we can
> access just over 1 billion numbers per second, and that is the best we can
> hope for. It takes over 3-times as long to pull a number from standard memory
> as it does from cache. So, if the core must wait for a number from memory, it
> will be idle for 2 or 3 cycles. It is as if your computer is only one-third
> as fast while it waits for the cache to be filled with what it needs.
>
> There is one other thing that we need to know. Single numbers are not moved
> from memory to cache, that would be too inefficient, rather a fixed sequential
> number of bytes are moved. This sequence of bytes is called a cache line.
> More correctly, a line in cache represents a fixed number of sequential bytes
> in memory and there are a fixed number of cache lines. The last time I looked
> seriously at cache management, cache lines were 128 or 256 bytes and that
> number had been growing. Having just now stared pretty hard at the 550 page
> Intel® Xeon® Processor C5500/C3500 Series Datasheet I am still not sure what
> the size of a cache line is on RRK's computers. There is some discussion
> about 64 bytes for the QPI cache lines, but those are for intercore
> communication, though they might also be used for cache management.
>
> I am inclined to think the cache lines are at least 128 bytes. That is the
> smallest sequential chunk of memory that can be moved into cache. With 8 MB of
> cache, that would imply 8*2^20/122 = 65,536 cache lines. We will need that
> number later.
>
> None of this discussion changes for multi-core CPU's.
>
> What does that have to do with RRK's timings? RRK's dataset is small and
> unusual. It is a single 4-byte variable, y, with 100,000 observations. It
> takes just .4 MB to store y itself. Stata is going to add twice that amount
> for the its own purposes, but that is still only 1.2 MB for the entire
> dataset. We can think of the data as being stored in a rectangular memory
> location with 8+4=12 byte records.
>
> Stata y
> +------------+
> |1235678|1234|
> |1235678|1234|
> |1235678|1234|
> | ... | ...|
>
> Stata datasets usually are not stored this densely. Normally, there would be
> free space at the end of each record where more variables can be added.
>
> Stata y free space
> +-----------------------------
> |1235678|1234| ...
> |1235678|1234|
> |1235678|1234|
> | ... | ...| ...
>
> Moreover, you are likely to have even more free space at the end of each
> record if you have allocated more memory to Stata. This lets Stata add and
> drop variables quickly.
>
> So, with 10 MB allocated, RRK's data might look like
>
> Stata y free space
> +---------------------+
> |1235678|1234|12345678|
> |1235678|1234|12345678| (1)
> |1235678|1234|12345678|
> | ... | ...| ... |
>
> And, with 1000 MB allocated, it might look like
>
> Stata y free space
> +-------------------------------------
> |1235678|1234|123456789... ... ...
> |1235678|1234|123456789... ... ... (2)
> |1235678|1234|123456789... ... ...
> | ... | ...| ...
>
> With the dataset organized as in (1), each record is 20 characters wide,
> including free space, and so there is enough room to store all of the data,
> including free space in the cache. With the dataset organized as in (2), that
> might not be true. Since we have 100,000 records and 8 MB of cache, if the
> records are wider than 8*2^20/100000 = 83.9 characters, then the entire data
> area will not fit into cache.
>
> We could think about it differently and assume that each record is longer than
> a cache line. We would then need 100,000 cache lines to get all of the y
> values into cache, but we computed earlier that these CPUs are like to have
> only 65,536 cache lines (or maybe half that). Even if they had more, many
> lines would be dedicated to the instructions required to run Stata and many
> others for the data and code required by the operating system.
>
> Rich's and Robert's runs with 10 MB allocated produced a data organization
> that fit entirely into cache. Their runs with 500 MB and 1000 MB did not.
>
> This task was generated nearly at random, but it would be difficult to
> construct a task better suited to confound the cache management of many modern
> computers. The organization of this dataset (1x100,000) and the fact that the
> commands spend little time processing each record is tailor made to expose the
> cache issues discussed above. On my Linux computer with an Intel Core2 Quad
> 9550 CPU and 6 MB of cache, reducing the observations to 30,000 means that I
> have enough cache lines, and I therefore get consistent timings regardless of
> how large I set memory.
>
> Also, do not think that you would be better off if Stata kept the data
> organized densely. That would make this one unusual problem sometimes faster
> with large memory allocation, but in general would slow Stata down on other
> operations far more than this small increase.
>
> To elaborate on my earlier comment that "Computers are weird", it applies
> primarily to cases where you are operating on the boundary of what can and
> cannot fit into cache. With Stata, that is particularly true when we are
> talking about the dataset fitting into cache. Most of the time the data for
> whatever you are doing in Stata is in standard memory. One or a few
> observations are pulled into cache where substantial computation is done on
> each record -- such as forming a cross product or evaluating a likelihood.
> When we do more serious computations on each observations (rather than just
> summing or counting) the effect of not having the entire dataset in cache is
> mitigated.
>
> It is unusual to have a problem where ALL of the data for the task can fit
> into cache and you care how long it takes to run your task on this small
> dataset. This might be true if you are bootstrapping or simulating using a
> small dataset. If you have a problem like this, you may be able to speed it
> up by tuning the amount of memory allocated to Stata.
>
> 1) Drop all variables and observations that are not part of
> your analysis
>
> 2) Save the resulting dataset
>
> 3) Set memory to the minimum amount to hold your data and
> perform your analysis. This may require some trial and
> error.
>
> 4) Perform your analysis using the the smaller dataset.
>
> These steps ensure that Stata has no more free space at the end of
> observations than it requires for any temporary variables required to perform
> your analysis.
>
> You might benefit from this recommendation if you regularly work with datasets
> that have only one or two variables and you are doing fast commands (say up to
> and including linear regression). Otherwise, this is rarely worth the trouble.
>
>
> -- Vince
> [email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/