Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: xt: unit-specific trends
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: xt: unit-specific trends
Date
Tue, 24 Apr 2012 10:45:41 -0400
Getting back to this: I must thank Bill for his explanation, clear as always.
Yet I want to point out what I learnt from this: All of us (a)do-file
authors should be careful with by-loops. When we use this device to
loop over a few values, there is no problem. Yet if we use it for some
panel-like setting, it can be "treacherous." If there is no way out of
this but Mata, at least we should be aware that commands that like
-egen- should be high on our priority list to rewrite in Mata.
In my experience, people use -egen- to generate (many-many) variables
in a panel, or "worse", leave-out means and alike. There the loops are
definitely on the order of N, which might be a high price in large
panels.
Laszlo
On Fri, Apr 20, 2012 at 12:30 PM, William Gould, StataCorp LP
<[email protected]> wrote:
>
> Laszlo [email protected] wrote,
>
> > I am just a bit surprised that the "if" checks slow down operations
> > this much. Esp. by-loops. [...]
> > But exactly these are the sorts of trade-offs that you are experts in.
>
> I would like to show Lazlo and the many others who I suspect would
> express the same sentiment that they should not be surprised.
>
> Let's imagine that we want to perform operations on 20 observations
> of a 200,000 obseration dataset, the 20 observations selected by
> -if-.
>
> Let's analyze execution time.
>
> As a first approximation, let's assume the time necessary to perform
> a linear operation on a set of observations is
>
> T = t_f + t_o*N
>
> By a linear operation, I mean an operation whose execution time is
> linear in the number of observations. -generate- and -replace- are
> examples of linear operations. -sort- is an example of a non-linear
> operation.
>
> In the above formula, t_f is the time to parse the user's input and
> set up the problem, which is to say, t_f is small. t_o is the time to
> perform the operation on a single observation, which is to say, t_o is
> small, too. Obviously different operations require different amounts
> of time, but this is an approximaton, so let's just assume t_o is the
> same across operations. We'll speculate later about the effects of of
> the assumption on our results.
>
> We are going to compare the total time it takes to operate on 20
> observations in a 20-observation dataset,
>
> T_0 = t_f + 20*t_o
>
> and the time it takes to operate on 20 observations on a
> 200,000-obseration dataset, such as a -gemnerate- statement with an
> additional -if-. The total time for tht would be
>
> T_1 = t_f + 20*t_o + 200,000*t_o
>
> For small datasets, it is approximately the case that t_f = t_o*N --
> the time to parse and setup the problem is about equal to performing
> the work of the problem itself. In that caes, the equations can be
> rewritten as
>
> T_0 = (20+1)*t_o
>
> T_1 = (20+1)*t_o + 200,000*t_o
>
> The ratio of T_1 to T_0 is then
>
>
> T_1 (20+1)*(t_o) + 200,000*t_o
> ----- = --------------------------
> T_0 (20+1)*t_o
>
> = 1 + 200,000/(20+1)
>
> = (approximately) 9,525
>
>
> Many of you -- perhaps Lazlo among them -- think that we "experts" at
> StataCorp can achieve results "mere" users cannot. Sometimes,
> however, being an expert is about knowing when to give up. At
> StataCorp, we make calculations like the agove and then check run
> times, and that's one way that we determine which problems deserve
> more work.
>
> In the above calculaton, we assumed all operations take roughly the
> same time. In particular, in
>
> . generate x = <exp1> if <exp2>
>
> we assumed that <exp1> takes the same amount of time as <exp2>.
> Clearly an <exp2> such as -if `touse'- is a light-weight. The ratio
> above might be better written by distinguishing between the execution
> times for <exp1> and <exp2>:
>
>
> T_1 (20+1)*(t_exp1) + 200,000*t_exp2
> ----- = --------------------------------
> T_0 (20+1)*t_exp1
>
> = 1 + 200,000*(t_exp2)/(21*t_exp1)
>
> Actually, the ratio of t_exp2/t_exp1 is probably much closer than 1
> than you expect, at least in interpretive languages like ado.
> Nontheless, if it pleases you, substitute 1/2 for the ratio and get
> approximately T_1/T_0 = 4763.
>
> By the way, t_exp1 might be approximately equal to t_exp2 in
> interpretive languages, but in compiled languages like Mata,
> the can be whoppingly different. Had we been analyzing
> run times in compiled languages and you were bothered by the
> assumption tht t_exp1 == t_exp2, you would have been right.
>
>
> Lazlo also wrote,
>
> > I would have guessed that the extra cost of not allowing re-sorting
> > would have justified a dramatic speedup of the -by- which is pretty
> > commonly used.
>
> Thi choice we made in this particular issue is something about which
> reasonably people can disagree. Let me outline our thinking in general.
>
> When we make such decisions, our view of ado-files is that
> ease-of-programming and likelihood-of-correctness trumps performance
> in most cases. I am not saying that ado-files perform poorly or that
> it is pure luck that they don't. We work to make them perform well,
> but when there is a tradeoff between speed of execution and ease of
> programming (which includes likelhood of correctness), we usually make
> the decision in favor of of ease of programming.
>
> Simultaneously, we provide a second programming language, Mata,
> in which the trade-off is reversed.
>
> That does not mean Mata is better than ado. We at StataCorp write
> lots of ado code. We choose the language according the problem. In
> some problems, there is little speed difference between Mata and ado
> because of the nature of the problem, so we choose ado. In other
> problems, there is a difference, but the speed really doesn't matter.
> We choose ado. In still other problems, the is a difference is speed,
> that does matter, and we choose Mata. There's one more case in which
> we choose Mata, which is when the problem is complex and the
> organizational aspects of Mata such as structures and classes makes it
> is easy for us to write readable code, meaning the code will require
> less debugging, and meaning the code will be more modifiable in the
> future.
>
> -- Bill
> [email protected]
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/