Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Random simulations of flow matrix with constraints
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Random simulations of flow matrix with constraints
Date
Fri, 22 Feb 2013 02:10:02 +0000
Local macro references start and end with different single quotation
marks. The pattern is `macname'. The first is char(96). This is
explained in the manuals and e.g. at
http://www.stata.com/support/faqs/programming/correct-delimiter-for-local-macros/
I have not tried to understand all your code, but I fear that it will
be very slow. The major problem is repeated loops, some 50 billion of
them. As this is interpreted code, you may need Mata instead.
Testing expressions such as
total_out=X_out_clone
for equality will fail because Stata uses == not = for that purpose.
You could get small speed-ups by replacing
gen constrained_out = 0
replace constrained_out = 1 if total_out=X_out_clone
gen constrained_in = 0
replace constrained_in = 1 if total_in=X_in_clone
gen constrained = max(constrained_out, constrained_in)
with one line
gen constrained = max(total_out == X_out_clone, total_in==X_in_clone)
so that there is no need for the variables -constrained_in- and
-constrained_out-. In fact that variable -constrained- can be omitted
too.
Copying -X`j'- to -Y- is also unnecessary.
-if _n == 1- will be faster as -in 1- for a large dataset.
So -- a quick and unchecked hack --
forvalues j = 1/100 {
gen X`j' = 0
forvalues i = 1/533361588 {
sort startstationid
egen X_out=total(X`j'), by(startstationid)
sort endstationid
egen X_in=total(X`j'), by(endstationid)
generate double u = runiform() if max(total_out ==
X_out_clone, total_in==X_in_clone) !=1
sort u
replace X`j' = X`j' + 1 in 1
drop X_out X_in X_out_clone X_in_clone u
}
}
However, after that it still looks buggy. At the end of the inner
loop, you drop your -*clone- variables, so that they are not there
second time round. So this will fail.
What else can we do?
if max(total_out == X_out_clone, total_in==X_in_clone) !=1
looks like
if total_out != X_out_clone & total_in != X_in_clone
which may be faster.
Repeated -egen- calls will be slower than you want. There is a lot of
interpreted code inside -egen- that is irrelevant to your purpose. You
can cut out all the -egen- calls by replacing
sort startstationid
egen X_out=total(X`j'), by(startstationid)
sort endstationid
egen X_in=total(X`j'), by(endstationid)
with
bysort startstationid : gen X_out = sum(X`j')
by startstationid : replace X_out = X_out[_N]
bysort endstationid : gen X_in = sum(X`j')
by endstationid : replace X_in = X_in[_N]
That may not look much different, but it will be much faster.
What remains as obvious problems are overall speed and the bug
mentioned above. I've only looked at this code, not tried it. Above
all, I didn't read the paper you cited.
Nick
On Fri, Feb 22, 2013 at 1:26 AM, Gordon Lee <[email protected]> wrote:
> I have data on traffic flows amongst a set of stations.
>
> What I would like to do on Stata is to create 100 random simulations
> that preserve the total number of incoming and outgoing links for each
> station. This is in the footsteps of Roth et al (2011) under "the null
> model". Link: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0015923
>
> In other words, this is a matrix with constrained row sums and column sums.
>
> ---My data set (illustrative only)---
>
> startstationid endstationid total_out total_in
> ---------------------------------------------------------------------------
> A B oA dB
> A C oA dC
> A D oA dD
> B A oB dA
> B C oB dC
> B D oB dD
> C A oC dA
> C B oC dB
> C D oC dD
> D A oD dA
> D B oD dB
> D C oD dC
>
> ...where total_out is the sum of traffic flows from the startstation,
> and total_in is the sum of traffic flows into the endstation. All
> items take on integer values.
>
>
> ****
> dofile: https://www.dropbox.com/s/c9s4gmqtd6ycmnz/do%20file.txt
> dataset: https://www.dropbox.com/s/zx9oufzekusnsuj/dataset.dta
>
> I tried the following commands, but stata tells me "'invalid name
> r(198);". I can't seem to be able to fix this.
>
> (note: 533361588 is the sum of all flows in the data)
>
> forvalues j = 1/100 {
> gen X'j' = 0
> forvalues i = 1/533361588 {
> gen Y = X'j'
> sort startstationid
> egen X_out=total(X'j'), by(startstationid)
> sort endstationid
> egen X_in=total(X'j'), by(endstationid)
>
> gen constrained_out = 0
> replace constrained_out = 1 if total_out=X_out_clone
> gen constrained_in = 0
> replace constrained_in = 1 if total_in=X_in_clone
>
> gen constrained = max(constrained_out, constrained_in)
>
> generate double u = runiform() if constrained!=1
> sort u
>
> replace X'j' = Y + 1 if _n == 1
>
> drop Y X_out X_in X_out_clone X_in_clone constrained_out
> constrained_in constrained u
> }
> }
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/