Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Random simulations of flow matrix with constraints |
Date | Fri, 22 Feb 2013 02:10:02 +0000 |
Local macro references start and end with different single quotation marks. The pattern is `macname'. The first is char(96). This is explained in the manuals and e.g. at http://www.stata.com/support/faqs/programming/correct-delimiter-for-local-macros/ I have not tried to understand all your code, but I fear that it will be very slow. The major problem is repeated loops, some 50 billion of them. As this is interpreted code, you may need Mata instead. Testing expressions such as total_out=X_out_clone for equality will fail because Stata uses == not = for that purpose. You could get small speed-ups by replacing gen constrained_out = 0 replace constrained_out = 1 if total_out=X_out_clone gen constrained_in = 0 replace constrained_in = 1 if total_in=X_in_clone gen constrained = max(constrained_out, constrained_in) with one line gen constrained = max(total_out == X_out_clone, total_in==X_in_clone) so that there is no need for the variables -constrained_in- and -constrained_out-. In fact that variable -constrained- can be omitted too. Copying -X`j'- to -Y- is also unnecessary. -if _n == 1- will be faster as -in 1- for a large dataset. So -- a quick and unchecked hack -- forvalues j = 1/100 { gen X`j' = 0 forvalues i = 1/533361588 { sort startstationid egen X_out=total(X`j'), by(startstationid) sort endstationid egen X_in=total(X`j'), by(endstationid) generate double u = runiform() if max(total_out == X_out_clone, total_in==X_in_clone) !=1 sort u replace X`j' = X`j' + 1 in 1 drop X_out X_in X_out_clone X_in_clone u } } However, after that it still looks buggy. At the end of the inner loop, you drop your -*clone- variables, so that they are not there second time round. So this will fail. What else can we do? if max(total_out == X_out_clone, total_in==X_in_clone) !=1 looks like if total_out != X_out_clone & total_in != X_in_clone which may be faster. Repeated -egen- calls will be slower than you want. There is a lot of interpreted code inside -egen- that is irrelevant to your purpose. You can cut out all the -egen- calls by replacing sort startstationid egen X_out=total(X`j'), by(startstationid) sort endstationid egen X_in=total(X`j'), by(endstationid) with bysort startstationid : gen X_out = sum(X`j') by startstationid : replace X_out = X_out[_N] bysort endstationid : gen X_in = sum(X`j') by endstationid : replace X_in = X_in[_N] That may not look much different, but it will be much faster. What remains as obvious problems are overall speed and the bug mentioned above. I've only looked at this code, not tried it. Above all, I didn't read the paper you cited. Nick On Fri, Feb 22, 2013 at 1:26 AM, Gordon Lee <gordon@gordonlee.me> wrote: > I have data on traffic flows amongst a set of stations. > > What I would like to do on Stata is to create 100 random simulations > that preserve the total number of incoming and outgoing links for each > station. This is in the footsteps of Roth et al (2011) under "the null > model". Link: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0015923 > > In other words, this is a matrix with constrained row sums and column sums. > > ---My data set (illustrative only)--- > > startstationid endstationid total_out total_in > --------------------------------------------------------------------------- > A B oA dB > A C oA dC > A D oA dD > B A oB dA > B C oB dC > B D oB dD > C A oC dA > C B oC dB > C D oC dD > D A oD dA > D B oD dB > D C oD dC > > ...where total_out is the sum of traffic flows from the startstation, > and total_in is the sum of traffic flows into the endstation. All > items take on integer values. > > > **** > dofile: https://www.dropbox.com/s/c9s4gmqtd6ycmnz/do%20file.txt > dataset: https://www.dropbox.com/s/zx9oufzekusnsuj/dataset.dta > > I tried the following commands, but stata tells me "'invalid name > r(198);". I can't seem to be able to fix this. > > (note: 533361588 is the sum of all flows in the data) > > forvalues j = 1/100 { > gen X'j' = 0 > forvalues i = 1/533361588 { > gen Y = X'j' > sort startstationid > egen X_out=total(X'j'), by(startstationid) > sort endstationid > egen X_in=total(X'j'), by(endstationid) > > gen constrained_out = 0 > replace constrained_out = 1 if total_out=X_out_clone > gen constrained_in = 0 > replace constrained_in = 1 if total_in=X_in_clone > > gen constrained = max(constrained_out, constrained_in) > > generate double u = runiform() if constrained!=1 > sort u > > replace X'j' = Y + 1 if _n == 1 > > drop Y X_out X_in X_out_clone X_in_clone constrained_out > constrained_in constrained u > } > } > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/