Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Random simulations of flow matrix with constraints
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Random simulations of flow matrix with constraints
Date
Fri, 22 Feb 2013 02:11:45 +0000
For the manual material see http://www.stata.com/help.cgi?quotes i.e.
[U] 18.3.1.
On Fri, Feb 22, 2013 at 2:10 AM, Nick Cox <[email protected]> wrote:
> Local macro references start and end with different single quotation
> marks. The pattern is `macname'. The first is char(96). This is
> explained in the manuals and e.g. at
>
> http://www.stata.com/support/faqs/programming/correct-delimiter-for-local-macros/
>
> I have not tried to understand all your code, but I fear that it will
> be very slow. The major problem is repeated loops, some 50 billion of
> them. As this is interpreted code, you may need Mata instead.
>
> Testing expressions such as
>
> total_out=X_out_clone
>
> for equality will fail because Stata uses == not = for that purpose.
>
> You could get small speed-ups by replacing
>
> gen constrained_out = 0
> replace constrained_out = 1 if total_out=X_out_clone
> gen constrained_in = 0
> replace constrained_in = 1 if total_in=X_in_clone
> gen constrained = max(constrained_out, constrained_in)
>
> with one line
>
> gen constrained = max(total_out == X_out_clone, total_in==X_in_clone)
>
> so that there is no need for the variables -constrained_in- and
> -constrained_out-. In fact that variable -constrained- can be omitted
> too.
>
> Copying -X`j'- to -Y- is also unnecessary.
>
> -if _n == 1- will be faster as -in 1- for a large dataset.
>
> So -- a quick and unchecked hack --
>
> forvalues j = 1/100 {
> gen X`j' = 0
> forvalues i = 1/533361588 {
> sort startstationid
> egen X_out=total(X`j'), by(startstationid)
> sort endstationid
> egen X_in=total(X`j'), by(endstationid)
> generate double u = runiform() if max(total_out ==
> X_out_clone, total_in==X_in_clone) !=1
> sort u
> replace X`j' = X`j' + 1 in 1
> drop X_out X_in X_out_clone X_in_clone u
> }
> }
>
> However, after that it still looks buggy. At the end of the inner
> loop, you drop your -*clone- variables, so that they are not there
> second time round. So this will fail.
>
> What else can we do?
>
> if max(total_out == X_out_clone, total_in==X_in_clone) !=1
>
> looks like
>
> if total_out != X_out_clone & total_in != X_in_clone
>
> which may be faster.
>
> Repeated -egen- calls will be slower than you want. There is a lot of
> interpreted code inside -egen- that is irrelevant to your purpose. You
> can cut out all the -egen- calls by replacing
>
> sort startstationid
> egen X_out=total(X`j'), by(startstationid)
> sort endstationid
> egen X_in=total(X`j'), by(endstationid)
>
> with
>
> bysort startstationid : gen X_out = sum(X`j')
> by startstationid : replace X_out = X_out[_N]
> bysort endstationid : gen X_in = sum(X`j')
> by endstationid : replace X_in = X_in[_N]
>
> That may not look much different, but it will be much faster.
>
> What remains as obvious problems are overall speed and the bug
> mentioned above. I've only looked at this code, not tried it. Above
> all, I didn't read the paper you cited.
>
> Nick
>
> On Fri, Feb 22, 2013 at 1:26 AM, Gordon Lee <[email protected]> wrote:
>
>> I have data on traffic flows amongst a set of stations.
>>
>> What I would like to do on Stata is to create 100 random simulations
>> that preserve the total number of incoming and outgoing links for each
>> station. This is in the footsteps of Roth et al (2011) under "the null
>> model". Link: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0015923
>>
>> In other words, this is a matrix with constrained row sums and column sums.
>>
>> ---My data set (illustrative only)---
>>
>> startstationid endstationid total_out total_in
>> ---------------------------------------------------------------------------
>> A B oA dB
>> A C oA dC
>> A D oA dD
>> B A oB dA
>> B C oB dC
>> B D oB dD
>> C A oC dA
>> C B oC dB
>> C D oC dD
>> D A oD dA
>> D B oD dB
>> D C oD dC
>>
>> ...where total_out is the sum of traffic flows from the startstation,
>> and total_in is the sum of traffic flows into the endstation. All
>> items take on integer values.
>>
>>
>> ****
>> dofile: https://www.dropbox.com/s/c9s4gmqtd6ycmnz/do%20file.txt
>> dataset: https://www.dropbox.com/s/zx9oufzekusnsuj/dataset.dta
>>
>> I tried the following commands, but stata tells me "'invalid name
>> r(198);". I can't seem to be able to fix this.
>>
>> (note: 533361588 is the sum of all flows in the data)
>>
>> forvalues j = 1/100 {
>> gen X'j' = 0
>> forvalues i = 1/533361588 {
>> gen Y = X'j'
>> sort startstationid
>> egen X_out=total(X'j'), by(startstationid)
>> sort endstationid
>> egen X_in=total(X'j'), by(endstationid)
>>
>> gen constrained_out = 0
>> replace constrained_out = 1 if total_out=X_out_clone
>> gen constrained_in = 0
>> replace constrained_in = 1 if total_in=X_in_clone
>>
>> gen constrained = max(constrained_out, constrained_in)
>>
>> generate double u = runiform() if constrained!=1
>> sort u
>>
>> replace X'j' = Y + 1 if _n == 1
>>
>> drop Y X_out X_in X_out_clone X_in_clone constrained_out
>> constrained_in constrained u
>> }
>> }
>>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/