Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Random simulations of flow matrix with constraints

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: Random simulations of flow matrix with constraints
Date	Fri, 22 Feb 2013 02:11:45 +0000

For the manual material see http://www.stata.com/help.cgi?quotes i.e.
[U] 18.3.1.

On Fri, Feb 22, 2013 at 2:10 AM, Nick Cox <[email protected]> wrote:
> Local macro references start and end with different single quotation
> marks. The pattern is `macname'. The first is char(96). This is
> explained in the manuals and e.g. at
>
> http://www.stata.com/support/faqs/programming/correct-delimiter-for-local-macros/
>
> I have not tried to understand all your code, but I fear that it will
> be very slow. The major problem is repeated loops, some 50 billion of
> them. As this is interpreted code, you may need Mata instead.
>
> Testing expressions such as
>
> total_out=X_out_clone
>
> for equality will fail because Stata uses == not = for that purpose.
>
> You could get small speed-ups by replacing
>
> gen constrained_out = 0
> replace constrained_out = 1 if total_out=X_out_clone
> gen constrained_in = 0
> replace constrained_in = 1 if total_in=X_in_clone
> gen constrained = max(constrained_out, constrained_in)
>
> with one line
>
> gen constrained = max(total_out == X_out_clone, total_in==X_in_clone)
>
> so that there is no need for the variables -constrained_in- and
> -constrained_out-. In fact that variable -constrained- can be omitted
> too.
>
> Copying -X`j'- to -Y- is also unnecessary.
>
> -if _n == 1- will be faster as -in 1- for a large dataset.
>
> So -- a quick and unchecked hack --
>
> forvalues j = 1/100 {
>        gen X`j' = 0
>        forvalues i = 1/533361588 {
>              sort startstationid
>              egen X_out=total(X`j'), by(startstationid)
>              sort endstationid
>              egen X_in=total(X`j'), by(endstationid)
>              generate double u = runiform() if max(total_out ==
> X_out_clone, total_in==X_in_clone) !=1
>              sort u
>              replace X`j' = X`j' + 1 in 1
>              drop X_out X_in X_out_clone X_in_clone u
> }
> }
>
> However, after that it still looks buggy. At the end of the inner
> loop, you drop your -*clone- variables, so that they are not there
> second time round. So this will fail.
>
> What else can we do?
>
> if max(total_out == X_out_clone, total_in==X_in_clone) !=1
>
> looks like
>
> if total_out != X_out_clone & total_in != X_in_clone
>
> which may be faster.
>
> Repeated -egen- calls will be slower than you want. There is a lot of
> interpreted code inside -egen- that is irrelevant to your purpose. You
> can cut out all the -egen- calls by replacing
>
> sort startstationid
> egen X_out=total(X`j'), by(startstationid)
> sort endstationid
> egen X_in=total(X`j'), by(endstationid)
>
> with
>
> bysort startstationid : gen X_out = sum(X`j')
> by startstationid : replace X_out = X_out[_N]
> bysort endstationid : gen X_in = sum(X`j')
> by endstationid : replace X_in = X_in[_N]
>
> That may not look much different, but it will be much faster.
>
> What remains as obvious problems are overall speed and the bug
> mentioned above. I've only looked at this code, not tried it. Above
> all, I didn't read the paper you cited.
>
> Nick
>
> On Fri, Feb 22, 2013 at 1:26 AM, Gordon Lee <[email protected]> wrote:
>
>> I have data on traffic flows amongst a set of stations.
>>
>> What I would like to do on Stata is to create 100 random simulations
>> that preserve the total number of incoming and outgoing links for each
>> station. This is in the footsteps of Roth et al (2011) under "the null
>> model". Link: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0015923
>>
>> In other words, this is a matrix with constrained row sums and column sums.
>>
>> ---My data set (illustrative only)---
>>
>> startstationid     endstationid     total_out     total_in
>> ---------------------------------------------------------------------------
>> A                     B                     oA              dB
>> A                     C                     oA              dC
>> A                     D                     oA              dD
>> B                     A                     oB              dA
>> B                     C                     oB              dC
>> B                     D                     oB              dD
>> C                     A                     oC              dA
>> C                     B                     oC              dB
>> C                     D                     oC              dD
>> D                     A                     oD              dA
>> D                     B                     oD              dB
>> D                     C                     oD              dC
>>
>> ...where total_out is the sum of traffic flows from the startstation,
>> and total_in is the sum of traffic flows into the endstation. All
>> items take on integer values.
>>
>>
>> ****
>> dofile: https://www.dropbox.com/s/c9s4gmqtd6ycmnz/do%20file.txt
>> dataset: https://www.dropbox.com/s/zx9oufzekusnsuj/dataset.dta
>>
>> I tried the following commands, but stata tells me "'invalid name
>> r(198);". I can't seem to be able to fix this.
>>
>> (note: 533361588 is the sum of all flows in the data)
>>
>> forvalues j = 1/100 {
>> gen X'j' = 0
>> forvalues i = 1/533361588 {
>> gen Y = X'j'
>> sort startstationid
>> egen X_out=total(X'j'), by(startstationid)
>> sort endstationid
>> egen X_in=total(X'j'), by(endstationid)
>>
>> gen constrained_out = 0
>> replace constrained_out = 1 if total_out=X_out_clone
>> gen constrained_in = 0
>> replace constrained_in = 1 if total_in=X_in_clone
>>
>> gen constrained = max(constrained_out, constrained_in)
>>
>> generate double u = runiform() if constrained!=1
>> sort u
>>
>> replace X'j' = Y + 1 if _n == 1
>>
>> drop Y X_out X_in X_out_clone X_in_clone constrained_out
>> constrained_in constrained u
>> }
>> }
>>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Random simulations of flow matrix with constraints
  - From: Gordon Lee <[email protected]>
- Re: st: Random simulations of flow matrix with constraints
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: Random simulations of flow matrix with constraints
Next by Date: RE: st: RE: How to perform a Wald test between the fixed effect...
Previous by thread: Re: st: Random simulations of flow matrix with constraints
Index(es):
- Date
- Thread