[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: re: statsby slowness

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	st: RE: re: statsby slowness
Date	Mon, 20 Aug 2007 09:46:05 +0100

Interesting. You may get a bit more speed if 
you replace this 

egen rank_1 = rank(expression), by(ssrownum)
egen rank_2 = rank(iso_VSV), by(ssrownum)
egen corr = corr(rank_1 rank_2), by(ssrownum)

by this: 

sort ssrownum
by ssrowsum : egen rank_1 = rank(expression)
by ssrowsum : egen rank_2 = rank(iso_VSV)
by ssrowsum : egen corr = corr(rank_1 rank_2)

The two code segments are equivalent in what 
you end with, but not in when they -sort-. 

SImilarly 

keep if _n >= `start' & _n <= `stop'

should be faster as

keep in `start'/`stop'

and I would always use the built-in -sqrt()- 
when it applies, rather than powering to 0.5. 

Nick 
[email protected] 

David Airey
 
> Nick Cox asked how egen rank and egen corr compares to using 
> spearman  
> in the context of statsby. It turns out to be at least 10 times  
> faster. Thanks Nick.
> 
> 
> // Break file into subfiles of 500 "by groups" (genes),
> // because Stata "statsby" command is so slow across 22K by groups,
> // then calculate and save out spearman correlation stats
> // for iso_VSV and gene expression and append into a file 22K 
> rows long
> // and apply FDR
> 
> // This version comments out statsby and uses egen rank and egen corr
> // and consequently is 10 times fast, doing 22K spearman correlations
> // in about 50 seconds (along with file creation, etc.).
> 
> set more off
> set rmsg on
> clear
> set memory 200M
> 
> // sabatini_merged is a long format file
> // merging expression data and editing data
> use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"
> 
> preserve
> local start = 1
> forvalues stop = 6000(6000)267396 {
> 	keep if _n >= `start' & _n <= `stop'
> 	save sabatini_`stop'.dta, replace
> 	egen rank_1 = rank(expression), by(ssrownum)
> 	egen rank_2 = rank(iso_VSV), by(ssrownum)
> 	egen corr = corr(rank_1 rank_2), by(ssrownum)
> 	collapse (mean) spearman=corr (count) count=corr, by(ssrownum)
> 	gen t = spearman*((count-2)/(1-spearman^2))^0.5
> 	gen p = 2*ttail(count-2,abs(t))
> 	save spearman_`stop'.dta, replace
> 	local start = `stop'+1
> 	restore, preserve
> }
> 
> clear
> use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"
> keep if _n > 264000 & _n <= 267396
> save sabatini_267396.dta, replace
> keep ssrow expression iso_VSV
> egen rank_1 = rank(expression), by(ssrownum)
> egen rank_2 = rank(iso_VSV), by(ssrownum)
> egen corr = corr(rank_1 rank_2), by(ssrownum)
> collapse (mean) spearman=corr (count) count=corr, by(ssrownum)
> gen t = spearman*((count-2)/(1-spearman^2))^0.5
> gen p = 2*ttail(count-2,abs(t))
> save spearman_267396.dta, replace
> 
> 
> // append the statsby results into one file
> use spearman_6000.dta
> forvalues stop = 12000(6000)267396 {
> 	append using spearman_`stop'.dta
> 	erase spearman_`stop'.dta
> }
> append using spearman_267396.dta
> erase spearman_267396.dta
> save iso_VSV.dta, replace
> erase spearman_6000.dta
> 
> // see if there is significance after FDR
> multproc, pvalue(p) method(simes) puncor(0.05) rank(rank)
> 
> // keep top 100 and bring in gene information against ssrownum key id
> sort rank
> keep if rank < 101
> sort ssrownum
> merge ssrownum using "/Users/dairey/Desktop/primate_editing/ 
> sabatini_genes.dta"
> keep if _merge == 3
> 
> // list gene information
> sort rank
> list gene_title
> set rmsg off
> set more on
> 
> /*
> 
> // use a forvalues loop to make smaller files and do statsby
> preserve
> local start = 1
> forvalues stop = 6000(6000)267396 {
> 	keep if _n >= `start' & _n <= `stop'
> 	save sabatini_`stop'.dta, replace
> 	statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): ///
> 		spearman iso_VSV expression
> 	save spearman_`stop'.dta, replace
> 	local start = `stop'+1
> 	restore, preserve
> 	}
> 
> // above, the forvalues misses data after row 264000 so
> // those are picked up here
> clear
> use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"
> keep if _n > 264000 & _n <= 267396
> save sabatini_267396.dta, replace
> statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): ///
> 	spearman iso_VSV expression
> save spearman_267396.dta, replace
> 
> */
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: RE: RE: re: statsby slowness
  - From: "Nick Cox" <[email protected]>

References:
- st: re: statsby slowness
  - From: David Airey <[email protected]>

Prev by Date: Re: st: selection and clustered s.e.
Next by Date: st: RE: RE: re: statsby slowness
Previous by thread: st: re: statsby slowness
Next by thread: st: RE: RE: re: statsby slowness
Index(es):
- Date
- Thread