[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Number of variables limit

From	Phil Schumm <[email protected]>
To	[email protected]
Subject	Re: st: Number of variables limit
Date	Fri, 22 Aug 2008 08:10:17 -0500

On Aug 21, 2008, at 2:03 PM, Constantine Daskalakis wrote:

According to the Stata website, Stata has a limit of 32,767 variables (I've never personally had to go that high). Yet, in today's world of genomics, we may have a dataset with 100,000+ variables -- for example, in a microarray dataset, each gene would be a variable.

How do people do such analyses in Stata?

Is there a trick to handle such data?

Let me start by saying that I haven't (yet) done any manipulation of microarray data in Stata -- hopefully, if others have, they'll speak up. As you've already figured out, if you have, say, data for 500,000 SNPs (or even just 100k), you're not going to be able to load those as individual variables in Stata all at one time. There are alternatives: for example, (1) you could load the data in long format (which is the way such data are often delivered), (2) you could load the data transposed (i.e., with SNPs as observations and one variable for each individual), or (3) you could load the data into a matrix in Mata. However, even these strategies may prove inadequate to load the entire dataset, especially if you don't have sufficient memory available. Moreover, depending on what you want to do with the data, working with them in forms (1) or (2) in Stata (as opposed to Mata) may prove to be quite slow.

The real question is: Do you really need to have all of the data in Stata (or Mata) at one time? For example, if you are doing something which involves working with just one SNP at a time or with just the SNPs within a relatively small region of the chromosome, then you could pull in the data in chunks, and write the results of your analyses out to a file as you go. When you're finished, you can then read in the results file for summarizing and plotting.

One of the first problems in working with gene array data is figuring out how to store and access the data efficiently. When we first encountered such data, we wrote a tool to re-organize and index a gene array dataset so that you could export just a portion of the dataset very quickly. You can now use PLINK (an emerging standard for whole genome association analyses) to do this. In fact, if you want to work with gene array data in Stata and don't want to handle the data management yourself (and I wouldn't suggest this unless you have a good reason and the capability to do it), an excellent strategy would be to read the data into PLINK, save them as a PLINK-format dataset, and then use PLINK to extract portions of the dataset for use in Stata.

If someone were motivated to write a Stata plugin to read data directly from a PLINK binary file (I believe the file format is open), that would be a great first step toward facilitating working with gene array data in Stata.

-- Phil

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

References:
- st: Number of variables limit
  - From: Constantine Daskalakis <[email protected]>

Prev by Date: RE: st: ice and random-number seed
Next by Date: st: Compiling results
Previous by thread: st: Number of variables limit
Next by thread: AW: Re: st: Creating dataset for survival analysis
Index(es):
- Date
- Thread