|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Number of variables limit
On Aug 21, 2008, at 2:03 PM, Constantine Daskalakis wrote:
According to the Stata website, Stata has a limit of 32,767
variables (I've never personally had to go that high). Yet, in
today's world of genomics, we may have a dataset with 100,000+
variables -- for example, in a microarray dataset, each gene would
be a variable.
How do people do such analyses in Stata?
Is there a trick to handle such data?
Let me start by saying that I haven't (yet) done any manipulation of
microarray data in Stata -- hopefully, if others have, they'll speak
up. As you've already figured out, if you have, say, data for 500,000
SNPs (or even just 100k), you're not going to be able to load those as
individual variables in Stata all at one time. There are
alternatives: for example, (1) you could load the data in long format
(which is the way such data are often delivered), (2) you could load
the data transposed (i.e., with SNPs as observations and one variable
for each individual), or (3) you could load the data into a matrix in
Mata. However, even these strategies may prove inadequate to load the
entire dataset, especially if you don't have sufficient memory
available. Moreover, depending on what you want to do with the data,
working with them in forms (1) or (2) in Stata (as opposed to Mata)
may prove to be quite slow.
The real question is: Do you really need to have all of the data in
Stata (or Mata) at one time? For example, if you are doing something
which involves working with just one SNP at a time or with just the
SNPs within a relatively small region of the chromosome, then you
could pull in the data in chunks, and write the results of your
analyses out to a file as you go. When you're finished, you can then
read in the results file for summarizing and plotting.
One of the first problems in working with gene array data is figuring
out how to store and access the data efficiently. When we first
encountered such data, we wrote a tool to re-organize and index a gene
array dataset so that you could export just a portion of the dataset
very quickly. You can now use PLINK (an emerging standard for whole
genome association analyses) to do this. In fact, if you want to work
with gene array data in Stata and don't want to handle the data
management yourself (and I wouldn't suggest this unless you have a
good reason and the capability to do it), an excellent strategy would
be to read the data into PLINK, save them as a PLINK-format dataset,
and then use PLINK to extract portions of the dataset for use in Stata.
If someone were motivated to write a Stata plugin to read data
directly from a PLINK binary file (I believe the file format is open),
that would be a great first step toward facilitating working with gene
array data in Stata.
-- Phil
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/