Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Creating a smaller dataset from a larger one.
From
Richard Williams <[email protected]>
To
[email protected], [email protected]
Subject
Re: st: Creating a smaller dataset from a larger one.
Date
Mon, 13 Aug 2012 16:04:00 -0500
At 10:47 AM 8/13/2012, Le Wang wrote:
Dear Amal,
Stata has a built-in program called -sample- to draw a random sample.
See the link below for a detailed tutorial for this command.
http://www.ats.ucla.edu/stat/stata/faq/sample.htm
Hope that helps.
Le
I'll add a caution here -- if the data are -svyset-, I don't think
you are supposed to create extracts. Stata needs all the cases in
order to get the standard errors right. I've never fully understood
why, but Statalist has had various threads explaining why you should
use -subpop- rather than -if- for selecting cases (and presumably the
same logic applies to extracts).
On Mon, Aug 13, 2012 at 10:31 AM, Amal Khanolkar <[email protected]> wrote:
> Hello all,
>
> I have a very large dataset with almost 3 million subjects -
great to work with, but however a bit difficult to transport or
carry with me. I prefer to create a smaller sub-dataset with say
100,000 subjects chosen at random. As I'm interested in studying
ethnic differences, I use the variable 'Motherland' that denotes
country of birth in the code below to help create my sub-dataset.
However, the code I'm currently using, I get (I think) the first
100,000 subjects, which is then not at random. How may I change the
code below, to choose 100,000 (or say any number I wish) subjects at random?
>
> I use the following code to create a subset of my original dataset:
>
> *Creating a subsample of the dataset with say 100,000 subjects*
>
> // create random variable
> gen x = runiform()
>
> // sort by country and x
> sort motherland x
>
> // create a variable within country identifying the first 10%
(change this proprtion as you wish)
>
> by motherland: gen subsamp = _n <= (_N+0.5)*0.10
>
> tab motherland subsamp, col
>
> tab motherland kon, col, if magecat!=. & education!=. &
famsit_new!=. & smoke1!=. & parity!=. & zscore_gest!=. & MBMI2!=. &
mlangd!=. & multibirth==2 & subsamp==1
>
>
> Thanks for any help,
>
> /Amal.
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
--
~~~~~~~~~~~~~~~~~~~~~~~~
Le Wang, Ph.D
Assistant Professor
Department of Economics
University of New Hampshire
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME: (574)289-5227
EMAIL: [email protected]
WWW: http://www.nd.edu/~rwilliam
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/