Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Martin Weiss" <martin.weiss1@gmx.de> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | AW: st: AW: combining datasets |
Date | Thu, 19 Aug 2010 18:30:35 +0200 |
<> " The choice between append and merge is more important for large datasets because you need the right variable naming scheme." I do not really understand the meaning of this sentence. Why would the situation change given the size of the dataset at hand? -append- and -merge- are not slight variations of each other, IMHO. The manual entry for -merge- does make clear the many variations _within_ -merge- itself, but the choice between -append- and -merge- is more fundamental still... Also note [D], p. 397: " merge is for adding new variables from a second dataset to existing observations. You use merge, for instance, when combining hospital patient and discharge datasets. If you wish to add new observations to existing variables, then see [D] append. You use append, for instance, when adding current discharges to past discharges." HTH Martin -----Ursprüngliche Nachricht----- Von: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] Im Auftrag von Anders Alexandersson Gesendet: Donnerstag, 19. August 2010 17:56 An: statalist@hsphsun2.harvard.edu Betreff: Re: st: AW: combining datasets Martine, Also see [U] 22 Combining datasets. Maarten provided an excellent append solution with this being the main line: . append using `a' Here is the equivalent merge solution: . merge 1:1 source id using `a', nogen The choice between append and merge is more important for large datasets because you need the right variable naming scheme. Michael Mitchell gave a good tip in his data management book described at http://www.stata.com/bookstore/dmus.html : If you will append datasets, you want the variable names to be the same, but if you will merge datasets, you want the variable names to be different. Anders Alexandersson andersalex@gmail.com On Thu, Aug 19, 2010 at 4:34 AM, Maarten buis <maartenbuis@yahoo.co.uk> wrote: > --- On Wed, 18/8/10, martine etienne wrote: >> firstly, person 1 in dataset A is NOT same person as person >> 1 in dataset B, measurements are also taken at different times >> secondly, I would like the final dataset to look like Final 1 > > Here is an example of how to do that: > > *------------ begin example ------------ > // create the two datasets > tempfile a b > > drop _all > input id x > 1 3 > 2 4 > end > save `a' > > drop _all > input id x > 1 5 > 2 6 > end > save `b' > > // create a new variable in each dataset > // that identifies the source of those > // observations > use `a' > gen source = "a" > > save `a', replace > > use `b' > gen source = "b" > save `b', replace > > // use -append- to stack the datasets > append using `a' > > // create a extra id variable, which contains > // an unique integer for each source-id combination > // and attaches the values of the source and id > // variables to the value label > egen long new_id = group(source id), label > > // for display purposes I put the thre id variables > // to the left of the dataset > order id source new_id > > // display the result > list > *--------------- end example ---------------- > (For more on examples I sent to the Statalist see: > http://www.maartenbuis.nl/example_faq ) > > Hope this helps, > Maarten > > -------------------------- > Maarten L. Buis > Institut fuer Soziologie > Universitaet Tuebingen > Wilhelmstrasse 36 > 72074 Tuebingen > Germany > > http://www.maartenbuis.nl > -------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/