Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Collapsing with strings


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Collapsing with strings
Date   Sun, 15 Jan 2006 21:08:22 -0000

You're correct. I misread the question. Sorry. 

Nick 
[email protected] 

Gary Longton
 
> Daphna Bassok wrote:
> 
> > I have several duplicate observations in my data set.  
> However, they are 
> > not perfect duplicates.  Only the id # is the same. So 
> there might be 
> > two observations with id#16 for instance, the first will 
> have values for 
> > some variables, and missing values for others. The second 
> also have some 
> > values filled and some missing.  There are no cases in 
> which both have 
> > values- that is... either the first in the pair has the 
> value OR the 
> > second has a value (or neither).
> > 
> > For example: suppose I have two observations with id# 16... 
>  The first 
> > has values for var1 and  2 and not 3.   The second ONLY has 
> values for 
> > var 3.   What i would like to do is simply collapse these 
> into a single 
> > observation with all the relevant info. meaning, 1 observation with 
> > id#16 that has values for all three variables.
> > 
> > I am trying to do this with the collapse command with no success.
> > 
> > My code is:
> > 
> > collapse (min) var1-var3, by(id)
> > 
> > I thought this would create a new observation that has all 
> the data in it.
> > 
> > I am getting a "type mismatch" error.
> > 
> > Is this because some of my variables are string variables?
> 
> Nick Cox suggested:
> 
> > What you can do is -- if your description is correct -- 
> > 
> > egen nmiss = rowmiss(<insert variable names>) 
> > bysort id (nmiss) : keep if _n == 1
> > 
> > as the sort will sort the observation with 
> > more missings to second place. 
> 
> and Austin Nichols suggested:
> 
> > foreach v of varlist put all the relevant varnames here {
> >  bys id (`v'): qui replace `v'=`v'[_n-1] if mi(`v')
> > }
> > bys id: drop if _n>1
> 
> It is a rare day when one can make a correction to a 
> typically accurate and 
> elegant Nick Cox solution, so I make this one fearing that 
> I've probably missed 
> somthing obvious.
> 
> If I understand the problem correctly, I think this solution 
> will discard 
> non-missing data for some variables.
> 
> Eg. a simplified dataset like this seems consistent with 
> Daphna's description:
> 
> obs   id   var1   var2  var3  nmiss
> 
> 1     16    a      .      3     1
> 2     16    .      2      .     2
> 3     17    .      .      7     2
> 4     17    c      3      .     1
> 
> sorting on nmiss will discard observations 2 & 3, throwing 
> away non-missing data 
> for var2
> 
> As Austin's solution suggests, one needs to sort separately 
> for each variable in 
> the list and carry out the replace for that variable.  
> However missings sort 
> differently for string and numeric variables, taking first 
> place for strings and 
> last for numerics, so need to be handled differently in a 
> sort solution. 
> Austin's solution won't sort correctly for string variables.
> 
> There is probably a shorter approach, but I think this will do it:
> 
> ********
> foreach var of varlist var1-var3 {
>       if substr("`:type `var''",1,3) == "str" {
> 		bysort id (`var') replace `var' = `var'[_N]
> 	}
> 	else {
> 		bysort id (`var') replace `var' = `var'[1]
> 	}
> }
> bysort id: drop if _n>1
> ************
 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index