Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: disregarding duplicate observations in a variable list


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: disregarding duplicate observations in a variable list
Date   Wed, 7 Jan 2009 15:58:50 -0000

The problem is that of counting duplicate _values_ across a varlist and
within each observation. (The terminology of duplicate observations
would imply a problem for -duplicates-, but that command does not help
here.) 

Jessica's code borrowed from the -egenmore- package is to do with
counting values that are positive and non-missing. That won't help
either, as the values would be counted regardless of whether they are
distinct, as Jessica realises. There isn't a very easy way to go further
down that path, although it would be possible. 

Note that the -egenmore- package is on SSC. (Please remember to explain
where programs you use come from.) 

The problem is however very close to that discussed in an FAQ 

FAQ     . . . . . . . . .  Counting distinct strings across a set of
variables
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N.
J. Cox
        7/04    How do I count the number of distinct strings
                across a set of variables?
 
<http://www.stata.com/support/faqs/data/distinctstrings.html>

One strategy discussed there starts with a -reshape-. Scott Merryman has
followed a similar line in his suggestions. 

Since that FAQ was written writing an -egen- function based on a Mata
workhorse has come to seem a good way to do this. In fact, the
-rowmedian()- function for -egen- in the -egenmore- package has most of
the code needed. As the problem arises for numeric variables as well for
string variables, two functions could be useful. 

* -------------------- put in _grownvals.ado on your adopath 
* number of distinct non-missing numeric values in each observation 
* NJC 1.0.0 7 Jan 2009
program _grownvals 
	version 9
	gettoken type 0 : 0
	gettoken h    0 : 0 
	gettoken eqs  0 : 0

	syntax varlist(numeric) [if] [in] [, BY(string)]
	if `"`by'"' != "" {
		_egennoby rownvals() `"`by'"'
		/* NOTREACHED */
	}

	marksample touse, novarlist 
	quietly { 
		mata : row_nvals("`varlist'", "`touse'", "`h'",
"`type'") 
	}
end

mata : 

void row_nvals(string scalar varnames, 
		string scalar tousename,
		string scalar nvalsname,
		string scalar type)
{ 
	real matrix y 
	real colvector nvals, row

        st_view(y, ., tokens(varnames), tousename)    
	nvals = J(rows(y), 1, .) 

	for(i = 1; i <= rows(y); i++) { 
		row = y[i,]'        
		nvals[i] = length(uniqrows(select(row, (row :< .))))
        }

	st_addvar(type, nvalsname)
	st_store(., nvalsname, tousename, nvals) 
}	

end
* end of _grownvals.ado 

* -------------------- put in _growsvals.ado on your adopath 
* number of distinct non-missing string values in each observation
* NJC 1.0.0 7 Jan 2009
program _growsvals 
	version 9
	gettoken type 0 : 0
	gettoken h    0 : 0 
	gettoken eqs  0 : 0

	syntax varlist(string) [if] [in] [, BY(string)]
	if `"`by'"' != "" {
		_egennoby rowsvals() `"`by'"'
		/* NOTREACHED */
	}

	marksample touse, novarlist 
	quietly { 
		mata : row_svals("`varlist'", "`touse'", "`h'",
"`type'") 
	}
end

mata : 

void row_svals(string scalar varnames, 
		string scalar tousename,
		string scalar svalsname,
		string scalar type)
{ 
	string matrix y 
	string colvector row
	real colvector nvals

        st_sview(y, ., tokens(varnames), tousename)    
	svals = J(rows(y), 1, .) 

	for(i = 1; i <= rows(y); i++) { 
		row = y[i,]'        
		svals[i] = length(uniqrows(select(row, (row :!= ""))))
        }

	st_addvar(type, svalsname)
	st_store(., svalsname, tousename, svals) 
}	

end
* end of _growsvals.ado


You can invoke these functions, once the program files are in place, by 

egen nvals = rownvals(<numeric varlist>) 

egen svals = rowsvals(<string varlist>)

I'll add those functions to -egenmore- in due course. 

Nick 
[email protected] 

Jessica Looze

I am trying to create a variable that indicates the number of jobs an
individual has held during a period of years. The dataset I am using,
NLSY97, records each respondents' work history in a roster format.
This roster assigns each job a unique ID indicating the year the job
began. For example, the roster for respondent #1 might look like:

ID     Year     Job 1     Job2     Job3
1      1997     9701      9702    9703
1      1998     9801      9701    .

So, during these two years, this respondent held four different jobs
(9701 extending over into 1998).

My data looks something like this:

ID     EMP1_97     EMP2_97     EMP3_97     EMP1_98     EMP2_98
EMP3_98
1      9701            9702             9703            9801
 9701             .
2      9701            .                   .                  9701
       .                   .

I have been working with the row operations suggested in the egenmore
help entry. My current working code looks like that on this manual
page:

gen any = 0
gen all = 1
gen count = 0
     foreach v of varlist emp1_97 emp2_97 emp3_97 emp1_98 emp2_98
emp3_98 {
          replace any = max(any, inrange(`v', 0, .))
          replace all = min(all, inrange(`v', 0, .))
          replace count = count + inrange(`v', 0, .)
}




© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index