The problem is that of counting duplicate _values_ across a varlist and
within each observation. (The terminology of duplicate observations
would imply a problem for -duplicates-, but that command does not help
Jessica's code borrowed from the -egenmore- package is to do with
counting values that are positive and non-missing. That won't help
either, as the values would be counted regardless of whether they are
distinct, as Jessica realises. There isn't a very easy way to go further
down that path, although it would be possible.
Note that the -egenmore- package is on SSC. (Please remember to explain
where programs you use come from.)
The problem is however very close to that discussed in an FAQ
FAQ . . . . . . . . . Counting distinct strings across a set of
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N.
J. Cox
7/04 How do I count the number of distinct strings
across a set of variables?
One strategy discussed there starts with a -reshape-. Scott Merryman has
followed a similar line in his suggestions.
Since that FAQ was written writing an -egen- function based on a Mata
workhorse has come to seem a good way to do this. In fact, the
-rowmedian()- function for -egen- in the -egenmore- package has most of
the code needed. As the problem arises for numeric variables as well for
string variables, two functions could be useful.
* -------------------- put in _grownvals.ado on your adopath
* number of distinct non-missing numeric values in each observation
* NJC 1.0.0 7 Jan 2009
program _grownvals
version 9
gettoken type 0 : 0
gettoken h 0 : 0
gettoken eqs 0 : 0
syntax varlist(numeric) [if] [in] [, BY(string)]
if `"`by'"' != "" {
_egennoby rownvals() `"`by'"'
marksample touse, novarlist
quietly {
mata : row_nvals("`varlist'", "`touse'", "`h'",
mata :
void row_nvals(string scalar varnames,
string scalar tousename,
string scalar nvalsname,
string scalar type)
real matrix y
real colvector nvals, row
st_view(y, ., tokens(varnames), tousename)
nvals = J(rows(y), 1, .)
for(i = 1; i <= rows(y); i++) {
row = y[i,]'
nvals[i] = length(uniqrows(select(row, (row :< .))))
st_addvar(type, nvalsname)
st_store(., nvalsname, tousename, nvals)
* end of _grownvals.ado
* -------------------- put in _growsvals.ado on your adopath
* number of distinct non-missing string values in each observation
* NJC 1.0.0 7 Jan 2009
program _growsvals
version 9
gettoken type 0 : 0
gettoken h 0 : 0
gettoken eqs 0 : 0
syntax varlist(string) [if] [in] [, BY(string)]
if `"`by'"' != "" {
_egennoby rowsvals() `"`by'"'
marksample touse, novarlist
quietly {
mata : row_svals("`varlist'", "`touse'", "`h'",
mata :
void row_svals(string scalar varnames,
string scalar tousename,
string scalar svalsname,
string scalar type)
string matrix y
string colvector row
real colvector nvals
st_sview(y, ., tokens(varnames), tousename)
svals = J(rows(y), 1, .)
for(i = 1; i <= rows(y); i++) {
row = y[i,]'
svals[i] = length(uniqrows(select(row, (row :!= ""))))
st_addvar(type, svalsname)
st_store(., svalsname, tousename, svals)
* end of _growsvals.ado
You can invoke these functions, once the program files are in place, by
egen nvals = rownvals(<numeric varlist>)
egen svals = rowsvals(<string varlist>)
I'll add those functions to -egenmore- in due course.
[email protected]
Jessica Looze
I am trying to create a variable that indicates the number of jobs an
individual has held during a period of years. The dataset I am using,
NLSY97, records each respondents' work history in a roster format.
This roster assigns each job a unique ID indicating the year the job
began. For example, the roster for respondent #1 might look like:
ID Year Job 1 Job2 Job3
1 1997 9701 9702 9703
1 1998 9801 9701 .
So, during these two years, this respondent held four different jobs
(9701 extending over into 1998).
My data looks something like this:
ID EMP1_97 EMP2_97 EMP3_97 EMP1_98 EMP2_98
1 9701 9702 9703 9801
9701 .
2 9701 . . 9701
. .
I have been working with the row operations suggested in the egenmore
help entry. My current working code looks like that on this manual
gen any = 0
gen all = 1
gen count = 0
foreach v of varlist emp1_97 emp2_97 emp3_97 emp1_98 emp2_98
emp3_98 {
replace any = max(any, inrange(`v', 0, .))
replace all = min(all, inrange(`v', 0, .))
replace count = count + inrange(`v', 0, .)