My -egen- function works for me. Any alleged pickyness with Mata and
-if- cannot possibly bite you as my Mata code makes no use of -if-.
What's much (enormously) more likely is that the code has got mangled
somewhere en route. For example, the version in the Statalist archives
at Harvard has a split line, as has the version below.
As SSC is frozen in Kit's absence (see earlier today) I will send copies
of files directly to Jessica.
Nick
[email protected]
Jessica Looze
Thank you Nick and Scott for your suggestions. I tried Nick's
suggestion first, as an egen command seems the more efficient of the
two. However, when I entered the command
egen nvals = rownvals(emp1_97 emp2_97 emp3_97 emp1_98 emp2_98 emp_98)
(after saving Nick's ado files of course) I received the error message
unexpected end of line
<istmt> incomplete
r(3000);
Unsure what this meant, I did a search and found a reference to this
message in an archived Statalist coversation.
http://www.stata.com/statalist/archive/2006-04/msg00434.html
This discussion seems to indicate that this message has to do with the
pickyness of Mata when "if" is involved. I am not very advanced at
writing programs, so looking through your programs Nick, I am
uncertain how to tweak it (if tweaking is even the issue). Maybe there
is something else I need to be doing here?
On Wed, Jan 7, 2009 at 10:58 AM, Nick Cox <[email protected]> wrote:
> The problem is that of counting duplicate _values_ across a varlist
and
> within each observation. (The terminology of duplicate observations
> would imply a problem for -duplicates-, but that command does not help
> here.)
>
> Jessica's code borrowed from the -egenmore- package is to do with
> counting values that are positive and non-missing. That won't help
> either, as the values would be counted regardless of whether they are
> distinct, as Jessica realises. There isn't a very easy way to go
further
> down that path, although it would be possible.
>
> Note that the -egenmore- package is on SSC. (Please remember to
explain
> where programs you use come from.)
>
> The problem is however very close to that discussed in an FAQ
>
> FAQ . . . . . . . . . Counting distinct strings across a set of
> variables
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N.
> J. Cox
> 7/04 How do I count the number of distinct strings
> across a set of variables?
>
> <http://www.stata.com/support/faqs/data/distinctstrings.html>
>
> One strategy discussed there starts with a -reshape-. Scott Merryman
has
> followed a similar line in his suggestions.
>
> Since that FAQ was written writing an -egen- function based on a Mata
> workhorse has come to seem a good way to do this. In fact, the
> -rowmedian()- function for -egen- in the -egenmore- package has most
of
> the code needed. As the problem arises for numeric variables as well
for
> string variables, two functions could be useful.
>
> * -------------------- put in _grownvals.ado on your adopath
> * number of distinct non-missing numeric values in each observation
> * NJC 1.0.0 7 Jan 2009
> program _grownvals
> version 9
> gettoken type 0 : 0
> gettoken h 0 : 0
> gettoken eqs 0 : 0
>
> syntax varlist(numeric) [if] [in] [, BY(string)]
> if `"`by'"' != "" {
> _egennoby rownvals() `"`by'"'
> /* NOTREACHED */
> }
>
> marksample touse, novarlist
> quietly {
> mata : row_nvals("`varlist'", "`touse'", "`h'",
> "`type'")
> }
> end
>
> mata :
>
> void row_nvals(string scalar varnames,
> string scalar tousename,
> string scalar nvalsname,
> string scalar type)
> {
> real matrix y
> real colvector nvals, row
>
> st_view(y, ., tokens(varnames), tousename)
> nvals = J(rows(y), 1, .)
>
> for(i = 1; i <= rows(y); i++) {
> row = y[i,]'
> nvals[i] = length(uniqrows(select(row, (row :< .))))
> }
>
> st_addvar(type, nvalsname)
> st_store(., nvalsname, tousename, nvals)
> }
>
> end
> * end of _grownvals.ado
>
> * -------------------- put in _growsvals.ado on your adopath
> * number of distinct non-missing string values in each observation
> * NJC 1.0.0 7 Jan 2009
> program _growsvals
> version 9
> gettoken type 0 : 0
> gettoken h 0 : 0
> gettoken eqs 0 : 0
>
> syntax varlist(string) [if] [in] [, BY(string)]
> if `"`by'"' != "" {
> _egennoby rowsvals() `"`by'"'
> /* NOTREACHED */
> }
>
> marksample touse, novarlist
> quietly {
> mata : row_svals("`varlist'", "`touse'", "`h'",
> "`type'")
> }
> end
>
> mata :
>
> void row_svals(string scalar varnames,
> string scalar tousename,
> string scalar svalsname,
> string scalar type)
> {
> string matrix y
> string colvector row
> real colvector nvals
>
> st_sview(y, ., tokens(varnames), tousename)
> svals = J(rows(y), 1, .)
>
> for(i = 1; i <= rows(y); i++) {
> row = y[i,]'
> svals[i] = length(uniqrows(select(row, (row :!= ""))))
> }
>
> st_addvar(type, svalsname)
> st_store(., svalsname, tousename, svals)
> }
>
> end
> * end of _growsvals.ado
>
>
> You can invoke these functions, once the program files are in place,
by
>
> egen nvals = rownvals(<numeric varlist>)
>
> egen svals = rowsvals(<string varlist>)
>
> I'll add those functions to -egenmore- in due course.
>
> Nick
> [email protected]
>
> Jessica Looze
>
> I am trying to create a variable that indicates the number of jobs an
> individual has held during a period of years. The dataset I am using,
> NLSY97, records each respondents' work history in a roster format.
> This roster assigns each job a unique ID indicating the year the job
> began. For example, the roster for respondent #1 might look like:
>
> ID Year Job 1 Job2 Job3
> 1 1997 9701 9702 9703
> 1 1998 9801 9701 .
>
> So, during these two years, this respondent held four different jobs
> (9701 extending over into 1998).
>
> My data looks something like this:
>
> ID EMP1_97 EMP2_97 EMP3_97 EMP1_98 EMP2_98
> EMP3_98
> 1 9701 9702 9703 9801
> 9701 .
> 2 9701 . . 9701
> . .
>
> I have been working with the row operations suggested in the egenmore
> help entry. My current working code looks like that on this manual
> page:
>
> gen any = 0
> gen all = 1
> gen count = 0
> foreach v of varlist emp1_97 emp2_97 emp3_97 emp1_98 emp2_98
> emp3_98 {
> replace any = max(any, inrange(`v', 0, .))
> replace all = min(all, inrange(`v', 0, .))
> replace count = count + inrange(`v', 0, .)
> }
>
> From here, I cannot figure out how to modify the variable count, so
> that it disregards duplicate IDs.
>
> Any suggestions would be much appreciated.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/