[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: disregarding duplicate observations in a variable list

From	"Jessica Looze" <[email protected]>
To	[email protected]
Subject	Re: st: RE: disregarding duplicate observations in a variable list
Date	Fri, 9 Jan 2009 07:07:50 -0500

A split line in the code was indeed the problem. The files you sent
work nicely. Thank you Nick.

Jessica

On Thu, Jan 8, 2009 at 2:04 PM, Nick Cox <[email protected]> wrote:
> My -egen- function works for me. Any alleged pickyness with Mata and
> -if- cannot possibly bite you as my Mata code makes no use of -if-.
>
> What's much (enormously) more likely is that the code has got mangled
> somewhere en route. For example, the version in the Statalist archives
> at Harvard has a split line, as has the version below.
>
> As SSC is frozen in Kit's absence (see earlier today) I will send copies
> of files directly to Jessica.
>
> Nick
> [email protected]
>
> Jessica Looze
>
> Thank you Nick and Scott for your suggestions. I tried Nick's
> suggestion first, as an egen command seems the more efficient of the
> two. However, when I entered the command
>
> egen nvals = rownvals(emp1_97 emp2_97 emp3_97 emp1_98 emp2_98 emp_98)
>
> (after saving Nick's ado files of course) I received the error message
>
> unexpected end of line
> <istmt> incomplete
> r(3000);
>
> Unsure what this meant, I did a search and found a reference to this
> message in an archived Statalist coversation.
>
> http://www.stata.com/statalist/archive/2006-04/msg00434.html
>
> This discussion seems to indicate that this message has to do with the
> pickyness of Mata when "if" is involved. I am not very advanced at
> writing programs, so looking through your programs Nick, I am
> uncertain how to tweak it (if tweaking is even the issue). Maybe there
> is something else I need to be doing here?
>
> On Wed, Jan 7, 2009 at 10:58 AM, Nick Cox <[email protected]> wrote:
>> The problem is that of counting duplicate _values_ across a varlist
> and
>> within each observation. (The terminology of duplicate observations
>> would imply a problem for -duplicates-, but that command does not help
>> here.)
>>
>> Jessica's code borrowed from the -egenmore- package is to do with
>> counting values that are positive and non-missing. That won't help
>> either, as the values would be counted regardless of whether they are
>> distinct, as Jessica realises. There isn't a very easy way to go
> further
>> down that path, although it would be possible.
>>
>> Note that the -egenmore- package is on SSC. (Please remember to
> explain
>> where programs you use come from.)
>>
>> The problem is however very close to that discussed in an FAQ
>>
>> FAQ     . . . . . . . . .  Counting distinct strings across a set of
>> variables
>>        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N.
>> J. Cox
>>        7/04    How do I count the number of distinct strings
>>                across a set of variables?
>>
>> <http://www.stata.com/support/faqs/data/distinctstrings.html>
>>
>> One strategy discussed there starts with a -reshape-. Scott Merryman
> has
>> followed a similar line in his suggestions.
>>
>> Since that FAQ was written writing an -egen- function based on a Mata
>> workhorse has come to seem a good way to do this. In fact, the
>> -rowmedian()- function for -egen- in the -egenmore- package has most
> of
>> the code needed. As the problem arises for numeric variables as well
> for
>> string variables, two functions could be useful.
>>
>> * -------------------- put in _grownvals.ado on your adopath
>> * number of distinct non-missing numeric values in each observation
>> * NJC 1.0.0 7 Jan 2009
>> program _grownvals
>>        version 9
>>        gettoken type 0 : 0
>>        gettoken h    0 : 0
>>        gettoken eqs  0 : 0
>>
>>        syntax varlist(numeric) [if] [in] [, BY(string)]
>>        if `"`by'"' != "" {
>>                _egennoby rownvals() `"`by'"'
>>                /* NOTREACHED */
>>        }
>>
>>        marksample touse, novarlist
>>        quietly {
>>                mata : row_nvals("`varlist'", "`touse'", "`h'",
>> "`type'")
>>        }
>> end
>>
>> mata :
>>
>> void row_nvals(string scalar varnames,
>>                string scalar tousename,
>>                string scalar nvalsname,
>>                string scalar type)
>> {
>>        real matrix y
>>        real colvector nvals, row
>>
>>        st_view(y, ., tokens(varnames), tousename)
>>        nvals = J(rows(y), 1, .)
>>
>>        for(i = 1; i <= rows(y); i++) {
>>                row = y[i,]'
>>                nvals[i] = length(uniqrows(select(row, (row :< .))))
>>        }
>>
>>        st_addvar(type, nvalsname)
>>        st_store(., nvalsname, tousename, nvals)
>> }
>>
>> end
>> * end of _grownvals.ado
>>
>> * -------------------- put in _growsvals.ado on your adopath
>> * number of distinct non-missing string values in each observation
>> * NJC 1.0.0 7 Jan 2009
>> program _growsvals
>>        version 9
>>        gettoken type 0 : 0
>>        gettoken h    0 : 0
>>        gettoken eqs  0 : 0
>>
>>        syntax varlist(string) [if] [in] [, BY(string)]
>>        if `"`by'"' != "" {
>>                _egennoby rowsvals() `"`by'"'
>>                /* NOTREACHED */
>>        }
>>
>>        marksample touse, novarlist
>>        quietly {
>>                mata : row_svals("`varlist'", "`touse'", "`h'",
>> "`type'")
>>        }
>> end
>>
>> mata :
>>
>> void row_svals(string scalar varnames,
>>                string scalar tousename,
>>                string scalar svalsname,
>>                string scalar type)
>> {
>>        string matrix y
>>        string colvector row
>>        real colvector nvals
>>
>>        st_sview(y, ., tokens(varnames), tousename)
>>        svals = J(rows(y), 1, .)
>>
>>        for(i = 1; i <= rows(y); i++) {
>>                row = y[i,]'
>>                svals[i] = length(uniqrows(select(row, (row :!= ""))))
>>        }
>>
>>        st_addvar(type, svalsname)
>>        st_store(., svalsname, tousename, svals)
>> }
>>
>> end
>> * end of _growsvals.ado
>>
>>
>> You can invoke these functions, once the program files are in place,
> by
>>
>> egen nvals = rownvals(<numeric varlist>)
>>
>> egen svals = rowsvals(<string varlist>)
>>
>> I'll add those functions to -egenmore- in due course.
>>
>> Nick
>> [email protected]
>>
>> Jessica Looze
>>
>> I am trying to create a variable that indicates the number of jobs an
>> individual has held during a period of years. The dataset I am using,
>> NLSY97, records each respondents' work history in a roster format.
>> This roster assigns each job a unique ID indicating the year the job
>> began. For example, the roster for respondent #1 might look like:
>>
>> ID     Year     Job 1     Job2     Job3
>> 1      1997     9701      9702    9703
>> 1      1998     9801      9701    .
>>
>> So, during these two years, this respondent held four different jobs
>> (9701 extending over into 1998).
>>
>> My data looks something like this:
>>
>> ID     EMP1_97     EMP2_97     EMP3_97     EMP1_98     EMP2_98
>> EMP3_98
>> 1      9701            9702             9703            9801
>>  9701             .
>> 2      9701            .                   .                  9701
>>       .                   .
>>
>> I have been working with the row operations suggested in the egenmore
>> help entry. My current working code looks like that on this manual
>> page:
>>
>> gen any = 0
>> gen all = 1
>> gen count = 0
>>     foreach v of varlist emp1_97 emp2_97 emp3_97 emp1_98 emp2_98
>> emp3_98 {
>>          replace any = max(any, inrange(`v', 0, .))
>>          replace all = min(all, inrange(`v', 0, .))
>>          replace count = count + inrange(`v', 0, .)
>> }
>>
>> From here, I cannot figure out how to modify the variable count, so
>> that it disregards duplicate IDs.
>>
>> Any suggestions would be much appreciated.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: disregarding duplicate observations in a variable list
  - From: "Jessica Looze" <[email protected]>
- st: RE: disregarding duplicate observations in a variable list
  - From: "Nick Cox" <[email protected]>
- Re: st: RE: disregarding duplicate observations in a variable list
  - From: "Jessica Looze" <[email protected]>
- RE: st: RE: disregarding duplicate observations in a variable list
  - From: "Nick Cox" <[email protected]>

Prev by Date: RE: st: Problem with command after update
Next by Date: st: discrete choice experiments
Previous by thread: RE: st: RE: disregarding duplicate observations in a variable list
Next by thread: [no subject]
Index(es):
- Date
- Thread