See also Roger Newson's -sencode- on SSC, which
is designed for an overlapping problem.
Nick
[email protected]
Nick Cox
> Note for anyone interested:
>
> -levelsof- as implemented in Stata 9 differs
> subtly from -levels- as added to Stata 8
> during its lifetime.
>
> That aside, I am very surprised at Iwan's
> report that -levelsof- reports categories
> according to their order of occurrence in the data.
> That contradicts not just the help file, but
> also the code as I read it (and for that matter
> as I wrote it, originally). StataCorp would like
> to see evidence, I am sure. I suspect Iwan's
> impression is mistaken, but I am not sure why
> it arises.
>
> The general problem to which -levelsof- is
> one solution is discussed in
>
> http://www.stata.com/support/faqs/data/foreach.html
>
> A fairly general strategy for going through all
> possible levels
>
> * according to their order of first occurrence
> * in the data
>
> is as follows.
> (This circumvents problems arising when -levelsof-
> cannot cope.)
>
> Suppose we have an identifier, say -id-.
>
> First generate an observation number:
>
> gen long obs = _n
>
> Now we sort by -id-, breaking ties by
> -obs-. The first observation in each block
> then carries information on first occurrence.
> We copy the observation number of first
> occurrence to each other occurrence of the same id.
>
> bysort id (obs) : replace obs = obs[1]
>
> Now we tag ids from 1 to whatever, according
> to first occurrence:
>
> bysort obs : gen group = _n == 1
> replace group = sum(group)
>
> Those familiar with -egen, group()- may
> recognise the basic idea here.
>
> Now the number of groups is identifiable from
>
> su group, meanonly
> local max = r(max)
>
> Typically then you loop over groups:
>
> forval i = 1/`max' {
> ...
> }
>
> Within that loop, a look-up technique to
> get the identifier concerned is, for
> a numeric identifier:
>
> su id if group == `i', meanonly
>
> All identifiers in each group are the same,
> so it matters little whether we pick up
> the minimum, the mean or the maximum:
>
> local which = r(min)
>
> will do, for example.
>
> If the identifier -id- is a string variable, a little
> more work is needed. Outside the loop,
>
> replace obs = _n
>
> Inside the loop,
>
> su obs if group == `i', meanonly
> local which = id[`r(min)']
>
> Nick
> [email protected]
>
> Barankay, Iwan
> >
> > I find the command "levelsof" very useful to cut down the
> > time on loops when I run through the category of a variable
> > (e.g. the location_ids of a large survey).
> >
> > What I also like is that the local macro generated by
> > levlesof is - so it seams to me - still in the order in which
> > it appears in the data and does not sort it which is needed
> > at times (even though the hlp file of levelsof says
> > otherwise). When usually a list is entered into a local it is
> > then sorted.
> >
> > The problem of course is that there are constraints on
> > levelsof when it hits the character limit.
> >
> > My question is:
> >
> > What can I use instead of levelsof for (i) a large number of
> > categories to avoid the character constraint but which (ii)
> > also keeps the categories in the order it appears in the data
> > and does not sort it.
> >
> > (i) is much more important than (ii) but if someone did an
> > elegant solution for (ii) I would love to hear of it.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/