Thank you everyone. I had just finished writing a solution similar to Jeph's but without the generalizations Nick's solution offers. -nmodes- will definitely do the trick.
Thanks again,
- Elan
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Nick Cox
> Sent: Tuesday, November 17, 2009 10:09 AM
> To: [email protected]
> Subject: RE: RE: st: AW: Create a flag variable for 10 most
> frequent values
>
> I agree with these criteria. In addition, a general solution to this
> should be able to tackle
>
> Missing values
> Weights
> Ties in frequency (e.g. there may not be exactly 10 modes)
>
> As promised earlier, here is an update of -modes- earlier published in
> the STB and the SJ. An update follows in the Stata Journal.
>
> *! NJC 1.4.0 17 November 2009
> * NJC 1.3.0 13 May 2003 (SJ3-2: sg113_1)
> * NJC 1.2.0 15 June 1999
> * NJC 1.1.2 23 December 1998
> * NJC 1.1.1 29 October 1998
> program modes, sort
> version 8.0
> syntax varname [if] [in] [fweight aweight/] ///
> [ , Min(int 0) Nmodes(int 0) GENerate(str) ]
>
> if "`generate'" != "" {
> capture confirm new variable `generate'
> if _rc {
> di as err "generate() requires new variable
> name"
> exit _rc
> }
> }
>
> if `min' & `nmodes' {
> di as err "may not specify both min() and nmodes()"
> exit 198
> }
>
> quietly {
> marksample touse, strok
> count if `touse'
> if r(N) == 0 error 2000
>
> tempvar freq
> if "`exp'" == "" local exp = 1
> bysort `touse' `varlist' : ///
> gen double `freq' = sum(`exp') * `touse'
> by `touse' `varlist' : ///
> replace `freq' = (_n == _N) * `freq'[_N]
> label var `freq' "Freq."
>
> if `min' > 0 {
> local which "`freq' >= `min'"
> }
> else if `nmodes' > 0 {
> sort `touse' `freq' `varlist'
> count if `freq'
> local nmodes = min(`nmodes', r(N))
> local which "`freq' >= `freq'[_N - `nmodes' +
> 1]"
> }
> else {
> su `freq', meanonly
> local max = r(max)
> local which "`freq' == `max'"
> }
>
> count if `which'
> if r(N) == 0 {
> di as err "no such modes in data"
> exit 498
> }
> }
>
> tabdisp `varlist' if `which', c(`freq')
>
> quietly if "`generate'" != "" {
> gen byte `generate' = `which' if `touse'
> bysort `touse' `varlist' (`generate') : ///
> replace `generate' = `generate'[_N] if `touse'
> }
>
> end
>
>
> --------------------------------------------------------------
> ----------
> help for modes (SJ9-4: sg113_2;
> SJ3-2: sg113_1)
> --------------------------------------------------------------
> ----------
>
> Tabulation of mode(s)
>
> modes varname [weight] [if exp] [in range] [ , { min(#) |
> nmodes(#) } generate(newvar) ]
>
>
> Description
>
> modes tabulates the mode(s) of varname, that is, the value(s) of
> varname that occur most frequently. varname may be numeric or
> string. fweights and aweights are allowed. Missing values are
> ignored.
>
> modes is most obviously useful with a discrete or categorical
> variable. Continuous variables may need to be placed in bins or
> classes first.
>
>
> Options
>
> min(#) specifies that all values with a frequency of # or more
> should be shown.
>
> nmodes(#) specifies that # modes should be shown. However, if ties
> in frequency make identification of precisely # modes
> arbitrary, all such tied modes will be shown. Note that fewer
> modes will be shown if fewer than # modes exist.
>
> min() and nmodes() may not be specified together.
>
> generate(newvar) generates an indicator variable that is missing
> if varlist is missing or observations are excluded by if or
> in, 1 whenever the value of varlist is one of the displayed
> modes, and 0 otherwise.
>
>
> Examples
>
> . modes rep78
> . modes rep78 if foreign
> . modes mpg, min(3)
> . modes mpg, nmodes(3)
> . modes turn, nmodes(10) gen(flag)
>
>
> Author
>
> Nicholas J. Cox, Durham University, U.K.
> [email protected]
>
>
> Acknowledgments
>
> A problem posed by Sylvain Friederich led to the nmodes() option.
> A problem posed by Elan Cohen led to the generate() option.
>
>
> Also see
>
> STB: STB-50 sg113
> Online: help for tabulate, kdensity, egen
>
> Nick
> [email protected]
>
> Martin Weiss
>
> As discussed last night between me and Sergiy: You want the whole
> dataset
> with all variables intact plus one that denotes membership in
> the "club
> of
> most frequent values of mpg"...
>
> [email protected]
>
> Suppose we need to flag the 5 most frequent values, how about the
> following
> typings?
>
> sysuse auto, clear
> keep mpg
> bys mpg: egen mycount=count(mpg)
> bys mycount: g num=_n
> gsort num -mycount
> g tag=_n<=5
> bys mycount: egen rank5=max(tag)
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/