RE: RE: st: AW: Create a flag variable for 10 most frequent values

From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: RE: st: AW: Create a flag variable for 10 most frequent values
Date   Tue, 17 Nov 2009 15:08:39 -0000

I agree with these criteria. In addition, a general solution to this
should be able to tackle

Missing values
Ties in frequency (e.g. there may not be exactly 10 modes) 

As promised earlier, here is an update of -modes- earlier published in
the STB and the SJ. An update follows in the Stata Journal. 

*! NJC 1.4.0 17 November 2009 
* NJC 1.3.0 13 May 2003            (SJ3-2: sg113_1)
* NJC 1.2.0 15 June 1999 
* NJC 1.1.2 23 December 1998
* NJC 1.1.1 29 October 1998
program modes, sort 
        version 8.0
        syntax varname [if] [in] [fweight aweight/] ///
	[ , Min(int 0) Nmodes(int 0) GENerate(str) ]

	if "`generate'" != "" { 
		capture confirm new variable `generate' 
		if _rc { 
			di as err "generate() requires new variable
			exit _rc 

	if `min' & `nmodes' { 
		di as err "may not specify both min() and nmodes()"
		exit 198
	quietly { 
		marksample touse, strok
		count if `touse' 
		if r(N) == 0 error 2000 
		tempvar freq 
		if "`exp'" == "" local exp = 1 
		bysort `touse' `varlist' : ///
			gen double `freq' = sum(`exp') * `touse'
		by `touse' `varlist' : ///
			replace `freq' = (_n == _N) * `freq'[_N] 
		label var `freq' "Freq."

		if `min' > 0 { 
			local which "`freq' >= `min'" 
		else if `nmodes' > 0 { 
			sort `touse' `freq' `varlist' 
			count if `freq' 
			local nmodes = min(`nmodes', r(N)) 
			local which "`freq' >= `freq'[_N - `nmodes' +
		else {
			su `freq', meanonly
			local max = r(max)
			local which "`freq' == `max'" 
		count if `which'
		if r(N) == 0 {
			di as err "no such modes in data"
			exit 498

	tabdisp `varlist' if `which', c(`freq')

	quietly	if "`generate'" != "" { 
		gen byte `generate' = `which' if `touse' 
		bysort `touse' `varlist' (`generate') : ///
		replace `generate' = `generate'[_N] if `touse' 

help for modes                          (SJ9-4: sg113_2; SJ3-2: sg113_1)

Tabulation of mode(s)

        modes varname [weight] [if exp] [in range] [ , { min(#) |
                 nmodes(#) } generate(newvar) ]


    modes tabulates the mode(s) of varname, that is, the value(s) of
    varname that occur most frequently. varname may be numeric or
    string.  fweights and aweights are allowed. Missing values are

    modes is most obviously useful with a discrete or categorical
    variable.  Continuous variables may need to be placed in bins or
    classes first.


    min(#) specifies that all values with a frequency of # or more
        should be shown.

    nmodes(#) specifies that # modes should be shown. However, if ties
        in frequency make identification of precisely # modes
        arbitrary, all such tied modes will be shown. Note that fewer
        modes will be shown if fewer than # modes exist.

        min() and nmodes() may not be specified together.

    generate(newvar) generates an indicator variable that is missing
        if varlist is missing or observations are excluded by if or
        in, 1 whenever the value of varlist is one of the displayed
        modes, and 0 otherwise.


    . modes rep78
    . modes rep78 if foreign
    . modes mpg, min(3)
    . modes mpg, nmodes(3)
    . modes turn, nmodes(10) gen(flag)


    Nicholas J. Cox, Durham University, U.K.
    [email protected]


    A problem posed by Sylvain Friederich led to the nmodes() option.
    A problem posed by Elan Cohen led to the generate() option.

Also see

    STB:     STB-50 sg113
    Online:  help for tabulate, kdensity, egen

[email protected] 

Martin Weiss

As discussed last night between me and Sergiy: You want the whole
with all variables intact plus one that denotes membership in the "club
most frequent values of mpg"...

[email protected]

Suppose we need to flag the 5 most frequent values, how about the

sysuse auto, clear
keep mpg
bys mpg: egen mycount=count(mpg)
bys mycount: g num=_n
gsort num -mycount
g tag=_n<=5
bys mycount: egen rank5=max(tag)

