I agree with these criteria. In addition, a general solution to this
should be able to tackle
Missing values
Weights
Ties in frequency (e.g. there may not be exactly 10 modes)
As promised earlier, here is an update of -modes- earlier published in
the STB and the SJ. An update follows in the Stata Journal.
*! NJC 1.4.0 17 November 2009
* NJC 1.3.0 13 May 2003 (SJ3-2: sg113_1)
* NJC 1.2.0 15 June 1999
* NJC 1.1.2 23 December 1998
* NJC 1.1.1 29 October 1998
program modes, sort
version 8.0
syntax varname [if] [in] [fweight aweight/] ///
[ , Min(int 0) Nmodes(int 0) GENerate(str) ]
if "`generate'" != "" {
capture confirm new variable `generate'
if _rc {
di as err "generate() requires new variable
name"
exit _rc
}
}
if `min' & `nmodes' {
di as err "may not specify both min() and nmodes()"
exit 198
}
quietly {
marksample touse, strok
count if `touse'
if r(N) == 0 error 2000
tempvar freq
if "`exp'" == "" local exp = 1
bysort `touse' `varlist' : ///
gen double `freq' = sum(`exp') * `touse'
by `touse' `varlist' : ///
replace `freq' = (_n == _N) * `freq'[_N]
label var `freq' "Freq."
if `min' > 0 {
local which "`freq' >= `min'"
}
else if `nmodes' > 0 {
sort `touse' `freq' `varlist'
count if `freq'
local nmodes = min(`nmodes', r(N))
local which "`freq' >= `freq'[_N - `nmodes' +
1]"
}
else {
su `freq', meanonly
local max = r(max)
local which "`freq' == `max'"
}
count if `which'
if r(N) == 0 {
di as err "no such modes in data"
exit 498
}
}
tabdisp `varlist' if `which', c(`freq')
quietly if "`generate'" != "" {
gen byte `generate' = `which' if `touse'
bysort `touse' `varlist' (`generate') : ///
replace `generate' = `generate'[_N] if `touse'
}
end
------------------------------------------------------------------------
help for modes (SJ9-4: sg113_2; SJ3-2: sg113_1)
------------------------------------------------------------------------
Tabulation of mode(s)
modes varname [weight] [if exp] [in range] [ , { min(#) |
nmodes(#) } generate(newvar) ]
Description
modes tabulates the mode(s) of varname, that is, the value(s) of
varname that occur most frequently. varname may be numeric or
string. fweights and aweights are allowed. Missing values are
ignored.
modes is most obviously useful with a discrete or categorical
variable. Continuous variables may need to be placed in bins or
classes first.
Options
min(#) specifies that all values with a frequency of # or more
should be shown.
nmodes(#) specifies that # modes should be shown. However, if ties
in frequency make identification of precisely # modes
arbitrary, all such tied modes will be shown. Note that fewer
modes will be shown if fewer than # modes exist.
min() and nmodes() may not be specified together.
generate(newvar) generates an indicator variable that is missing
if varlist is missing or observations are excluded by if or
in, 1 whenever the value of varlist is one of the displayed
modes, and 0 otherwise.
Examples
. modes rep78
. modes rep78 if foreign
. modes mpg, min(3)
. modes mpg, nmodes(3)
. modes turn, nmodes(10) gen(flag)
Author
Nicholas J. Cox, Durham University, U.K.
[email protected]
Acknowledgments
A problem posed by Sylvain Friederich led to the nmodes() option.
A problem posed by Elan Cohen led to the generate() option.
Also see
STB: STB-50 sg113
Online: help for tabulate, kdensity, egen
Nick
[email protected]
Martin Weiss
As discussed last night between me and Sergiy: You want the whole
dataset
with all variables intact plus one that denotes membership in the "club
of
most frequent values of mpg"...
[email protected]
Suppose we need to flag the 5 most frequent values, how about the
following
typings?
sysuse auto, clear
keep mpg
bys mpg: egen mycount=count(mpg)
bys mycount: g num=_n
gsort num -mycount
g tag=_n<=5
bys mycount: egen rank5=max(tag)
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/