Daphna Bassok started a thread by asking various
questions on box plots. Here I edit slightly, also
numbering the questions, DB1 ... DB4.
DB1. Is there any way I can get labels on my box plot graphs?
Guilherme Silva answered
>> Supposing the variable of interest is named "xvar",
>> the variable of identification - "case", and that you
>> have seen just 4 outliers in a previous screening ... then to
>> identify outliers (outsides in the box plot) you may type:
>> . graph box xvar, medtype(line) mark(1,mlabel(case)) ...
and he pointed out that the rule is a separate -mark(,)- option
for each y variable.
DB2. I would like to see the values of the median, 25th percentile,
75 %...etc.
I answered
>> Use -summarize, detail- to see the median and quartiles.
DB3. I want to see/know the values of the top and bottom cut off lines.
How do I find these values?
I answered
>> The adjacent values are the extreme data points within
>> 1.5 iqr of the nearer quartile. I think you might have
>> to re-create those for yourself, as -graph box- doesn't
>> seem to leave them in memory. Nor should it really,
>> as there could be lots of them.
I also posted the code of a program -adjacent- to
calculate these, and commented
>> I seem to get the same values as do the box
>> plot routines. Note that adjacent values
>> need not be unique. More testing advisable.
Ric Uslaner wrote
>> I copied -adjacent- into the do file editor and
>> tried to run it ... and this is what I got:
>> you must specify the lname() option
>> r(198);
whereas Clive Nicholas reported no problem.
He suggested -update q-.
The message Ric was seeing was coming from
official Stata -egen, group()-, which is
called by -adjacent-. I am not
clear why he's getting it. As far as I can
see it shouldn't happen. If it persists,
do flag that privately.
Daphna also asked privately, and I take
the liberty of echoing the question
here as others may be interested:
>> I am not sure I follow why the lower
>> and upper adjacent values are not
>> unique for a given population
What I meant was that there could
be ties for adjacent value. Naturally,
there could also be ties even for the
most extreme outliers.
I have now extended -adjacent- so that
it supports multiple variables in
the varlist and also frequency and
analytic weights. I'll send the files
to Kit Baum for posting on SSC.
DB4. I am interested in analyzing the outliers or outside values,
but I am not able to see what the specific lower and upper cut off
values are.
Another program which may be of interest here
is -extremes- from SSC. With the -iqr- option,
or with -iqr(1.5)- you can see which observations
are more than 1.5 iqr from the nearer quartile:
. extremes mpg, iqr
+--------------------+
| obs: iqr: mpg |
|--------------------|
| 59. 2.286 41 |
+--------------------+
What's often more useful is to specify
other variables which are included
in the listing as context:
. extremes mpg make, iqr
+--------------------------------+
| obs: iqr: mpg make |
|--------------------------------|
| 59. 2.286 41 VW Diesel |
+--------------------------------+
Just added to -extremes-, but not yet
in the version on SSC is support for -by:-.
. bysort for : extremes mpg make, iqr
-------------------------------------
-> foreign = Domestic
+----------------------------------+
| obs: iqr: mpg make |
|----------------------------------|
| 23. 2.182 34 Plym. Champ |
+----------------------------------+
--------------------------------------------------------------------------------------------------
-> foreign = Foreign
+--------------------------------+
| obs: iqr: mpg make |
|--------------------------------|
| 71. 1.857 41 VW Diesel |
+--------------------------------+
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/