Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Sheena Sullivan" <sgsullivan@ucla.edu> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | st: calculating the whiskers on a boxplot using -twoway- |
Date | Thu, 21 Mar 2013 10:40:21 +1100 |
I am creating boxplots using -twoway- rather than -graph boxplot- because I want to display the mean values and it was suggested in a previous Statalist post to follow the (great) guidance provided in N.J. Cox "Speaking Stata: Creating and varying box plots." SJ 9(3):478-496. However, I'm a little confused about the guidance for calculating the whiskers. The Cox article suggests that the length of the whiskers can be generated using the following (syntax for upper whisker only is shown): . bysort year: egen upper=max(min(logIC50, upq+1.5*iqr)) ///(eq1) logIC50 is my variable of interest, upq is its upper quartile, loq is its lower quartile and iqr is its inter-quartile range. The upper limit, upper, should be the largest value (of an observation within the dataset) not greater than upq + 1.5*iqr and the lower limit, lower, should be the smallest value not less than loq - 1.5*iqr But they aren't. I noticed the problem because if I draw my plots using the -twoway- commands in the Cox article (see end of email) the whiskers are longer than when drawn using -graph box-. So I did a manual inspection of the data and sometimes the values generated by eq1 represent a value in the dataset and at other times they do not. If I instead use the following syntax: . gen upper1=upq+1.5*iqr ///(eq2) . bysort year: egen upper2=max(logIC50) if logIC50<upper1 ///(eq3) I can generate a graph using the -twoway- commands in the Cox article with whiskers that look the same as when I draw it using -graph box- However, to get my outliers to appear, I need to include a -scatter- and using the upper2 and lower2 variables doesn't work because the if statement in eq4 means no value is generated for lower2 and upper2 for those observations outside the range. A work around is to fill the gaps by creating another variable of the mean (or min or max because they are all the same) of the other observations within that by group: . bysort year: egen upper3=mean(upper2) . bysort year: egen lower3=mean(lower2) But this getting to be an awful lot of syntax just to add a mean to a boxplot! So, is there a better way to generate the whiskers? And can anyone help me understand why eq1 sometimes produces results that do not correspond to a value in the dataset? Thanks, Sheena Syntax for drawing the boxplot using -twoway- . twoway rbar med upq year, pstyle(p1) blc(gs4) bfc(gs9) blw(vthin) barw(0.35) || /// upper half of box > rbar med loq year, pstyle(p1) blc(gs4) bfc(gs9) blw(vthin) barw(0.35) || /// lower half of box > rspike upq upper year, pstyle(p1) || /// upper whisker > rspike loq lower year, pstyle(p1) || /// lower whisker > rcap upper upper year, pstyle(p1) msize(*2) || /// cap the upper whisker > rcap lower lower year, pstyle(p1) msize(*2) || /// cap the lower whisker > scatter mean year, ms(+) msize(1) mlw(vvthin) || /// mean > scatter logIC50 year if !inrange(logIC50, lower, upper), ms(O) mc(gs4) msize(1) mlw(vthin) /// outliers > legend(off) /// > yla(, ang(h)) ytitle(Log IC50) xtitle("") /// > yscale(range(-2 2.5)) ylabel(-2 -1 0 1 2) title(B) b1title(Year) /// > name(p_B, replace) yline(.53 1.53, lcolor(black) lpattern(dash)) /// > scheme(sj) Syntax for -graph box- . graph box logIC50, over(year) /// > name(pB, replace) yline(.53 1.53, lcolor(black) lpattern(dash)) /// > yscale(range(-2 2.5)) ylabel(-2 -1 0 1 2) title(B) b1title(Year) * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/