Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: extract values from kdensity graphic
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: extract values from kdensity graphic
Date
Wed, 2 May 2012 18:02:41 +0100
... should be "their being local minima"
On Wed, May 2, 2012 at 5:49 PM, Nick Cox <[email protected]> wrote:
> That problem is several orders of magnitude more difficult than what
> you originally asked.
>
> -kdensity- says nothing directly about the number of groups that
> really or notionally exist. If you are counting modes, that is
> evidence, but the number of modes is dependent on what kernel type and
> what kernel width are chosen and where you estimate the density
> function. Also, if the data are skewed, it may be a good idea to
> estimate the density on a transformed scale.
>
> You should never conclude anything from kernel density estimation
> without a sensitivity analysis on kernel type, width and where
> estimated. Know that the defaults for -kdensity- are pretty arbitrary.
>
> I would have said that for your original problem. I intensify this
> advice on now being told that you are trying to identify hundreds of
> modes in the real problem.
>
> If you persist in this you can look for troughs by there being local
> minima i.e. less than values on either side in a sorted set of values.
>
> On the contrary, cluster analysis methods have scope to address the
> question of how many groups exist. But they aren't likely to be
> practical for identifying hundreds of classes.
>
> My -round()- suggestion was a little flippant. Your example is one
> where five groups appear to exist unequivocally and many methods will
> find them. -round(, 1)- was one but I do agree that it is not a good
> method generally.
>
> Nick
>
> On Wed, May 2, 2012 at 5:27 PM, <[email protected]> wrote:
>> Many thanks Nick,
>>
>> -group1d- doesn't suit my application (versions of Stata aside) as I don't
>> want to have to specify the number of groups. I really like the kdensity
>> plot because it automatically determines the number of groups (which are
>> in the hundreds for my real data sets).
>>
>> Unfortunately -round- often fails to group sizes appropriately in my full
>> data sets too, as the clusters don't always align with the rounding units.
>>
>> The kdensity plot shows exactly what I want, but alas I can't extract it's
>> data (trough coordinates).
>>
>> Any more thoughts from the list?
>>
>> Mike.
>>
>>
>>
>>
>> Another way of looking at these data is to apply -group1d- (SSC). In fact
>> Mike cannot do that himself because it needs Stata 9, but he can use the
>> results. With a least-squares criterion explained in the help and
>> references given, -group1d- yields as the best 5 groups
>>
>> Group Size First Last Mean SD
>> 5 8 23 100.62 30 100.91 100.75 0.09
>> 4 1 22 98.41 22 98.41 98.41 0.00
>> 3 6 16 97.19 21 97.39 97.29 0.06
>> 2 8 8 96.11 15 96.34 96.25 0.07
>> 1 7 1 94.74 7 95.08 94.95 0.11
>>
>> In fact, just about any method of cluster analysis should find the same
>> groups if they are genuine, e.g. -cluster kmeans-. Then use whatever
>> summary you prefer.
>>
>> Details follow for -group1d-.
>>
>> . sort size
>>
>> . group1d size, max(7)
>>
>> Partitions of 30 data up to 7 groups
>>
>> 1 group: sum of squares 143.60
>> Group Size First Last Mean SD
>> 1 30 1 94.74 30 100.91 97.43 2.19
>>
>> 2 groups: sum of squares 23.00
>> Group Size First Last Mean SD
>> 2 9 22 98.41 30 100.91 100.49 0.74
>> 1 21 1 94.74 21 97.39 96.12 0.93
>>
>> 3 groups: sum of squares 6.62
>> Group Size First Last Mean SD
>> 3 8 23 100.62 30 100.91 100.75 0.09
>> 2 15 8 96.11 22 98.41 96.81 0.66
>> 1 7 1 94.74 7 95.08 94.95 0.11
>>
>> 4 groups: sum of squares 1.26
>> Group Size First Last Mean SD
>> 4 8 23 100.62 30 100.91 100.75 0.09
>> 3 7 16 97.19 22 98.41 97.45 0.40
>> 2 8 8 96.11 15 96.34 96.25 0.07
>> 1 7 1 94.74 7 95.08 94.95 0.11
>>
>> 5 groups: sum of squares 0.20
>> Group Size First Last Mean SD
>> 5 8 23 100.62 30 100.91 100.75 0.09
>> 4 1 22 98.41 22 98.41 98.41 0.00
>> 3 6 16 97.19 21 97.39 97.29 0.06
>> 2 8 8 96.11 15 96.34 96.25 0.07
>> 1 7 1 94.74 7 95.08 94.95 0.11
>>
>> 6 groups: sum of squares 0.14
>> Group Size First Last Mean SD
>> 6 8 23 100.62 30 100.91 100.75 0.09
>> 5 1 22 98.41 22 98.41 98.41 0.00
>> 4 6 16 97.19 21 97.39 97.29 0.06
>> 3 8 8 96.11 15 96.34 96.25 0.07
>> 2 5 3 94.95 7 95.08 95.01 0.05
>> 1 2 1 94.74 2 94.89 94.81 0.08
>>
>> 7 groups: sum of squares 0.10
>> Group Size First Last Mean SD
>> 7 2 29 100.84 30 100.91 100.88 0.04
>> 6 6 23 100.62 28 100.76 100.71 0.05
>> 5 1 22 98.41 22 98.41 98.41 0.00
>> 4 6 16 97.19 21 97.39 97.29 0.06
>> 3 8 8 96.11 15 96.34 96.25 0.07
>> 2 5 3 94.95 7 95.08 95.01 0.05
>> 1 2 1 94.74 2 94.89 94.81 0.08
>>
>> Groups Sums of squares
>> 1 143.60
>> 2 23.00
>> 3 6.62
>> 4 1.26
>> 5 0.20
>> 6 0.14
>> 7 0.10
>>
>>
>> On Wed, May 2, 2012 at 9:34 AM, Nick Cox <[email protected]> wrote:
>> In practice,
>>
>> gen sizer = round(size)
>>
>> is a simpler way of degrading your data. Check by
>>
>> scatter sizer size
>>
>> Nick
>>
>> On Wed, May 2, 2012 at 9:16 AM, <[email protected]> wrote:
>> * Hi Statalist,
>> * I'm a beginner using version 8.
>> * The following measurements were collected by a machine in my lab...
>> clear
>> input sampling_event size
>> 1 94.74
>> 2 94.89
>> 3 94.95
>> 4 94.97
>> 5 95
>> 6 95.05
>> 7 95.08
>> 8 96.11
>> 9 96.22
>> 10 96.24
>> 11 96.27
>> 12 96.27
>> 13 96.27
>> 14 96.32
>> 15 96.34
>> 16 97.19
>> 17 97.26
>> 18 97.26
>> 19 97.32
>> 20 97.34
>> 21 97.39
>> 22 98.41
>> 23 100.62
>> 24 100.69
>> 25 100.69
>> 26 100.76
>> 27 100.76
>> 28 100.76
>> 29 100.84
>> 30 100.91
>> end
>> list
>> twoway (scatter size sampling_event)
>>
>> * My aim is to class these size values into categories (5 categories in
>> * the example shown).
>> * kdensity will generate the following graphic...
>>
>> kdensity size , w(0.1) n(30)
>>
>> * The troughs of this graphic are a good way to define the bounds of
>> * each category.
>> * Category_4, for example would include all size values larger than 98
>> * and less than 99.
>> * I'd like to extract these trough points as a kdensity post-estimation
>> * and output them as a new variable.
>> * Is this possible?
>> * Look forward to any advice the list has to offer.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/