Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail

From	László Sándor <[email protected]>
To	[email protected]
Subject	st: Re: st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
Date	Sun, 11 Aug 2013 12:27:17 -0400

Thanks, for all the help on this.

Yes, Robert's right, I can get missing values, though my "punishment"
should downweight sources with the most missing values.

Yes, Nick's right, this works only (?) because I did -set type double-.

Yes, Mike's right, I should have reshaped. I could have even saved the
ordering per asset (as I need it later). Shame I somehow missed this,
my original question on the list a few weeks ago already had a
suspicion that this a task for reshape. However, this operation is
slightly faster, small comfort.

Finally, a note for posterity: The code works as it is, I had another
error in the code that checked the result. I was merging in old
results at some point, and those did not line up with what seemed
closest by the new code. Unfortunate this cost me a few evenings, but
at least I vetted (and corrected) other aspects of the code.

To be clear: The conditions works as they should, as the numbers are
integers. You can sleep at night.

Thanks!

Laszlo

On Fri, Aug 9, 2013 at 3:17 PM, Robert Picard <[email protected]> wrote:
> You get back missing values because your closest variable is
> constant per isin group so your last step simply copies the
> values from the selected source, including its missing
> values.
>
> The following example replicates the results produced by
> your code. I don't really understand what you are trying to
> do but your method of "punishing" an observation with a
> missing value has the net effect of selecting the source
> with the least missing values.
>
> * --------------- begin example ---------------------------
>
> clear
> set obs 10000
> set seed 1234
>
> gen isin = _n
> gen rawprice = runiform() * 100
>
> expand runiform() * 10
> sort isin
> by isin: gen timevar = _n
> sort isin timevar
>
> * make rawprice change through time and add missing values
> foreach v in bb ds ms fs mm skv {
>         gen rawpriceornav_`v' = rawprice + runiform()
>         replace rawpriceornav_`v' = . if runiform() < .4
> }
>
> gen bestprice = .
> gen bestsource = .
> gen mindif = .
>
> local i 0
> foreach v in bb ds ms fs mm {
>         local ++i
>         clonevar xtemp = rawpriceornav_`v'
>         replace xtemp = 2 * rawpriceornav_skv if mi(xtemp)
>         gen distance = abs(xtemp - rawpriceornav_skv)
>         by isin: egen mdist = mean(distance)
>         replace bestprice = rawpriceornav_`v' if mdist < mindif
>         replace bestsource = `i' if mdist < mindif
>         replace mindif = mdist if mdist < mindif
>         drop xtemp mdist distance
> }
>
> tab bestsource, mi
>
> * --------------- end example -----------------------------
>
>
>
> On Fri, Aug 9, 2013 at 10:42 AM, Michael Barker
> <[email protected]> wrote:
>> Hi Laszlo,
>>
>> I don't have any help on the specific problem you are encountering,
>> but you may be able to avoid the problem all together by reshaping
>> your data.
>>
>> If you reshape your data to the asset-source level, you could create a
>> single "distance" variable. Then sort by asset id, distance
>> (descending) , and source (making sure "Bloomberg is on top, in case
>> of a tie). Then you can just keep the first observation for each
>> asset.
>>
>> If your data set is very large, you might work on a subset of the data
>> containing only the price variables, then merge it back into the full
>> data set. If I understand the problem correctly, the code would look
>> something like what I've included below.
>>
>> Mike
>>
>> ****
>>
>> save full_data
>>
>> keep isin bb ds ms fs mm
>>
>> foreach var in varlist bb ds ms fs mm {
>> rename `var' price`var'
>> }
>>
>> reshape long price, i(isin) j(source) string
>>
>> bys isin: egen aveprice = mean(price)
>> gen pricediff = abs(price-aveprice)
>>
>> gsort isin -pricediff source
>>
>> bys isin: gen closest = _n==1
>>
>> keep if closest
>> drop closest
>>
>> merge isin 1:1 using full_data
>>
>>
>>
>>
>> On Thu, Aug 8, 2013 at 9:57 PM, Nick Cox <[email protected]> wrote:
>>> Your original variables may be -double- but the results of these
>>> -egen- and -generate- commands will be -float- unless you have -set
>>> type double-.  For what you want, as soon as one variable is -float-
>>> testing for exact equality (always tricky with non-integers) becomes a
>>> small nightmare.
>>>
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 9 August 2013 01:18, László Sándor <[email protected]> wrote:
>>>
>>>> Thanks, Sergiy, as always.
>>>>
>>>> My trouble is that I don't know which part of my code causes the
>>>> problem. Then I could test it, or show a fail in a mock example.
>>>>
>>>> All my variables are double, by the way unless otherwise specified,
>>>> i.e. for "closest."
>>>>
>>>> But the real problem is why the second nested cond fails. The first
>>>> produces the "closest" integer values that seem to be correct. And
>>>> they do look integer, as I tried to verify. But the second nested
>>>> cond, that assigns values run all the way to missing values, which I
>>>> don't understand.
>>>>
>>>> I will let you know if I find the source of problem.
>>>
>>> On Thu, Aug 8, 2013 at 7:44 PM, Sergiy Radyakin <[email protected]> wrote:
>>>
>>>>> See if this is the reason for problems:
>>>>> ....cond(minmeanskvdiff==meanmmskvdiff & .....
>>>>> Are your diffs doubles? try (a-b)<epsilon for conditions of that type.
>>>>>
>>>>> Or rewrite your example based on auto.dta, and repost.
>>>>>
>>>>> Writing huge gens like that makes it really difficult to read the code.
>>>
>>> On Thu, Aug 8, 2013 at 6:46 PM, László Sándor <[email protected]> wrote:
>>>
>>>>>> Please let me ask for some help because I cannot figure out why my
>>>>>> code works for some observations but not others. Maybe this is a
>>>>>> -cond- or precision issue and we all learnt something.
>>>>>>
>>>>>> I need to use financial prices from various sources, but maddeningly,
>>>>>> the sources don't line up completely (there is some ambiguity about
>>>>>> the assets in question). I do have some reference prices for a few
>>>>>> years for a part of the panel. I want to price at least those assets
>>>>>> right.
>>>>>>
>>>>>> I think have some code that calculates the average distance from each
>>>>>> source, and fills a variable about which came closest. This could be
>>>>>> subject to rounding errors, but this seems to fill in impressively
>>>>>> many values. And manual checks verify that the indicated source is
>>>>>> indeed the closest to the reference.
>>>>>>
>>>>>> The next round of -cond-s should pick the value from the right source
>>>>>> but produces many missing values. The conditions are all the
>>>>>> source-indicator being equal to a single-digit integer value, which
>>>>>> seems to be true for the variable, e.g. the label apply to it, I can
>>>>>> tab it etc.
>>>>>>
>>>>>> What is going on? I do have further details, but would confuse you
>>>>>> more only if necessary.
>>>>>>
>>>>>> * Code to generate the closest source for an asset, with panel id isin
>>>>>>
>>>>>> foreach v in bb ds ms fs mm {
>>>>>> gen fakerawpriceornav_`v'=cond(mi(rawpriceornav_`v'),2*rawpriceornav_skv,rawpriceornav_`v')
>>>>>> // I need this to punish a source missing when "SKV" reference still
>>>>>> has value, otherwise a missing difference would benefit the source by
>>>>>> the logic of -egen mean-
>>>>>> gen `v'skvdiff = abs(fakerawpriceornav_`v'-rawpriceornav_skv)
>>>>>> bys isin: egen mean`v'skvdiff = mean(`v'skvdiff)
>>>>>> }
>>>>>> egen minmeanskvdiff = rowmin(mean*skvdiff)
>>>>>>
>>>>>> g byte closest = cond(minmeanskvdiff==meanbbskvdiff & !mi(minmeanskvdiff),1, ///
>>>>>> cond(minmeanskvdiff==meanmmskvdiff & !mi(minmeanskvdiff),2, ///
>>>>>> cond(minmeanskvdiff==meanmsskvdiff & !mi(minmeanskvdiff),3, ///
>>>>>>   cond(minmeanskvdiff==meanfsskvdiff & !mi(minmeanskvdiff),4, ///
>>>>>>   cond(minmeanskvdiff==meandsskvdiff & !mi(minmeanskvdiff),5, ///     )
>>>>>> .)))))
>>>>>>
>>>>>> * So far not many missing values generated, though vast majority of
>>>>>> "closest" is simply the first value, I think this is still correct
>>>>>> (There are many ties, and I break the tie in favor of Bloomberg.)
>>>>>>
>>>>>> la def closest 1 "Bloomberg" 3 "Morningstar" 5 "Datastream" 2
>>>>>> "MoneyMate" 4 "FactSet"
>>>>>> la val closest closest
>>>>>> * This verifies that labels are picked up by the integer values.
>>>>>>
>>>>>> g rawpriceornav_pick = ///
>>>>>> cond(closest==1,rawpriceornav_bb, ///
>>>>>> cond(closest==2,rawpriceornav_mm, ///
>>>>>> cond(closest==3,rawpriceornav_ms, ///
>>>>>> cond(closest==4,rawpriceornav_fs, ///
>>>>>> cond(closest==5,rawpriceornav_ds, ///
>>>>>> .)))))
>>>>>> * But here I get back a ton of missing values. Yes, it can happen that
>>>>>> in my panel a source is the closest for an ISIN but has missing values
>>>>>> for a few years, so closest will always have less missing than this
>>>>>> variable. But only 20% or so of this variable gets filled here, which
>>>>>> is not reasonable.
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: picking the closest source from multiple — nested cond fail
  - From: László Sándor <[email protected]>
- st: Re: st: picking the closest source from multiple — nested cond fail
  - From: Sergiy Radyakin <[email protected]>
- st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: László Sándor <[email protected]>
- st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: Nick Cox <[email protected]>
- st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: Michael Barker <[email protected]>
- st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: Robert Picard <[email protected]>

Prev by Date: Re: st: margins with logarithmic offsets
Next by Date: st: STATA SEM with censored data
Previous by thread: st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
Next by thread: st: Arellano and Bond estimation with System of Equations
Index(es):
- Date
- Thread