Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail

From	Robert Picard <[email protected]>
To	[email protected]
Subject	st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
Date	Fri, 9 Aug 2013 15:17:04 -0400

You get back missing values because your closest variable is
constant per isin group so your last step simply copies the
values from the selected source, including its missing
values.

The following example replicates the results produced by
your code. I don't really understand what you are trying to
do but your method of "punishing" an observation with a
missing value has the net effect of selecting the source
with the least missing values.

* --------------- begin example ---------------------------

clear
set obs 10000
set seed 1234

gen isin = _n
gen rawprice = runiform() * 100

expand runiform() * 10
sort isin
by isin: gen timevar = _n
sort isin timevar

* make rawprice change through time and add missing values
foreach v in bb ds ms fs mm skv {
	gen rawpriceornav_`v' = rawprice + runiform()
	replace rawpriceornav_`v' = . if runiform() < .4
}

gen bestprice = .
gen bestsource = .
gen mindif = .

local i 0
foreach v in bb ds ms fs mm {
	local ++i
	clonevar xtemp = rawpriceornav_`v'
	replace xtemp = 2 * rawpriceornav_skv if mi(xtemp)
	gen distance = abs(xtemp - rawpriceornav_skv)
	by isin: egen mdist = mean(distance)
	replace bestprice = rawpriceornav_`v' if mdist < mindif
	replace bestsource = `i' if mdist < mindif
	replace mindif = mdist if mdist < mindif
	drop xtemp mdist distance
}

tab bestsource, mi

* --------------- end example -----------------------------



On Fri, Aug 9, 2013 at 10:42 AM, Michael Barker
<[email protected]> wrote:
> Hi Laszlo,
>
> I don't have any help on the specific problem you are encountering,
> but you may be able to avoid the problem all together by reshaping
> your data.
>
> If you reshape your data to the asset-source level, you could create a
> single "distance" variable. Then sort by asset id, distance
> (descending) , and source (making sure "Bloomberg is on top, in case
> of a tie). Then you can just keep the first observation for each
> asset.
>
> If your data set is very large, you might work on a subset of the data
> containing only the price variables, then merge it back into the full
> data set. If I understand the problem correctly, the code would look
> something like what I've included below.
>
> Mike
>
> ****
>
> save full_data
>
> keep isin bb ds ms fs mm
>
> foreach var in varlist bb ds ms fs mm {
> rename `var' price`var'
> }
>
> reshape long price, i(isin) j(source) string
>
> bys isin: egen aveprice = mean(price)
> gen pricediff = abs(price-aveprice)
>
> gsort isin -pricediff source
>
> bys isin: gen closest = _n==1
>
> keep if closest
> drop closest
>
> merge isin 1:1 using full_data
>
>
>
>
> On Thu, Aug 8, 2013 at 9:57 PM, Nick Cox <[email protected]> wrote:
>> Your original variables may be -double- but the results of these
>> -egen- and -generate- commands will be -float- unless you have -set
>> type double-.  For what you want, as soon as one variable is -float-
>> testing for exact equality (always tricky with non-integers) becomes a
>> small nightmare.
>>
>> Nick
>> [email protected]
>>
>>
>> On 9 August 2013 01:18, László Sándor <[email protected]> wrote:
>>
>>> Thanks, Sergiy, as always.
>>>
>>> My trouble is that I don't know which part of my code causes the
>>> problem. Then I could test it, or show a fail in a mock example.
>>>
>>> All my variables are double, by the way unless otherwise specified,
>>> i.e. for "closest."
>>>
>>> But the real problem is why the second nested cond fails. The first
>>> produces the "closest" integer values that seem to be correct. And
>>> they do look integer, as I tried to verify. But the second nested
>>> cond, that assigns values run all the way to missing values, which I
>>> don't understand.
>>>
>>> I will let you know if I find the source of problem.
>>
>> On Thu, Aug 8, 2013 at 7:44 PM, Sergiy Radyakin <[email protected]> wrote:
>>
>>>> See if this is the reason for problems:
>>>> ....cond(minmeanskvdiff==meanmmskvdiff & .....
>>>> Are your diffs doubles? try (a-b)<epsilon for conditions of that type.
>>>>
>>>> Or rewrite your example based on auto.dta, and repost.
>>>>
>>>> Writing huge gens like that makes it really difficult to read the code.
>>
>> On Thu, Aug 8, 2013 at 6:46 PM, László Sándor <[email protected]> wrote:
>>
>>>>> Please let me ask for some help because I cannot figure out why my
>>>>> code works for some observations but not others. Maybe this is a
>>>>> -cond- or precision issue and we all learnt something.
>>>>>
>>>>> I need to use financial prices from various sources, but maddeningly,
>>>>> the sources don't line up completely (there is some ambiguity about
>>>>> the assets in question). I do have some reference prices for a few
>>>>> years for a part of the panel. I want to price at least those assets
>>>>> right.
>>>>>
>>>>> I think have some code that calculates the average distance from each
>>>>> source, and fills a variable about which came closest. This could be
>>>>> subject to rounding errors, but this seems to fill in impressively
>>>>> many values. And manual checks verify that the indicated source is
>>>>> indeed the closest to the reference.
>>>>>
>>>>> The next round of -cond-s should pick the value from the right source
>>>>> but produces many missing values. The conditions are all the
>>>>> source-indicator being equal to a single-digit integer value, which
>>>>> seems to be true for the variable, e.g. the label apply to it, I can
>>>>> tab it etc.
>>>>>
>>>>> What is going on? I do have further details, but would confuse you
>>>>> more only if necessary.
>>>>>
>>>>> * Code to generate the closest source for an asset, with panel id isin
>>>>>
>>>>> foreach v in bb ds ms fs mm {
>>>>> gen fakerawpriceornav_`v'=cond(mi(rawpriceornav_`v'),2*rawpriceornav_skv,rawpriceornav_`v')
>>>>> // I need this to punish a source missing when "SKV" reference still
>>>>> has value, otherwise a missing difference would benefit the source by
>>>>> the logic of -egen mean-
>>>>> gen `v'skvdiff = abs(fakerawpriceornav_`v'-rawpriceornav_skv)
>>>>> bys isin: egen mean`v'skvdiff = mean(`v'skvdiff)
>>>>> }
>>>>> egen minmeanskvdiff = rowmin(mean*skvdiff)
>>>>>
>>>>> g byte closest = cond(minmeanskvdiff==meanbbskvdiff & !mi(minmeanskvdiff),1, ///
>>>>> cond(minmeanskvdiff==meanmmskvdiff & !mi(minmeanskvdiff),2, ///
>>>>> cond(minmeanskvdiff==meanmsskvdiff & !mi(minmeanskvdiff),3, ///
>>>>>   cond(minmeanskvdiff==meanfsskvdiff & !mi(minmeanskvdiff),4, ///
>>>>>   cond(minmeanskvdiff==meandsskvdiff & !mi(minmeanskvdiff),5, ///     )
>>>>> .)))))
>>>>>
>>>>> * So far not many missing values generated, though vast majority of
>>>>> "closest" is simply the first value, I think this is still correct
>>>>> (There are many ties, and I break the tie in favor of Bloomberg.)
>>>>>
>>>>> la def closest 1 "Bloomberg" 3 "Morningstar" 5 "Datastream" 2
>>>>> "MoneyMate" 4 "FactSet"
>>>>> la val closest closest
>>>>> * This verifies that labels are picked up by the integer values.
>>>>>
>>>>> g rawpriceornav_pick = ///
>>>>> cond(closest==1,rawpriceornav_bb, ///
>>>>> cond(closest==2,rawpriceornav_mm, ///
>>>>> cond(closest==3,rawpriceornav_ms, ///
>>>>> cond(closest==4,rawpriceornav_fs, ///
>>>>> cond(closest==5,rawpriceornav_ds, ///
>>>>> .)))))
>>>>> * But here I get back a ton of missing values. Yes, it can happen that
>>>>> in my panel a source is the closest for an ISIN but has missing values
>>>>> for a few years, so closest will always have less missing than this
>>>>> variable. But only 20% or so of this variable gets filled here, which
>>>>> is not reasonable.
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Re: st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: László Sándor <[email protected]>

References:
- st: picking the closest source from multiple — nested cond fail
  - From: László Sándor <[email protected]>
- st: Re: st: picking the closest source from multiple — nested cond fail
  - From: Sergiy Radyakin <[email protected]>
- st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: László Sándor <[email protected]>
- st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: Nick Cox <[email protected]>
- st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
  - From: Michael Barker <[email protected]>

Prev by Date: st: Compare regression coefficients between two models
Next by Date: Re: st: Compare regression coefficients between two models
Previous by thread: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
Next by thread: st: Re: st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail
Index(es):
- Date
- Thread