Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | László Sándor <sandorl@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | st: Re: st: Re: st: Re: st: Re: st: Re: st: Re: st: picking the closest source from multiple — nested cond fail |
Date | Sun, 11 Aug 2013 12:27:17 -0400 |
Thanks, for all the help on this. Yes, Robert's right, I can get missing values, though my "punishment" should downweight sources with the most missing values. Yes, Nick's right, this works only (?) because I did -set type double-. Yes, Mike's right, I should have reshaped. I could have even saved the ordering per asset (as I need it later). Shame I somehow missed this, my original question on the list a few weeks ago already had a suspicion that this a task for reshape. However, this operation is slightly faster, small comfort. Finally, a note for posterity: The code works as it is, I had another error in the code that checked the result. I was merging in old results at some point, and those did not line up with what seemed closest by the new code. Unfortunate this cost me a few evenings, but at least I vetted (and corrected) other aspects of the code. To be clear: The conditions works as they should, as the numbers are integers. You can sleep at night. Thanks! Laszlo On Fri, Aug 9, 2013 at 3:17 PM, Robert Picard <picard@netbox.com> wrote: > You get back missing values because your closest variable is > constant per isin group so your last step simply copies the > values from the selected source, including its missing > values. > > The following example replicates the results produced by > your code. I don't really understand what you are trying to > do but your method of "punishing" an observation with a > missing value has the net effect of selecting the source > with the least missing values. > > * --------------- begin example --------------------------- > > clear > set obs 10000 > set seed 1234 > > gen isin = _n > gen rawprice = runiform() * 100 > > expand runiform() * 10 > sort isin > by isin: gen timevar = _n > sort isin timevar > > * make rawprice change through time and add missing values > foreach v in bb ds ms fs mm skv { > gen rawpriceornav_`v' = rawprice + runiform() > replace rawpriceornav_`v' = . if runiform() < .4 > } > > gen bestprice = . > gen bestsource = . > gen mindif = . > > local i 0 > foreach v in bb ds ms fs mm { > local ++i > clonevar xtemp = rawpriceornav_`v' > replace xtemp = 2 * rawpriceornav_skv if mi(xtemp) > gen distance = abs(xtemp - rawpriceornav_skv) > by isin: egen mdist = mean(distance) > replace bestprice = rawpriceornav_`v' if mdist < mindif > replace bestsource = `i' if mdist < mindif > replace mindif = mdist if mdist < mindif > drop xtemp mdist distance > } > > tab bestsource, mi > > * --------------- end example ----------------------------- > > > > On Fri, Aug 9, 2013 at 10:42 AM, Michael Barker > <mdb96statalist@gmail.com> wrote: >> Hi Laszlo, >> >> I don't have any help on the specific problem you are encountering, >> but you may be able to avoid the problem all together by reshaping >> your data. >> >> If you reshape your data to the asset-source level, you could create a >> single "distance" variable. Then sort by asset id, distance >> (descending) , and source (making sure "Bloomberg is on top, in case >> of a tie). Then you can just keep the first observation for each >> asset. >> >> If your data set is very large, you might work on a subset of the data >> containing only the price variables, then merge it back into the full >> data set. If I understand the problem correctly, the code would look >> something like what I've included below. >> >> Mike >> >> **** >> >> save full_data >> >> keep isin bb ds ms fs mm >> >> foreach var in varlist bb ds ms fs mm { >> rename `var' price`var' >> } >> >> reshape long price, i(isin) j(source) string >> >> bys isin: egen aveprice = mean(price) >> gen pricediff = abs(price-aveprice) >> >> gsort isin -pricediff source >> >> bys isin: gen closest = _n==1 >> >> keep if closest >> drop closest >> >> merge isin 1:1 using full_data >> >> >> >> >> On Thu, Aug 8, 2013 at 9:57 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> Your original variables may be -double- but the results of these >>> -egen- and -generate- commands will be -float- unless you have -set >>> type double-. For what you want, as soon as one variable is -float- >>> testing for exact equality (always tricky with non-integers) becomes a >>> small nightmare. >>> >>> Nick >>> njcoxstata@gmail.com >>> >>> >>> On 9 August 2013 01:18, László Sándor <sandorl@gmail.com> wrote: >>> >>>> Thanks, Sergiy, as always. >>>> >>>> My trouble is that I don't know which part of my code causes the >>>> problem. Then I could test it, or show a fail in a mock example. >>>> >>>> All my variables are double, by the way unless otherwise specified, >>>> i.e. for "closest." >>>> >>>> But the real problem is why the second nested cond fails. The first >>>> produces the "closest" integer values that seem to be correct. And >>>> they do look integer, as I tried to verify. But the second nested >>>> cond, that assigns values run all the way to missing values, which I >>>> don't understand. >>>> >>>> I will let you know if I find the source of problem. >>> >>> On Thu, Aug 8, 2013 at 7:44 PM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >>> >>>>> See if this is the reason for problems: >>>>> ....cond(minmeanskvdiff==meanmmskvdiff & ..... >>>>> Are your diffs doubles? try (a-b)<epsilon for conditions of that type. >>>>> >>>>> Or rewrite your example based on auto.dta, and repost. >>>>> >>>>> Writing huge gens like that makes it really difficult to read the code. >>> >>> On Thu, Aug 8, 2013 at 6:46 PM, László Sándor <sandorl@gmail.com> wrote: >>> >>>>>> Please let me ask for some help because I cannot figure out why my >>>>>> code works for some observations but not others. Maybe this is a >>>>>> -cond- or precision issue and we all learnt something. >>>>>> >>>>>> I need to use financial prices from various sources, but maddeningly, >>>>>> the sources don't line up completely (there is some ambiguity about >>>>>> the assets in question). I do have some reference prices for a few >>>>>> years for a part of the panel. I want to price at least those assets >>>>>> right. >>>>>> >>>>>> I think have some code that calculates the average distance from each >>>>>> source, and fills a variable about which came closest. This could be >>>>>> subject to rounding errors, but this seems to fill in impressively >>>>>> many values. And manual checks verify that the indicated source is >>>>>> indeed the closest to the reference. >>>>>> >>>>>> The next round of -cond-s should pick the value from the right source >>>>>> but produces many missing values. The conditions are all the >>>>>> source-indicator being equal to a single-digit integer value, which >>>>>> seems to be true for the variable, e.g. the label apply to it, I can >>>>>> tab it etc. >>>>>> >>>>>> What is going on? I do have further details, but would confuse you >>>>>> more only if necessary. >>>>>> >>>>>> * Code to generate the closest source for an asset, with panel id isin >>>>>> >>>>>> foreach v in bb ds ms fs mm { >>>>>> gen fakerawpriceornav_`v'=cond(mi(rawpriceornav_`v'),2*rawpriceornav_skv,rawpriceornav_`v') >>>>>> // I need this to punish a source missing when "SKV" reference still >>>>>> has value, otherwise a missing difference would benefit the source by >>>>>> the logic of -egen mean- >>>>>> gen `v'skvdiff = abs(fakerawpriceornav_`v'-rawpriceornav_skv) >>>>>> bys isin: egen mean`v'skvdiff = mean(`v'skvdiff) >>>>>> } >>>>>> egen minmeanskvdiff = rowmin(mean*skvdiff) >>>>>> >>>>>> g byte closest = cond(minmeanskvdiff==meanbbskvdiff & !mi(minmeanskvdiff),1, /// >>>>>> cond(minmeanskvdiff==meanmmskvdiff & !mi(minmeanskvdiff),2, /// >>>>>> cond(minmeanskvdiff==meanmsskvdiff & !mi(minmeanskvdiff),3, /// >>>>>> cond(minmeanskvdiff==meanfsskvdiff & !mi(minmeanskvdiff),4, /// >>>>>> cond(minmeanskvdiff==meandsskvdiff & !mi(minmeanskvdiff),5, /// ) >>>>>> .))))) >>>>>> >>>>>> * So far not many missing values generated, though vast majority of >>>>>> "closest" is simply the first value, I think this is still correct >>>>>> (There are many ties, and I break the tie in favor of Bloomberg.) >>>>>> >>>>>> la def closest 1 "Bloomberg" 3 "Morningstar" 5 "Datastream" 2 >>>>>> "MoneyMate" 4 "FactSet" >>>>>> la val closest closest >>>>>> * This verifies that labels are picked up by the integer values. >>>>>> >>>>>> g rawpriceornav_pick = /// >>>>>> cond(closest==1,rawpriceornav_bb, /// >>>>>> cond(closest==2,rawpriceornav_mm, /// >>>>>> cond(closest==3,rawpriceornav_ms, /// >>>>>> cond(closest==4,rawpriceornav_fs, /// >>>>>> cond(closest==5,rawpriceornav_ds, /// >>>>>> .))))) >>>>>> * But here I get back a ton of missing values. Yes, it can happen that >>>>>> in my panel a source is the closest for an ISIN but has missing values >>>>>> for a few years, so closest will always have less missing than this >>>>>> variable. But only 20% or so of this variable gets filled here, which >>>>>> is not reasonable. >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/