| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Merging of two datasets
Tobi Brütsch <[email protected]> writes,
> I have to match two datasets from two diffrend sources but about the same
> issue:
>
> Dataset1 (+/- 5000 observations):
> date Firm recommendation brokerID
> 15.03.2002 ABB 2 LEHM
> 11.01.2005 ABB 5 HESLB
> 01.07.2005 ABB 2 JBCOB
> 06.08.2004 ABB 3 MORGAN
>
> Dataset2 (+/- 2500 observations):
> Date firm brokerID recommendation
> 04aug2003 ABB MORGAN 4
> 13nov2002 ABB WAREURO 4
> 22mar2005 ABB JYSKE 4
> 25jan2005 ABB SOGENED 4
> 27jan2004 ABB PARIBEU 3
>
> [...]
> [...] i want to analyse if the recommendations of Dataset 2 are:
>
> 1. the same like these in set 1 (should be for some but I think not for
> all)
>
> 2. The recommendations in set 2 are published with a certain lag,
> perhabs 5-10 days after the same recommendation in set1.
> Now i don't know to merge the sets. [...]
I think the way to proceed is not to -merge- the two datasets, but to
-append- them.
. use Dataset1
. gen byte source = 1
. append using Dataset2
. replace source = 2 if source==.
. save combined
It appears that the date variable in Dataset2 is a %td value, but I'm
unsure about the value in Dataset1. Make sure both are %td dates BEFORE
appending, and make sure that the variables are named the same in each.
Now with the datasets combined, we can start to work on a fuzzy merge.
This is more a process than a procedure.
For instance, let's first get rid of broker/firm combinations that are
unique to one dataset or the other, or, said in our new terms, that are
to unique to source==1 or source==2:
. use combined
. sort firm brokerID date
. by firm brokerID: gen unique = (source==source[1])
. by firm brokerID: replace unique = sum(unique)
. by firm brokerID: gen todrop = (unique==_N) if _n==_N
. by firm brokerID: replace todrop = todrop[_N]
. drop if todrop
. drop unique
Excuse me for being so verbose in my code; if I worked at it I could
reduce it to just one or two lines. I wrote the code verbosely, however,
because that was how I thought about it as I was writing it and, really,
it is easier to understand. Everything above hinges on
. by firm brokerID: gen unique = (source==source[1])
That variable takes on value 0 or 1; it is 1 if the source is the same as
the source was in observation 1, and 0 otherwise. Thus, if the sum(source)
== _N, the source never changed. I want to get rid of all the observions for
the firm/brokerID combination in that case.
The dataset is probably worth saving at this point.
After this first step, by advice is less specific. Tobi could do the
following:
. sort firm brokerID recommendation date source
Then if he listed the data, it would be easy to see how long ago each
firm/brokerID made the same recommendation. (By the way, I included source
at the end of the -sort- just in case the same recommendation was made twice
on the same day. I wanted to put the records in a deterministic order, even
in that case.)
Rather than listing, Tobi coiuld generate a variable recording how long
ago the same recommendation was made:
. by firm brokerID recommendation: gen lastrec = date - date[_n-1]
lastrec would require days from last recommendation, but it would not
necessarily be from different sources; the firm/brokerID might have made the
same recommendation in the same source. Tobi will have to think about what
that means, if it occurs.
We could make sure it was from different sources:
. by firm brokerID rec: gen n=_n if source==1
. by firm brokerID rec: replace n = n[_n-1] if n==.
. by firm brokerID rec: gen lobs1 = n[_n-1]
. drop n
If I coded the above right, and if I'm thinking correctly, I just put in
lobs1 the observation number (within firm brokerID rec) of the last
observation made by the firm/brokerid at source==1.
We could now do the same for source==2:
. by firm brokerID rec: gen n= _n if source==1
. by firm brokerID rec: replace n = n[_n-1] if n==.
. by firm brokerID rec: gen lobs2 = n[_n-1]
. drop n
With lobs1 and lobs2, Tobi could now get the last recommendation made
by the other source:
. by firm brokerID rec: gen lastrec = date-date[lobs2] if source==1
. by firm brokerID rec: replace lastrec = date-date[lobs1] if source==2
Continuing like this, Tobi should be able come up with operation definitions
of what it means to make the same recommendations, and then count them.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/