Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Two datasets: Look for similar observations in the second dataset
From
Torsten Häberle <[email protected]>
To
[email protected]
Subject
st: Two datasets: Look for similar observations in the second dataset
Date
Sat, 25 Jan 2014 18:26:55 +0100
Hey guys,
I have quite a difficult "matching" problem to solve and I am not sure
how to approach it. This is the situation:
I have two datasets:
1) The first one is my sample dataset
2) The second one is basically the entire population, but excluding my
sample dataset
Both datasets include data about firms. In general, what I want to do:
Find for each firm in dataset (1) another "matching" firm in dataset
(2) that is as similar as possible to the sample firm in dataset (1)
(based on two characteristics).
Dataset 1 looks like:
Company Year CompanySize A ratio
A 2012 140 0.2
B 2011 200 0.4
C 2010 300 0.2
It includes many firms over a period of 20 years including their
characteristics. There are two matching characteristics: the company
size and a (company) ratio that I calculated.
For example, company A has a size of 140 and a ratio of 0.2 in 2012.
Now, I want to find a firm in dataset (2), which is similar to firm A
in dataset (1) in the same year 2012.
Dataset 2 looks very similar:
Company Year CompanySize A ratio
X 2012 150 0.19
Y 2012 280 0.9
Z 2012 50 0.01
...
Dataset (2) includes many many other firms. As mentioned, I want to
find a matching firm for each sample firm. This should be somehow
constructed by a loop or macro (?) I think, but I am not sure.
The match should be conducted in the following way. Let's assume in
our example that we want to find a matching firm for sample firm A in
dataset (1).
1) Characteristic: CompanySize >> First matching characteristic
Stata shall pick all firms from dataset (2) that have a company size
between 80% and 120% of firm A's size. All other firms in dataset (2)
shall be immediately dismissed. This is basically the first step in
the matching procedure.
In our case: Company size is 140 and range 112 - 168. All firms in
dataset (2) that have a CompanySize of above 168 or below 112 shall be
dismissed --> Company Y and Z.
2) Characteristic: Ratio >> Second matching characteristic
Now, Stata shall pick from the remaining firms in dataset (2) the
single one firm which has the most similar ratio as firm A from
dataset (1) has. In our example, this would be Company X. This should
be done somehow like:
Ratio firm A dataset (1) - Ratio of firm X dataset (2) = 0.2 - 0.19 = 0.01
- Ratio of firm Y = 0.4 - 0.9 = - 0.5
- Ratio of firm Z = 0.2 - 0.01 = 0.19
>>> Pick firm X since the the difference is the smallest. Be careful here: Y and Z
are actually already excluded due to their CompanySize (first matching
characteristic). This
is just an example.
Finally, to make it even more complicated: I am not only looking for
the "best" (closest) match, but also for the second and third closest
match.
In the end, I want to get one dataset that looks like this:
Company Matching Firm 1 Matching Firm 2 MF3
A X 2nd rank
3rd
Hopefully, I made my problem clear. Would appreciate some help. Since
this matching
has to be done for every sample firm, this has to be some kind of
loop/macro that does
this matching over and over again for every sample firm.
Thanks!
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/