Dear all,
Apologies for the long-winded explanation that follows, but I am trying
to be as precise as possible about the problem that I am having.
I am constructing a panel from 44 files; 22 of production statistics and
22 of investment. The time dimension runs from 1978 to 1999. As of 1993
there was a change in the unique identifier for the firm, from a
two-part system (r1 and r2) to a single identifier variable (b1). I have
a 45th file which contains information on the change in identifier for
all firms operating in 1993 - "link.dta" conatins three variables; r1 r2
b1.
For all time periods, in both types of dataset, the relevant identifier
is removed from a long string and destrung (or converted to a number if
you want to avoid excessive butchery of the language). The code for
doing this is exactly the same for both the investment and the
production statistics, there are no duplicates, the format is the same
for both types; r1 is a 9-digit code, r2 has 3 digits and b1 8. For
periods prior to 1993 I issue a merge command
-merge r1 r2 using "link.dta"
The expected result is that for firms still operating in 1993 an extra
variable is created in the dataset, containing the value of b1.
As you would expect attrition means that not all firm operating in 1978
or so are still operating in 1993, so don't get linked to a value for b1
(_merge==1). Equally the link datafile contains a large amount of
unnecessary information so a large proportion come up as _merge==2. What
I don't understand is some of the _merge==3 observations.
For the investment data, everything is fine. The majority of r1/r2
combinations are linked to a value of b1, which is added to the dataset.
However, for the production statistics something weird is going on.
Again the majority of combinations of r1/r2 are linked to b1
(_merge==3), and a new variable b1 is added. Yet this variable is always
coded as missing (.).
So what seems to be occurring is that the r1/r2 combination is correctly
linked to a value of b1, but this value is not added correctly.
I have checked the issue of format; the _merge==2 observations are added
neatly at the end of the file, with r1 in the correct column, r2 and b1
also. So there seems to be no issue of formatting (just checking).
The values of r1 and r2 are present in link.dta, and have an associated
value of b1 (as suggested by _merge==3). It is possible to make the
r1/r2 connection for link-investment and for production-investment, but
not link-production.
Could anyone give me a pointer as to what I am doing wrong? Thanks for
reading this far, and also in advance for any help.
Julian
P.S. An illustration of what I mean (with fewer digits). Please note
that the number of observations is not the same in the investment and
production datasets.
Link.dta
Obs. R1 r2 b1
1 100 001 123
2 101 001 124
3 102 001 125
4 102 002 126
5 103 001 127
6 104 001 128
7 105 001 129
8 106 001 130
Investment.dta
Obs. R1 r2 other vars
1 100 001
2 101 001
3 102 001
4 102 002
5 104 001
6 107 001
Production.dta
Obs. R1 r2 other vars
1 100 001
2 101 001
3 102 001
4 102 002
5 103 001
6 104 001
7 107 001
8 108 001
-------*After a merge with link*------------
Investment.dta
Obs. R1 r2 b1 _merge
other vars
1 100 001 123 3
2 101 001 124 3
3 102 001 125 3
4 102 002 126 3
5 104 001 128 3
6 105 001 129 2
7 106 001 130 2
8 107 001 . 1
Production.dta
Obs. R1 r2 b1 _merge
other vars
1 100 001 . 3
2 101 001 . 3
3 102 001 . 3
4 102 002 . 3
5 104 001 . 3
6 105 001 129 2
7 106 001 130 2
8 107 001 . 1
9 108 001 . 1
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/