I kind of hacked away at one way to do this. I assumed what you wanted was a larger data set -- one that had, for example,
id1 id2
112 5100
112 5101
112 5102
. .
. .
112 5199
If this is correct, one approach would call for use of the expandcl statement.
Here you have a cluster whenever you have a dash in id2. So
gen cluster = 1 if strpos(id2,"-")
Then you'll need to know how many observations to add/expand. Here
gen expand = 10^(4 - (strpos(id2,"-") - 1)) if cluster
I also save the observation number as a new variable n
gen n = _n
And now
expandcl expand if cluster == 1, cluster(n) generate(newcl)
But you're still not there since you need to create new id2 variables. Here you'll want something like this:
gen id3 = substr(id2,1,strpos(id2,"-")-1) * expand if cluster
and then
sort id1 id2 cluster
by id1 id2 cluster: gen ct = sum(cluster) - 1
replace id3 = id3 + ct if cluster == 1
replace id3 = real(id2) if cluster ~= 1
And now your id3 should be what you wanted for id2.
+-------------------+
| id1 id2 id3 |
|-------------------|
1. | 111 4569 4569 |
2. | 111 4574 4574 |
3. | 112 51-- 5100 |
4. | 112 51-- 5101 |
5. | 112 51-- 5102 |
|-------------------|
6. | 112 51-- 5103 |
7. | 112 51-- 5104 |
8. | 112 51-- 5105 |
9. | 112 51-- 5106 |
10. | 112 51-- 5107 |
|-------------------|
11. | 112 51-- 5108 |
12. | 112 51-- 5109 |
13. | 112 51-- 5110 |
14. | 112 51-- 5111 |
15. | 112 51-- 5112 |
+-------------------+
Now, if you have an embedded dash, such as 3-75, then the task is somewhat more complicated.
Let me know how you fare.
Eric
>Would someone kindly help me with this problem?
>
>I'm trying to join two datasets. The matching is one to many. Let's call
>identifiers in the two datasets id1 and id2. I have a concordance of the
>form (all variables are string):
>
>id1 id2
>111 4569
>111 4574
>112 51--
>112 5689
>113 411-
>113 4145
>113 4310
>
>The problem is where I have an entry for id2 like 51--. This means all
>four digits that start with 51 should be matched to the id1 value of 112.
>Likewise, 411- means all four digits starting with 411 correspond to 113.
>
>If I had just one entry like that, then I could go into dataset2 and
>create another id variable for which all id2s starting with 51 or 411 have
>the same value, and come back and make the appropriate change in the
>concordance. But I have many cases like this where many id2s match with
>id1. Would there be an easy way to take care of this? Perhaps some way to
>automate the creation of an alternative id2 in the second dataset?
>
>Thank you very much for your help.
>
>Jason
--
===================================================
Eric G. Wruck
Econalytics
2535 Sherwood Road
Columbus, OH 43209
ph: 614.231.5034
cell: 614.330.8846
eFax: 614.573.6639
eMail: [email protected]
website: http://www.econalytics.com
====================================================
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/