At 04:07 PM 3/30/2005 +0000, Louis Boakye-Yiadom wrote:
Dear All,
Is it true that "for a match merge to work, the identifier or identifiers
must uniquely identify each observation"? I found this statement in sample
lecture NC 101 (one of StataCorp's NetCourses), but I thought that this
requirement (of the id uniquely identifying each observation) is often
desirable, but not necessary in all cases. Any insights will be
appreciated. Thank you.
As I understand it, it's not required.
But it's a good idea for the identifier(s) to uniquely identify
observations in at least one of the files. In nearly all instances, it is
essential to adhere to this rule in order to get meaningful results.
Sometimes it/they uniquely identify observations in both files. That is
easy to understand. (And you can use the -uniq- option.)
Often, you may have unique identification in one file but not the
other. Then the observations in the file of unique identification get
spread out over multiple observations in the other. Typically, this is to
bring in information about non-key attributes. For example, in a file of
person, you may merge in information about their families. (And you can
use the uniqu or uniqm option, depending on which direction you are going.)
But -merge- will still work if neither file is uniquely identified by the
identifier(s). But it is rare that you would want to do that; it usually
leads to meaningless pairings. So you need to be careful about what and
why you are doing it. In ten years of merging, I have done it once (to
produce something for clerical inspection).
When this situation occurs, the matchings proceed one-to-one in the order
that observations appear, until one side or the other runs out of
observations. Then "spreading" occurs on the remainder. (This is a
generalization of the one-to-many or many-to-one matching described
above.) Suppose that the in-memory file has 4 observations with a
particular value in the matching identifier, and that the using file has 6
observations with that same value in the matching identifier. Then, for
these observations, the first four will be paired in the order received,
and the final observation in the in-memory file will also be paired with
the other two in the using file.
Understand that in this situation, the pairings are probably meaningless;
they share a value in the matching identifier, but there is no particular
reason that observation a got paired with observation b. Furthermore,
unless you impose stable sorting, the resulting pairings are not reproducible.
(A weaker condition for getting "meaningful" pairings is that the
identifier be unique in one or the other file for any particular value(s)
in the matching identifier(s) -- but not necessarily the same file in every
case. While this leads to possibly meaningful pairings, which are also
reproducible, it is a contrived situation that wouldn't naturally occur --
as far as I can see.)
I hope this has been useful.
-- David
David Kantor
Institute for Policy Studies
Johns Hopkins University
[email protected]
410-516-5404
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/