Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: merge m:1 by string
From
"Ben Ammar" <[email protected]>
To
[email protected]
Subject
Re: st: merge m:1 by string
Date
Sat, 19 Mar 2011 15:18:38 +0100
Hi Rebecca,
thanks for your answer and your hint. Your are right there still were trailing blanks in the strings I didn't consider.
For others that might encounter the same problem. First use the command trim() for your strings before you merge datasets.
Cheers
Ben
-------- Original-Nachricht --------
> Datum: Fri, 18 Mar 2011 18:45:18 -0500
> Von: Rebecca Pope <[email protected]>
> An: [email protected]
> Betreff: Re: st: merge m:1 by string
> Ben,
> If this is real data from your sample, I'm not sure what is causing
> your problem. I wasn't able to duplicate the issue you describe.
>
> /***** begin code *****/
> clear
> input str32 name budget
> "Alex T. Smith" 130
> "Andrew J. Williams" 345
> "Steve R. Jackson" 245
> end
> save using, replace
>
> clear
> input str32 name household1 date
> "Alex T. Smith" 45 1988
> "Alex T. Smith" 33 1977
> "Andrew J. williams" 12 1999
> "Andrew J. Williams" 12 2004
> "Steve R. Jackson" 23 1979
> end
>
> merge m:1 name using using
>
> list
> /***** end code *****/
>
> /**** output - apologies if this doesn't line up on your end... ****/
>
> name househ~1 date budget _merge
> -----------------------------------------------------------------
> 1. Alex T. Smith 45 1988 130 matched (3)
> 2. Alex T. Smith 33 1977 130 matched (3)
> 3. Andrew J. Williams 12 2004 345 matched (3)
> 4. Andrew J. williams 12 1999 . master only (1)
> 5. Steve R. Jackson 23 1979 245 matched (3)
>
> /********/
> As you can see, Stata matches everything except obs. #4 above, but
> that's to be expected because "williams" is not equivalent "Williams";
> Stata is case-sensitive.
>
> Also, please verify that either (1) this produces the same results on
> your computer or (2) that the same problem emerges even when you run
> this code. Since you didn't specify, I'm assuming you are running
> Stata 11.
>
> If this code works for you, my guess is that there are differences in
> your actual data you can't see by just "eyeballing" it. You say you
> checked for leading spaces. Did you check for trailing ones?
>
> As regards -encode-, I think you are using it incorrectly or at least
> expecting it to be something it isn't. It is just going to generate a
> numeric variable that takes a new value for each distinct value of the
> string, there is no particular relationship between the numeric
> variable and the string variable other than the order Stata
> encountered the particular string value. Observe the results below
> (code not shown) "ename" is encoded name in the master set & "ename_u"
> is for using. As you can see, the encoded names are different for obs
> 4 & 5.
>
> name househ~1 date ename budget ename_u _merge
> --------------------------------------------------------------------------
> 1. Alex T. Smith 33 1977 1 130 1 3
> 2. Alex T. Smith 45 1988 1 130 1 3
> 3. Andrew J. Williams 12 2004 2 345 2
> 3
> 4. Andrew J. williams 12 1999 3 . .
> 1
> 5. Steve R. Jackson 23 1979 4 245 3 3
>
> Hope this helps,
> Rebecca
>
>
>
> __o __o
> _`\ <,_ _`\ <,_
> (_)/ (_) (_)/ (_)
> =========================
>
>
> On Fri, Mar 18, 2011 at 5:21 PM, Ben Ammar <[email protected]> wrote:
> >
> > Hi everybody,
> >
> > I've got a problem concerning the merge-command or rather the result of
> it.
> > I'd be very grateful for any help. There are more than 2 million names
> (%str32) in my master and 4000 names(%str32) in my using concerning the
> variable (name) I want to merge on. Since there are multiple observations with
> the same name in my master but only one unique observation in the using,
> the m:1 merge command supposed to be correct.
> >
> > master:
> > name household1 date
> >
> > Alex T. Smith 45 1988
> > Alex T. Smith 33 1977
> > Andrew J. williams 12 1999
> > Andrew J. Williams 12 2004
> > Steve R. Jackson 23 1979
> >
> >
> > using:
> > name budget
> >
> > Alex T. Smith 130
> > Andrew J. Williams 345
> > Steve R. Jackson 245
> >
> >
> > but what happens is that the using is appended at the end of the master
> after the merger. I think the problem here is the string variable even
> though I don't understand why. When I encoded the string variable (name) about
> 8000 observations (out of 2 million) in the master where matched just like
> it should be but unfortunately not yet enough. The format of the var in
> both data sets is the same and I even sorted them. I also checked if there's a
> space at the beginning of the name or if there's anything within the
> string that differs from the using-name but both string-variables are exactly
> the same. Last (unlikely) case I checked was the RAM by dropping all other
> variables which could have taken too much memory and therefore explain why a
> very little part was matched when trying to encode the string. That didn't
> work either. Does anyone have an idea on that or even made the same
> experience? Thanks for any comments!
> >
> > Regards
> > Ben
> >
> >
> > --
> > NEU: FreePhone - kostenlos mobil telefonieren und surfen!
> > Jetzt informieren: http://www.gmx.net/de/go/freephone
> > *
> > * For searches and help try:
> > * http://www.stata.com/help.cgi?search
> > * http://www.stata.com/support/statalist/faq
> > * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
--
NEU: FreePhone - kostenlos mobil telefonieren und surfen!
Jetzt informieren: http://www.gmx.net/de/go/freephone
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/