Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Duplicate observations
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: RE: Duplicate observations
Date
Mon, 10 Mar 2014 19:02:12 +0000
Joe is right.
Away from -gsort- a minus sign in a varlist acts as a hyphen,
Good catch!
Nick
[email protected]
On 10 March 2014 18:58, Joe Canner <[email protected]> wrote:
> Emanuele,
>
> Nick provided a good solution to your problem, but it's probably worth noting why you had a problem to begin with.
>
> The statement:
>
> by reporter partner year (x_1 -date), sort: gen duplicates=_n
>
> is probably not doing what you want it to do. It looks like you want to sort by x_1 (ascending) and date (descending). However, as far as I am aware, the minus sign to indicate a descending sort can only be used in a -gsort- command. In this context the minus is sign is interpreted as a hyphen and thus "x_1 -date" is a variable list (variables x_1 through date). Accordingly, it is not sorting in descending date order, which results in the problem you noted.
>
> If you need to do something like this in the future and Nick's solution doesn't apply, try the following:
>
> gsort reporter partner year x_1 -date
> bysort reporter partner year: gen duplicates=_n
>
> Regards,
> Joe Canner
> Johns Hopkins University School of Medicine
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of emanuele mazzini
> Sent: Monday, March 10, 2014 2:31 PM
> To: [email protected]
> Subject: st: Duplicate observations
>
> Hello to everybody,
>
> I have an issue about duplicate observations that I find puzzling to solve.
> I have data on country-pairs by year and I am interested in two
> specific variables, a date and, say a variable which I call x_1.
>
> Specifically, my data look like this :
>
> reporter partner year date x_1
>
> Albania Austria 1980 6dec1980 n_1
> Albania Austria 1980 15nov1980 n_1
> . . .
> . . .
> . . .
>
> As you may have noticed observations differ amongst them only by date
> and I need to drop them so as to keep the most recent one (hence, in
> this case the second one).
>
> I ran the following commands:
>
> duplicates tag reporter partner year, generate(dup)
>
> by reporter partner year (x_1 -date), sort: gen duplicates=_n
>
> so as to be able to identify duplicates and then - among those with
> dup >0 - drop those for which duplicates > 1.
> This method was suggested in this thread (I take this opportunity to
> thank again), but it seems not to work for some observations.
> Take, for instance the following example:
>
> reporter partner year date x_1 dup duplicates
> Albania Germany 1967 08apr1967 n_1 1 1
> Albania Germany 1967 17dec1967 n_1 1 2
>
> As you may notice, Stata identifies the observation occurred the
> 17dec1967 as those with duplicates > 1 (which will then be dropped),
> while I would have expected Stata to make the opposite.
>
> Can anyone explain me why and, possibly, tell me how to deal with such issue?
>
> Thank you very much in advance,
>
> Emanuele
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/