Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Sara Neto Machado <mail@saramachado.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: find the corresponding values between two variables |
Date | Fri, 18 Apr 2014 00:04:26 +0100 |
Joe, Yes, I run your code exactly like you suggested me. I had no -id- variable before the gen id=_n generated by your method. I think I understand why there is an error with reshaping and is due to my initial inquiry which is the reason why I am having so much trouble to solve this issue. In order to better explain my theory about the problem let me send again the example that I provide in the beginning with more detail information that may help us with the reasoning issue. thus, for the sake of illustration, let's consider the following: c_est----w_est---c_trab----w_trab 10----------10-----------10-------11 11-----------3------------11-------3 13-----------4-------------17-------5 17------------5-------------18-------7 18---------10----------------23-----3 23-----------5----------------25-----6 25----------9-----------------37-----3 35---------5------------------"."-----"." 37----------6----------------"."------"." 40----------4-----------------"."-----"." c_"something" represents a code of a company and w_"something" represents the number of workers for that exactly company in a given year. Thus, the values of c_Est and c_trab when doing the comparison should be equal and the value of w_est and w_trab should be equal as well but that is not happening in reality.. So I need to understand where those differences appear as I notice they come from the worker's side as well as from the companies side since there are some companies missing in c_Trab. Stata is defining missing value for those cases at the end of the database because the data is mismatch along the database. By taking into account this example, we have 3 missing values that are due to c_est == 13 & c_est == 35 & c_est == 40 observations that do not have a match in c_trab like their were supposed too. If we apply your reshape syntax code in our example it will give a result of 3 missing values for -id-. The problem is always the mismatch situation with companies that I do not know how to indentify.... In my real database I have 28 cases of mismatch and thus 28 missing values for -id-. However, and since missing values always appear in the end I do not know to which companies they belong. If you have any more suggestions for this problem I will be much appreciated. Thank you for your valuable input. regards, Sara 2014-04-17 15:22 GMT+01:00 Joe Canner <jcanner1@jhmi.edu>: > Sara, > > Based on the results of -reshape error- it looks like you have a bunch of observations for which -id- is missing. This, in turn, causes problems with duplicate -j- values. Did you generate -id- using my suggestion (gen id=_n) or did you already have a variable called -id- in your dataset? My method should not have resulted in missing values for -id-. If you are using your own -id- variable you will need to figure out why there are missing values and either fix them or get rid of them. > > Regards, > Joe > > -----Original Message----- > From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Sara Neto Machado > Sent: Wednesday, April 16, 2014 8:28 PM > To: statalist@hsphsun2.harvard.edu > Subject: Re: st: find the corresponding values between two variables > > Dear Nick and Joe, > > For Nick: > initially I did something similar of what you have proposed > gen diff4 = w_est - w_trab if c_est != c_trab > to understand how much was the difference between the cases that > didn't match. However I found the result to be a lot bigger that was > supposed to, therefore I notice a new problem: the mismatch that Joe > correclty understood from my previous example sample. I know that I > have 28 cases that do not match, however they are all over the > database and I need to identify them. > > Taking into account Joe's solution: > I read about the reshape syntax before but I though it woulnd't apply > for my case because it doesn´t allow me to isolate the mismatch cases > instead I need to eliminate those which I do not want to do. However, > I test all your code and an error occur (see result below, please) due > to the exactly 28 mismatch that I state previously and thus I continue > to not know how to isolate them since the 28 missing values appear in > the end which is not correct since they appear all over the > database.... > > there isn't any other syntax that just do some kind of sorting with > -if- statements to allow to indentify this mismatch? I know that we > have - sort by - but also does no good for my problem. > > STATA RESULT: > reshape wide c_ w_, i(id) j(j) string > (note: j = emp est trab) > j not unique within id; > there are multiple observations at the same j within id. > Type "reshape error" for a listing of the problem observations. > . reshape error > (note: j = emp est trab) > > i (id) indicates the top-level grouping such as subject id. > j (j) indicates the subgrouping such as time. > The data are in the long form; j should be unique within i. > > There are multiple observations on the same j within id. > > The following 28 of 1049448 observations have repeated j values: > > +-----------+ > | id j | > |-----------| > 1049421. | . trab | > 1049422. | . trab | > 1049423. | . trab | > 1049424. | . trab | > 1049425. | . trab | > |-----------| > 1049426. | . trab | > 1049427. | . trab | > 1049428. | . trab | > 1049429. | . trab | > 1049430. | . trab | > |-----------| > 1049431. | . trab | > 1049432. | . trab | > 1049433. | . trab | > 1049434. | . trab | > 1049435. | . trab | > |-----------| > 1049436. | . trab | > 1049437. | . trab | > 1049438. | . trab | > 1049439. | . trab | > 1049440. | . trab | > |-----------| > 1049441. | . trab | > 1049442. | . trab | > 1049443. | . trab | > 1049444. | . trab | > 1049445. | . trab | > |-----------| > 1049446. | . trab | > 1049447. | . trab | > 1049448. | . trab | > +-----------+ > > (data now sorted by id j) > > thanks in advance! > Kind regards, > Sara > > 2014-04-16 20:00 GMT+01:00 Joe Canner <jcanner1@jhmi.edu>: >> The problem as I understand it (which may not be correct) is that the entry with c_est==13 is not just a problem for that observation but its presence makes all of the rest of the observations mismatched as well. >> >> What might help here is to -reshape- the dataset so that all (c_est,c_trab) pairs can be identified and all unmatched cases eliminated. Or, if you'd rather, just reshape again based on the c_ variables and make the mismatches have a missing partner. Something like: >> >> gen id=_n >> reshape long w_ c_, i(id) j(j) string >> replace id=c_ >> reshape wide c_ w_, i(id) j(j) string >> >> This assumes that your c_ variables are unique. If not, you will have to modify this or do something else entirely. >> >> Regards, >> Joe Canner >> Johns Hopkins University School of Medicine >> >> -----Original Message----- >> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox >> Sent: Wednesday, April 16, 2014 2:38 PM >> To: statalist@hsphsun2.harvard.edu >> Subject: Re: st: find the corresponding values between two variables >> >> Not clear what you want (or what you tried: you show no code) but does >> >> gen diff = w_est - w_trab if c_est == c_trab >> >> or >> >> gen OK = c_est == c_trab >> >> edit if OK >> >> help? >> Nick >> njcoxstata@gmail.com >> >> >> On 16 April 2014 19:31, Sara Neto Machado <mail@saramachado.com> wrote: >>> Dear all, >>> >>> c_est----w_est---c_trab----w_trab >>> 10----------10-----------10-------11 >>> 11-----------3------------11-------3 >>> 13-----------4-------------17-------5 >>> 17------------5-------------18-------7 >>> 18---------10----------------23-----3 >>> 23-----------5----------------25-----6 >>> >>> my aim is to perform the differences between w_est and w_trab for the >>> same values of c_est and c_trab. However, I have along the dataset as >>> well as values that do not coincide in c_trab and c_est (Eg from the >>> sample: 13) that "ruins" the sorting between those columms. I want the >>> 13 to appear like missing values on the same line of c_trab. I am >>> trying to search any syntax that suits my purpose and nothing relevant >>> came up. Maybe there are other alternatives that I am not seeing now.. >>> >>> Can anyone help me? much appreciated! >>> >>> regards, >>> Sara >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/