Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Contract/Collapse Combination |
Date | Tue, 22 May 2012 21:55:35 +0100 |
Thanks for these extra comments. I will focus on those I think I understand. I did misread you on 10^15 or so. The best way to get a unique identifier for cross-combinations of variables is something like bysort <varlist> : gen id = _n == 1 replace id = sum(id) which is canned as -egen-'s -group()- function. The limits on size of variables' values can't bite here, as the number of cross-combinations in the data can't exceed the number of observations. Nick On Tue, May 22, 2012 at 9:15 PM, Lucas <lucaselastic@gmail.com> wrote: > I guess I too did not clarify everything because I hoped attention > would focus on the problem I identified, not my reasons for why > workarounds don't work. But, to clarify: > > The 15 variable ID is not 10^15. The confusion stems from reading my > note as concerning a 15 DIGIT identifier. However, I did not > reference a 15-digit identifier, I referenced an identifier made out > of the *collective* digits of 15 variables. Some of the fifteen > variables are continuous, which means they have lots of categories, > which means they account for lots of digits. Thus, expanding to make > an id out of them *will* exceed the size of the largest number allowed > in stata. I do not maintain this is a general reality, it is a > reality in my data. > > Second, it is not a bug if a user asks their machine to do something > with stata but the user has insufficient memory to do the task. I've > seen that issue discussed here, and I explicitly asked that stata add > an ability to use the disk as RAM, and when I asked they declined. > Their thinking was it would be too slow. My thinking was slow beats > impossible. But, I am not a decision-maker at stata, so I respect > their decision, their allocation of programming effort, and live with > the consequences. Why this seems troubling to someone, I am not sure. > > I saw your post about joint frequencies, and will try the solution. I > have not had a chance because I am running something else on stata at > the moment. > > Thanks a bunch. > Sam > > On Tue, May 22, 2012 at 10:07 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote: >> I am finding it very difficult to work out what you are seeking in this thread. >> >> First, it really wasn't clear to me from your post that you fully understood the precision problem. Your explanation for why the 15-digit identifier didn't work is below. Here it is again: "it will not work for 15 variables of various types, because the id# will exceed the largest value allowed in stata". But that is wrong, as 10^15 is certainly allowed in Stata. I didn't correct that explicitly, but I pointed to the deeper question of precision, which I guessed was at the root of what you were trying. >> >> Second, I understood you earlier as implying that -contract- can not produce reproducible results. Now you seem to imply that this can't be a bug. I'm lost here. >> >> BTW, I made a suggestion in an earlier post that you don't need StataCorp or anybody else to hit -contract-. You just need to apply -contract- to get joint frequencies, and then everything you want is implicit in that reduced dataset. >> >> Nick >> n.j.cox@durham.ac.uk >> >> Lucas >> >> Nick, >> >> A composite 6-digit identifier is not a problem. I indicated I did >> not think it possible to make such an identifier for each cell of >> 15-way crosstab. So, we are not disagreeing. >> >> I don't think contract is buggy. I think a simple (conceptually, >> perhaps not computer "programmingly") extension of contract to allow >> multiple (or at least 2) frequency counts seems a good idea if >> possible, and consistent with the stata-proposed solution of >> addressing slow estimation on big data with collapsing data and using >> frequency counts. >> >> I won't alert stata--they are listening anyway, and they can easily >> come back at me and say I should get more memory. And, of course, I'd >> agree. But, still, we'd be left with a command seemingly within >> whispering distance of providing a general solution to a common task, >> but not going that final distance. >> >> Thanks, though. >> Sam >> >> On Tue, May 22, 2012 at 9:37 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote: >>> The solution here of producing a composite identifier looks likely to fail. You are putting a very big number into a -float- variable and expect to retain every last bit of precision. See >>> >>> http://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/ >>> >>> for why that is a bad idea. >>> >>> As for the rest, you seem to be claiming that -contract- is buggy. That is important if true, and you should send in a report containing incontrovertible evidence to Stata tech-support. >>> >>> Nick >>> n.j.cox@durham.ac.uk >>> >>> Lucas >>> >>> Brendan, >>> >>> My original note indicated exactly the solution you propose, of doing >>> it twice and merging. But this is incredibly risky, because there is >>> no way to assure every combination appears in both files. Even the >>> "zero" option apparently cannot assure this. Believe me, I tried this >>> with about 6 variables, and the file sizes do not equate across >>> runs--not to mention that one has to be pretty certain everything is >>> sorted exactly right. I do not know *why* the problem occurred, it >>> occurred, and perhaps it is that the file is so big, that problems >>> emerge that do not exist for smaller datasets (e.g., sorted cases fall >>> out of sorts, as it were). >>> >>> At any rate, my response was to make an id based on the 6 variables: >>> >>> gen id=(x1*10000)+(x2*1000)+. . .+(x6) ; >>> >>> This works for 6 dichotomous variables; it will not work for 15 >>> variables of various types, because the id# will exceed the largest >>> value allowed in stata. >>> >>> THUS, it seems a more general solution is needed, that does not >>> require a later merge. >>> >>> As for your collapse example, it is unclear, as you start with data >>> that is already collapsed. The problem is the data is not collapsed, >>> and the aim is to get it into the collapsed form. >>> >>> On Tue, May 22, 2012 at 7:50 AM, Brendan Halpin <brendan.halpin@ul.ie> wrote: >>>> On Tue, May 22 2012, Lucas wrote: >>>> >>>>> Is there a way to use the contract command and obtain frequencies for >>>>> TWO variables rather than just ONE? A corollary question would be, Is >>>>> there a way to use the contract command and obtain the count of 1's on >>>>> TWO separate dichotomous variables? >>>> >>>> That is what my example achieves, though using -collapse- instead of >>>> -contract-. >>>> >>>> Another way of doing it would be to separate the data by entercol, and >>>> -contract- or -collapse- it twice, once for entercol==1 and once for >>>> entercol==0, and then merge the resulting files by the 15 crosstab >>>> variables. >>> * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/