Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Contract/Collapse Combination

From	Lucas <[email protected]>
To	[email protected]
Subject	Re: st: Contract/Collapse Combination
Date	Tue, 22 May 2012 13:15:58 -0700

I guess I too did not clarify everything because I hoped attention
would focus on the problem I identified, not my reasons for why
workarounds don't work.  But, to clarify:

The 15 variable ID is not 10^15.  The confusion stems from reading my
note as concerning a 15 DIGIT identifier.  However, I did not
reference a 15-digit identifier, I referenced an identifier made out
of the *collective* digits of 15 variables.  Some of the fifteen
variables are continuous, which means they have lots of categories,
which means they account for lots of digits.  Thus, expanding to make
an id out of them *will* exceed the size of the largest number allowed
in stata.  I do not maintain this is a general reality, it is a
reality in my data.

Second, it is not a bug if a user asks their machine to do something
with stata but the user has insufficient memory to do the task.  I've
seen that issue discussed here, and I explicitly asked that stata add
an ability to use the disk as RAM, and when I asked they declined.
Their thinking was it would be too slow.  My thinking was slow beats
impossible.  But, I am not a decision-maker at stata, so I respect
their decision, their allocation of programming effort, and live with
the consequences.  Why this seems troubling to someone, I am not sure.

I saw your post about joint frequencies, and will try the solution.  I
have not had a chance because I am running something else on stata at
the moment.

Thanks a bunch.
Sam

On Tue, May 22, 2012 at 10:07 AM, Nick Cox <[email protected]> wrote:
> I am finding it very difficult to work out what you are seeking in this thread.
>
> First, it really wasn't clear to me from your post that you fully understood the precision problem. Your explanation for why the 15-digit identifier didn't work is below. Here it is again: "it will not work for 15 variables of various types, because the id# will exceed the largest value allowed in stata". But that is wrong, as 10^15 is certainly allowed in Stata. I didn't correct that explicitly, but I pointed to the deeper question of precision, which I guessed was at the root of what you were trying.
>
> Second, I understood you earlier as implying that -contract- can not produce reproducible results. Now you seem to imply that this can't be a bug. I'm lost here.
>
> BTW, I made a suggestion in an earlier post that you don't need  StataCorp or anybody else to hit -contract-. You just need to apply -contract- to get joint frequencies, and then everything you want is implicit in that reduced dataset.
>
> Nick
> [email protected]
>
> Lucas
>
> Nick,
>
> A composite 6-digit identifier is not a problem.  I indicated I did
> not think it possible to make such an identifier for each cell of
> 15-way crosstab.  So, we are not disagreeing.
>
> I don't think contract is buggy.  I think a simple (conceptually,
> perhaps not computer "programmingly") extension of contract to allow
> multiple (or at least 2) frequency counts seems a good idea if
> possible, and consistent with the stata-proposed solution of
> addressing slow estimation on big data with collapsing data and using
> frequency counts.
>
> I won't alert stata--they are listening anyway, and they can easily
> come back at me and say I should get more memory.  And, of course, I'd
> agree.  But, still, we'd be left with a command seemingly within
> whispering distance of providing a general solution to a common task,
> but not going that final distance.
>
> Thanks, though.
> Sam
>
> On Tue, May 22, 2012 at 9:37 AM, Nick Cox <[email protected]> wrote:
>> The solution here of producing a composite identifier looks likely to fail. You are putting a very big number into a -float- variable and expect to retain every last bit of precision. See
>>
>> http://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/
>>
>> for why that is a bad idea.
>>
>> As for the rest, you seem to be claiming that -contract- is buggy. That is important if true, and you should send in a report containing incontrovertible evidence to Stata tech-support.
>>
>> Nick
>> [email protected]
>>
>> Lucas
>>
>> Brendan,
>>
>> My original note indicated exactly the solution you propose, of doing
>> it twice and merging.  But this is incredibly risky, because there is
>> no way to assure every combination appears in both files.  Even the
>> "zero" option apparently cannot assure this.  Believe me, I tried this
>> with about 6 variables, and the file sizes do not equate across
>> runs--not to mention that one has to be pretty certain everything is
>> sorted exactly right.  I do not know *why* the problem occurred, it
>> occurred, and perhaps it is that the file is so big, that problems
>> emerge that do not exist for smaller datasets (e.g., sorted cases fall
>> out of sorts, as it were).
>>
>> At any rate, my response was to make an id based on the 6 variables:
>>
>> gen id=(x1*10000)+(x2*1000)+. . .+(x6) ;
>>
>> This works for 6 dichotomous variables; it will not work for 15
>> variables of various types, because the id# will exceed the largest
>> value allowed in stata.
>>
>> THUS, it seems a more general solution is needed, that does not
>> require a later merge.
>>
>> As for your collapse example, it is unclear, as you start with data
>> that is already collapsed.  The problem is the data is not collapsed,
>> and the aim is to get it into the collapsed form.
>>
>> On Tue, May 22, 2012 at 7:50 AM, Brendan Halpin <[email protected]> wrote:
>>> On Tue, May 22 2012, Lucas wrote:
>>>
>>>> Is there a way to use the contract command and obtain frequencies for
>>>> TWO variables rather than just ONE?  A corollary question would be, Is
>>>> there a way to use the contract command and obtain the count of 1's on
>>>> TWO separate dichotomous variables?
>>>
>>> That is what my example achieves, though using -collapse- instead of
>>> -contract-.
>>>
>>> Another way of doing it would be to separate the data by entercol, and
>>> -contract- or -collapse- it twice, once for entercol==1 and once for
>>> entercol==0, and then merge the resulting files by the 15 crosstab
>>> variables.
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Contract/Collapse Combination
  - From: Nick Cox <[email protected]>

References:
- st: Contract/Collapse Combination
  - From: Lucas <[email protected]>
- Re: st: Contract/Collapse Combination
  - From: Nick Cox <[email protected]>
- Re: st: Contract/Collapse Combination
  - From: Lucas <[email protected]>
- Re: st: Contract/Collapse Combination
  - From: [email protected] (Brendan Halpin)
- Re: st: Contract/Collapse Combination
  - From: Lucas <[email protected]>
- RE: st: Contract/Collapse Combination
  - From: Nick Cox <[email protected]>
- Re: st: Contract/Collapse Combination
  - From: Lucas <[email protected]>
- RE: st: Contract/Collapse Combination
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: score test for the parallel regression assumption
Next by Date: Re: st: Contract/Collapse Combination
Previous by thread: RE: st: Contract/Collapse Combination
Next by thread: Re: st: Contract/Collapse Combination
Index(es):
- Date
- Thread