Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: wrong number of observations after append


From   [email protected] (William Gould)
To   [email protected]
Subject   Re: st: wrong number of observations after append
Date   Tue, 03 Sep 2002 11:11:47 -0500

Karyen Chu <[email protected]> reports, 

> I have 50 individual-specific datasets ([...] one subject ID per dataset)
> [..]
>
> I then appended all 50 datasets into 1 very large dataset and discovered
> that some of the subjects now have the wrong number of observations!  Some
> have too many observations and some have too few observations although the
> total number of observations for all 50 subjects is correct.

Karyen provided -table- output along with annotated notes about what the 
number ought to be.  One is tempted just to dimiss this except that 
in Karyen's posting is was obvious that she had already done a lot of work
and she provided enough information so that I, too, must admit that I am 
at a loss.

I took Karyen's table, deleted all the rows where numbers were as Karyen 
excepted, and then added my own annotation:

    ----------------------                                    My annotation
        subjid |      Freq.      Karyen's annoation             freq-exp
     ----------+-----------       ------------------------       -------
         28722 |     19,296      (should have 19,299 obs)         -3
         50910 |     23,971      (should have 23,972 obs)         -1
        119669 |     59,245      (should have 59,244 obs)         +1
        209171 |      7,991      (should have 7,989 obs)          +2
        323652 |      2,267      (should have 2,269 obs)          -2
        459852 |     14,024      (should have 14,023 obs)         +1
        666481 |     16,679      (should have 16,678 obs)         +1
        836796 |     11,495      (should have 11,494 obs)         +1
    ----------------------                                    ----------
                                                                   0

I also checked all the subjid numbers to see if there could be a float rounding
issue.  The answer is no, all numbers fit into a float just fine and, even if
they did not, all the commands Karyen used are advertised as being robust to
such problems.

So I, like Karyen, am at a loss.  I do, however, have advice on how to figure 
out what is going on.


1.  Trust nothing and no one
----------------------------

In tracing this down, do not trust Stata, do not trust what you have been 
told, do not even trust yourself.  Verify everything.

My suspicion is that Karyen's assumptions are wrong and that the datasets with
which Karyen started have the number of observations Stata reported in the
final -table-.  I would not even have bothered to respond to Karyen's posting
had she not written down the number she expected next to each group.  Where, I
asked myself, did she get those numbers if not from already examining her
assumptions?

Even so, I ask that Karyen once again examine her assumptions.  -use- the
datasets user_usage.X1.28722.dta, user_usage.X1.50910.dta, ..., -describe-
each, and verify that the number of observations is as you expect.

If you find that they are, ask yourself whether you have more than one 
copy of the datasets, and whether you perhaps used another copy when you 
ran your program.


2.  Mechanize
-------------

Karyen claims to have run the code 

    ---------------------------------------------------------------------------
    use user_usage.X1.28722.dta, clear
    describe

    foreach subj of numlist 50910 54476 87734 119669 123614 /*
    */ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /*
    */ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /*
    */ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /*
    */ 666481 677056 717037 /*
    */ 751384 763586 788300 828191 836796 876142 /*
    */ 917942 929316 943493 955867 968002  {

         append using user_usage.X1.`subj'.dta

         capture noisily save user_usage.X1.merge.dta, replace
     }
     compress

     sort subjid startedate start_hr start_min start_sec /*
          */ endedate end_hr end_min end_sec

     save user_usage.X1.merge.dta, replace

     describe

     table subjid
    ---------------------------------------------------------------------------

Verify this is true.  Put the code in a do-file (if it is not already), turn
on a log, and and run the do-file.  Close the log and keep it.  Do you still
have the problem?

If so, rename this do-file problem1.do.  Anytime you want to recreate the 
problem, all you have do is type "do problem1".


2.  Simplify
------------

We are now going to make problem2.do, problem3.do, ..., at each step
simplifying the problem a little until (1) it becomes obvious to us what the
problem is or (2) the problem vanishes.  If (2), we will have to explore what
is happening beteen problem{K}.do and problem{K+1}.do, perhaps by breaking it
into more steps.

The first thing I am going to suggest is silly, but it will set the standard
by which I want Karyen to operate.  In each step, we make a small change.

Why, Karyen, the -capture noisily- in front of -save user_usage.X1.merge.dta,
replace-?  Well, I don't care about the answer, and I cannot imagine that 
-capture noisily- is causing any problem, but -capture- has the ability to 
hide problems, so remove the -capture noisily-.  Call that new do-file 
problem2.do.  Run it.  Still have the problem?

In problem3.do, let's consider the idea -table- is wrong.  Add 

    tabulate subid, missing

before or after the -table-.  Run it.  Output should match.  Does it?

In problem4.do, let's focus on just the ids that have the problem.  Karyen's
code currently begins

    use user_usage.X1.28722.dta, clear
    describe

    foreach subj of numlist 50910 54476 87734 119669 123614 /*
    */ 127871 130008 130722 162245 194574 209171 226711 228761 284310 323652 /*
    */ 326175 328958 360402 370576 371133 407487 413293 415301 417756 459852 /*
    */ 462509 475134 476368 484595 487508 507428 564155 577895 580566 598037 /*
    */ 666481 677056 717037 /*
    */ 751384 763586 788300 828191 836796 876142 /*
    */ 917942 929316 943493 955867 968002  {

change the -foreach- to include just 28722, 50910, 119669, 209171, 323652,
459852, 666481, and 836796.  Run the do-file.  Still have the problem?

In problem5.do, let's try just the pair 50910 and 119669:

    use user_usage.X1.50910.dta, clear
    describe

    foreach subj of numlist 119669 { 

Run it.  Still have the problem?

Keep going like this.  At some point, let's introduce into the code the 
proof that the input datasets are as you claim:

        use user_usage.X1...., clear 
        describe
        tabulate subjid, missing           /* <- new */
        foreach subj of numlist ... {
                preserve
                display "dataset for `subj'":
                use user_usage.X1.`subj'.dta, clear
                tabulate subjid, missing
                restore
                append using user_usage.X1.`subj'.dta
                display "result:"
                tabulate subjid, missing
                save user_usage.X1.merge.dta, replace
        }
        ...

Follow these steps and eventually, Karyen, you will find the problem.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index