Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Merge Panel Datasets
From
Phil Schumm <[email protected]>
To
[email protected]
Subject
Re: st: Merge Panel Datasets
Date
Mon, 20 Jun 2011 10:39:11 -0500
On Jun 19, 2011, at 8:41 PM, Diana Beketova wrote:
This is totally true that I first had to create 'total foreign
ownership' and 'total domestic ownership' in order to make one
observation line out of many. But I first wanted just to try to
merge both data files, so I can see if this merge can be successful
at all and where are my week points to work on.
Seems reasonable, though note that you could also do this with
merge 1:m ID_NUMBER YEAR using file2, keepusing(ID_NUMBER YEAR)
(i.e., ignore for now the rest of the variables in the second file)
which would cut down on your memory usage.
I had an idea about building year clusters because I have a range of
years 2002-2010. So I can build 3x3 year clusters: 2002-2004,
2005-2007, 2008-2010. Within each of these years I can generate new
variables for Total Assets and Oper. Revenue that will be averages
of Total Assets and Oper. Revenue within this cluster. Because
ownership is so oddly distributed, there is a high probability that
there will be only one observation per year cluster. At the end I
would use Heckman correction method in order to correct for
selection bias. Or also Tobit-model for censored variables. Do you
think, this methodology could be reasonable to use? Otherwise, I
don’t know how to match these to files. I have to say that data
comes from an emerging market and is very biased and incomplete.
Maybe you know further ways how to deal with the bias problem?
I don't see how your "cluster" strategy is related to the use of a
selection model (e.g., Heckman) or censored regression model (e.g.,
Tobit). Moreover, I know absolutely nothing about this substantive
area, so I cannot comment intelligently on your strategy. Grouping
three years together may affect your results (e.g., it will smooth out
year-to-year changes), so at a minimum, you would need to do a
sensitivity analysis to see how your choice of endpoints (including
size of "cluster") affects things. Of perhaps less importance, you
might also want to take account of the fact that a mean of three years
has different properties than a mean of only one year (if the data for
the other two years are missing).
Of critical importance before proceeding with any strategy is to have
a good understanding of why the missing data are missing, and to think
about what effects this might have on your results (even if you don't
explicitly take account of this in your analysis).
-- Phil
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/