Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: How to Correctly Structure a CSV before Loading it into STATA
From
"Stephen R. Clark" <[email protected]>
To
<[email protected]>
Subject
st: How to Correctly Structure a CSV before Loading it into STATA
Date
Wed, 26 May 2010 23:42:45 -0500
Dear Statalisters:
Hello. I am a long-time member, but a first-time writer.
I am using STATA/IC 10.1.
I have primarily used STATA for cross-sectional analysis, but I now need to
use it to engage in panel data analysis. Thankfully, from my reading of
posts to this forum, I have learned that STATA has very powerful panel data
analysis features.
Now, let me get to my question. I have an unbalanced panel of data that
consists of 20 cross-sectional units (markets). Each of these markets
contains a different number of time-series (daily) observations. These range
from 31 days for the shortest market to 48 days for the longest market.
I currently have the data in stacked (long) form in a CSV file. I am
dealing with "relative dates," so I am just using integer values (not actual
dates) for the date variable. The data are, somewhat arbitrarily, organized
in this stacked format according to alphabetical order of the cross-section
name. To be as clear as possible, please let me specify in more detail how
the data is arranged in the CSV file:
Relative-Day Market (# of observations) Dependent Variable Independent
Variables
Under the relevant headings, I have 43 observations for "Market A." I then
have 41 observations for "Market B," and so on until "Market T" (the 20th
and final market), which has 40 observations.
The missing data values can arguably be considered as randomly missing, so I
am not concerned about any potential inferential problems associated with
having an unbalanced panel. What I am concerned with is how to structure the
data in the CSV file before importing it into STATA.
Since the longest market has 48 observations, do I need to have 48 rows for
each cross-section with blank cells where the data is missing? In other
words, do I need to "artificially balance" the data before importing it into
STATA? If not, then will I be fine leaving the data in stacked (long)
format, given an unequal number of days for each of the cross-sections?
In considering my question, please be advised that my analysis will involve
the use of lagged values of the dependent variable. In other words, I will
be conducting dynamic panel data analysis. As such, I need STATA to
recognize the panel structure of the data and not "lag into" the values for
the preceding cross-section.
Finally, if I need to "artificially balance" the data prior to importing it
into STATA, then should I enter the NA values at the beginning or at the end
of the respective markets? For instance, say that I am dealing with Market
A, which has 43 observations. With the maximum number of observations at 48,
I would need to enter 5 NA values. Should I do this as:
NA
NA
NA
NA
NA
43 values
or as
43 values
NA
NA
NA
NA
NA
Thanks in advance for your help.
Stephen Clark
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/