Dear all,
Thanks so much again to Dev for recommending
�reshape-, to Toyoto for venturing an illustrative
example, and especially to Professor Nick Cox for
kindly writing the immensely helpful code to �reshape�
the WDI data. Sorry that I haven�t been able to follow
up on this until now; it took my computer many days to
figure out that Stata might not be able to easily
�manipulate� dataset of this large size.
Indeed, although with 2 or 3 countries� worth of
data from WDI each time, Stata 8 was able to perform
Nick�s powerful �reshape� code quite smoothly and
speedily, it seemed very slow when the entire dataset
of around 200 countries (with 200-500 variables each,
for 43 years) is included. Nick�s first-line �reshape-
command took about 24 hours to execute, the �fillin�
command took much longer and could never finish
completely after about 60 hours before Stata finally
crashed.
The original WDI data is 62.5 MB in Stata, and the
size of the transformed data after the first command
�reshape long v , i(country var) j(year)� grows to 540
MB. My computer has a memory of about 512 MB DDR SDRAM
at 333MHz. The hard drive has 80 GB. Maybe it is
because I am just using the intercooled version of
Stata 8, not the SE, but I also set the Stata memory
to 1000 MB.
Is it true that Stata 8 is no longer suitable for
data manipulation of this kind? If not, how should I
proceed next so that Nick�s code could be run to the
finishing line? If yes, could anyone kindly suggest
alternative statistical software that might do the
job?
Terribly sorry to trouble you all again with this,
but your wise suggestion will be tremendously
appreciated, as always!
Best wishes,
Yumin
--- Nick Cox <[email protected]> wrote:
> Yumin Sheng asked [slight editing here, combining
> two postings]
>
> ---------------------
> I have been trying to save data from World
> Development
> Indicators (WDI, online) into a cross-section
> time-series format. The original data look like the
> following
>
> Country Variable Name 1960 1961 ... 2002
> Afghanistan Aid X X ... X
> Afghanistan Population X X ...X
> ... ... ... ... ... ...
> ... ... ... ... ... ...
> Afghanistan Wage X X ... X
> Albania Aid X X ... X
> Albania Population X X ... X
> ... ... ... ... ... ...
> ... ... ... ... ... ...
> Albania Wage X X ... X
>
> The ideal format I would like to have would be:
>
> Country Aid Population ... ... Wage
> Afghanistan 1960 X X ... X
> Afghanistan 1961 X X ... X
> ... ... ... ... ... ...
> ... ... ... ... ... ...
> Afghanistan 2002 X X ... X
> Albania 1960 X X ... X
> Albania 1961 X X ... X
> ... ... ... ... ... ...
> ... ... ... ... ... ...
> Albania 2002 X X ... X
> ... ... ... ... ... ...
> China 1960 X X ... X
> China 1961 X X ... X
> ... ... ... ... ... ...
> ... ... ... ... ... ...
> China 2002 X X ... X
>
> World Development Indicators have about 200
> variables
> for each country.
>
> But (1) sometimes many words are contained in
> one variable name such as "Export share in GDP, %"
> and (2) some countries have data on more
> variables than do other countries.
> --------------------
>
> Dev Vencappa recommended -reshape- and Toyoto Iwata
> posted code for a example dataset. Toyoto's code
> doesn't extend to coping with the first problem
> above and I am not sure that it would cope with the
> second problem above.
>
> Here is a sketch of slightly more general code:
> the dataset is just an example. (I guess that
> there is a shorter and still general solution
> but I can't find it.)
>
> . l
>
> +------------------------------------------+
> | country variable v1960 v1961 |
> |------------------------------------------|
> 1. | Afghanistan Aid 1 6 |
> 2. | Afghanistan Population 2 7 |
> 3. | Albania Aid 3 8 |
> 4. | Albania Fishing 4 9 |
> 5. | Albania Population 5 10 |
> +------------------------------------------+
>
> -reshape- to long and then rectangularise.
> -fillin- takes care of problem (2).
>
> . reshape long v , i(country var) j(year)
> . fillin country var year
> . drop _fillin
>
> We know from problem (1) that the values in
> -variable- aren't all legal variable names,
> so we can't use them directly. One trick is
>
> . egen group = group(variable)
>
> That gives ways of counting variables and
> years, so our code is more general:
>
> . su group , meanonly
> . local nvars = r(max)
> . qui count if country == country[1] & group ==
> group[1]
> . local nyears = r(N)
>
> Now we split the composite -v- into variables
> and assign variable labels, picking them up
> from -variable-:
>
> . gen order = _n
>
> . qui forval i = 1/`nvars' {
> . gen v`i' = v if group == `i'
> . su order if group == `i', meanonly
> . label var v`i' `"`=variable[`r(min)']'"'
> . }
>
> . drop v variable order
>
> . l
>
> +-------------------------------------------+
> | country year group v1 v2 v3 |
> |-------------------------------------------|
> 1. | Afghanistan 1960 1 1 . . |
> 2. | Afghanistan 1961 1 6 . . |
> 3. | Afghanistan 1960 2 . . . |
> 4. | Afghanistan 1961 2 . . . |
> 5. | Afghanistan 1960 3 . . 2 |
> |-------------------------------------------|
> 6. | Afghanistan 1961 3 . . 7 |
> 7. | Albania 1960 1 3 . . |
> 8. | Albania 1961 1 8 . . |
> 9. | Albania 1960 2 . 4 . |
> 10. | Albania 1961 2 . 9 . |
> |-------------------------------------------|
> 11. | Albania 1960 3 . . 5 |
> 12. | Albania 1961 3 . . 10 |
> +-------------------------------------------+
>
> We need to pull values for most variables from
> later in order:
>
> . qui forval j = 2/`nvars' {
> . replace v`j' = v`j'[_n+`nyears' * (`j' - 1)]
> . }
>
> . keep if group == 1
> . drop group
>
> . l
>
> +-----------------------------------+
> | country year v1 v2 v3 |
> |-----------------------------------|
> 1. | Afghanistan 1960 1 . 2 |
> 2. | Afghanistan 1961 6 . 7 |
> 3. | Albania 1960 3 4 5 |
> 4. | Albania 1961 8 9 10 |
> +-----------------------------------+
>
> I also looked at a solution centred on -xpose-,
> but the preparation and clean-up were not
> easier than this.
>
> Nick
> [email protected]
>
> *
> * For searches and help try:
> *
> http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
_______________________________
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/