Well,
interesting thoughts. Maybe I am overzealous in wanting the whole dataset in
memory, but I have a hunch that this should be possible in the latest
version of an otherwise perfect statistical program. I have never touched
the outer limits of the capabilities of hard- and software, so this is a new
situation for me. Having limited my research to the right tail of the income
distribution (which begins at two times average income), the size of the
file has dropped to 1.9G which fits comfortably into my memory without
touching virtual mem.
As advice to those who are bitten by my problem, I consider -describe using-
as particularly helpful as it lets you peek into the contents of a file
without actually opening it. Also bear in mind that -db use_option- lets you
select cases and / or vars before you actually open the file.
Martin Weiss
_________________________________________________________________
Diplom-Kaufmann Martin Weiss
Mohlstrasse 36
Room 415
72074 Tuebingen
Germany
Fon: 0049-7071-2978184
Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130
Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131
SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Paul Seed
Sent: Friday, May 02, 2008 4:04 PM
To: [email protected]
Subject: RE: st: File sizes in Stata & SPSS (was Weights )
Dear Statalist,
Martin Weiss <[email protected]> has been asking for help in
handling an extremely long file that seems to gain size when converted from
CSV to Stata, but not to SPSS. For reasons of confidentiality, he cannot
tell us want is in it; but some comments suggest that the problem might
relate to variable length strings. For instance, there might be a comment
filed that is generally blank, but in a few cases contains a long &
extremely detailed response. (A hymn of praise or a bitter complaint,
perhaps).
As Stata allocates each string variable a fixed length, there will be a lot
of unused space. As SPSS can store strings of variable length, it will make
use of this.
To check this out, I wrote a script that produces 3 files: example1 contains
a string of 30 characters that is always full; example2 contains a similar
string that is blank except in the first record (similar to Martin's file as
imagined); example3 encodes the string in example2. After saving the files,
I copied them to SPSS using Stat/Transfer, and then checked the file sizes.
In examples 1 & 3, Stata gives smaller files. Only in example 2 does SPSS
"win".
In this case, there is no loss of information due to encoding, as the
maximum length of the string is less than 244 characters. If Martin Weiss
has strings longer than this, and cares about the details contained beyond
character 244, he is perhaps involved in qualitative analysis for which
neither SPSS nor Stata are very useful.
The code is below.
clear
set obs 30000
gen n = _n
gen string = "123456789012345678901234567890"
compress
memory
save example1, replace
replace string = "" if _n > 1
compress
memory
save example2, replace
encode string, gen(string_)
drop string
compress
memory
save example3, replace
* Copy files to SPSS before continuing
pause on
pause
dir example*.*
Paul;
Paul Seed, Senior Lecturer in Medical Statistics
KCL School of Medicine, Division of Reproduction and Endocrinology
tel� (+44) (0) 20 7188 3642
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/