Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Improved commands, sample implementations. Any interest?
From
James Sams <[email protected]>
To
[email protected]
Subject
st: Improved commands, sample implementations. Any interest?
Date
Fri, 07 Dec 2012 10:34:50 -0600
I keep changing some user-written commands to suit my purposes or fix things
that have broken over the years and thought I'd contribute these back.
However, some peer review may be a good idea before tracking down the
individual authors and trying to get the changes committed.
Here is a summary of what I have right now:
* collapse_preserve_label.do: preserve variable and value labels of
same-named variables when using collapse. I believe StataCorp has an FAQ
that outlines this program.
* gzfile.ado: provide ability to interact with gzipped dta files using
modern syntax of Stata's various file commands (save, use, append, merge).
Derived from gzsave.
* indexesof.ado: a variant of levelsof to skirt around macro length issues
and provide the index within the dataset of each unique value.
* insheet2.ado: a more reliable insheet, uses replace_dquotes.py.
* labmask.ado: an update to the original labmask to be faster.
Depends on indexesof.
* replace_dquotes.py: Replaces double quotes in csv files to another
character, e.g pipe ('|'), so that Stata's insheet does not corrupt the
input. Assumes there are no |'s in the original data. Replace all |'s in
all string variables back to double quotes to restore original data. The
character used is printed to stdout.
* unique.ado: edited unique command from ssc to accept a compound if stmt.
You can check out the files and future updates/additions at my bitbucket
repository: https://bitbucket.org/james.sams/statafiles/
There are no help files, but the commands are well documented within each
source file.
A couple examples of what I've changed:
An example of a performance improvement is labmask.ado, which is derived from
Nick Cox's labmask. On somewhat larger datasets (a couple of a million
observations with thousands unique value/label pairs), this version runs in a
few seconds rather than multiple hours. It also does not require the creation
of any new variables, just a couple of mata vectors; so, it does not increase
memory usage much at all.
insheet breaks for me, and others I provide support for, constantly. Between
truncating data, misinterpreting column breaks, and not using double by
default, I think insheet should be used more conservatively than most may
expect given the apparent simplicity of the command, especially since a lot of
these errors are silent and are not easy to catch.
I wrote insheet2/replace_dquotes.py to try to be a catch-all place to put all
the necessary guards for insheet, to be used without second thought. I'm not
100% sure that I've caught everything, but it has worked for me on all the
datasets that have failed with insheet, with the exception of one observation
files that do not have a header, which Stata still interprets as having 0
observations without the 'nonames' argument.
--
James Sams
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/