| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: how does insheet determine datatypes?
On Jan 6, 2007, at 11:51 AM, Jens Lauritsen wrote:
I edit the raw file manually by adding a record at the top after
variable names or (better) use stata to add that line:
cpr v1 v2 v3
0xx1201956 1 2 2 // this record will force the
first variable to string
0101201956 1 2 2
1101201954 1 2 1
etc .... rest of records
and then read as :
insheet using myfile
drop in 1 ....and I have the cpr variable as a string without the
"fake" record
If you want to avoid having to edit your data file(s), and assuming
the first row contains variable names each of which contains at least
one non-numeric character, you can also simply use the -nonames-
option and then use something like the following:
foreach var of varlist _all {
ren `var' `=`var'[1]'
}
drop in 1
Of course this will result in *all* your data being read as strings.
However, this is sometimes desirable, as for example when you need to
read and append several files and want to make certain that no data
are lost due to appending a numeric variable onto a string variable
(or vice versa). In cases where this is not desirable, a single call
to -destring- is all that is necessary to restore one or more
variables to numeric.
I mention this only because it raises two interesting questions (at
least to me). First, note that the rename above will fail if the
original variable name (as appears in the first row of the data file)
is not a valid Stata name. -insheet- takes care of this for you, by
automatically deleting spaces and/or special characters, dropping
leading digits, etc. I wonder: Is the function -insheet- uses to do
this exposed? It would be very handy to have such a function
available, such as for use in the snippet above. And although such a
function is easily written in Mata, it would be nice to be able to
use the same one -insheet- uses in cases where you might need
consistency between the two.
Let's suppose it's not exposed. If I were to write such a function
myself, I might use a regular expression to match those elements that
are not valid and remove them. Problem is, since -regexr(s1,re,s2)-
only replaces the first match of re in s1, you can't do this with a
single function call. I wonder: Why does -regexr()- not take a
fourth argument like -subinstr()-, indicating the maximum number of
replacements to make (with . indicating replace all)?
Just some random musings on a Saturday night...
-- Phil
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/