Thank you for the replies. I have been unable to resolve the problem, so am copying more details below as requested.
The data in the original text dataset looks as follows
1010100100050101112101 var3 var 4...
1010100100050101112102 var3 var 4...
1010100100050101112104 var3 var 4...
1010100100050101112303 var3 var 4...
1010100100050101113101 var3 var 4...
The number in the first column is actually the first 2 variables, var1 is 14 digits and var2 is 8 digits. In the text dataset there is no space between them. Actually neither var1 nor var2 are supposed to be unique, but the combination of them is (and is in the original data). (Although they do need to be analysed separately - var1 is the person identifier and var2 is the activity).
I am now using stat transfer to convert the file (specifying the option ASCII - Delimited). When I look at the data in the "view" option in stat transfer it looks fine. One relevant point might be that in the 'variables' window of stat transfer, the first variable (which is actually var1 and var2 which it is treating as one) is listed as string while the others are floats.
The good news is that I can now make the transfer and the col1 variable that comes up in Stata (of 22 digits, combining var1 and var2) is unique. One problem however is that when I try to encode this variable 'col1', it does not work as I get error message 134 (that I have tried to encode too many values). There are just under 1.5 million observations.
I then tried specifying 'col1' in stat transfer as either a float or long variable, but neither or these work - with long all the variables come up in Stata as 0, and with float they are no longer unique (no matter how many digits I allow for when formatting the variable).
I guess one option would be to convert them using Stattransfer in the original string format, and then find a way of encoding the variables (despite the problem of too many observations) and then somehow splitting the 'col1' variable into the 2 variables var1 (first 14 digits) and var2 (next 8 digits).
When I try using infix, my command is:
..infix var1 1-14 var2 15-22 using "filename"
I then format the variables to give them enough places (format %16.0g var1 var2). When I sort by var1 var2, my first 3 observations are as follows - clearly the combination of var1 and var2 is not unique:
var1 var2
10101000765440 1111101
10101000765440 1111101
10101000765440 1111101
Any suggestions would be highly appreciated.
regards,
Gisella
--- On Tue, 7/1/08, Steven Samuels <[email protected]> wrote:
> From: Steven Samuels <[email protected]>
> Subject: Re: st: problem in uploading data into Stata - data "changes"
> To: [email protected]
> Date: Tuesday, July 1, 2008, 3:18 PM
> Gisella,
>
> Show us an example of a data line and your -infix-
> statements Also,
> what are the item separators in your text file (commas,
> tabs,..) ?
> If Excel can figure out the variable columns, then
> StatTransfer can
> also (see ASCII input options); there is no need to go
> through Excel.
>
> -Steve
> On Jul 1, 2008, at 11:05 AM, Gisella Young wrote:
>
> > Dear all,
> >
> > I am trying to load a datafile in text format into
> Stata. I am
> > using the infix command. The problem is that 1 column
> of data (the
> > firm column, which is the unique identification number
> for each
> > observation, is different when I open it in Stata as
> from what I
> > can see in the original text file. In fact I have
> several such text
> > files for various years, and in every case the problem
> is the same:
> > all variables upload correctly except for the first
> one. Not only
> > is that number different but it is no longer unique to
> each
> > observation. It is however the same number of digits
> as the
> > original. I have checked that the infix command is
> specified
> > correctly (eg correct number of digits).
> >
> > I have also tried saving the text file into excel (and
> applying
> > text-to-columns) and then converting it into a stata
> file using
> > Stat-transfer. When I do this all the variable upload
> correctly
> > into Stata. The problem is that I cannot do this for
> the entire
> > files because of their size (the limits of Excel mean
> that only a
> > small fraction of each file can be accommodated), so
> this is not a
> > solution.
> >
> > I realise that it may be difficult for someone to
> suggest an
> > explanation/solution without seeing the actual data,
> but I wonder
> > whether there are any suggestions as to what the
> problem might
> > potentially be, and how to get around it?
> >
> > Many thanks,
> > Gisella
> >
> >
> >
> >
> > *
> > * For searches and help try:
> > * http://www.stata.com/support/faqs/res/findit.html
> > * http://www.stata.com/support/statalist/faq
> > * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/