Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Importing subset of a pipe delimited textfile


From   Rob Shaw <[email protected]>
To   [email protected]
Subject   Re: st: Importing subset of a pipe delimited textfile
Date   Wed, 17 Oct 2012 13:10:51 +0100

Maarten

The problem is not the pipes as such (otherwise I could just use the
delimiter options in -insheet-), it's that the file is too large to
use -insheet-

So i need to use -infile- to import my file in separate parts, but
infile will only accept fixed format files (as far as I understand).
Therefore, if I import my file using:

infile str2 var1 _skip(1) str4 var2 _skip(1) str3 var3 _skip(1) str4
var4  using myfile in 1/1000000

I get nonesense because the first record then gets filled with [1|,
BCD|, 3|X, YZ]

Rob

Maarten wrote:

To give a concrete example: I stored Rob's example dataset in foo.raw

I than typed in Stata:

filefilter foo.raw foo2.raw, from("|") to(\t) replace

insheet using foo2.raw

The first line replaced all pipes in the file foo.raw with a tab and
stored the resulting tab-delimited file in foo2.raw, and the second
line read this tab-delimited file foo2.raw into Stata.

Hope this helps,
Maarten

On Wed, Oct 17, 2012 at 1:37 PM, Nick Cox <[email protected]> wrote:
> Why is varying length of line a problem? So long as the same variables
> are represented on each line, I can see no problem.
>
> Also, -filefilter- has a tacit loop; you don't need to set it up for yourself.
>
> Nick
>
> On Wed, Oct 17, 2012 at 12:33 PM, Rob Shaw <[email protected]> wrote:
>> Nick
>>
>> Thanks. Yes that would work but the problem is the varying length of
>> each line. So I need to get filefilter or another command to do one
>> of:
>>
>> x=0
>> counter=1
>> with "myfile.txt" {
>>  y = position of 10000th EOL in `i'
>>  save `i' from position x to y in "myfilepos"+counter+".txt"
>>  x =y
>> }
>>
>> This would create files called myfilepos1, myfilepos2 etc each with
>> 10000 lines that I could then -insheet- with a delimiter(|) option.
>> But I don't know how to correctly specify the bit in the loop.
>>
>> OR
>>
>> for each line in "myfile.txt" {
>>  find | and replace with a number of spaces depending on position in row
>> }
>>
>> This would make each line the same length so I could use -infile-
>>
>> Is there a way to use -filefilter- to achieve this?
>>
>> File sample:
>>
>> 1|ABCD|23|XYZ
>> 10|BCED|1|YZX
>> 30|DCHS|234|YBH
>> ....
>>
>> Thanks
>> Rob
>>
>>
>>>I'd use -filefilter- to change the pipes to something that -infile- can handle.
>>
>>>(Strictly, -in- is a qualifier, not an option.)
>>
>>>Nick
>>
>>>On Wed, Oct 17, 2012 at 9:13 AM, Rob Shaw <[email protected]> wrote:
>>
>>> I have a very large (around 4Gb) text file that has been pipe
>>> delimited. It won't all fit in memory so I want to process it in
>>> parts.
>>>
>>> For fixed datasets I would use infile with the in 1/10000000 option
>>> then 10000001/2000000 etc. However, this dataset has been pipe
>>> delimited so I would need to use insheet, but insheet doesn't seem to
>>> permit the "in" option.
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/



-- 
---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index