Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Import csv file


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: st: Import csv file
Date   Tue, 29 Aug 2006 17:26:50 +0100

If you give examples, people may take you at your word!

If the basic idea is sound, that should not be fatal. 

I tried different "data"

Name,Foo and Bar LLP,Meganational Monsters,Stata is Smart
Code,0000(MV),0001(MV),0002(MV)
Currency,GBP,GBP,USD
31/12/1986,1,2,3
30/01/1987,4,5,6
30/12/2005,7,8,9

and tweaked the code 

insheet using soares.csv, clear
gen long id = _n
rename v1 foo 
reshape long v , i(id)
gen Company = v if foo == "Name" 
gen Code = v if foo == "Code"
gen Currency = v if foo == "Currency"
bysort _j (Company) : replace Company = Company[_N]
bysort _j (Code) : replace Code = Code[_N]
bysort _j (Currency) : replace Currency = Currency[_N]
drop if inlist(foo, "Name", "Code", "Currency")
rename v Value 
gen Date = date(foo, "dmy") 
format Date %dd/n/CY 
drop foo id _j 
sort Company Date
destring Value, replace 

    +------------------------------------------------------------------+
     | Value                 Company       Code   Currency         Date |
     |------------------------------------------------------------------|
  1. |     1         Foo and Bar LLP   0000(MV)        GBP   31/12/1986 |
  2. |     4         Foo and Bar LLP   0000(MV)        GBP    30/1/1987 |
  3. |     7         Foo and Bar LLP   0000(MV)        GBP   30/12/2005 |
  4. |     2   Meganational Monsters   0001(MV)        GBP   31/12/1986 |
  5. |     5   Meganational Monsters   0001(MV)        GBP    30/1/1987 |
     |------------------------------------------------------------------|
  6. |     8   Meganational Monsters   0001(MV)        GBP   30/12/2005 |
  7. |     3          Stata is Smart   0002(MV)        USD   31/12/1986 |
  8. |     6          Stata is Smart   0002(MV)        USD    30/1/1987 |
  9. |     9          Stata is Smart   0002(MV)        USD   30/12/2005 |
     +------------------------------------------------------------------+

This corrects a small bug in my previous code. -destring Value- needed
the -, replace-. 

On the face of it, there might be a shorter solution involving 
two -reshape-s. 

Nick 
[email protected] 

Nuno Soares
 
> Thanks for your insight. However, I've ran into a problem 
> when running your code. The code runs fine for your example, 
> but when we replace the company names by actual names, we 
> have problems with spaces in the companies' names or names 
> that have characters that aren't allowed as var names by 
> Stata (ex. I have companies called "@UK",  "Marks & Spencer", 
> "Azure Holdings", etc.). These names can't be used your first 
> rename procedure... That's a pitty because your code is simpler...
 
Nick Cox

> This solution seems a bit more complicated than necessary. 
> 
> I started with this soares.csv
> 
> Name,Company1,Company2,Company3
> Code,0000(MV),0001(MV),0002(MV)
> Currency,GBP,GBP,USD
> 31/12/1986,1,2,3
> 30/01/1987,4,5,6
> 30/12/2005,7,8,9
> 
> and then ran this 
> 
> insheet using soares.csv, clear
> foreach v of var * {
> 	rename `v' `=`v'[1]'
> }
> drop in 1
> gen long id = _n
> reshape long Company , i(id)
> gen Code = Company if  Name == "Code"
> gen Currency = Company if Name == "Currency"
> bysort _j (Code) : replace Code = Code[_N] 
> bysort _j (Currency) : replace Currency = Currency[_N] 
> drop if inlist(Name, "Code", "Currency") 
> drop id rename Company Value 
> rename _j Company 
> gen Date = date(Name, "dmy") 
> format Date %dd/n/CY 
> drop Name 
> destring Value 
> sort Company Date
> 
> The results are
> 
> . l
> 
>      +----------------------------------------------------+
>      | Company   Value       Code   Currency         Date |
>      |----------------------------------------------------|
>   1. |       1       1   0000(MV)        GBP   31/12/1986 |
>   2. |       1       4   0000(MV)        GBP    30/1/1987 |
>   3. |       1       7   0000(MV)        GBP   30/12/2005 |
>   4. |       2       2   0001(MV)        GBP   31/12/1986 |
>   5. |       2       5   0001(MV)        GBP    30/1/1987 |
>      |----------------------------------------------------|
>   6. |       2       8   0001(MV)        GBP   30/12/2005 |
>   7. |       3       3   0002(MV)        USD   31/12/1986 |
>   8. |       3       6   0002(MV)        USD    30/1/1987 |
>   9. |       3       9   0002(MV)        USD   30/12/2005 |
>      +----------------------------------------------------+
> 
> What are the key differences? 
> 
> 1. No need to resort to a user-written program -sxpose-, no 
> matter how much you trust it. 
> 
> 2. Date manipulations are much simpler with -date()-. 
> You _don't_ have to -destring- the date. Also dropping the 
> "/" is dangerous! 
 
Nuno Soares
  
> > Thanks a lot Radu! Just had to destring the date_ variables. 
> > Here goes the code. Might be useful to other statalisters:
> > 
> > ***************** BEGIN CODE ***********************
> > 
> > clear
> > *reading in the data
> > insheet using "C:\mv1.csv", clear
> > 
> > *transposing
> > sxpose, clear
> > 
> > *renaming the first three variables
> > foreach v of varlist _var1 _var2 _var3 { local i = `v'[1] 
> rename `v' 
> > `i'
> > }
> > 
> > *getting rid of "/" in the dates
> > *note that after renaming, only the date variables start with "_var"
> > destring _var*, replace ignore("/") force
> > 
> > *now renaming the remaining variables
> > foreach v of varlist _var* {
> > local i = `v'[1]
> > rename `v' data`i'
> > }
> > 
> > *dropping the first observation (because this is just the 
> name of the 
> > variable) drop in 1
> > 
> > *reshaping
> > reshape long data, i(Name) j(date) string
> > 
> > 
> > *now you're left with the date in a string format, which you can 
> > convert to a proper date, in a couple of steps
> > 
> > gen str date_year = substr(date, -4,.) gen str date_month = 
> > substr(date, 3, 2) gen str date_day = substr(date, 1, 2) destring  
> > date_year date_month date_day, replace
> > 
> > gen truedate = mdy(date_month, date_day, date_year) format 
> truedate %d
> > 
> > ***************** END CODE ***********************
> 
> Radu Ban
>  
> > Here's a try. For this you would need to install the 
> -sxpose- command 
> > (do a -findit sxpose- and then follow the link to install). This is 
> > needed because the built-in -xpose- command doesn't work 
> with strings.
> > 
> > ******************************
> > 
> > *reading in the data
> > insheet using "C:\somedir\myfile.csv", clear
> > 
> > *transposing
> > sxpose, clear
> > 
> > *renaming the first three variables
> > foreach v of varlist _var1 _var2 _var3 { local i = `v'[1] 
> rename `v' 
> > `i'
> > }
> > 
> > *getting rid of "/" in the dates
> > *note that after renaming, only the date variables start with "_var"
> > destring _var*, replace ignore("/") force
> > 
> > *now renaming the remaining variables
> > foreach v of varlist _var* {
> > local i = `v'[1]
> > rename `v' data`i'
> > }
> > 
> > *dropping the first observation (because this is just the 
> name of the 
> > variable) drop in 1
> > 
> > *reshaping
> > reshape long data, i(Name) j(date) string
> > 
> > 
> > *now you're left with the date in a string format, which you can 
> > convert to a proper *date, in a couple of steps
> > 
> > gen str date_year = substr(date, -4,.) gen str date_month = 
> > substr(date, 3, 2) gen str date_day = substr(date, 1, 2)
> > 
> > gen truedate = mdy(date_month, date_day, date_year)
> > 
> > *now you can format the date using the preferred date format, for 
> > example format truedate %d
> > ********************
> > 
> > I hope this helps. Also if you want to run this for several .csv 
> > files, check out the archives for last week, there's a 
> thread on how 
> > to loop over all files in a directory.
>  
> 
> Nuno Soares 
> 
> > > I need to import some csv files that have an awkward
> > format. The format is always the same:
> > >
> > > Name Company1 Company2 Company3 ...
> > > Code 0000(MV) 0001(MV) 0002(MV) ...
> > > Currency £ £ € ...
> > > 31/12/1986 data1 data2 data3 ...
> > > 30/01/1987 dataX dataY dataZ ...
> > > ...
> > > 30/12/2005 dataW dataJ dataK ...
> > >
> > > The number of companies is variable (ie, columns), and its
> > width is also variable. Data1, data2, etc. are numeric.
> > >
> > > Can you please tell me if it is possible to import to Stata
> > the data described, in the following format:
> > >
> > > Company  Code   Currency  Date  Data
> > > Company1 0000(MV) £   31/12/1986 data1
> > > Company1 0000(MV) £   30/01/1987 dataX
> > > ...
> > > Company1 0000(MV) £   30/12/2005 dataW
> > > Company2 0001(MV) £   31/12/1986 data2
> > > Company2 0001(MV) £   30/01/1987 dataY
> > > ...
> > > Company2 0001(MV) £   30/12/2005 dataJ
> > > Company3 0002(MV) £   31/12/1986 data3
> > > Company3 0002(MV) £   30/01/1987 dataZ
> > > ...
> > > Company3 0002(MV) £   30/12/2005 dataK
> > > and so on.
> > >
> > > Some treatment on the original data can be done in Excel,
> > but I wanted to avoid it (I have almost one hundred files 
> that I need 
> > to import...). If any one has an idea of how to do it, I would be 
> > greatly appreciated!

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index