Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: repeat same commands over hundreds of files
From
tbrunell <[email protected]>
To
[email protected]
Subject
Re: st: repeat same commands over hundreds of files
Date
Tue, 2 Nov 2010 15:42:20 -0500
These are very helpful, to be more specific my files structure will have
/Users/tbrunell/MPG/ as the root
then there are currently 50 subfolders one for each state AL, AR,...WY.
The file names look like
mpg_09_CTC1972_1972_EDCD11_10_JH22.csv
mpg_09_CTC1972_1974_EDCD11_10_JH22.csv
mpg_09_CTC1972_1976_EDCD11_10_JH22.csv
mpg_09_CTC1972_1978_EDCD11_10_JH22.csv
mpg_09_CTC1972_1980_EDCD11_10_JH22.csv
mpg_09_CTC1982_1982_EDCD11_10_JH22.csv
mpg_09_CTC1982_1984_EDCD11_10_JH22.csv
mpg_09_CTC1982_1986_EDCD11_10_JH22.csv
mpg_09_CTC1982_1988_EDCD11_10_JH22.csv
mpg_09_CTC1982_1990_EDCD11_10_JH22.csv
the things that change across states are
the number after MPG which is a state number
The 3 letters before the first year CTC is CT Congress, TXS would be texas senate, etc
the first year is the redistricting regime, usually a year ending in 2
then the second year is the election year
I could do several things like
1) not have 50 separate folders, just keep everything in one folder
2) rename all the input files.
Though I must admit I would prefer not to do either of those things.
Thanks for your help
On Nov 2, 2010, at 3:31 PM, Eric Booth wrote:
> <>
>
> One other note: if your files are sequentially numbered but there are gaps (as there are in my example of filenames), you might want to put in a -confirm- statement to capture whether the file exists and skip it if it doesn't exist. So, modifying my prev. example, you'd want something like this:
>
> *********!
> forval n = 1972/1981 {
>
> cap confirm file "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
> if !_rc {
>
> clear
> insheet using "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
> drop in L /*this drops file notation at the bottom*/
> compress
> gen demper=dem/(dem+rep)
> gen demwin=.
> replace demwin=1 if demper>.5 & demper~=.
> replace demwin=0 if demper<.5
> sort rkey
> gen overalldemper=overalldem/(overalldem+overallrep)
> collapse (count) numberofseats=demper (sum) demwin (mean) year demper overalldemper (p50) median=demper,by(rkey)
> gen percentdemdist=demwin/numberofseats
>
>
> **create a macro for the decade**
> local save
> if inrange(`n', 1970, 1979) local save 1970
> if inrange(`n', 1980, 1989) local save 1980
>
>
> save "/Users/tbrunell//MPG/CT/CTC`save's", replace
>
> }
>
> else {
> di "file for `n' doesnt exist!"
> }
> }
> ************!
>
> - Eric
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> [email protected]
>
> On Nov 2, 2010, at 3:22 PM, Eric Booth wrote:
>
>> <>
>>
>> Hi Tom:
>>
>> The best approach probably depends on how your file names are sequenced and how your folders/files are organized, but programs like -fs- (from SSC) and others are useful for this type of work. Here's two approaches:
>>
>>
>> assuming you've got files named sequentially like this:
>>
>> mpg_09_CTC1972_1972_EDCD11_10_JH22
>> mpg_09_CTC1973_1973_EDCD11_10_JH22
>> mpg_09_CTC1974_1974_EDCD11_10_JH22
>> mpg_09_CTC1975_1975_EDCD11_10_JH22
>> mpg_09_CTC1981_1981_EDCD11_10_JH22
>> mpg_09_CTC1982_1982_EDCD11_10_JH22
>>
>>
>>
>> You could use a -forvalues- loop like:
>>
>> *********!
>> forval n = 1972/1981 {
>
> cap confirm file "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
> if !_rc {
>> clear
>> insheet using "/Users/tbrunell/MPG/CT/mpg_09_CTC`n'_`n'_EDCD11_10_JH22.csv"
>> drop in L /*this drops file notation at the bottom*/
>> compress
>> gen demper=dem/(dem+rep)
>> gen demwin=.
>> replace demwin=1 if demper>.5 & demper~=.
>> replace demwin=0 if demper<.5
>> sort rkey
>> gen overalldemper=overalldem/(overalldem+overallrep)
>> collapse (count) numberofseats=demper (sum) demwin (mean) year demper overalldemper (p50) median=demper,by(rkey)
>> gen percentdemdist=demwin/numberofseats
>>
>>
>> **create a macro for the decade**
>> local save
>> if inrange(`n', 1970, 1979) local save 1970
>> if inrange(`n', 1980, 1989) local save 1980
>>
>>
>> save "/Users/tbrunell//MPG/CT/CTC`save's", replace
>>
> }
>
> else {
> di "file for `n' doesnt exist!"
> }
> }
>> ************!
>>
>> Note the use of the local macros to create the decade for the -save- filename.
>>
>>
>>
>> Another approach is to just find all the .csv files in your folder (or alternatively this could be done to find all the folders of interest and all the .csv files in all the folders of interest) using the macro extended functions (see -help extended_fcn-) and run the code on all of them , e.g.,
>>
>> *************!
>> global files:dir "<folder path>" files "*.csv", respectcase
>> token `"$files"'
>> di in yellow `"$files"'
>>
>> while "`1'" != "" {
>> clear
>> insheet using "/Users/tbrunell/MPG/CT/`1'.csv"
>> <snip>
>> save "/Users/tbrunell//MPG/CT/`1'.dta", replace
>>
>> macro shift
>> }
>> ***************!
>>
>>
>>
>> - Eric
>> __
>> Eric A. Booth
>> Public Policy Research Institute
>> Texas A&M University
>> [email protected]
>>
>>
>> P.S. Say "Hi" to Dave Smith for me if he's still around there.
>>
>>
>>
>>
>> On Nov 2, 2010, at 2:57 PM, tbrunell wrote:
>>
>>> I am doing some simple analysis on election data that spans all the states and several decades.
>>> So I have hundreds of files that I want to do the same relatively simple analysis on (I have an example below).
>>> At first I started writing .do files for each state/year and the only things I changed were the
>>> 1) file name for the insheet command
>>> 2) the name and location of the collapsed file at the end.
>>>
>>> However, when I wanted to add an additional command this meant opening hundreds of separate .do files, making a change, resaving the file. It is not the end of the world, but I would prefer to set up the commands and then, somehow, tell stata to run the commands separately for each specified file and then save the resulting file with some new name.
>>>
>>> The techs at Stata recommended using macros for file names and the foreach command. But that doesn't solve my filename and output file problem.
>>>
>>> Any recommendations would be much appreciated.
>>>
>>> Tom Brunell
>>> Professor of Political Science
>>> University of Texas at Dallas
>>>
>>> _____________________________
>>> clear
>>> insheet using "/Users/tbrunell/MPG/CT/mpg_09_CTC1972_1972_EDCD11_10_JH22.csv"
>>> drop in L /*this drops file notation at the bottom*/
>>> compress
>>>
>>> gen demper=dem/(dem+rep)
>>> gen demwin=.
>>> replace demwin=1 if demper>.5 & demper~=.
>>> replace demwin=0 if demper<.5
>>> sort rkey
>>> gen overalldemper=overalldem/(overalldem+overallrep)
>>>
>>> *here overalldemper will be total votes percentage, demper is "normalized" vote - averaged across districts
>>> collapse (count) numberofseats=demper (sum) demwin (mean) year demper overalldemper (p50) median=demper,by(rkey)
>>> gen percentdemdist=demwin/numberofseats
>>>
>>> save "/Users/tbrunell//MPG/CT/CTC1970s", replace
>
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/