Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Use of matrix values in generate statements |
Date | Sun, 27 Mar 2011 00:15:46 +0000 |
Kit has given the most important answer: Mata is a much richer language for handling non-standard problems. I want to add a footnote. Here is a basic technique for using a lookup matrix to populate a variable z. I mix algebra with Stata. The idea is that variables x and y tell you which row and column of the matrix to use. matrix lookup = ... gen z = . forval i = 1/I { forval j = 1/J { quietly replace z = lookup[`i', `j'] if x == `i' & y == `j' } } I am just using -forval- to turn Daniel's statements into a double loop over possibilities. Nick On Sat, Mar 26, 2011 at 11:55 PM, Christopher Baum <kit.baum@bc.edu> wrote: > <> > Dan says > > I continue to work on a tax calculator for Stata. > > I am at the point of calculating the standard deduction for each taxpayer. > There are 6 possible filing status's and 24 years of tax law, so there are > 144 possible values for the deduction. In SAS, fortran, PL/1, C or any > other language I know of, the calculation would be some form of: > > stded = stdvalues(year,filestat) > > and the processor would index into the 24x6 array of stdvalues to obtain > the value for each taxpayer. As I understand it, Stata matricies can't be > used in -generate- statements, though, so I can't do something like: > > matrix input stdvalues (3700 6200...\3800 6350...\... > generate stded = stdvalues[year-1992,filestat] > > (Here and below, ... is meant to conceal a lot of typing on my part but > 3700 is the deduction in 1993 for a single taxpayer, 6350 is the deduction > in 1994 for a joint return, etc). The most straightforward way I can see > to calculate the deduction in Stata would be: > > generate stded = 3700 if year == 1993 & filestat == 1 > replace stded = 6200 if year == 1993 & filestat == 2 > ... > > and so forth, for 144 lines. I have millions of observations, and will > make thousands of runs, so I am looking for a more efficient solution. My > next thought is: > > generate stded = (year==1993&filestat==1)*3700+(year==1993&filestat==2)*6200... > > which would be one very long line of code once all 144 terms were written > out, and still quite a bit of wasted arithmetic. Still a third > possibility would be -recode-: > > gen filestatyear = year*10+filestat > recode filestatyear (19931 = 3700)(19932 = 6200)... > > but looking at the -recode- .ado file suggests that this is not an > efficiency gain. > > I take it I am supposed to -sort- the data by year and filestat, and then > -merge- onto a file of parameter values by year and filestat: > > sort year filestat > merge m:1 year filestat using params > > where params is a dataset with the deduction amount for each year and > filestat. This is a reasonable amount of code, (even including the code > necessary to create params) but it is not space efficient and it strikes > me as odd that a large dataset needs to be sorted, just to make some > simple recodes. Is that right? Am I missing something? > > I note that the -egen- command -mtr- must address this same question, but > it is not very fast - about 1,000 observations/minute on our hardware. > > Oddly enough, although one cannot index into a Stata matrix, it is > possible to index into a series observation: > > generate stded = stdvalues[filestatyear-199200] > > is very fast, but doesn't address the problem of filling stdvalues in a > not too hackish manner (especially if there are fewer than 144 taxpayers > in the dataset). > > > > The following code will do 1 million table lookups in 8 or 9 seconds on my laptop: > > --------------------------------- > clear all > // fake data for lookup table > mata: sdlookup = 100*runiform(24,6) :+ 3200 > > set obs 10 > input year fs > 1994 1 > 1998 2 > 1999 1 > 2000 6 > 2000 5 > 2005 3 > 2004 4 > 1996 2 > 2008 5 > 2007 3 > expand 100000 > g byte yrind = year - 1992 > g stded = . > set rmsg on > mata > st_view(yrfs=., ., ("yrind","fs")) > st_view(stded=., . , "stded") > for(i=1; i<=rows(stded); i++) { > stded[i] = sdlookup[yrfs[i,1], yrfs[i,2]] > } > end > su stded > --------------------------------- > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/