Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Increase speed of -replace-


From   "Michael Blasnik" <[email protected]>
To   <[email protected]>
Subject   st: Re: Increase speed of -replace-
Date   Thu, 08 May 2008 16:52:21 -0400

...

Although I am perhaps the biggest proponent of using -in- rather than -if- when performing many operations on subsets of large datasets, I would actually recommend reducing the dataset and using -merge- for this problem (if I understand it correctly).

Continuing with your example, if there are perhaps 200 different values of make in the 150,000 observation dataset, then I would

keep make
bysort make: keep if _n==1
gen newmake=make
replace newmake = "AMC" if strpos(make,"AMC")>0
replace newmake = "Buick" if strpos(make,"Buick")>0
replace newmake = "Cadillac" if strpos(make,"Cad.")>0
replace newmake = "Chevrolet" if strpos(make,"Chev.")>0
.....etc..
sort make
save make2newmake

use master
sort make
merge make using make2newmake


Now you have the new names in a dataset. This approach will be much faster because you are only replacing one copy of each value of make -- just 200 values rather than 150,000. The speed improvement from this approach should more than make up for the need to merge it back in.

If you do this repeatedly, you can just keep the make2newmake dataset and do the merge first. Then you can check for new categories of make (count if _m==1) and classify them and add them to that lookup file.

M Blasnik

----- Original Message ----- From: "Friedrich Huebler" <[email protected]>
To: <[email protected]>
Sent: Thursday, May 08, 2008 3:10 PM
Subject: st: Increase speed of -replace-



I am looking for a way to increase the speed of -replace-. I have a
long string variable consisting of several words that should be
reduced to a shorter string, depending on the text in each
observation. The problem can be reproduced with the auto data. Assume
that we want to replace the text in the variable "make" by a single
word. Assume further that the text we are looking for (e.g. "Chev.")
is not necessarily at the beginning of the string but that it can be
anywhere in the variable. My solution is shown below but it is slow
with more than 200 -replace- commands and about 150,000 observations.
Is there a faster solution?

sysuse auto
replace make = "AMC" if strpos(make,"AMC")>0
replace make = "Buick" if strpos(make,"Buick")>0
replace make = "Cadillac" if strpos(make,"Cad.")>0
replace make = "Chevrolet" if strpos(make,"Chev.")>0
replace make = "Dodge" if strpos(make,"Dodge")>0
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index