Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: extracting substrings from string, with irregular patterns
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: extracting substrings from string, with irregular patterns
Date
Thu, 16 Aug 2012 19:40:23 +0100
Here is a sketch of an approach (look, no regex). No code has been
tested by computer or anybody reading it.
The -city- comes after the last comma so reverse the string to make it easier
gen city = reverse(station)
replace city = substr(city, 1, strpos(city, ","))
replace city = reverse(city)
Now blank out -city-
replace station = subinstr(station, city, "", .)
Now zap the initial comma in -city-
replace city = substr(city, 2, .)
Now let's try the name.
gen name = "Petrobas" if substr(lower(station), 1, 8) == "petrobas"
replace name = "Copec" if substr(lower(station), 1, 5) == "copec"
You are going to need to add similar statements.
Once you have non-empty -name- on all observations, you can remove it
from your main variable to leave the address as the residue.
Nick
On Thu, Aug 16, 2012 at 7:27 PM, Fernando Luco
<[email protected]> wrote:
> I have a dataset with one variable that contains the name of a gas
> station, the address and the city in which the station is located. I
> would like to separate all these in three different variables, name,
> address and city. I have tried to use the regexs machinery but I
> haven't been succesful. The data looks as follows
>
> COPEC AV. 11 DE SEPTIEMBRE 000,Tocopilla
> PETROBRAS Av. Antonio Rendic 6850,Antofagasta
> TERPEL Basilio Urrutia esq. Janequeo 312,Lautaro
> Sin Bandera carrera 348,Lautaro
> Sin Bandera Isabel Riquielme 403,Villarrica
>
> In the example the names are COPEC, PETROBRAS, TERPEL and Sin Bandera,
> so there is a mixture of only uppercase and lowercase letters. The
> addreses are written as: AV. 11 DE SEPTIEMBRE 000, Av. Antonio Rendic
> 6850, Basilio Urrutia esq Janequeo 312, carrera 348 and Isabel
> Riquielme 403. Finally, the city is what follows the comma, so
> Tocopilla, Antofagasta, Lautaro and Villarrica.
>
> What I would like to do, even if it requires several steps, is to have
> the name, address and city each as a different variable. I have tried
> to separate everything by sub strings by spaces but it didn't work. I
> also tried first recovering names in uppercase letters but it also
> didn't work.
>
> Finally, I have 1,600 stations so I would like to avoid doing this one
> by one. Any suggestions?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/