If bw, the original poster, acquires one of the title lists that
Sergiy referred to, -regexm- statement with no more than one title,
plus ending variants, on a line may be useful. There are two kinds of
titles, those written out ("Mister" "Doctor" "Colonel") and their
abbreviations ("Mr." "Dr."), which may, in error, exclude the
period. I wrote the following code which puts each written=out title
on one line and the abbreviations on another. Note the alternative
indicator "|" at the beginning and end of successive lines. These
have proved necessary on lines containing multiple abbreviations
(e.g. "^mr(\.| )" and are harmless on other lines. Therefore I
include them on all. The code also attempts to cope with some
possible scenarios that b.w. may encounter: no title, successive
spaces, spaces before the title,no space after). Note, that in
American English, "Missy" is a first name. Be sure to zap gremlins
before using.
A small point: Many titles are gender-neutral, so an effort to
determine gender from title will produce many missing values.
-Steve
**************************code begins**************************
** Do file to remove titles: Version 4
clear
input str40 name
"Mr John Smith"
"Mr. John Jones"
" Mr Donald Trump"
"Mrs. Felicia Mroz"
"Mrumph Caliph"
" Dr. Tom Lester "
"drummond katz"
"John Amro"
"Mr.Tim Donner"
"Mister D.D. Smith"
"Doctor Nicholas J. Cox"
"Ms. Virginia Wolfe"
"Ms Jane Austen"
"Missy Columbine"
"Miss Sadie Thompson"
end
gen namex =trim(lower(name))
#delim ;
gen str30 name_only = trim(proper(regexr(namex,
"^mister|
|^mr(\.| )|
|^mistress|
|^mrs(\.| )|
|^doctor|
|^dr(\.| )|
|^miss |
|^ms(\.| )"
," ")));
#delim cr
list name name_only
***************************code ends***************************
On Dec 24, 2008, at 2:53 PM, Steven Samuels wrote:
>
> I agree with everything that Sergiy wrote. A technical point: in
> Howie's code, the "^" must be inside the quotes. Here's some code
> I tried for fun.
>
> -Steve
>
> **************************CODE BEGINS**************************
> clear
> drop _all
> input str40 name
> ""Mr.Tim Donner"
> "Mister D.D. Smith"
> "Doctor Nicholas J. Cox"
> "Ms. Virginia Wolfe"
> end
>
> gen namex = trim(lower(name))
> #delim ;
> gen str30 name_only = proper(regexr(namex,"(^mr(\.| |s |s\.))|(^dr
> (\.| ))
> |(^mister)|(^doctor) |(^ms(\.| ))",""));
> #delim cr
> ***************************CODE ENDS***************************
>
>
>
> On Dec 24, 2008, at 11:48 AM, Howard Lempel wrote:
>>
>> Following from Sergiy's advice, I'd like to suggest that bw use
>> regular expressions to only delete occurrences of Mr, Dr, etc.
>> that occur at the beginning of a name. This should save Dr. Mroz
>> (or someone with last name Mr) from being deleted. Someone with a
>> first name of MR will still be in trouble (you may want to
>> experiment with finding a way to only deleting titles from people
>> where var1 is at least three words, saving someone with first name
>> MR and no title in the data).
>>
>>
>> BW, carrot (^) tells Stata you are searching for characters at the
>> beginning of a string only, so you probably want something to the
>> effect of:
>>
>>
>>> -----Original Message-----
>>> From: [email protected] [mailto:owner-
>>> [email protected]] On Behalf Of Sergiy Radyakin
>>> Sent: Wednesday, December 24, 2008 11:30 AM
>>> To: [email protected]
>>> Subject: Re: st: data management - string function
>>>
>>> Hi,
>>>
>>> I just hope that this program will not manage banking accounts,
>>> otherwize someone like Dr Mroz
>>> (http://www.unc.edu/~mroz/index_files/vita_mroz_2007_August%5B2%
>>> 5D.pdf)
>>> will loose all his savings. The program should be very careful about
>>> replacing the combinations of letters. When there is no guarantee,
>>> that "Mr." is always spelled with a dot (like in the original data
>>> sample in the first email in this thread) spaces should be
>>> incorporated, but even then there is no way you can be sure that
>>> Mr is
>>> not a lastname. E.g. the common Asian last name "Ng" (e.g.
>>> http://www.drdavidng.com/contact_us.html) would not qualify many
>>> naive
>>> validators (very short, no vowels). Perhaps in some languages
>>> "Mr" is
>>> also a name, lastname or a middle name.
>>>
>>> Also the choice of titles should probably be wider, to allow e.g.
>>> for
>>> Dr., Prof., Col., or any combination of these (which can occur in
>>> multiple combinations like "The life and activities of Col. Prof.
>>> Dr.
>>> Jezdimir STUDIC" here:
>>> http://www.ncbi.nlm.nih.gov/pubmed/14447887?
>>> ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.
>>> Pubmed_DiscoveryPanel.Pubmed_Discovery_RA&linkpos=1&log
>>> $=relatedarticles&logdbfrom=pubmed)
>>>
>>> Some of the titles are listed here:
>>> http://ecs.victoria.ac.nz/Groups/AI/TitleGeneratorTitles but more
>>> extensive lists can be found in the internet.
>>> Careless replacing of "Master", "Marquis" or "Baron" might leave
>>> some
>>> of the people in your list without a lastname.
>>>
>>> The only way to be sure about the title is to ask for it separately
>>> while collecting the data.
>>>
>>> Best regards, Sergiy Radyakin
>>>
>>
>> On Wed, Dec 24, 2008 at 3:37 AM, Ashim Kapoor
>> <[email protected]> wrote:
>>
>>> About your 2nd query.
>>>
>>>>
>>>> Step 1 : gen gender = word(var1,1)
>>>>
>>>> Then do
>>>>
>>>> replace gender="F" if gender=="Mrs"
>>>> replace gender="F" if gender=="Mrs."
>>>>
>>>> On Wed, Dec 24, 2008 at 1:22 PM, b. water
>>>> <[email protected]> wrote:
>>>>> dear all,
>>>>>
>>>>> stata 8.2, windows xp,
>>>>>
>>>>> i have a data management problem: have a variable (strings) of
>>>>> names like these:
>>>>>
>>>>> var1
>>>>> Mrs A Jones
>>>>> Mrs Anne Jones
>>>>> Ms Abra Ham
>>>>> Mr Ko Jack
>>>>> Jack Kroll
>>>>> . <- denotes missing
>>>>> .
>>>>> .
>>>>> Miss. Wonder Full
>>>>> Mrs Bond Trader
>>>>>
>>>>> i want to generate new variable which removed the person's
>>>>> title, so it appear like these:
>>>>>
>>>>> var2
>>>>> A Jones
>>>>> Anne Jones
>>>>> Abra Ham
>>>>> Ko Jack
>>>>> Jack Kroll
>>>>> No Probs
>>>>> Abra Ham
>>>>> Ko Jack
>>>>> . <- denotes missing
>>>>> .
>>>>> .
>>>>> Wonder Full
>>>>> Bond Trader
>>>>>>>>> i want to also generate another variable that will assign
>>>>>>>>> gender based on the title of the name
>>>>>
>>>>>
Steven Samuels
845-246-0774
18 Cantine's Island
Saugerties, NY 12477
EFax: 208-498-7441
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/