|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: data management - string function
I agree with everything that Sergiy wrote. A technical point: in
Howie's code, the "^" must be inside the quotes. Here's some code I
tried for fun.
-Steve
**************************CODE BEGINS**************************
clear
drop _all
input str40 name
"Mr John Smith"
"Mr. John Jones"
" Mr Donald Trump"
"Mrs. Felicia Mroz"
"Mrumph Caliph"
" Dr. Tom Lester "
"drummond katz"
"John Amro"
"Mr.Tim Donner"
"Mister D.D. Smith"
"Doctor Nicholas J. Cox"
"Ms. Virginia Wolfe"
end
gen namex = trim(lower(name))
#delim ;
gen str30 name_only = proper(regexr(namex,"(^mr(\.| |s |s\.))|(^dr
(\.| ))
|(^mister)|(^doctor) |(^ms(\.| ))",""));
#delim cr
list name name_only
***************************CODE ENDS***************************
On Dec 24, 2008, at 11:48 AM, Howard Lempel wrote:
Hi all,
Following from Sergiy's advice, I'd like to suggest that bw use
regular expressions to only delete occurrences of Mr, Dr, etc. that
occur at the beginning of a name. This should save Dr. Mroz (or
someone with last name Mr) from being deleted. Someone with a
first name of MR will still be in trouble (you may want to
experiment with finding a way to only deleting titles from people
where var1 is at least three words, saving someone with first name
MR and no title in the data).
I don't have time to write out the full code, but see the regular
expression FAQ here: http://www.stata.com/support/faqs/data/regex.html
Also look up -help regexm-
BW, carrot (^) tells Stata you are searching for characters at the
beginning of a string only, so you probably want something to the
effect of:
Gen var2 = regexr(var1,^("MR" | "MR." | "Mr" | . . .),)
Note: That code is untested, unfinished, and written by someone w/o
expertise on regular expressions (e.g. I'd need to look up exactly
how the "OR" operator and parentheses work).
Hope this helps.
Howie
-----Original Message-----
From: [email protected] [mailto:owner-
[email protected]] On Behalf Of Sergiy Radyakin
Sent: Wednesday, December 24, 2008 11:30 AM
To: [email protected]
Subject: Re: st: data management - string function
Hi,
I just hope that this program will not manage banking accounts,
otherwize someone like Dr Mroz
(http://www.unc.edu/~mroz/index_files/vita_mroz_2007_August%5B2%
5D.pdf)
will loose all his savings. The program should be very careful about
replacing the combinations of letters. When there is no guarantee,
that "Mr." is always spelled with a dot (like in the original data
sample in the first email in this thread) spaces should be
incorporated, but even then there is no way you can be sure that Mr is
not a lastname. E.g. the common Asian last name "Ng" (e.g.
http://www.drdavidng.com/contact_us.html) would not qualify many naive
validators (very short, no vowels). Perhaps in some languages "Mr" is
also a name, lastname or a middle name.
Also the choice of titles should probably be wider, to allow e.g. for
Dr., Prof., Col., or any combination of these (which can occur in
multiple combinations like "The life and activities of Col. Prof. Dr.
Jezdimir STUDIC" here:
http://www.ncbi.nlm.nih.gov/pubmed/14447887?
ordinalpos=1&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pu
bmed_DiscoveryPanel.Pubmed_Discovery_RA&linkpos=1&log
$=relatedarticles&logdbfrom=pubmed)
Some of the titles are listed here:
http://ecs.victoria.ac.nz/Groups/AI/TitleGeneratorTitles but more
extensive lists can be found in the internet.
Careless replacing of "Master", "Marquis" or "Baron" might leave some
of the people in your list without a lastname.
The only way to be sure about the title is to ask for it separately
while collecting the data.
Best regards, Sergiy Radyakin
On Wed, Dec 24, 2008 at 3:37 AM, Ashim Kapoor
<[email protected]> wrote:
About your 2nd query.
Step 1 : gen gender = word(var1,1)
Then do
replace gender="F" if gender=="Mrs"
replace gender="F" if gender=="Ms"
replace gender="M" if gender=="Mr"
replace gender="M" if gender=="Mrs"
Trouble , what if you have Mr. ( notice the dot ) in place of Mr
So we do
replace gender="F" if gender=="Mrs."
replace gender="F" if gender=="Ms."
replace gender="M" if gender=="Mr."
I think this should do it.
Merry Xmas to you.
Ashim.
On Wed, Dec 24, 2008 at 2:03 PM, Ashim Kapoor
<[email protected]> wrote:
Hello!
I think you want to do this :--
gen j=var1
gen j2=subinstr(j,"Mrs","",1)
gen j3=subinstr(j2,"Mr","",1)
gen j4=susinstr(j3,"Ms","",1)
Note the order of j2 and j3 , it is needed because we have Mr as as
subsitring of Mrs. It would be ruined if you did it the other way.
I hope you liked it.
Thank you,
Ashim.
On Wed, Dec 24, 2008 at 1:22 PM, b. water
<[email protected]> wrote:
dear all,
stata 8.2, windows xp,
i have a data management problem: have a variable (strings) of
names like these:
var1
Mrs A Jones
Mrs Anne Jones
Ms Abra Ham
Mr Ko Jack
Jack Kroll
No Probs
Ms. Abra Ham
Mr. Ko Jack
. <- denotes missing
.
.
Miss. Wonder Full
Mrs Bond Trader
i want to generate new variable which removed the person's
title, so it appear like these:
var2
A Jones
Anne Jones
Abra Ham
Ko Jack
Jack Kroll
No Probs
Abra Ham
Ko Jack
. <- denotes missing
.
.
Wonder Full
Bond Trader
i tried (thinking that i would slowly truncate Mr, Mrs, Ms title
by title):
gen var2=var1
replace var2=subinstr("Mr","Mr","",.) <- just as well i
generate var2 as this command wiped out all the names!
i want to also generate another variable that will assign gender
based on the title of the name in var 1 i.e. if Mr or Mr. then M
(ale) and if Mrs, Mrs., Ms, Ms., Miss, Miss. then F(emale). i
thought generate/replace or replace/if using string functions
would help but i think this require loop of a sort to achieve.
F
F
F
M
.
.
F
M
.
.
.
F
F
thank for advice/help.
season's greetings,
bw
_________________________________________________________________
It's the same Hotmail(R). If by "same" you mean up to 70% faster.
http://windowslive.com/online/hotmail?
ocid=TXT_TAGLM_WL_hotmail_acq_broad1_122008
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/