Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Extracting Data |
Date | Sun, 24 Nov 2013 10:35:19 +0000 |
Sure, but I note that neither Steve nor I proposed searching for "Ba" or "Ma". In practice if this were my problem I would start by -split-ting on commas, then finding empirically which spurious short elements were identified and then modify a script to catch those first. Nick njcoxstata@gmail.com On 24 November 2013 10:25, Sergiy Radyakin <serjradyakin@gmail.com> wrote: > Steve, Nick, almost perfect! > I hate to spoil the fun, but here is a test case which it doesn't handle: > > "[Testisland] Test (Peter Ma, PhD, John Ba, PhD) El School" > results in: > 5. | Peter Ma PhD | > > (note that Ma and Ba are common lastnames, besides being > MA=Master/Magister, and BA=Bachelor) > > Interestingly, changing John's degree to MS causes correct parsing: > 5. | Peter Ma, PhD John Ba, MS | > > After some investigation, I can be relatively confident that the > problem occurs every time the second person has the same suffix > (title) as the first one, which could be quite common (in case of > titles). > > > > Here is another case with a different error: > "[Testisland] Test (Peter Ba, Sr. BA, John Ba, Jr. BA) El School" > which results in one person: > 5. | Peter Ba, Sr. BA, John Ba, Jr. BA | > > > The above cases seem to be fixable, but imho the most difficult part > of the assignment (the one without which I don't think one can search > for solution) is that I don't see a rule how one can decide whether: > "John Smith, Ba Ma" > is one person with two degrees, or two persons with no (or > unspecified) degrees. If you think you know the answer, meet Ba Ma: > http://in.linkedin.com/pub/ba-ma/38/4b2/838 > > Perhaps you could enforce case-sensitivity and check for caps in > degrees? Or rely on dots (as recommended here > http://www.slc.edu/style-guide/), or at least know how many people are > in the list? > > Don't forget to extend the list of abbreviations with Prof., Hon., > Rev., Rep., Sen., Gen., Capt., Sgt., Pvt., etc,etc,etc. And don't > forget that some of them can be (part) of perfectly real names too! > e.g. John Capt: > http://www.linkedin.com/pub/john-capt/5/529/68b > or Peter Sen: > http://www.linkedin.com/in/petesen > or for that matter Amartya Sen: > en.wikipedia.org/wiki/Amartya_Sen > I've found numerous people with last names Hon Rev Rep Sen Gen Capt, > including multiple combinations in one person such as, e.g. : > http://cn.linkedin.com/pub/ma-sen/16/516/255 > http://ca.linkedin.com/pub/ed-ma/1/386/784 > > > Compiling a comprehensive list of only degrees is a serious task on > it's own. Start with AA AS AAS BA BS MA MS PhD EdD MD DDS DSc LL.D. > BFA, BA/MA, BMus, DPT, MFA, MPH, MPT, MS, MSEd, MSW, > And wait till you get to Dr.sc.math, Dr.sc.agr and various foreign degrees... > > School names can be fun too! You've guessed it: Yes! they can include > parentheses in the official school name: > http://profiles.dcps.dc.gov/Fillmore+Arts+Center+%28East%29 > But that's a whole other story... > > > Is there anything in the data generating software that could eliminate > the need for parsing, or is it raw user-entered data? Avoid parsing if > possible, this is usually the safest way to go. Otherwise, check the > results very very carefully, after the program completes. > > Best, > Sergiy Radyakin > > On Sat, Nov 23, 2013 at 3:19 PM, Steve Samuels <sjsamuels@gmail.com> wrote: >> I hadn't read Nick's post when I wrote mine. Both ideas follow the same >> logic. I omitted the comma, but Nick's suggestion of creating a >> temporary placeholder is superior. Here's a revised version which >> incorporates Nick's idea. It also adds abbreviations with periods (".") >> and retains these where present. >> >> Steve >> >> *********************Code Begins************************** >> clear >> input str244 s >> "[Meadowfield] Park Sq (Susan Sims) Middle School" >> "[Somerset] Upton & Pride School (Judith Taper, MA PhD) El School" >> "[Temperly] Lakewood (Jason Stevenson Jr., B.A., Jill Harris, BA ) K-12" >> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School" >> end >> >> local suffix "Jr. Sr. BA B.A. B.S. BS MS M.Sc. Ph.D. PhD" >> gen p = regexs(1) if regexm(s,"\((.+)\)") /*Robert's code */ >> >> /* Now remove a single comma preceding the suffixes */ >> foreach x of local suffix { >> replace p =regexs(1)+"_"+regexs(3)+regexs(4)+regexs(5) /// >> if regexm(p,"(.*)(,)(.*)(`x')(.*)") >> } >> >> split p, p(",") >> foreach x of local suffix { >> replace p1 = subinstr(p1,"_",",",.) >> replace p2 = subinstr(p2,"_",",",.) >> } >> list p1 p2 >> *******************Code Ends****************************** >> >> >> >> On Nov 22, 2013, at 4:24 AM, Nick Cox wrote: >> >> This sounds like a two-stage process. For example, you might use >> -split- to split a variable containing the one or two names. ", Jr." >> needs special treatment. I'd edit ", Jr" to "_Jr." and then edit back. >> >> For "Principle" read "Principal" throughout. >> Nick >> njcoxstata@gmail.com >> >> >> On 22 November 2013 04:58, Becker Stein <becker.stein@aol.com> wrote: >>> Hi, >>> >>> I asked this question yesterday. I needed help creating a regex to >>> extract data from a single string variable. Robert's solution was >>> really helpful. I was able to generate the School District, >>> School Name, and School Type variables. However, I run into problems >>> trying to create the Principle and Assist. Principles variables. >>> The gen principal =regexs(1) if regexm(s,"\((.+)\)") returns all of the >>> contents in the >>> parentheses, but I need the contents before the comma to generate the >>> principle name variable and the contents after the comma to generate >>> the assist. principle name (if any). It gets a little complicated >>> because sometimes the names themselves have commas in them (as in the >>> case of Robert Williams, Jr.) I've pasted some sample data below. >>> >>> >>> [School District] School Name (Principle, Asst. Principal) School Type >>> >>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School >>> [Somerset] Upton & Pride Day School (Judith Taper) Elementary School >>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12 >>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School >>> >>> >>> Thanks, >>> Becker >>> >>> -----Original Message----- >>> From: Robert Picard <picard@netbox.com> >>> To: statalist <statalist@hsphsun2.harvard.edu> >>> Sent: Wed, Nov 20, 2013 10:51 pm >>> Subject: Re: st: Extracting Data >>> >>> Becker >>> >>> Here's one way to parse each variable using regex functions. >>> >>> Robert >>> >>> * ---------------- begin example ------------------------- >>> clear >>> input str244 s >>> "[Meadowfield] Park Sq (Susan Sims) Middle School" >>> "[Somerset] Upton & Pride School (Judith Taper) El School" >>> "[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12" >>> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School" >>> end >>> >>> gen district = regexs(1) if regexm(s,"\[(.+)\]") >>> gen sname = regexs(1) if regexm(s,"\](.+)\(") >>> gen principal = regexs(1) if regexm(s,"\((.+)\)") >>> gen stype = regexs(1) if regexm(s,"\)(.+)") >>> >>> list district sname principal stype >>> * ---------------- end example --------------------------- >>> >>> On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <becker.stein@aol.com> >>> wrote: >>>> >>>> >>>> -----Original Message----- >>>> From: Becker Stein <becker.stein@aol.com> >>>> To: statalist <statalist@hsphsun2.harvard.edu.> >>>> Sent: Wed, Nov 20, 2013 9:23 pm >>>> Subject: Help Extracting Data >>>> >>>> Hi, >>>> >>>> I'm trying to extract data from a single string variable, and I was >>>> wondering if how to create a regular expression that I can >>>> use to do so. I've tried to create one just to extract the school >>>> name, but to no avail. My data is set up as: [school district] name of >>>> school (name of principle, name of assistant principle (*if any)) >>>> school type. Below are some examples. >>>> >>>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School >>>> [Somerset] Upton & Pride Day School (Judith Taper) Elementary School >>>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12 >>>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School >>>> >>>> I would like to extract the school name, principle name and asst. >>>> principle name as separate variables. Sometimes the names have special >>>> characters such as an "&" (as in the case of Upton & Pride) or a "."., >>>> and the administrators section may have only have 1 name or 2 names >>>> (separated by a comma). Also, some of the data in the brackets and >>>> parentheses have extra spaces. I initially used the itrim function on >>>> the variable, and it removed the extra spaces for the content outside >>> >>> of the >>>> >>>> brackets and >>>> parentheses (i.e., school name and school type), but it didn't work >>> >>> for >>>> >>>> content inside of them (school district and principal names). >>>> Thanks in advance for any/all help. >>>> >>>> Best, >>>> Becker >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/