Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Extracting Data

From	Nick Cox <njcoxstata@gmail.com>
To	"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject	Re: st: Extracting Data
Date	Sun, 24 Nov 2013 10:35:19 +0000
Sure, but I note that neither Steve nor I proposed searching for "Ba" or "Ma".

In practice if this were my problem I would start by -split-ting on
commas, then finding empirically which spurious short elements were
identified and then modify a script to catch those first.

Nick
njcoxstata@gmail.com


On 24 November 2013 10:25, Sergiy Radyakin <serjradyakin@gmail.com> wrote:
> Steve, Nick, almost perfect!
> I hate to spoil the fun, but here is a test case which it doesn't handle:
>
> "[Testisland] Test (Peter Ma, PhD, John Ba, PhD) El School"
> results in:
>  5. |                   Peter Ma                PhD |
>
> (note that Ma and Ba are common lastnames, besides being
> MA=Master/Magister, and BA=Bachelor)
>
> Interestingly, changing John's degree to MS causes correct parsing:
>   5. |              Peter Ma, PhD        John Ba, MS |
>
> After some investigation, I can be relatively confident that the
> problem occurs every time the second person has the same suffix
> (title) as the first one, which could be quite common (in case of
> titles).
>
>
>
> Here is another case with a different error:
> "[Testisland] Test (Peter Ba, Sr. BA, John Ba, Jr. BA) El School"
> which results in one person:
>   5. | Peter Ba, Sr. BA, John Ba, Jr. BA                    |
>
>
> The above cases seem to be fixable, but imho the most difficult part
> of the assignment (the one without which I don't think one can search
> for solution) is that I don't see a rule how one can decide whether:
> "John Smith, Ba Ma"
> is one person with two degrees, or two persons with no (or
> unspecified) degrees. If you think you know the answer, meet Ba Ma:
> http://in.linkedin.com/pub/ba-ma/38/4b2/838
>
> Perhaps you could enforce case-sensitivity and check for caps in
> degrees? Or rely on dots (as recommended here
> http://www.slc.edu/style-guide/), or at least know how many people are
> in the list?
>
> Don't forget to extend the list of abbreviations with Prof., Hon.,
> Rev., Rep., Sen., Gen., Capt., Sgt., Pvt., etc,etc,etc. And don't
> forget that some of them can be (part) of perfectly real  names too!
> e.g. John Capt:
> http://www.linkedin.com/pub/john-capt/5/529/68b
> or Peter Sen:
> http://www.linkedin.com/in/petesen
> or for that matter Amartya Sen:
> en.wikipedia.org/wiki/Amartya_Sen
> I've found numerous people with last names Hon Rev Rep Sen Gen Capt,
> including multiple combinations in one person such as, e.g. :
> http://cn.linkedin.com/pub/ma-sen/16/516/255
> http://ca.linkedin.com/pub/ed-ma/1/386/784
>
>
> Compiling a comprehensive list of only degrees is a serious task on
> it's own. Start with AA AS AAS BA BS MA MS PhD EdD MD DDS DSc  LL.D.
> BFA, BA/MA, BMus, DPT, MFA, MPH, MPT, MS, MSEd, MSW,
> And wait till you get to Dr.sc.math, Dr.sc.agr and various foreign degrees...
>
> School names can be fun too! You've guessed it: Yes! they can include
> parentheses in the official school name:
> http://profiles.dcps.dc.gov/Fillmore+Arts+Center+%28East%29
> But that's a whole other story...
>
>
> Is there anything in the data generating software that could eliminate
> the need for parsing, or is it raw user-entered data? Avoid parsing if
> possible, this is usually the safest way to go. Otherwise, check the
> results very very carefully, after the program completes.
>
> Best,
>   Sergiy Radyakin
>
> On Sat, Nov 23, 2013 at 3:19 PM, Steve Samuels <sjsamuels@gmail.com> wrote:
>> I hadn't read Nick's post when I wrote mine. Both ideas follow the same
>> logic. I omitted the comma, but Nick's suggestion of creating a
>> temporary placeholder is superior. Here's a revised version which
>> incorporates Nick's idea. It also adds abbreviations with periods (".")
>> and retains these where present.
>>
>> Steve
>>
>> *********************Code Begins**************************
>> clear
>> input str244 s
>> "[Meadowfield] Park Sq (Susan Sims) Middle School"
>> "[Somerset] Upton & Pride School (Judith Taper, MA PhD) El School"
>> "[Temperly] Lakewood (Jason Stevenson Jr.,  B.A., Jill Harris, BA ) K-12"
>> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
>> end
>>
>> local suffix "Jr. Sr. BA  B.A. B.S. BS MS M.Sc. Ph.D. PhD"
>> gen p = regexs(1) if regexm(s,"\((.+)\)") /*Robert's code */
>>
>> /* Now remove a single comma preceding the suffixes */
>> foreach x of local suffix {
>> replace p  =regexs(1)+"_"+regexs(3)+regexs(4)+regexs(5) ///
>>  if regexm(p,"(.*)(,)(.*)(`x')(.*)")
>> }
>>
>> split p, p(",")
>> foreach x of local suffix {
>> replace p1 = subinstr(p1,"_",",",.)
>> replace p2 = subinstr(p2,"_",",",.)
>> }
>> list p1 p2
>> *******************Code Ends******************************
>>
>>
>>
>> On Nov 22, 2013, at 4:24 AM, Nick Cox wrote:
>>
>> This sounds like a two-stage process. For example, you might use
>> -split- to split a variable containing the one or two names. ", Jr."
>> needs special treatment. I'd edit ", Jr" to "_Jr." and then edit back.
>>
>> For "Principle" read "Principal" throughout.
>> Nick
>> njcoxstata@gmail.com
>>
>>
>> On 22 November 2013 04:58, Becker Stein <becker.stein@aol.com> wrote:
>>> Hi,
>>>
>>> I asked this question yesterday. I needed help creating a regex to
>>> extract data from a single string variable. Robert's solution was
>>> really helpful. I was able to generate the School District,
>>> School Name, and School Type variables. However, I run into problems
>>> trying to create the Principle and Assist. Principles variables.
>>> The gen principal =regexs(1) if regexm(s,"\((.+)\)") returns all of the
>>> contents in the
>>> parentheses, but I need the contents before the comma to generate the
>>> principle name variable and the contents after the comma to generate
>>> the assist. principle name (if any). It gets a little complicated
>>> because sometimes the names themselves have commas in them (as in the
>>> case of Robert Williams, Jr.) I've pasted some sample data below.
>>>
>>>
>>> [School District] School Name (Principle, Asst. Principal) School Type
>>>
>>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
>>> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
>>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
>>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>>>
>>>
>>> Thanks,
>>> Becker
>>>
>>> -----Original Message-----
>>> From: Robert Picard <picard@netbox.com>
>>> To: statalist <statalist@hsphsun2.harvard.edu>
>>> Sent: Wed, Nov 20, 2013 10:51 pm
>>> Subject: Re: st: Extracting Data
>>>
>>> Becker
>>>
>>> Here's one way to parse each variable using regex functions.
>>>
>>> Robert
>>>
>>> * ---------------- begin example -------------------------
>>> clear
>>> input str244 s
>>> "[Meadowfield] Park Sq (Susan Sims) Middle School"
>>> "[Somerset] Upton & Pride School (Judith Taper) El School"
>>> "[Temperly] Lakewood (Jason Stevenson, Jill Harris ) K-12"
>>> "[Packard] W.E.B.Bos ( Bob Williams, Jr.) Middle School"
>>> end
>>>
>>> gen district = regexs(1) if regexm(s,"\[(.+)\]")
>>> gen sname = regexs(1) if regexm(s,"\](.+)\(")
>>> gen principal = regexs(1) if regexm(s,"\((.+)\)")
>>> gen stype = regexs(1) if regexm(s,"\)(.+)")
>>>
>>> list district sname principal stype
>>> * ---------------- end example ---------------------------
>>>
>>> On Wed, Nov 20, 2013 at 4:38 PM, Becker Stein <becker.stein@aol.com>
>>> wrote:
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Becker Stein <becker.stein@aol.com>
>>>> To: statalist <statalist@hsphsun2.harvard.edu.>
>>>> Sent: Wed, Nov 20, 2013 9:23 pm
>>>> Subject: Help Extracting Data
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to extract data from a single string variable, and I was
>>>> wondering if how to create a regular expression that I can
>>>> use to do so. I've tried to create one just to extract the school
>>>> name, but to no avail. My data is set up as: [school district] name of
>>>> school (name of principle, name of assistant principle (*if any))
>>>> school type. Below are some examples.
>>>>
>>>> [Meadowfield] Park Square (Susan Sims, John Riley) Middle School
>>>> [Somerset] Upton & Pride Day School (Judith  Taper) Elementary School
>>>> [Temperly] Lakewood School (Jason Stevenson, Jill Harris ) K-12
>>>> [Packard] W.E.B. Du Bois ( Robert Williams, Jr.) Middle School
>>>>
>>>> I would like to extract the school name, principle name and asst.
>>>> principle name as separate variables. Sometimes the names have special
>>>> characters such as an "&" (as in the case of Upton & Pride) or a ".".,
>>>> and the administrators section may have only have 1 name or 2 names
>>>> (separated by a comma). Also, some of the data in the brackets and
>>>> parentheses have extra spaces. I initially used the itrim function on
>>>> the variable, and it removed the extra spaces for the content outside
>>>
>>> of the
>>>>
>>>> brackets and
>>>> parentheses (i.e., school name and school type), but it didn't work
>>>
>>> for
>>>>
>>>> content inside of them (school district and principal names).
>>>> Thanks in advance for any/all help.
>>>>
>>>> Best,
>>>> Becker
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
References:
- Re: st: Extracting Data
  - From: Becker Stein <becker.stein@aol.com>
- Re: st: Extracting Data
  - From: Nick Cox <njcoxstata@gmail.com>
- Re: st: Extracting Data
  - From: Steve Samuels <sjsamuels@gmail.com>
- Re: st: Extracting Data
  - From: Sergiy Radyakin <serjradyakin@gmail.com>
Prev by Date: Re: st: Extracting Data
Next by Date: st: Date string to date
Previous by thread: Re: st: Extracting Data
Next by thread: st: Overlaying line graphs
Index(es):
- Date
- Thread