Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | "Pavlos C. Symeou" <p.symeou@lmu.de> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | st: Extracting parts of string variable |
Date | Thu, 08 Apr 2010 18:35:45 +0200 |
Dear Statalisters,I am experiencing some problems with a command I use to extract a part of a string variable which I use to create another string variable. The existing string variable is cit_1 and may contain (one or multiple instances of any of) a patent number (e.g. "US6449348-B1"), a company name (e.g. "3COM CORP"), a company abbreviation enclosed by "_" (e.g. "_THRE-Non-standard_"), other text after the "_" (e.g. see id 8). My aim is to extract the company name, which appears always before its abbreviation and use it to create a new string variable company_1. I used the following command, which however fails to account for different forms of the cit_1 values and produces incorrect company names.
gen company_1 = regexs(2) if (regexm(cit_1, "([A-Z0-9]*[\-][A-Z0-9]*[ \-]*) *([A-Z0-9 ]*)( *)([\_])(.*)([\_])"))
I provide below the various forms that cit_1 takes and how company_1 should look.
id cit_1 company_1 1 US6449348-B1 3COM CORP _THRE-Non-standard_ 3COM CORP2 US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_ CETACEA NETWORKS CORP
3 US5566180-A HEWLETT-PACKARD CO _HEWP_ HEWLETT-PACKARD CO 4 US6215865-B1 E-TALK CORP _ETAL-Non-standard_ E-TALK CORPUS4528422-A -- US452232-A1 INTELEPLEX CORP _INTE-Non-standard_ INTELEPLEX CORP
6 US5600312-A MOTOROLA INC _MOTI_ MOTOROLA INC7 CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S CONRED ELECTRONICS LTD 8 TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B TEMIC TELEFUNKEN MICROELECTRONIC GMBH
9 US3476883-A 10 US5136671-A AT & T BELL LAB _AMTT_ AT & T BELL LAB11 US5195132-A AMERICAN TELEPHONE & TELEGRAPH CO _AMTT_ AMERICAN TELEPHONE & TELEGRAPH CO 12 US5605491-A CHURCH & DWIGHT CO INC _CHUR-Non-standard_ CHURCH & DWIGHT CO INC 13 US6028656-A CAMBRIDGE RES & INSTR INC _CAMB-Non-standard_ CAMBRIDGE RES & INSTR INC 14 US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B DAEWOO ELECTRONICS CO LTD 15 US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F INT BUSINESS MACHINES CORP
16 US6947529-B2 -- US761995 Best regards, Pavlos * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/