Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Extracting parts of string variable
From
"Pavlos C. Symeou" <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: RE: Extracting parts of string variable
Date
Thu, 08 Apr 2010 22:50:25 +0200
Dear Robert, Ulrich and Nick,
thank you for your responses. I have run Robert's suggested code (at
the moment) on a larger sample just to notice that the code does not
capture the patent codes which start with text other than "US" and also
the code does not consider the possibility of a second patent code in
the string. I give examples below.
cit_1
company_1
US7599365-B1 -- US2004249974-A1 ALKHATIB H S _ALKH-Individual_ --
US2004249974-A1 ALKHATIB H S
US7535880-B1 -- EP1379030-A2 SAMSUNG ELECTRONICS CO LTD _SMSU_ --
EP1379030-A2 SAMSUNG ELECTRONICS CO LTD
WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO _GENE_
WO2008027657-A2 -- US5773966-A GENERAL ELECTRIC CO
Regards,
Pavlos
On 08/04/2010 19:07, Robert Picard wrote:
Pavlos,
Here's my attempt:
*-------------------- example -------------------------
version 11
clear
input id str244( cit_1 company_1)
1 "US6449348-B1 3COM CORP _THRE-Non-standard_" "3COM CORP"
2 "US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_" "CETACEA
NETWORKS CORP"
3 "US5566180-A HEWLETT-PACKARD CO _HEWP_" "HEWLETT-PACKARD CO"
4 "US6215865-B1 E-TALK CORP _ETAL-Non-standard_" "E-TALK CORP"
6 "US5600312-A MOTOROLA INC _MOTI_" "MOTOROLA INC"
7 "CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S" "CONRED
ELECTRONICS LTD"
8 "TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B"
"TEMIC TELEFUNKEN MICROELECTRONIC GMBH"
9 "US3476883-A" ""
10 "US5136671-A AT& T BELL LAB _AMTT_" "AT& T BELL LAB"
11 "US5195132-A AMERICAN TELEPHONE& TELEGRAPH CO _AMTT_" "AMERICAN
TELEPHONE& TELEGRAPH CO"
12 "US5605491-A CHURCH& DWIGHT CO INC _CHUR-Non-standard_" "CHURCH&
DWIGHT CO INC"
13 "US6028656-A CAMBRIDGE RES& INSTR INC _CAMB-Non-standard_"
"CAMBRIDGE RES& INSTR INC"
14 "US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B"
"DAEWOO ELECTRONICS CO LTD"
15 "US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F" "INT
BUSINESS MACHINES CORP"
16 "US6947529-B2 -- US761995" ""
end
compress
gen co2 = trim(regexs(2)) if regexm(cit_1,"^(US[0-9]+[^ ]*)*([^_]+)_")
assert company_1 == co2
list, noobs
*-------------------- example -------------------------
Robert
On Thu, Apr 8, 2010 at 12:56 PM, Nick Cox<[email protected]> wrote:
I'd take a look at -split-. The recipe doesn't look simple even then given that your company names may contain blanks.
Nick
[email protected]
Pavlos C. Symeou
I am experiencing some problems with a command I use to extract a part
of a string variable which I use to create another string variable. The
existing string variable is cit_1 and may contain (one or multiple
instances of any of) a patent number (e.g. "US6449348-B1"), a company
name (e.g. "3COM CORP"), a company abbreviation enclosed by "_" (e.g.
"_THRE-Non-standard_"), other text after the "_" (e.g. see id 8). My aim
is to extract the company name, which appears always before its
abbreviation and use it to create a new string variable company_1. I
used the following command, which however fails to account for different
forms of the cit_1 values and produces incorrect company names.
gen company_1 = regexs(2) if (regexm(cit_1, "([A-Z0-9]*[\-][A-Z0-9]*[
\-]*) *([A-Z0-9 ]*)( *)([\_])(.*)([\_])"))
I provide below the various forms that cit_1 takes and how company_1
should look.
id cit_1 company_1
1 US6449348-B1 3COM CORP _THRE-Non-standard_ 3COM CORP
2 US2004257999-A1 CETACEA NETWORKS CORP _CETA-Non-standard_ CETACEA
NETWORKS CORP
3 US5566180-A HEWLETT-PACKARD CO _HEWP_ HEWLETT-PACKARD CO
4 US6215865-B1 E-TALK CORP _ETAL-Non-standard_ E-TALK CORP
US4528422-A -- US452232-A1 INTELEPLEX CORP _INTE-Non-standard_
INTELEPLEX CORP
6 US5600312-A MOTOROLA INC _MOTI_ MOTOROLA INC
7 CONRED ELECTRONICS LTD _CONR-Non-standard_ MURAKOSHI S CONRED
ELECTRONICS LTD
8 TEMIC TELEFUNKEN MICROELECTRONIC GMBH _TELE_ LEICHT G, SCHUCH B
TEMIC TELEFUNKEN MICROELECTRONIC GMBH
9 US3476883-A
10 US5136671-A AT& T BELL LAB _AMTT_ AT& T BELL LAB
11 US5195132-A AMERICAN TELEPHONE& TELEGRAPH CO _AMTT_ AMERICAN
TELEPHONE& TELEGRAPH CO
12 US5605491-A CHURCH& DWIGHT CO INC _CHUR-Non-standard_ CHURCH&
DWIGHT CO INC
13 US6028656-A CAMBRIDGE RES& INSTR INC _CAMB-Non-standard_ CAMBRIDGE
RES& INSTR INC
14 US6201832 DAEWOO ELECTRONICS CO LTD _DAEW-Non-standard_ CHOI B
DAEWOO ELECTRONICS CO LTD
15 US6238946 INT BUSINESS MACHINES CORP _IBMC_ ZIEGLER J F INT
BUSINESS MACHINES CORP
16 US6947529-B2 -- US761995
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/