Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Encoding and matching string values
From
"Pavlos C. Symeou" <[email protected]>
To
[email protected]
Subject
Re: st: Encoding and matching string values
Date
Mon, 04 Oct 2010 12:47:31 +0200
thank you for your input.
@ Martin, we thought about using the -encode- command and then
replace the original variable with the encoded one to reduce the
data size. We managed to reduce some files from nearly 1Gb down to
10 Mb. If you consider that we have 250 such files with some of a
size of more than 10Gb each (due to string variables and thousands
of observations) you can understand the degree of difficulty we are
facing. Of course we used - compress - as well as clearing the
string values as much as we could.
@ Eric. I ran the code below after having done some minor
adjustments and works just fine. This was an excellent exercise to
see how our data behave. The final task we want to carry out though
is even more complex in terms of programming and as we are lacking
that competence we would like to ask for your input once more.
The problem has as follows. We have about 250 company files with
their patents (see below a sample). For example, company "Acer"
which operates in industrial sector 3456 (mother_SIC) has 100,000
patents published between 1960-2009 (year). Certain years may have
multiple patents. Every patent is assigned multiple patent numbers
(patent_number) which uniquely identify it. Each of that patent can
be used in at least one industrial sector (patent_Sic). Last, every
patent may cite multiple patents (citation).
The data below tell that, ACER in year 1994 published two patents
which were assigned the names TW231391-A, TW231391-B,.....,
TW231391-C and DR231342-A, DRE231342-B,......,TA231342-C,
respectively. Each patent can be applied in a respective number of
industries (patent_Sic). In each of the two patents, ACER is citing
a number of other patents, which may belong to ACER or other
companies, which themselves are described by similar attributes
(patent_number, patent_Sic etc).
Company mother_sic Year patent_Sic_1
patent_Sic_2 patent_Sic_3 patent_number_1
patent_number_2 patent_number_40 citation_1 citation_2
citation_500
ACER 3456 1994 3661
TW231391-A TW231391-B TW231391-C
US231391-A GB231391-A CY231391-A
ACER 3456 1994
3417 5472 5571
DR231342-A DRE231342-B
TA231342-C FR231342-A GG231342-D CY2634542-B
ACER 3456 1995 3577 3572
3571 BR231342-B LTE231342-A
PAT231342-A GR231342-A TW231342-A
SE231342-A
..........
..........
.........
We want to estimate a new variable ("convergence") for ACER which
will measure how much its patents' SIC sectors and these patents'
citations' SIC sectors deviate from ACER's industrial sector
(mother_SIC) based on the following formula. Take for example the
two patents in 1994 above.
The value of "convergence" for the year 1994 should be: {[*0.90 *
(a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)*] ** + [*0.90 * (a2 +b2 +c2) +
0.10 * (d2 + e2 + f2)*] *} / n* *
*
where 1,2,...,n is the number of patents that ACER published in
1994 and a,b,c,d,e,f are:
For every ACER patent published in 1994:
a) take the proportion of all patent_SICs whose 1st digit is
different than the 1st digit of mother_SIC and multiply it by 3;
b) take the proportion of all patent_SICs whose 1st digit is the
same as the 1st digit of mother_SIC but the 2nd digit is different
than the 2nd digit of mother_SIC and multiply it by 2;
c) take the proportion of all patent_SICs whose first 2 digits are
the same as the first 2 digits of mother_SIC but they have a
different 3rd digit and multiply it by 1;
d) take the proportion of all cited patents' patent_SICs whose 1st
digit is different than the 1st digit of ACER's mother_SIC and
multiply it by 3;
e) take the proportion of all cited patents' patent_SICs whose 1st
digit is the same as the 1st digit of ACER's mother_SIC but the 2nd
digit is different than the 2nd digit of mother_SIC and multiply it
by 2;
f) take the proportion of all cited patents' patent_SICs whose
first 2 digits are the same as the first 2 digits of ACER's
mother_SIC but they have a different 3rd digit and multiply it by 1;
For a, b, c the search will be done inside ACER's file. For d, e, f
the search will be done inside all available companies' files,
including ACER and only for the years 1994 and earlier. This is
because cited patents are published before the focal citing patent
(of course).
The output should look like this:
Company Year convergence
ACER 1994 2.3
ACER 1995 2.1
ACER 1996 2.5
.........................................
........................................
In retrospect, the use of -encode- aimed to reduce the sizes of
consisting company datasets so that the files can be joined
together (if this is going to help anyway) and the "search and
match" time is reduced.
I know that this is not a trivial task, but unfortunately my
programming knowledge is minimal. Any help will be very appreciated.
Best,
Pavlos
**solution**
//1. mk lookup table of values//
clear
set more off
cd "MY DIRECTORY WITH ALL MY DTA FILE"
save "lookuptable.dta", emptyok replace
foreach file in att ameritech aol {
append using "`file'.dta", keep(citation)
}
duplicates drop
g citation_number = _n
l
save "lookuptable.dta", emptyok replace
cap which labmask
if _rc ssc install labutil, replace
labmask citation_number, value(citation) lblname(cit)
la save cit using "citationlabels.do" , replace
//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace
foreach file in att ameritech aol {
u "`file'.dta", clear
merge m:1 citation using "lookuptable.dta" //Here I changed
the 1:1 to m:1 because I have multiple patents which cite the same
patent
drop if _m!=3
drop _m
drop citation
sa "`file'_encoded.dta", replace
append using "final.dta"
sa "final.dta", replace
}
//3. apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation
************************!
*Von: *Eric Booth <[email protected] <mailto:[email protected]>>
*Datum: *24. September 2010 16:05:13 GMT-07:00
*An: *"<[email protected]
<mailto:[email protected]>>"
<[email protected]
<mailto:[email protected]>>
*Betreff: **Re: st: RE: Encoding and matching string values*
*Antwort an: *[email protected]
<mailto:[email protected]>
<>
In an attempt to reduce the size of the final/appended dataset,
Florian wants to encode "citation" in each of the un-appended
datasets first, remove the long string variable and have an
encoded numeric variable in it's place, and then append the
files to create the large, final dataset.
The problems are (1) you will need to remove the original string
version of "citation" before appending or the -encode- didn't
save you any space (as Martin mentions) and (2) if the appended
datasets have the same "citations", then -encode- may have
assigned it one value in one dataset and a different value in a
different dataset (I think this is what Martin was asking about
in his response).
It's easier to -encode- "citation" in the final, appended
dataset so that the encoding is consistent, but in Florian's
case this is undesirable because of space limitations.
One solution is to create a look up table containing the string
variable "citation" and an assigned code/number for each value
in citation (citation_number). Then you can merge this
citation_number to each individual, un-appended file & drop the
string "citation" (in the un-appended files before appending
them ) to save space.
After appending all these files , you can apply the "citations"
as value labels to the "citation_number" in the large/appended
dataset.
You'll need -labmask- (from findit labutil on SSC) and -fre-
(from SSC) to use the example below:
************************!
//fake "using" dataset//
clear
inp id patent_number str5(citation)
1 12 "one"
2 13 "two"
3 99 "three"
4 98 "four"
end
sa using.dta, replace
encode citation, g(citation2)
cap which fre
if _rc ssc install fre, replace
fre citation2
sa using_encoded.dta, replace
//fake "master" dataset//
clear
inp id patent_number str5(citation)
5 19 "four"
6 17 "five"
7 89 "six"
8 88 "seven"
end
sa master.dta, replace
encode citation, g(citation2)
fre citation2
/*
this is what Florian is running into, it doesnt work because
encode assigned
different values to the same labels across datasets
*/
append using "using_encoded.dta"
**solution**
//1. mk lookup table of values//
clear
save "lookuptable.dta", emptyok replace
foreach file in using master /* put all your files here */ {
append using "`file'.dta", keep(citation)
}
duplicates drop
g citation_number = _n
l
save "lookuptable.dta", emptyok replace
cap which labmask
if _rc ssc install labutil, replace
labmask citation_number, value(citation) lblname(cit)
la save cit using "citationlabels.do" , replace
//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace
foreach file in using master /* put all your files here */ {
u "`file'.dta", clear
merge 1:1 citation using "lookuptable.dta"
drop if _m!=3
drop _m
drop citation
sa "`file'_encoded.dta", replace
append using "final.dta"
sa "final.dta", replace
}
//3. apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation
************************!
- Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected] <mailto:[email protected]>
Office: +979.845.6754
On Sep 24, 2010, at 4:53 PM, Martin Weiss wrote:
<>
I am not sure the description here is clear enough: -encode-
forces you to -generate()- the new numeric variable, so that
both the string and its -encode-d counterpart coexist
afterwards. So it is hard to see how a) your dataset is
supposed to decrease in size via -encode- b) how the "original
string values" are no longer there...
How does Stata (_not STATA_) "...mess up the the numerical
values after appending the dataset"?
HTH
Martin
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/