Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Encoding and matching string values

From	"Pavlos C. Symeou" <[email protected]>
To	[email protected]
Subject	Re: st: Encoding and matching string values
Date	Mon, 04 Oct 2010 12:47:31 +0200

thank you for your input.

@ Martin, we thought about using the -encode- command and thenreplace the original variable with the encoded one to reduce thedata size. We managed to reduce some files from nearly 1Gb down to10 Mb. If you consider that we have 250 such files with some of asize of more than 10Gb each (due to string variables and thousandsof observations) you can understand the degree of difficulty we arefacing. Of course we used - compress - as well as clearing thestring values as much as we could.

@ Eric. I ran the code below after having done some minoradjustments and works just fine. This was an excellent exercise tosee how our data behave. The final task we want to carry out thoughis even more complex in terms of programming and as we are lackingthat competence we would like to ask for your input once more.

The problem has as follows. We have about 250 company files withtheir patents (see below a sample). For example, company "Acer"which operates in industrial sector 3456 (mother_SIC) has 100,000patents published between 1960-2009 (year). Certain years may havemultiple patents. Every patent is assigned multiple patent numbers(patent_number) which uniquely identify it. Each of that patent canbe used in at least one industrial sector (patent_Sic). Last, everypatent may cite multiple patents (citation).

The data below tell that, ACER in year 1994 published two patentswhich were assigned the names TW231391-A, TW231391-B,.....,TW231391-C and DR231342-A, DRE231342-B,......,TA231342-C,respectively. Each patent can be applied in a respective number ofindustries (patent_Sic). In each of the two patents, ACER is citinga number of other patents, which may belong to ACER or othercompanies, which themselves are described by similar attributes(patent_number, patent_Sic etc).

Company mother_sic Year patent_Sic_1patent_Sic_2 patent_Sic_3 patent_number_1patent_number_2 patent_number_40 citation_1 citation_2citation_500ACER 3456 1994 3661TW231391-A TW231391-B TW231391-CUS231391-A GB231391-A CY231391-AACER 3456 19943417 5472 5571DR231342-A DRE231342-BTA231342-C FR231342-A GG231342-D CY2634542-BACER 3456 1995 3577 35723571 BR231342-B LTE231342-APAT231342-A GR231342-A TW231342-ASE231342-A

..........
..........
.........

We want to estimate a new variable ("convergence") for ACER whichwill measure how much its patents' SIC sectors and these patents'citations' SIC sectors deviate from ACER's industrial sector(mother_SIC) based on the following formula. Take for example thetwo patents in 1994 above.

The value of "convergence" for the year 1994 should be: {[*0.90 *(a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)*] ** + [*0.90 * (a2 +b2 +c2) +0.10 * (d2 + e2 + f2)*] *} / n* *

where 1,2,...,n is the number of patents that ACER published in1994 and a,b,c,d,e,f are:


For every ACER patent published in 1994:

a) take the proportion of all patent_SICs whose 1st digit isdifferent than the 1st digit of mother_SIC and multiply it by 3;b) take the proportion of all patent_SICs whose 1st digit is thesame as the 1st digit of mother_SIC but the 2nd digit is differentthan the 2nd digit of mother_SIC and multiply it by 2;c) take the proportion of all patent_SICs whose first 2 digits arethe same as the first 2 digits of mother_SIC but they have adifferent 3rd digit and multiply it by 1;d) take the proportion of all cited patents' patent_SICs whose 1stdigit is different than the 1st digit of ACER's mother_SIC andmultiply it by 3;e) take the proportion of all cited patents' patent_SICs whose 1stdigit is the same as the 1st digit of ACER's mother_SIC but the 2nddigit is different than the 2nd digit of mother_SIC and multiply itby 2;f) take the proportion of all cited patents' patent_SICs whosefirst 2 digits are the same as the first 2 digits of ACER'smother_SIC but they have a different 3rd digit and multiply it by 1;

For a, b, c the search will be done inside ACER's file. For d, e, fthe search will be done inside all available companies' files,including ACER and only for the years 1994 and earlier. This isbecause cited patents are published before the focal citing patent(of course).


The output should look like this:

Company    Year        convergence
ACER            1994      2.3
ACER            1995      2.1
ACER            1996      2.5
.........................................
........................................

In retrospect, the use of -encode- aimed to reduce the sizes ofconsisting company datasets so that the files can be joinedtogether (if this is going to help anyway) and the "search andmatch" time is reduced.

I know that this is not a trivial task, but unfortunately myprogramming knowledge is minimal. Any help will be very appreciated.


Best,

Pavlos




**solution**

//1.  mk lookup table of values//
clear
set more off
cd "MY DIRECTORY WITH ALL MY DTA FILE"
save "lookuptable.dta", emptyok replace
foreach file in att ameritech aol {
    append using "`file'.dta", keep(citation)
    }
    duplicates drop
    g citation_number = _n
    l
    save "lookuptable.dta", emptyok replace
    cap which labmask
    if _rc ssc install labutil, replace
    labmask citation_number, value(citation) lblname(cit)
    la save cit  using "citationlabels.do" , replace


//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace

foreach file in att ameritech aol   {
    u "`file'.dta", clear

merge m:1 citation using "lookuptable.dta" //Here I changedthe 1:1 to m:1 because I have multiple patents which cite the samepatent

    drop if _m!=3
    drop _m
    drop citation
    sa "`file'_encoded.dta", replace
    append using "final.dta"
    sa "final.dta", replace
    }

//3.  apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation

************************!



*Von: *Eric Booth <[email protected] <mailto:[email protected]>>

*Datum: *24. September 2010 16:05:13 GMT-07:00
*An: *"<[email protected]<mailto:[email protected]>>"<[email protected]<mailto:[email protected]>>
*Betreff: **Re: st: RE: Encoding and matching string values*
*Antwort an: *[email protected]<mailto:[email protected]>
<>
In an attempt to reduce the size of the final/appended dataset,Florian wants to encode "citation" in each of the un-appendeddatasets first, remove the long string variable and have anencoded numeric variable in it's place, and then append thefiles to create the large, final dataset.
The problems are (1) you will need to remove the original stringversion of "citation" before appending or the -encode- didn'tsave you any space (as Martin mentions) and (2) if the appendeddatasets have the same "citations", then -encode- may haveassigned it one value in one dataset and a different value in adifferent dataset (I think this is what Martin was asking aboutin his response).It's easier to -encode- "citation" in the final, appendeddataset so that the encoding is consistent, but in Florian'scase this is undesirable because of space limitations.
One solution is to create a look up table containing the stringvariable "citation" and an assigned code/number for each valuein citation (citation_number). Then you can merge thiscitation_number to each individual, un-appended file & drop thestring "citation" (in the un-appended files before appendingthem ) to save space.After appending all these files , you can apply the "citations"as value labels to the "citation_number" in the large/appendeddataset.
You'll need -labmask- (from findit labutil on SSC) and -fre-(from SSC) to use the example below:
************************!

//fake "using" dataset//
clear
inp id patent_number str5(citation)
1 12 "one"
2 13 "two"
3 99 "three"
4 98 "four"
end
sa using.dta, replace

encode citation, g(citation2)
cap which fre
if _rc ssc install fre, replace
fre citation2
sa using_encoded.dta, replace


//fake "master" dataset//
clear
inp id patent_number str5(citation)
5 19 "four"
6 17 "five"
7 89 "six"
8 88 "seven"
end
sa master.dta, replace

encode citation, g(citation2)
fre citation2

/*
this is what Florian is running into, it doesnt work becauseencode assigned
different values to the same labels across datasets
*/
append using "using_encoded.dta"


**solution**

//1.  mk lookup table of values//
clear
save "lookuptable.dta", emptyok replace
foreach file in using master   /* put all your files here */  {
append using "`file'.dta", keep(citation)
}
duplicates drop
g citation_number = _n
l
save "lookuptable.dta", emptyok replace
cap which labmask
if _rc ssc install labutil, replace
labmask citation_number, value(citation) lblname(cit)
la save cit  using "citationlabels.do" , replace



//2. mk final tbl w. citation_number, not citation//
clear
save "final.dta", emptyok replace

foreach file in using master  /* put all your files here */  {
u "`file'.dta", clear
merge 1:1 citation using "lookuptable.dta"
drop if _m!=3
drop _m
drop citation
sa "`file'_encoded.dta", replace
append using "final.dta"
sa "final.dta", replace
}

//3.  apply labels to citation_number//
l
fre citation
do "citationlabels.do"
lab val citation_number cit
fre citation

************************!



- Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected] <mailto:[email protected]>
Office: +979.845.6754



On Sep 24, 2010, at 4:53 PM, Martin Weiss wrote:
<>
I am not sure the description here is clear enough: -encode-forces you to -generate()- the new numeric variable, so thatboth the string and its -encode-d counterpart coexistafterwards. So it is hard to see how a) your dataset issupposed to decrease in size via -encode- b) how the "originalstring values" are no longer there...
How does Stata (_not STATA_) "...mess up the the numericalvalues after appending the dataset"?
HTH
Martin

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: st: RE: Finding Stata commands
Next by Date: Re: st: RE: Finding Stata commands
Previous by thread: st: Fwd: FW: output tables after xtlogit using outreg2
Next by thread: st: Re: RE: re: Re: forvalues, panel data
Index(es):
- Date
- Thread