Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Estimating new variable from multiple datasets
From
"Pavlos C. Symeou" <[email protected]>
To
[email protected]
Subject
st: Estimating new variable from multiple datasets
Date
Thu, 28 Oct 2010 12:07:00 +0200
Dear Statalisters,
Some time ago I wrote to Statalist about the problem below but I
have been unsuccessful in receiving any suggestions. I am trying
herein to simplify my problem in the hope that you can help. I have
250 company files with their patents of total size of 12Gbytes (see
below a sample). I want to use information from each company's
patents and their citations and create a new dataset which will
consist of all companies in a panel format with a new variable
whose estimation I describe below.
Let me give you an example. Company "Acer" which operates in
industrial sector 3456 (mother_SIC) has 100,000 patents published
between 1960-2009 (year). Certain years may have multiple patents.
Every patent is assigned multiple patent numbers (patent_number)
which uniquely identify it. Each of that patent can be used in at
least one industrial sector (patent_Sic). Every patent may cite
multiple patents (citation).
The data below tell that, ACER in year 1994 published two patents
which were assigned 20 numbers. Each patent is used in 20
industries (patent_Sic). In each of the two patents, ACER is citing
20 other patents, which may belong to ACER or other companies,
which themselves appear in a similar fashion as observations (in
ACER's or) another company's file.
name mother_sic Year patent_Sic_1 patent_Sic_20 patent_number_1
patent_number_20 citation_1 citation_20
ACER 3456 1994
3661 TW231391-A
TW231391-B US231391-A CY231391-A
ACER 3456 1994 3417 5472
DR231342-A TA231342-C FR231342-A CY2634542-B
ACER 3456 1995 3577 3572
BR231342-B PAT231342-A TW231342-A SE231342-A
I want to estimate a new numerical variable ("convergence") for
ACER which will measure how much its patents' SIC sectors and the
related cited patents' SIC sectors deviate from ACER's industrial
sector (mother_SIC) based on the following formula. Take for
example the two patents in 1994 above. The value of "convergence"
for the year 1994 should be:
{[0.90 * (a1 +b1 +c1) + 0.10 * (d1 + e1 + f1)] + [0.90 * (a2 +b2
+c2) + 0.10 * (d2 + e2 + f2)] } / n
where 1,2,...,n is the number of patents that ACER published in
1994 and a, b, c, d, e, f are:
For every ACER patent published in 1994:
a) the proportion of all patent_SICs whose 1st digit is different
than the 1st digit of mother_SIC, multiplied by 3;
b) the proportion of all patent_SICs whose 1st digit is the same as
the 1st digit of mother_SIC but the 2nd digit is different than the
2nd digit of mother_SIC, multiplied by 2;
c) the proportion of all patent_SICs whose first 2 digits are the
same as the first two digits of mother_SIC but they have a
different 3rd digit, multiplied it 1;
d) the proportion of all cited patents' patent_SICs whose 1st digit
is different than the 1st digit of ACER's mother_SIC, multiplied by 3;
e) the proportion of all cited patents' patent_SICs whose 1st digit
is the same as the 1st digit of ACER's mother_SIC but the 2nd digit
is different than the 2nd digit of mother_SIC, multiplied by 2;
f) the proportion of all cited patents' patent_SICs whose first 2
digits are the same as the first 2 digits of ACER's mother_SIC but
they have a different 3rd digit, multiplied by 1;
Notes:
A) For a, b, c the search will be done inside ACER's file. For d,
e, f the search will be done inside all available companies' files,
including ACER and only for the years 1994 and earlier. This is
because cited patents are already published.
B) A citation number in ACER's file will appear as a patent number
in another company's file (or in ACER's file if the company is
citing another patent it owns).
C) Since a patent can be assigned multiple numbers, the search
intended to match the citation with the patent number must go
through all columns that have patent numbers.
D) It is possible that a cited patent does not belong to our sample
companies. This should not terminate the loop but go on with the
next citation.
The output should look like this:
Company Year convergence
ACER 1994 2.3
ACER 1995 2.1
ACER 1996 2.5
......
......
Any help will be very appreciated.
Best,
Pavlos
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/