Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: recognizing patterns within two columns of data

From	Robert Picard <[email protected]>
To	[email protected]
Subject	Re: st: recognizing patterns within two columns of data
Date	Mon, 11 Jul 2011 14:51:40 -0400
Since memory might be an issue, here's a more efficient version of my
code. If memory is still a problem (depends on the maximum number of
shareholders per company), then using numeric identifiers would be
better than using company and shareholder names.

* --------------------- begin example ---------------------
clear all
input str5 company str7 holder
compA holderA
compB holderA
compC holderL
compD holderH
compA holderB
compB holderB
compC holderA
compC holderB
compD holderD
compE holderA
end

sort company holder
by company: gen N = _N
sum N, meanonly
forvalues i = `=r(max)'(-1)2 {
  by company: gen holder`i' = holder[`i']
  by company: drop if _n == `i'
}
rename holder holder1
order holder*, sequential after(N)
list, noobs

sort h* company
by h*: gen one = _n == 1
gen cogroup = sum(one)
list company N holder* cogroup, sepby(cogroup) noobs
* --------------------- end example -----------------------


On Mon, Jul 11, 2011 at 12:30 PM, Robert Picard <[email protected]> wrote:
> I'm with Nick: the rules for grouping companies are still unclear. As
> far as I can tell, Subrata's code simply makes a list of each
> shareholders per company. If all you want is to identify companies
> that have exactly the same shareholders, then Subrata's approach can
> be simplified to
>
> * --------------------- begin example ---------------------
> clear all
> input str5 company str7 holder
> compA holderA
> compB holderA
> compC holderL
> compD holderH
> compA holderB
> compB holderB
> compC holderA
> compC holderB
> compD holderD
> compE holderA
> end
> tempfile f
> qui save "`f'"
>
> use "`f'", clear
> sort company holder
> by company: gen N = _N
> sum N, meanonly
> forvalues i = 1/`=r(max)' {
>  by company: gen h`i' = holder[`i']
> }
> by company: keep if _n == 1
> drop holder
> list, noobs
>
> sort h* company
> by h*: gen one = _n == 1
> gen cogroup = sum(one)
> list, sepby(cogroup) noobs
> * --------------------- end example -----------------------
>
> On Mon, Jul 11, 2011 at 9:37 AM, Nick Cox <[email protected]> wrote:
>> Sorry, but I don't still understand exactly what your problem is. Your example is still incomplete and/or unexplained, and saying that Subrata understands it unfortunately does not help me.
>>
>> Specifically, I do grasp that you want to group companies, but I don't understand the exact criteria. What's to stop there being one big group, that they are all connected by overlapping holdings?
>>
>> Also, your problems with Subrata's code are as I predicted, namely that it would fall over on large problems.
>>
>> It's possible that your problem is that tackled by Robert Picard's -group_id- (SSC).
>>
>> I shall be travelling myself very soon and not easily able to access email or send (long) emails. Very likely someone else can help more.
>>
>> Nick
>> [email protected]
>>
>> Dalhia
>>
>> Thanks for the response. I apologize for the delay in getting back. I was in a small village last week, and internet connectivity was hard to come by.
>>
>> Nick and Austin, The problem is exactly the way Subrata has described it. I have shareholders who invest in the same companies, and I need to identify the companies that have these common shareholders. For example, below holderA owns shares in both compA and compB, and holderB also holds shares in the same companies - compA and compB. I want to identify these companies that are connected indirectly through common shareholders.
>>
>> example:
>> compA holderA
>> compB holderA
>> compA holderB
>> compB holderB
>> compC holderB
>>
>> Nick, I left out a few rows in my second email since the problem is confusing, and so I decided to stick only to the most pertinent cases.
>>
>> Austin, the problem is different from the household_id example since there is no one column by which companies are connected. What is connecting compA and compB is not that both are owned by holderA, but that there are other owners (holderB) who also own shares in the same companies.
>>
>> Finally Subrata, the code you've developed works really well with the set of companies listed above. But as soon as I use it on the original data (with 105,000) rows. I get an error message after I run the "foreach x of local temp1{" code. The error message says  "too many literals."
>>
>> Here is how it looks:
>>
>> . foreach x of local temp1{
>>  2.
>> .     levelsof holder if company=="`x'", local(temp2)
>>  3.
>> .     foreach y of local temp2{
>>  4.
>> .         local temp3 "`temp3'"+" "+"`y'"
>>  5.
>> .     }
>>  6.
>> .     replace hold_list="`temp3'" if company=="`x'"
>>  7.
>> .     local temp3=""
>>  8.
>> . }
>> `"citigroupglobalmarketsmauritius"' `"emiratesbankinternationalpjsc"' `"emiratesbankintnlpjsc"' `"icicibank"' `"idbiship
>>> services"' `"licofindiamoneyplus"' `"lifeinsurancecorporationsofindia"' `"norgesbank"' `"reliancecapital"' `"reliancec
>>> apitalreliancelongtermequityfund"' `"sbimf"' `"sundarambnpparibasmutualfund"' `"swissfinancescorporationsmauritius"' `
>>> "witeco"'
>> ..........
>>
>> too many literals
>> r(130);
>>
>> Thanks. I will appreciate any suggestions on how to fix this.
>>
>> Best
>> Dalhia Mani
>>
>> --- On Thu, 7/7/11, Nick Cox <[email protected]> wrote:
>>
>>> From: Nick Cox <[email protected]>
>>> Subject: RE: st: recognizing patterns within two columns of data
>>> To: "'[email protected]'" <[email protected]>
>>> Date: Thursday, July 7, 2011, 8:59 PM
>>> Thanks for your thanks. I would
>>> rather hold on further comment until such time as Dalhia
>>> explains the problem in a way that I can understand.
>>>
>>> Nick
>>> [email protected]
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]]
>>> On Behalf Of SUBRATA BHATTACHARYYA
>>> Sent: 07 July 2011 18:53
>>> To: [email protected]
>>> Subject: Re: st: recognizing patterns within two columns of
>>> data
>>>
>>> Hi Nick,
>>>
>>> Thanks for pointing out the much bigger problem! It was my
>>> bad that I
>>> missed -levelsof-. I made few changes to the code which I
>>> sent earlier
>>> and I think all the problems are taken care of. The way I
>>> understood
>>> Dalhia's problem was that Dalhia wanted to get multiple
>>> company's name
>>> together which are linked by multiple common owners. I
>>> think, Dahlia
>>> will be the best person to judge if we have understood his
>>> problem
>>> correctly. Nonetheless, here is the new code:
>>>
>>> split comp_hold
>>> ren comp_hold1 company
>>> ren comp_hold2 holder
>>> sort company holder
>>> gen hold_list=""
>>> gen comp_list=""
>>> levelsof company, local(temp1)
>>> foreach x of local temp1{
>>>     levelsof holder if company=="`x'",
>>> local(temp2)
>>>     foreach y of local temp2{
>>>         local temp3
>>> "`temp3'"+" "+"`y'"
>>>     }
>>>     replace hold_list="`temp3'" if
>>> company=="`x'"
>>>     local temp3=""
>>> }
>>>
>>> egen group_hold=group(hold_list)
>>> su group_hold, meanonly
>>> forval i = 1/`r(max)'{
>>>     levelsof company if group_hold==`i',
>>> local(temp4)
>>>     foreach j of local temp4{
>>>         local temp5
>>> "`temp5'"+" "+"`j'"
>>>     }
>>>     replace comp_list="`temp5'" if
>>> group_hold==`i'
>>>     local temp5=""
>>> }
>>>
>>> duplicates drop hold_list, force
>>> list hold_list comp_list
>>>
>>>
>>> I would be very glad in case you can further suggest on how
>>> to
>>> optimize it. My motivations are purely educational and I am
>>> sure many
>>> members in this list would immensely benefit of your
>>> valuable
>>> suggestions!
>>>
>>> Regards,
>>> Subrata Bhattacharyya
>>>
>>>
>>> On Thu, Jul 7, 2011 at 3:41 PM, Nick Cox <[email protected]>
>>> wrote:
>>> > The advice here sounds an appropriate caution, but
>>> much bigger problems with this solution are not mentioned.
>>> >
>>> > Note that -vallist- (SSC) doesn't do here anything
>>> that -levelsof- (official Stata) does not do. In fact, there
>>> is much more engineering behind -levelsof-, which is just
>>> -vallist- made official, and much more tested for larger
>>> sets of values. (The main reasons for -vallist- to continue
>>> to be visible are nothing to do with anything used here.)
>>> >
>>> > Further, commands like
>>> >
>>> > local temp1=r(list)
>>> >
>>> > will just truncate their arguments at 244 characters,
>>> so this code won't work for any serious dataset. Fixing this
>>> by something like
>>> >
>>> > local temp1 `r(list)'
>>> >
>>> > would remove that problem. The sticking-point for this
>>> solution then becomes the same kind of problem in another
>>> guise, namely an assumption that a list of holders can be
>>> held within a string variable, which cannot be more than 244
>>> characters long.
>>> >
>>> > Without knowing anything about Dalhia's real data, my
>>> guess is that such an assumption may bite, so watch out.
>>> >
>>> > Nick
>>> > [email protected]
>>> >
>>> > P.S. On a matter of style, note that Subrata's code
>>> >
>>> > egen group_hold=group(hold_list)
>>> > tostring group_hold, replace
>>> > vallist group_hold
>>> > local temp3=r(list)
>>> > foreach x of local temp3{
>>> > vallist company if group_hold=="`x'"
>>> > local temp4=r(list)
>>> > replace comp_list="`temp4'" if group_hold=="`x'"
>>> > }
>>> >
>>> > incorporates some needless to-and-fro, turning a
>>> well-behaved integer variable into a string and then calling
>>> up -vallist- when the answer is predictable in advance:
>>> >
>>> > egen group_hold=group(hold_list)
>>> > su group_hold, meanonly
>>> > forval x = 1/`r(max)' {
>>> >        vallist company if group_hold==`x'
>>> >        replace comp_list="`r(list)''" if
>>> group_hold==`x'
>>> > }
>>> >
>>> > should have the same effect. However, this is just
>>> tinkering, as the larger problems mentioned above still
>>> remain.
>>> >
>>> > SUBRATA BHATTACHARYYA
>>> >
>>> > You might want to try this: (though you would need a
>>> package vallist
>>> > for this, please use -findit- to locate and install)
>>> > I stored the data (you provided) in a variable named
>>> as comp_hold and
>>> > then split them into company and holder. Then I used
>>> vallist to
>>> > identify distinct observation and used that in a macro
>>> to get this
>>> > output:
>>> >
>>>  +-------------------------------------------------+
>>> >      |       hold_list
>>> comp_list |
>>> >
>>>  |---------------------------------------------------|
>>> >   1. | holderA holderB   compA compB |
>>> >   2. |         holderB
>>>  compC |
>>> >
>>>  +------------------------------------------------+
>>> >
>>> > I hope this works. This is what I wrote:
>>> > split comp_hold
>>> > ren comp_hold1 company
>>> > ren comp_hold2 holder
>>> > sort company holder
>>> > gen hold_list=""
>>> > gen comp_list=""
>>> > vallist company
>>> > local temp1=r(list)
>>> > foreach x of local temp1{
>>> > vallist holder if company=="`x'"
>>> > local temp2=r(list)
>>> > replace hold_list="`temp2'" if company=="`x'"
>>> > }
>>> > egen group_hold=group(hold_list)
>>> > tostring group_hold, replace
>>> > vallist group_hold
>>> > local temp3=r(list)
>>> > foreach x of local temp3{
>>> > vallist company if group_hold=="`x'"
>>> > local temp4=r(list)
>>> > replace comp_list="`temp4'" if group_hold=="`x'"
>>> > }
>>> > duplicates drop hold_list, force
>>> > list hold_list comp_list
>>> > I hope this works for you. FYI, I used Stata 11.2.
>>> Just one small
>>> > advice, please be sure that vallist can capture all
>>> the company names
>>> > or holder names at one go, I am not sure whether it
>>> can return a full
>>> > list of the names if your data set is too large. In
>>> that case, you
>>> > might want to split your file
>>> into manageable pieces.
>>> >
>>> > On Thu, Jul 7, 2011 at 11:37 AM, Dalhia <[email protected]>
>>> wrote:
>>> >
>>> >> Hello, Thanks. But egen group won't work since the
>>> holders are not the same. CompA and B (which I want grouped
>>> together) are owned by holderA and by holderB. The link is
>>> that these two companies are owned by people who also own
>>> shares in the other company - holderA owns shares in compA
>>> and also compB; similarly holderB owns shares in compA and
>>> also in compB. I want to identify those companies that are
>>> linked by multiple common owners.
>>> >>
>>> >> Example:
>>> >> compA holderA
>>> >> compB holderA
>>> >> compA holderB
>>> >> compB holderB
>>> >> compC holderB
>>> >>
>>> >> What I want:
>>> >> compA group1
>>> >> compB group1
>>> >>
>>> >> Thanks for your help. I appreciate it.
>>> >>
>>> >> Dalhia
>>> >>
>>> >> --- On Wed, 7/6/11, Nick Cox <[email protected]>
>>> wrote:
>>> >>
>>> >> > From: Nick Cox <[email protected]>
>>> >> > Subject: RE: st: recognizing patterns within
>>> two columns of data
>>> >> > To: "'[email protected]'"
>>> <[email protected]>
>>> >> > Date: Wednesday, July 6, 2011, 7:50 PM
>>> >> > -egen, group()- ?
>>> >> >
>>> >> > Nick
>>> >> > [email protected]
>>> >> >
>>> >> >
>>> >> > Austin Nichols
>>> >> >
>>> >> > Do you want to make an identifier as in
>>> >> > http://www.stata.com/statalist/archive/2011-07/msg00170.html
>>> >> > ?
>>> >> >
>>> >> > On Wed, Jul 6, 2011 at 10:12 AM, Dalhia
>>> <[email protected]>
>>> >> > wrote:
>>> >> > >
>>> >> > > I would like some advice on how to do
>>> the following.
>>> >> > Here is how the data looks:
>>> >> > >
>>> >> > > compA holderA
>>> >> > > compB holderA
>>> >> > > compC holderL
>>> >> > > compD holderH
>>> >> > > compA holderB
>>> >> > > compB holderB
>>> >> > > compC holderB
>>> >> > >
>>> >> > > Above, there was more than one instance
>>> where compA
>>> >> > and compB had the same holder. In a large
>>> database, how do I
>>> >> > identify instances where a set of comps
>>> appear repeatedly
>>> >> > with the same holders?
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- RE: st: recognizing patterns within two columns of data
  - From: Nick Cox <[email protected]>
- RE: st: recognizing patterns within two columns of data
  - From: Dalhia <[email protected]>
- RE: st: recognizing patterns within two columns of data
  - From: Nick Cox <[email protected]>
- Re: st: recognizing patterns within two columns of data
  - From: Robert Picard <[email protected]>
Prev by Date: Re: st: Second derivative for GMM
Next by Date: Re: st: providing raw weights for multivariate meta-analysis
Previous by thread: Re: st: recognizing patterns within two columns of data
Next by thread: RE: st: recognizing patterns within two columns of data
Index(es):
- Date
- Thread