Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: recognizing patterns within two columns of data
From
SUBRATA BHATTACHARYYA <[email protected]>
To
[email protected]
Subject
Re: st: recognizing patterns within two columns of data
Date
Thu, 7 Jul 2011 23:23:04 +0530
Hi Nick,
Thanks for pointing out the much bigger problem! It was my bad that I
missed -levelsof-. I made few changes to the code which I sent earlier
and I think all the problems are taken care of. The way I understood
Dalhia's problem was that Dalhia wanted to get multiple company's name
together which are linked by multiple common owners. I think, Dahlia
will be the best person to judge if we have understood his problem
correctly. Nonetheless, here is the new code:
split comp_hold
ren comp_hold1 company
ren comp_hold2 holder
sort company holder
gen hold_list=""
gen comp_list=""
levelsof company, local(temp1)
foreach x of local temp1{
levelsof holder if company=="`x'", local(temp2)
foreach y of local temp2{
local temp3 "`temp3'"+" "+"`y'"
}
replace hold_list="`temp3'" if company=="`x'"
local temp3=""
}
egen group_hold=group(hold_list)
su group_hold, meanonly
forval i = 1/`r(max)'{
levelsof company if group_hold==`i', local(temp4)
foreach j of local temp4{
local temp5 "`temp5'"+" "+"`j'"
}
replace comp_list="`temp5'" if group_hold==`i'
local temp5=""
}
duplicates drop hold_list, force
list hold_list comp_list
I would be very glad in case you can further suggest on how to
optimize it. My motivations are purely educational and I am sure many
members in this list would immensely benefit of your valuable
suggestions!
Regards,
Subrata Bhattacharyya
On Thu, Jul 7, 2011 at 3:41 PM, Nick Cox <[email protected]> wrote:
> The advice here sounds an appropriate caution, but much bigger problems with this solution are not mentioned.
>
> Note that -vallist- (SSC) doesn't do here anything that -levelsof- (official Stata) does not do. In fact, there is much more engineering behind -levelsof-, which is just -vallist- made official, and much more tested for larger sets of values. (The main reasons for -vallist- to continue to be visible are nothing to do with anything used here.)
>
> Further, commands like
>
> local temp1=r(list)
>
> will just truncate their arguments at 244 characters, so this code won't work for any serious dataset. Fixing this by something like
>
> local temp1 `r(list)'
>
> would remove that problem. The sticking-point for this solution then becomes the same kind of problem in another guise, namely an assumption that a list of holders can be held within a string variable, which cannot be more than 244 characters long.
>
> Without knowing anything about Dalhia's real data, my guess is that such an assumption may bite, so watch out.
>
> Nick
> [email protected]
>
> P.S. On a matter of style, note that Subrata's code
>
> egen group_hold=group(hold_list)
> tostring group_hold, replace
> vallist group_hold
> local temp3=r(list)
> foreach x of local temp3{
> vallist company if group_hold=="`x'"
> local temp4=r(list)
> replace comp_list="`temp4'" if group_hold=="`x'"
> }
>
> incorporates some needless to-and-fro, turning a well-behaved integer variable into a string and then calling up -vallist- when the answer is predictable in advance:
>
> egen group_hold=group(hold_list)
> su group_hold, meanonly
> forval x = 1/`r(max)' {
> vallist company if group_hold==`x'
> replace comp_list="`r(list)''" if group_hold==`x'
> }
>
> should have the same effect. However, this is just tinkering, as the larger problems mentioned above still remain.
>
> SUBRATA BHATTACHARYYA
>
> You might want to try this: (though you would need a package vallist
> for this, please use -findit- to locate and install)
> I stored the data (you provided) in a variable named as comp_hold and
> then split them into company and holder. Then I used vallist to
> identify distinct observation and used that in a macro to get this
> output:
> +-------------------------------------------------+
> | hold_list comp_list |
> |---------------------------------------------------|
> 1. | holderA holderB compA compB |
> 2. | holderB compC |
> +------------------------------------------------+
>
> I hope this works. This is what I wrote:
> split comp_hold
> ren comp_hold1 company
> ren comp_hold2 holder
> sort company holder
> gen hold_list=""
> gen comp_list=""
> vallist company
> local temp1=r(list)
> foreach x of local temp1{
> vallist holder if company=="`x'"
> local temp2=r(list)
> replace hold_list="`temp2'" if company=="`x'"
> }
> egen group_hold=group(hold_list)
> tostring group_hold, replace
> vallist group_hold
> local temp3=r(list)
> foreach x of local temp3{
> vallist company if group_hold=="`x'"
> local temp4=r(list)
> replace comp_list="`temp4'" if group_hold=="`x'"
> }
> duplicates drop hold_list, force
> list hold_list comp_list
> I hope this works for you. FYI, I used Stata 11.2. Just one small
> advice, please be sure that vallist can capture all the company names
> or holder names at one go, I am not sure whether it can return a full
> list of the names if your data set is too large. In that case, you
> might want to split your file into manageable pieces.
>
> On Thu, Jul 7, 2011 at 11:37 AM, Dalhia <[email protected]> wrote:
>
>> Hello, Thanks. But egen group won't work since the holders are not the same. CompA and B (which I want grouped together) are owned by holderA and by holderB. The link is that these two companies are owned by people who also own shares in the other company - holderA owns shares in compA and also compB; similarly holderB owns shares in compA and also in compB. I want to identify those companies that are linked by multiple common owners.
>>
>> Example:
>> compA holderA
>> compB holderA
>> compA holderB
>> compB holderB
>> compC holderB
>>
>> What I want:
>> compA group1
>> compB group1
>>
>> Thanks for your help. I appreciate it.
>>
>> Dalhia
>>
>> --- On Wed, 7/6/11, Nick Cox <[email protected]> wrote:
>>
>> > From: Nick Cox <[email protected]>
>> > Subject: RE: st: recognizing patterns within two columns of data
>> > To: "'[email protected]'" <[email protected]>
>> > Date: Wednesday, July 6, 2011, 7:50 PM
>> > -egen, group()- ?
>> >
>> > Nick
>> > [email protected]
>> >
>> >
>> > Austin Nichols
>> >
>> > Do you want to make an identifier as in
>> > http://www.stata.com/statalist/archive/2011-07/msg00170.html
>> > ?
>> >
>> > On Wed, Jul 6, 2011 at 10:12 AM, Dalhia <[email protected]>
>> > wrote:
>> > >
>> > > I would like some advice on how to do the following.
>> > Here is how the data looks:
>> > >
>> > > compA holderA
>> > > compB holderA
>> > > compC holderL
>> > > compD holderH
>> > > compA holderB
>> > > compB holderB
>> > > compC holderB
>> > >
>> > > Above, there was more than one instance where compA
>> > and compB had the same holder. In a large database, how do I
>> > identify instances where a set of comps appear repeatedly
>> > with the same holders?
>> >
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/