Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: recognizing patterns within two columns of data

From	SUBRATA BHATTACHARYYA <[email protected]>
To	[email protected]
Subject	Re: st: recognizing patterns within two columns of data
Date	Thu, 7 Jul 2011 23:23:04 +0530

Hi Nick,

Thanks for pointing out the much bigger problem! It was my bad that I
missed -levelsof-. I made few changes to the code which I sent earlier
and I think all the problems are taken care of. The way I understood
Dalhia's problem was that Dalhia wanted to get multiple company's name
together which are linked by multiple common owners. I think, Dahlia
will be the best person to judge if we have understood his problem
correctly. Nonetheless, here is the new code:

split comp_hold
ren comp_hold1 company
ren comp_hold2 holder
sort company holder
gen hold_list=""
gen comp_list=""
levelsof company, local(temp1)
foreach x of local temp1{
	levelsof holder if company=="`x'", local(temp2)
	foreach y of local temp2{
		local temp3 "`temp3'"+" "+"`y'"
	}
	replace hold_list="`temp3'" if company=="`x'"
	local temp3=""
}

egen group_hold=group(hold_list)
su group_hold, meanonly
forval i = 1/`r(max)'{
	levelsof company if group_hold==`i', local(temp4)
	foreach j of local temp4{
		local temp5 "`temp5'"+" "+"`j'"
	}
	replace comp_list="`temp5'" if group_hold==`i'
	local temp5=""
}

duplicates drop hold_list, force
list hold_list comp_list


I would be very glad in case you can further suggest on how to
optimize it. My motivations are purely educational and I am sure many
members in this list would immensely benefit of your valuable
suggestions!

Regards,
Subrata Bhattacharyya


On Thu, Jul 7, 2011 at 3:41 PM, Nick Cox <[email protected]> wrote:
> The advice here sounds an appropriate caution, but much bigger problems with this solution are not mentioned.
>
> Note that -vallist- (SSC) doesn't do here anything that -levelsof- (official Stata) does not do. In fact, there is much more engineering behind -levelsof-, which is just -vallist- made official, and much more tested for larger sets of values. (The main reasons for -vallist- to continue to be visible are nothing to do with anything used here.)
>
> Further, commands like
>
> local temp1=r(list)
>
> will just truncate their arguments at 244 characters, so this code won't work for any serious dataset. Fixing this by something like
>
> local temp1 `r(list)'
>
> would remove that problem. The sticking-point for this solution then becomes the same kind of problem in another guise, namely an assumption that a list of holders can be held within a string variable, which cannot be more than 244 characters long.
>
> Without knowing anything about Dalhia's real data, my guess is that such an assumption may bite, so watch out.
>
> Nick
> [email protected]
>
> P.S. On a matter of style, note that Subrata's code
>
> egen group_hold=group(hold_list)
> tostring group_hold, replace
> vallist group_hold
> local temp3=r(list)
> foreach x of local temp3{
> vallist company if group_hold=="`x'"
> local temp4=r(list)
> replace comp_list="`temp4'" if group_hold=="`x'"
> }
>
> incorporates some needless to-and-fro, turning a well-behaved integer variable into a string and then calling up -vallist- when the answer is predictable in advance:
>
> egen group_hold=group(hold_list)
> su group_hold, meanonly
> forval x = 1/`r(max)' {
>        vallist company if group_hold==`x'
>        replace comp_list="`r(list)''" if group_hold==`x'
> }
>
> should have the same effect. However, this is just tinkering, as the larger problems mentioned above still remain.
>
> SUBRATA BHATTACHARYYA
>
> You might want to try this: (though you would need a package vallist
> for this, please use -findit- to locate and install)
> I stored the data (you provided) in a variable named as comp_hold and
> then split them into company and holder. Then I used vallist to
> identify distinct observation and used that in a macro to get this
> output:
>      +-------------------------------------------------+
>      |       hold_list                 comp_list |
>      |---------------------------------------------------|
>   1. | holderA holderB   compA compB |
>   2. |         holderB                  compC |
>      +------------------------------------------------+
>
> I hope this works. This is what I wrote:
> split comp_hold
> ren comp_hold1 company
> ren comp_hold2 holder
> sort company holder
> gen hold_list=""
> gen comp_list=""
> vallist company
> local temp1=r(list)
> foreach x of local temp1{
> vallist holder if company=="`x'"
> local temp2=r(list)
> replace hold_list="`temp2'" if company=="`x'"
> }
> egen group_hold=group(hold_list)
> tostring group_hold, replace
> vallist group_hold
> local temp3=r(list)
> foreach x of local temp3{
> vallist company if group_hold=="`x'"
> local temp4=r(list)
> replace comp_list="`temp4'" if group_hold=="`x'"
> }
> duplicates drop hold_list, force
> list hold_list comp_list
> I hope this works for you. FYI, I used Stata 11.2. Just one small
> advice, please be sure that vallist can capture all the company names
> or holder names at one go, I am not sure whether it can return a full
> list of the names if your data set is too large. In that case, you
> might want to split your file into manageable pieces.
>
> On Thu, Jul 7, 2011 at 11:37 AM, Dalhia <[email protected]> wrote:
>
>> Hello, Thanks. But egen group won't work since the holders are not the same. CompA and B (which I want grouped together) are owned by holderA and by holderB. The link is that these two companies are owned by people who also own shares in the other company - holderA owns shares in compA and also compB; similarly holderB owns shares in compA and also in compB. I want to identify those companies that are linked by multiple common owners.
>>
>> Example:
>> compA holderA
>> compB holderA
>> compA holderB
>> compB holderB
>> compC holderB
>>
>> What I want:
>> compA group1
>> compB group1
>>
>> Thanks for your help. I appreciate it.
>>
>> Dalhia
>>
>> --- On Wed, 7/6/11, Nick Cox <[email protected]> wrote:
>>
>> > From: Nick Cox <[email protected]>
>> > Subject: RE: st: recognizing patterns within two columns of data
>> > To: "'[email protected]'" <[email protected]>
>> > Date: Wednesday, July 6, 2011, 7:50 PM
>> > -egen, group()- ?
>> >
>> > Nick
>> > [email protected]
>> >
>> >
>> > Austin Nichols
>> >
>> > Do you want to make an identifier as in
>> > http://www.stata.com/statalist/archive/2011-07/msg00170.html
>> > ?
>> >
>> > On Wed, Jul 6, 2011 at 10:12 AM, Dalhia <[email protected]>
>> > wrote:
>> > >
>> > > I would like some advice on how to do the following.
>> > Here is how the data looks:
>> > >
>> > > compA holderA
>> > > compB holderA
>> > > compC holderL
>> > > compD holderH
>> > > compA holderB
>> > > compB holderB
>> > > compC holderB
>> > >
>> > > Above, there was more than one instance where compA
>> > and compB had the same holder. In a large database, how do I
>> > identify instances where a set of comps appear repeatedly
>> > with the same holders?
>> >
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: recognizing patterns within two columns of data
  - From: Nick Cox <[email protected]>

References:
- RE: st: recognizing patterns within two columns of data
  - From: Nick Cox <[email protected]>
- RE: st: recognizing patterns within two columns of data
  - From: Dalhia <[email protected]>
- Re: st: recognizing patterns within two columns of data
  - From: SUBRATA BHATTACHARYYA <[email protected]>
- RE: st: recognizing patterns within two columns of data
  - From: Nick Cox <[email protected]>

Prev by Date: st: multivariate imputation using chained equations in Stata 12?
Next by Date: RE: st: recognizing patterns within two columns of data
Previous by thread: RE: st: recognizing patterns within two columns of data
Next by thread: RE: st: recognizing patterns within two columns of data
Index(es):
- Date
- Thread