Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Social Network Analysis shortest path centrality
From
Michael Goodwin <[email protected]>
To
[email protected]
Subject
Re: st: Social Network Analysis shortest path centrality
Date
Wed, 4 Sep 2013 12:54:43 -0400
I'm trying an approach to loop over observations within a given co and
to make blank any values that match any previous connection values.
For some reason, this block of code beginning at forv totalCompanies =
1/$maxSource { doesn't seem to work, despite the fact that when I
clear
input source target
1 2
1 5
1 6
1 9
2 3
2 5
2 7
3 8
3 6
4 9
4 1
end
* Create macros with total number of companies
preserve
duplicates drop source, force
global maxSource = _N
restore
tempfile main
save "`main'"
* Check for duplicates
sort source target
isid source target
* Create dataset containing all connections
rename source co
gen level0 = co
local i 0
local more 1
local isdone co
while `more' {
local ++i
rename target source
joinby source using `main', unmatched(master)
drop _merge
rename source level`i'
replace target = . if inlist(target,`isdone')
local isdone `isdone', level`i'
forv totalCompanies = 1/$maxSource {
preserve
keep if co==`totalCompanies'
local maxConnections = _N
forv num = 1/`i' {
local count = `i'-`num'
forv totalConnections = 1/`maxConnections' {
set trace on
replace level`i' = . if level`i'==level`count'[`totalConnections'] & level`i'!=.
set trace off
}
}
restore
}
sort co level`i'
by co level`i': gen one = _n == 1 & !mi(level`i')
by co: egen connect`i' = total(one)
drop one
count if !mi(target)
local more = r(N)
qui compress
}
drop target
sort co level*
order co level* connect*
On Wed, Sep 4, 2013 at 11:10 AM, Michael Goodwin
<[email protected]> wrote:
> This is very useful and more efficient than what I had put together.
> The final piece of the puzzle is identifying which observations
> contain the shortest path between the "co" and a given node. In the
> example above, in the third observation, co 1 is connected to 9 in
> level4. However, in the tenth observation, co 1 is connected to 9 in
> level1. Is there an efficient way to replace level with a missing
> value if that combination of co and level already exists and is a
> shorter path? I'm working on this final piece now, but would love to
> hear any thoughts. Thanks in advance for your help.
>
> On Wed, Sep 4, 2013 at 8:54 AM, Robert Picard <[email protected]> wrote:
>> You can stop a list of connections from looping back onto
>> itself by replacing target with a missing value if it
>> identifies a company that is already part of the list for
>> that observation. This can easily be done using -inlist()-.
>> Since -inlist()- does not handle a regular varlist, you can
>> use a local macro to build-up the comma separated variable
>> list as you go.
>>
>> Note that it really helps to create a toy dataset to sort
>> out these problems. Your proposed code does not stop looping
>> when used with the test data I'm using.
>>
>> * --------------------- begin example ---------------------
>> clear
>> input source target
>> 1 2
>> 1 5
>> 1 6
>> 1 9
>> 2 3
>> 2 5
>> 2 7
>> 5 8
>> 5 6
>> 8 9
>> 8 1
>> end
>> tempfile main
>> save "`main'"
>> * make sure the data has no duplicates
>> isid source target
>>
>> rename source co
>> local i 0
>> local more 1
>> local isdone co
>> while `more' {
>> local ++i
>> rename target source
>> joinby source using "`main'", unmatched(master)
>> drop _merge
>> rename source level`i'
>> replace target = . if inlist(target,`isdone')
>> local isdone `isdone', level`i'
>> sort co level`i'
>> by co level`i': gen one = _n == 1 & !mi(level`i')
>> by co: egen connect`i' = total(one)
>> drop one
>> count if !mi(target)
>> local more = r(N)
>> }
>> drop target
>> sort co level*
>> order co level* connect*
>> list, sepby(co) noobs
>> * --------------------- end example -----------------------
>>
>> On Tue, Sep 3, 2013 at 6:15 PM, Michael Goodwin
>> <[email protected]> wrote:
>>> I think I've figured out the issue. Using this approach, when dropping
>>> values that are equal, you need to ensure that you aren't dropping
>>> null values. Below is the code to drop only matching non-null values.
>>> Following the loop are the final few lines to calculate centrality
>>> using the approach I mentioned in my initial post. Thanks for your
>>> help!
>>>
>>> * Create dataset containing all connections;
>>> rename Source company;
>>> local i 0;
>>> local more 1;
>>> while `more' {;
>>> local ++i;
>>> rename Target Source;
>>> joinby Source using `main', unmatched(master);
>>> drop _merge;
>>> rename Source level`i';
>>> sort company level`i';
>>> forv num = 1/`i' {;
>>> local count = `i'-`num';
>>> cap drop if level`i'==level`count' & level`i'!="";
>>> cap drop if company==level`count';
>>> };
>>> by company level`i': gen one = _n == 1 & !mi(level`i');
>>> by company: egen connect`i' = total(one);
>>> drop one;
>>> count if !mi(Target);
>>> local more = r(N);
>>> qui compress;
>>> };
>>>
>>> * Centrality;
>>> egen totalPath = rownonmiss(connect*), strok;
>>> local maxPath = totalPath;
>>> forv num = 1/`maxPath' {;
>>> gen connectCentrality`num' = connect`num'/`num';
>>> };
>>> egen centrality = rowtotal(connectCentrality*);
>>>
>>> On Tue, Sep 3, 2013 at 5:10 PM, Michael Goodwin
>>> <[email protected]> wrote:
>>>> Robert, this is very helpful and actually not so far from what I had
>>>> originally coded out.
>>>>
>>>> The only issue I am encountering now is that some of the networks
>>>> actually double back on themselves, so that the loop you've written
>>>> out continues on infinitely (or would if Stata didn't have observation
>>>> limits).
>>>>
>>>> I think the solution is to write code that drops any observations in
>>>> which the level`i' variable is equal to any of the preceding
>>>> level[`i'-1], level[`i'-2], etc. variables. Do you have any thoughts
>>>> on how to best accomplish that? My thinking was to create a loop that
>>>> compares the current level with each previous level and drops the
>>>> observation if any two values match. I have to use capture because
>>>> there is no level0. This doesn't seem to be working using my dataset.
>>>>
>>>>
>>>> * --------------------- begin example ---------------------
>>>> clear
>>>> input Source Target
>>>> 1 2
>>>> 1 5
>>>> 1 6
>>>> 1 9
>>>> 2 3
>>>> 2 5
>>>> 2 7
>>>> 5 8
>>>> 5 6
>>>> 8 9
>>>> end
>>>> tempfile main
>>>> save "`main'"
>>>> * make sure the data has no duplicates
>>>> isid Source Target
>>>>
>>>> rename Source co;
>>>> local i 0;
>>>> local more 1;
>>>> while `more' {;
>>>> local ++i;
>>>> rename Target Source;
>>>> joinby Source using `main', unmatched(master);
>>>> drop _merge;
>>>> rename Source level`i';
>>>> sort co level`i';
>>>> forv num = 1/`i' {;
>>>> local count = `i'-`num';
>>>> cap drop if level`i'==level`count';
>>>> cap drop if co==level`count';
>>>> };
>>>> by co level`i': gen one = _n == 1 & !mi(level`i');
>>>> by co: egen connect`i' = total(one);
>>>> drop one;
>>>> count if !mi(Target);
>>>> local more = r(N);
>>>> qui compress;
>>>> };
>>>> drop Target;
>>>> sort co level*;
>>>> order co level* connect*;
>>>> list, sepby(co) noobs;
>>>> * --------------------- end example -----------------------
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Sep 1, 2013 at 11:44 AM, Robert Picard <[email protected]> wrote:
>>>>> There is indeed a problem in merging the list with itself
>>>>> as it leads to many-to-many merges and I have yet to see
>>>>> one case where an m:m merge is useful. You can however use
>>>>> -joinby- to perform what you intuitively would want -merge-
>>>>> to do in this case.
>>>>>
>>>>> * --------------------- begin example ---------------------
>>>>> clear
>>>>> input source target
>>>>> 1 2
>>>>> 1 5
>>>>> 1 6
>>>>> 1 9
>>>>> 2 3
>>>>> 2 5
>>>>> 2 7
>>>>> 5 8
>>>>> 5 6
>>>>> 8 9
>>>>> end
>>>>> tempfile main
>>>>> save "`main'"
>>>>> * make sure the data has no duplicates
>>>>> isid source target
>>>>>
>>>>> rename source co
>>>>> local i 0
>>>>> local more 1
>>>>> while `more' {
>>>>> local ++i
>>>>> rename target source
>>>>> joinby source using "`main'", unmatched(master)
>>>>> drop _merge
>>>>> rename source level`i'
>>>>> sort co level`i'
>>>>> by co level`i': gen one = _n == 1 & !mi(level`i')
>>>>> by co: egen connect`i' = total(one)
>>>>> drop one
>>>>> count if !mi(target)
>>>>> local more = r(N)
>>>>> }
>>>>> drop target
>>>>> sort co level*
>>>>> order co level* connect*
>>>>> list, sepby(co) noobs
>>>>> * --------------------- end example -----------------------
>>>>>
>>>>> Original message follows:
>>>>>
>>>>> st: Social Network Analysis shortest path centrality
>>>>>
>>>>> From Michael Goodwin <[email protected]>
>>>>> To [email protected]
>>>>> Subject st: Social Network Analysis shortest path centrality
>>>>> Date Fri, 30 Aug 2013 13:53:16 -0400
>>>>> Hi,
>>>>>
>>>>> I am trying to do some light social network analysis on a dataset
>>>>> containing a list of edges. I have the dataset organized such that
>>>>> there are two variables, Source and Target. Bot the Source and Target
>>>>> are companies, and the connection between indicates that an employee
>>>>> from Source went on to found Target. The relationship between these
>>>>> two variables is indeterminate (i.e. m:m) and although the variables
>>>>> start as strings, I've converted them to numeric values using encode
>>>>> (and ensured
>>>>> that the numeric values in both Target and Source are equal to one another).
>>>>>
>>>>> I am attempting to determine the number of first, second, third,...,n
>>>>> degree connections that each Source has. For example if an employees
>>>>> from Company A went on to found Company B and then employees from
>>>>> Company B went on to found Companies C and D, Company A would have 1
>>>>> first degree connection and 2 second degree connections.
>>>>>
>>>>> My goal is to create something similar to a shortest path measurement
>>>>> whereby a first degree connection is equal to 1, a second degree
>>>>> connection 1/2, a third degree connection 1/3, and so forth. In the
>>>>> above example, Company A's score would be (1/1)+(2/2) or 2. I believe
>>>>> this is a closeness/shortest path centrality approach, but I may be
>>>>> mistaken (and would love to be corrected!).
>>>>>
>>>>> After making the connections symmetric (i.e. all pairs are present as
>>>>> both inbound and outbound connections), I've attempted three
>>>>> approaches, all without success:
>>>>>
>>>>> 1. Use netsis and netsummarize. Neither the adjacency nor closeness
>>>>> calculations seems to get me to the right answer. I don't have
>>>>> experience using mata, but it appears that the matrix generate by
>>>>> netsis doesn't reflect the appropriate connections (i.e. a connection
>>>>> in the original edge list is not represented by a 1 in the matrix)
>>>>>
>>>>> netsis Source Target, measure(adjacency) name(A, replace);
>>>>> netsummarize A/(rows(A)-1), generate(degree) statistic(rowsum);
>>>>>
>>>>> netsis Source Target, measure(distance) name(D, replace)
>>>>> netsummarize (rows(D)-1):/rowsum(D), generate(closeness) statistic(rowsum)
>>>>>
>>>>> 2. Create a matrix data structure in Stata and use centpow. I keep
>>>>> receiving an error noting that the matrix is not symmetrical. I've
>>>>> checked and made sure that the dataset is a perfect square (it has 707
>>>>> observations and 707 variables) and that a connection between Company
>>>>> A and Company B is also represented by a connection between Company B
>>>>> and Company A. Does centpow require the data to actually be in a mata
>>>>> matrix?
>>>>>
>>>>> use ".\dta\\${connection}_connectionIDSymmetric${typeInt}.dta";
>>>>> contract targetID sourceID;
>>>>> reshape wide _freq , i(targetID) j(sourceID);
>>>>> qui foreach v of var _freq* {;
>>>>> replace `v' = 0 if mi(`v');
>>>>> };
>>>>> drop targetID;
>>>>> save ".\dta\\${connection}_adjacencyMatrix${typeInt}.dta", replace;
>>>>> centpow ".\dta\\${connection}_adjacencyMatrix${typeInt}.dta";
>>>>>
>>>>> 3. Start with the edgelist, and merge it with itself, changing the
>>>>> Target and Source variable names such that Target becomes Source for
>>>>> the second degree connection and so forth (I think this is
>>>>> demonstrably not the solution, so I won't elaborate further).
>>>>>
>>>>> I think this either has a simple solution that I can't think of
>>>>> involving the edge list, or will involve a more intensive solution
>>>>> using mata. If anyone has experience or could point me in the
>>>>> direction of content (Statalist has limited SNA resources), that would
>>>>> be a huge help.
>>>>>
>>>>> Here are some of the resources I've already reviewed:
>>>>> http://www.rensecorten.org/index.php/research/social-network-analysis-with-stata/
>>>>> https://sites.google.com/site/statagraphlibrary/netgen111
>>>>> http://www.ats.ucla.edu/stat/sna/sna_stata.htm
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> Best,
>>>>>
>>>>> Mike
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>>
>>>>
>>>> --
>>>> Mike Goodwin
>>>> Senior Associate, Endeavor Insight
>>>> 900 Broadway | Suite 301 | New York, NY 10003
>>>> +1 646 368 6354 | Skype: michael.p.goodwin
>>>
>>>
>>>
>>> --
>>> Mike Goodwin
>>> Senior Associate, Endeavor Insight
>>> 900 Broadway | Suite 301 | New York, NY 10003
>>> +1 646 368 6354 | Skype: michael.p.goodwin
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> Mike Goodwin
> Senior Associate, Endeavor Insight
> 900 Broadway | Suite 301 | New York, NY 10003
> +1 646 368 6354 | Skype: michael.p.goodwin
--
Mike Goodwin
Senior Associate, Endeavor Insight
900 Broadway | Suite 301 | New York, NY 10003
+1 646 368 6354 | Skype: michael.p.goodwin
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/