Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Data Manipulation Question
From
Robert Picard <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: RE: Data Manipulation Question
Date
Fri, 20 Sep 2013 15:37:01 -0400
This is a tricky problem if individuals can have more than
one job per quarter. As Joe Canner mentionned, -joinby- can
be used to form all pairwise combinations of employers
within groups of individual/quarter. By aligning each
quarter with the previous quarter, you can track each
transition from one employer to another.
Once all employer -> employer transitions that match the
desired threshold are identified, then -group_id- (SSC)
can be used to handle the tricky part of how to group
employer identifiers.
* --------------- begin example ---------------------------
clear
input indiv quarter employer1
1 1 1
1 2 1
1 3 1
1 3 9
1 4 2
1 4 9
1 5 2
2 2 2
2 3 2
2 4 2
3 1 1
3 2 1
3 3 1
3 4 2
3 5 2
4 2 3
4 3 3
4 4 3
5 3 1
5 4 2
5 5 2
6 3 1
6 4 1
6 5 2
6 6 2
7 3 1
7 4 2
8 2 5
8 2 7
8 3 20
9 2 5
9 3 20
end
* we assume this is true!
isid indiv quarter employer1
* save with the number of employees per quarter
bysort employer1 quarter: gen N1 = _N
list, sepby(employer1 quarter)
tempfile f
save "`f'"
* adjust time to match previous quarter and form all
* pairwise combination of employers
replace quarter = quarter - 1
rename (employer1 N1) (employer2 N2)
joinby indiv quarter using "`f'"
* count the number of employees that move from
* one employer to another per matched quarter
bysort employer1 quarter employer2: gen matchN = _N
gen ratio1 = matchN / N1
gen ratio2 = matchN / N2
* ignore cases where there is no change in employers
drop if employer1 == employer2
order employer1 N1 employer2 N2 quarter
list , sepby(employer1 employer2 quarter)
* reduce to >80% to >80% moves and id each move
keep if ratio1 >= .8 & ratio2 >= .8
bysort employer1 employer2: keep if _n == 1
gen idmove = _n
keep employer1 employer2 idmove
list
* reshape each move to match the orginal data
expand 2
bysort idmove: replace employer1 = employer2 if _n == 2
keep employer1 idmove
list
* use -group_id- (from SSC) to handle the merging of ids
append using "`f'"
egen newid = group(employer1)
group_id newid, matchby(idmove)
drop if !mi(idmove)
drop idmove
isid indiv quarter employer1, sort
list, sepby(indiv)
tab employer1 newid
* --------------- end example -----------------------------
On Thu, Sep 19, 2013 at 2:31 PM, Joe Canner <[email protected]> wrote:
> Alex,
>
> Very interesting problem...
>
> First, I assume when you say that the individual ID is "not unique" you mean that there are multiple records for the same people at different time points, not that there are ID numbers that are shared my more than one person? With that assumption, I have some thoughts:
>
> 1. Do you have exact dates or just quarter? If you have dates, or even a sequence within quarter, for each employee you could save the last EIN in Q1 and the first EIN in Q2, thus making the employee IDs unique and allowing a 1:1 merge.
>
> 2. One way to deal with the non-uniqueness is to use -joinby-, which produces an observation for all possible combinations of EINS observed between the two quarters. Unless a lot of people are changing jobs a lot of times, this shouldn't generate a lot of extra data and the result will allow you to look for certain high-frequency EIN pairs, just as you were hoping to do with -merge-.
>
> 3. Alternatively, depending on the magnitude of non-unique IDs, you might even be able to arbitrarily delete duplicates and still have enough data to be able to see the patterns you are looking for.
>
> 4. Depending on how much job movement there is (i.e., if there aren't huge layoffs or huge hiring binges in a single quarter), you may even be able to get away with looking at frequencies by EIN and looking to see how much those frequencies change from quarter to quarter. If an employer goes from X employees in Q1 to X*0.2 in Q2, you could flag that as suspicious, then look for cases where an employer goes from Y employees in Q1 to Y*5 in Q2, and then try to match them up one-by-one by matching the frequency changes or, if necessary, by individual IDs. In other words, perhaps you don't necessarily need to track individuals in order to see patterns.
>
> Sorry, no great insights, just a few thoughts....
>
> Regards,
> Joe Canner
> Johns Hopkins University School of Medicine
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Alex Warofka
> Sent: Thursday, September 19, 2013 11:25 AM
> To: statalist
> Subject: st: Data Manipulation Question
>
> Hi,
>
> I'm trying to perform what seems like a relatively simple data
> manipulation task on a very large dataset (~20GB with 20 million
> observations), but having some difficulties wrapping my head around
> the best way to do so using Stata.
>
> I have four variables—individual ID (not unique because one individual
> can work for multiple employers in the same period), employer EIN, and
> quarter—and am attempting to flag events where >=80% of the employees
> working at a given EIN in quarter 1 move to the same different EIN in
> quarter 2 AND >=80% of the employees working at such an EIN in quarter
> 2 came from the same different EIN in quarter 1. In essence, the goal
> is to flag spurious transition events where an employer appears to
> change but in fact only their EIN has changed. This is the same
> procedure used in building the successor-predecessor file for the QWI
> and described in Census technical paper TP-2006-01.
>
> My initial thought was to use levelsof and loop over EINs, pulling a
> local macro containing the IDs of employees for each EIN, then looping
> through these employees to see where they are working in Q2 and so on.
> This doesn't work as I run into the 67,784 character macro length
> limit. Splitting the dataset by quarter, merging, and then using
> _merge to track individual movements between firms doesn't work as my
> IDs are not unique.
>
> Does anyone have any recommendations for handling this in Stata? At
> this point, I'm becoming tempted to just write a Ruby script to do
> this, but would be thrilled to discover it was possible in Stata.
>
> Thanks,
>
> --
> Alex Warofka
> Research Associate | California Center for Population Research, UCLA
> [email protected] | nomad.cm | @AlexWarofka
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/