I must admit to being a bit nervous about discussing this on
Statalist, because I have not yet collected any data for it - although
I know exactly where all of it is, and I have collected some of it
already for another side-project - and someone else could easily pick
up the baton and say "Hmmm, I fancy a bit of that" and steal my
thunder. Oh well.
Anyway, I have come up with the idea of modelling the performance of
international football teams within Europe, from 1945 to the present.
It sounds like a fun idea, but then I am supposed to have more
important things to do. The regressand is a team's monthly World
Football Elo Rating (or WFER, if you can imagine such a thing). This
is a super regressand to have, not least because it has no upper
limit, and although it _does_ have a lower limit (zero), it is never
reached in practice. Key regressors of interest include:
(1) a veritable host of lagged variables (I hypothesise a positive [+]
effect on a team's WFER here);
(2) population size (but this is not a very suitable measure of 'size'
for football teams and will therefore be subject to measurement error)
[+];
(3) changes of manager or coach (but _not_ the men themselves) [-];
(4) May variables indicating if any of the ith country's clubs won a
European club competition that month [+];
(5) percentage changes in gross domestic product over 1 year to the
jth month [+];
(6) whether or not the ith country was under totalitarian rule in the
jth month (although using the slightly more nuanced Freedom House
ratings from 1972 onwards would also be a contender here) [-];
and
(7) fixed effects by country or time-point. These are not of
substantive interest, but they have to be included in order to adjust
the intercept to suit individual countries (years), so that
differences in mean WFER by country (year) do not bias the parameter
estimates for (1)-(6) (Franklin, 2004: 132) [n/a].
This is the easy bit. The difficult bit is deciding on the most
suitable modelling strategy. As far as I can see, there are two
alternatives:
(a) pooled TSCS models in which lagged variables are 'legally'
admissible, using either -xtivreg-, -xtivreg2-, -xtpcse-, -xtlsdvc- or
-xtabond2- (routines 2, 4 and 5 are available from SSC). This would
allow me to estimate the importance of most of my key regressors of
interest across countries and time in one general model, but there are
two main hurdles, The first is in the use of fixed-country effects
(temporal FEs would not tell us anything about the countries). There
are now 53 affiliated associations to the Union of European Football
Associations (UEFA) - and with at least two more joining soon - and
that is simply too many regressors in one model, denying it any
parsimony (Maddala, 1971). Almost every time I have used FEs, they
destroy much of the explanatory power of the more interesting
variables. Still, others have said much the same about lagged
variables (Achen, 2000), but here they capture the impact, or none
thereof, of a team's form over various time periods, so I must include
these. Second, using lagged variables would almost certainly mean
having to do some instrumenting, which I have found to be an equally
incredibly frustrating experience. 'Balancing' the panel of countries
down to those whose performances (re-)started straight after 1945
would introduce selection biases, so this is a non-starter;
or
(b) single-country Box-Jenkins ARIMA models, using -arima-. This has
the big advantage of removing the need of having to find and use any
of those pesky instruments for lagged dependent variables as well as
the need for country FEs. Also, ARIMA provides the nice advantage of
generating dymanic forecasts in Stata (Baum, 2004: 5-6). However,
there are two main hurdles here, also. One, I still to have to use
FEs, but temoporal ones. But, here, the issue is more complex: (y)
which temporal FEs should I use: months or years?; (z) should I use
_both?_ Either way, but especially with (z), we return to Maddala's
problem, only a hundred times worse. Two, this approach would mean
estimating anything from 32 to 53 different ARIMA models, and that
represents an intimidating consumption of time and effort.
So, just what is a (very) part-time football analyst to do in this
situation? Stimson (1994: 945) concluded that both of these modelling
approaches were 'winners' for different reasons. Answers on an
e-postcard, please.
--
Clive Nicholas
[Please DO NOT mail me personally here, but at
<[email protected]>. Thanks!]
Achen C (2000) "Why Lagged Dependent Variables Can Suppress the
Explanatory Power of Other Indepedent Variables", paper presented at
the Annual Meeting of the Political Methodology Section of the
American Political Science Association, University College of Los
Angeles, July 20-22
Baum CF (2004) "SUGUK 2004 Invited Lecture: Topics in Times Series
Modelling With Stata", paper presented at the 10th London Stata Users'
Group Meeting, City University, June 28.
Franklin MN (2004) Voter Turnout and the Dynamics of Electoral
Competition in Established Democracies Since 1945, Cambridge:
Cambridge University Press.
Maddala GS (1971) "The Use of Variance Components Models in Pooling
Cross Section and Time Series Data", Econonometrica 39 (2): 341-58.
Stimson JA (1971) "Regression in Space and Time: A Statistical Essay",
American Journal of Political Science 29 (4): 914-47.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/