This is a tricky area. I don't have a definitive answer.
There is a substantial literature on these problems, under
headings like inbuilt or spurious or ratio correlation, going back
at least to Karl Pearson. I would
do a literature search using those headings. The references
I am aware of are in fields like hydrology or geology, so the journals
or texts may not be easily accessible to you and the examples may be
difficult to map on to your territory.
Some papers are indeed scary in making warnings like your questioner,
usually with more detail however. But the main examples are
often of toy problems not like yours. Here is one salutary game:
. set obs 100
obs was 0, now 100
. gen x = uniform()
. gen y = uniform()
. gen z = uniform()
. gen yx = y/x
. gen zx = z/x
. scatter yx zx
. corr yx zx
(obs=100)
| yx zx
-------------+------------------
yx | 1.0000
zx | 0.6744 1.0000
Many people in the social sciences would be very happy
to get a R-sq of 45%. Well, you don't need data at all;
you can do it by taking ratios of random numbers.
More to the point:
. corr yx x
(obs=100)
| yx x
-------------+------------------
yx | 1.0000
x | -0.6172 1.0000
So you should be worried -- but you can get a handle
on these problems by simulation, or perhaps bootstrapping.
Naturally sampling from distributions more relevant to
your data than the uniform is advisable.
I think it's pretty clear that you need to be gearing
your analysis to the research question and try to keep the
statistical issues secondary. What's most evident from your
details is that your two models are quite different and so
I would expect that to be echoed in results.
One strategy might be to explain y/x as far as you can
in terms of variables other than x; and then to look
at the residuals from that model against x to see whether
structure is being missed. That would be a defence against
the charge that the same variable appears on both sides.
I think the bottom line is to acknowledge that ratioing
can induce artefacts, but to assert that that does not
rule out genuine relationships also existing.
Incidentally, my guess is that penetration ratio
would be better considered on a log scale, for
all sorts of reasons. I would expect some quirky small
economies to have very high penetration ratios, but
it's some years since I studied economics.
).
Nick
[email protected]
Jason Yackee
> I just received this question at a presentation of a paper
> and I wasn’t sure how to answer it.
>
> I have a panel data set, and a model that is of the general
> form: (y/x) = a + b + c+…+ x. My dependent variable (y/x) is
> a ratio of the total dollar amount of foreign capital inflows
> that a host country receives in a given year as a ratio of
> the host country’s GDP in that same year (annual capital
> inflows = y, gdp = x in the model above).
>
> This ratio is called the “penetration ratio” in the
> literature. I also included GDP on the right-hand side of
> the equation as a control for each country’s overall economic
> size. The GDP variable was a significant, negative predictor
> of the penetration ratio. Larger GDP → Less Penetration.
>
> The questioner said that it was improper to have “GDP” on
> “both sides of the equation”, and that it was sufficient to
> have a model of the form y = a + b + c +…+ x, where “x” is
> GDP is “y” is simply the dollar value of foreign capital
> inflows in absolute, not ratio, form. He couldn't explain
> why. I couldn't explain why not.
>
> I re-ran the model in the form the questioner suggested, and
> the results are overall quite different for the theoretically
> interesting independent variables. But my own sense is still
> that the questioner is wrong, and that my original model was
> not necessarily improperly specified. But I don’t have the
> mastery of statistics to justify my “sense”.
>
> Would some kind soul be able to weigh in before my next presentation?
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/