Dear All,
I hope this email finds you well.
About a week ago I asked a questions on the Gini Coefficient which didn't receive any replies - probably because it was about macroeconomic statistics and not Stata as such.
Does anybody know of a list which is specifically aimed at people who want to ask questions about data sources and the computation of statistics. To my mind many of the international datasets such as World Development Indicators are poorly referenced and it is unclear what the actual original source of the data is and whether all countries used consistent definitions. E.G. how did the World Bank calculate per capita income for Chad, did they undertake a household survey, did they just get the figure from another database, did Chad's government statistics office calculate this. I am always reluctant to just plug the data in, and it would be useful if there was some list where I could speak to people who have looked into all this in detail.
Any reply would be greatly appreciated.
Kind Regards,
Daniel Wilde
________________________________________
From: [email protected] [[email protected]] On Behalf Of Nick Cox [[email protected]]
Sent: Monday, March 02, 2009 10:53 AM
To: [email protected]
Subject: st: RE: AW: Data management: looking up content in observations
You could write an -egen- function for this. I am not aware of one. But
that is not the only way to attack the problem, nor the most natural
way.
I don't think -by:- is natural here. There's more than instinct behind
that statement, as it follows from the logic of the problem. The problem
entails comparing observations in different blocks of observations,
however "blocks" are defined. That is, the day will be different, the
home team and guest team may both be different, etc. -by:-, conversely,
is for problems in which you need only work _within_ blocks.
As Martin pointed out, it helps to be thoroughly familiar with
subscripting for this kind of problem. He didn't spell out the mundane
details of any solution, so here is one way to do it.
I fall back on the often-deprecated "loop over observations". It is not
especially elegant or fast, but it is a direct attack on the problem and
does work. There are probably more cunning solutions entailing -merge-s
of the data with itself and so forth, but I'll still do it this way.
gen winlast = .
gen obsno = _n
qui forval i = 1/`=_N' {
su obsno if day == day[`i'] - 1 & ///
(hometeam == hometeam[`i'] | guestteam == hometeam[`i']),
meanonly
if r(min) < . {
replace winlast = (winner[r(min)] == hometeam[`i']) in
`i'
}
}
Notes:
1. I am assuming here that each team plays at most once per day. That is
not explicit, but is suggested by Florian's data segment.
2. I am assuming that the total number of games in the dataset is modest
enough to use a -float- for -obsno-. In a bigger dataset than that,
specify that -obsno- is to be a -long-.
3. There are no games before the first, so the loop need not start at 1,
but I'd rather leave it at 1 and let Stata do a little unnecessary work,
rather than wire in 5 and then create a source of bugs if the data get
out of -sort- order, or the code is ported to a different dataset for
which 5 is no longer the correct number.
4. Florian hit the nail on the head in labelling this a "look up"
problem. So, we can think of it in two stages:
* Which observation contains the details for the previous game with this
home team?
* Did this home team win in that game?
The first is, for observation `i', on the previous day and involves the
same team as the present home team, either at home or away, and will be
when this condition is satisfied:
day == day[`i'] - 1 & ///
(hometeam == hometeam[`i'] | guestteam == hometeam[`i'])
What we do is exploit what -summarize- leaves in memory. At most one
game should satisfy that condition, so that observation number will be
recorded in multiple places, as r(min), r(max), r(mean) and r(sum). It
is arbitrary which we use.
(winner[r(min)] == hometeam[`i'])
will be 1 if the home team for this game was the winner in that game,
and 0 otherwise.
5. However, suppose that a team didn't play on the previous day. Then
the -summarize- will return missing in r(min) and the comparison will be
(winner[.] == hometeam[`i'])
which will return 0, as -winner[.]- is evaluated as an empty string,
which will not equal any team name. That's wrong, as the answer should
be ., not 0. A similar issue arises with the first day's games.
Thus,
if r(min) < . {
replace winlast = (winner[r(min)] == hometeam[`i']) in
`i'
}
is the more careful code needed to trap such difficulties.
7. I assume that Florian meant
count if day == 1 & winner == "F"
but my solution does not depend on -winner- being string or numeric,
just that -winner-, -hometeam-, -guestteam- are either all nmeric or all
string.
There is a discussion of related technique in
SJ-6-4 dm0025 . . . . . . . . . . Stata tip 36: Which observations?
Erratum
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N.
J. Cox
Q4/06 SJ 6(4):596 (no commands)
correction of example code for Stata tip 36
SJ-6-3 dm0025 . . . . . . . . . . . . . . Stata tip 36: Which
observations?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N.
J. Cox
Q3/06 SJ 6(3):430--432 (no
commands)
tip for identifying which observations satisfy some
specified condition
Nick
[email protected]
Martin Weiss
well, why do you want an -egen- function? Note there is an -egen, count-
command already, which, in combination with -by-, might just do what you
want. -help subscripting- may also be useful.
Florian Kuhn
I am trying to find out if in a league winning the previous game has an
effect on the current game. Specifically, I have 8 teams, named A to H.
I
would like to construct the variable "winlast", being 1 if the current
home
team won the last game and 0 otherwise.
The data is organized as follows:
Day hometeam guestteam winner (winlast)
1 A H A (.)
1 C F . (.)
1 E B B (.)
1 G D D (.)
2 F E . (0)
2 B G G (1)
2 H C C (0)
2 D A D (1)
3 G E E (1)
...
That is, for each observation I would like to check whether the home
team is
listed as "winner" for the previous day. I get the right digit by (for
example)
count if day == 1 & winner == F
but I have no idea of how to incorporate this into an egen command (that
is,
I had a lot of ideas none of which worked).
Does someone know how to get this right?
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/