There was a key typo in my previous post. I invented names,
then edited back to Mingfeng's names, but with an error.
Here is a second edition.
Nick
[email protected]
There is an interesting underlying issue here, what
exactly is "programming" in Stata? A precise
answer is that a program is whatever is defined
by whatever follows a -program- statement. (There
is no circularity here, as program the English
word and -program- the Stata command name are from
metalanguage and language.)
OK, enough of that.
The good news is that this can be done without
ever writing down the Stata command name -program-,
so the answer is yes.
The other news looks bad, but isn't so bad really.
In fact, it is really good news.
You can do this, but it requires a little more
Stata than you may want at this moment. However, the features
to be used are among the most Stataish of all
Stata features and are very, very useful.
Using your second list of values (which differs
slightly from your first) we have
. l
+------+
| x |
|------|
1. | cd1 |
2. | cd2 |
3. | cd2 |
4. | cd3 |
5. | cd1 |
|------|
6. | cd3 |
7. | cd4 |
8. | cd1 |
9. | cd5 |
10. | cd3 |
+------+
We need to tag the first time any value
occurs. That will need a -sort-, and because
of that we should keep a record of the current
sort order, not least because we will want
to return to it. That means
. gen order = _n
If your dataset is really big, that should be
. gen long order = _n
We sort into groups of -x- and ensure that the
within groups of -x- the original sort order
is followed. Then we tag the very first occurrence
of each value of -x-. This can all be telescoped into one
statement.
. bysort x (order) : gen y = _n == 1
There is a FAQ on constructs like those on the right-hand
side of the assignment:
FAQ . . . . . . . . . . . . . . . . . . . . . . . True
and false in Stata
2/03 What is true and false in Stata?
http://www.stata.com/support/faqs/data/trueorfalse.html
Now -sort- back to the original order. Then we just need a running
sum of -y-, as the number of distinct values
seen so far is equal to (or even defined as)
the number of first occurrences seen so far.
. sort order
. replace y = sum(y)
(9 real changes made)
-order- has served its purpose. Bye-bye!
. drop order
What have we got?
. l
+----------+
| x y |
|----------|
1. | cd1 1 |
2. | cd2 2 |
3. | cd2 2 |
4. | cd3 3 |
5. | cd1 3 |
|----------|
6. | cd3 3 |
7. | cd4 4 |
8. | cd1 4 |
9. | cd5 5 |
10. | cd3 5 |
+----------+
Now with a little more knowledge we could wrap that
up into a command, or better an -egen- function. But
in many ways it is better to use the code here and
understand its logic, which will help
for that next problem with a similar flavour.
The key construct here is -by:-. The documentation
for -by:- is scattered around the manuals. A Mickey Mouse
tutorial bringing together the main ideas was given in
SJ-2-1 pr0004 . . . . . . . . . . Speaking Stata: How to
move step by: step
Q1/02 SJ 2(1):86-102
explains the use of the by varlist : construct to tackle
a variety of problems with group structure, ranging from
simple calculations for each of several groups to more
advanced manipulations that use the built-in _n and _N
Nick
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/