Thank you, Nick, for complimenting Martin's advice. I do find a slight
difference in outcomes from the two procedures (may be about 0.1% in a
sample of close to a million, so I can't immediately tell why this is
so; perhaps due to the complications you allude). Good to know also
about -egen, mode()-
JJ
On Thu, Oct 8, 2009 at 6:23 PM, Nick Cox <[email protected]> wrote:
> Let's underline that this can all be done with strings. There is no need to resort to -encode- or otherwise to convert to numeric.
>
> Missing, i.e. empty, strings sort first. Thus after -input- and -trim()-, Martin's code can be slimmed to
>
> bys year Prof (Uni) : replace Uni = Uni[_N] if missing(Uni)
>
> -- without any need for an extra variable.
>
> However, there is no check here for different non-missing values within groups of -year Prof-.
>
> In the same territory, note that -egen, mode()- takes string arguments as well as numeric, so can be used for imputation. However, the direct route that Martin exemplifies has many advantages.
>
> Nick
> [email protected]
>
> Martin Weiss
>
> *************
> clear*
>
> inp year str10(Uni Prof)
> 1990 Harvard " S Smith"
> 1990 "" "S Smith"
> 1990 UCLA "P Williams"
> 1990 Yale " K John"
> 1991 "" "K Evert"
> 1991 Oxford "K Evert"
> 1991 "" "K Evert"
> end
>
> replace Uni=trim(Uni)
> replace Prof=trim(Prof)
> compress
>
> gen byte nonmiss=!mi(Uni)
>
> //replace with last obs
> bys year Prof (nonmiss): /*
> */ replace Uni=Uni[_N] /*
> */ if nonmiss==0
>
> l, noo sepby(year Prof)
> *************
>
> joe j
>
> Thanks. (Your suggestion helped me create a variable that takes a
> numeric value, instead of the university name; this is definitely an
> improvement.)
>
> This is how the data looks like:
>
> Year University Professor
>
> 1990 Harvard S Smith
> 1990 --------- S Smith
> 1990 UCLA P Williams
> 1990 Yale K John
>
> 1991 --------- K Evert
> 1991 Oxford K Evert
>
> What I want is to replace the missing names above, in 1990 with
> Harvard and in 1991 with Oxford.
>
> On Thu, Oct 8, 2009 at 11:59 AM, Martin Weiss <[email protected]>
>
>> You should turn the string into a numeric variable via -encode-. Then
> -egen-
>> can go to work. Also provide an excerpt of your data and show what you
> want
>> to happen to them...
>
> joe j
>
>> In my data I have a string variable "University", which lists
>> university names. In some years the names are missing. Two other
>> variables I've are "Professor" and "Year". The same "Professor" and
>> "University" can occur multiple times in a year.
>>
>> The problem I have is that there are quite a few University names that
>> are missing. What I want to do is to replace as many missing
>> University names as possible, by assuming that: when a professor is
>> linked to a university at least once in a year, she is linked to the
>> same university during that year - so the missing university name when
>> her name occurs again in the same year can be replaced (why there are
>> missing university names is a complicated story:)).
>
>> I tried the following in Stata (it's foolish, I know):
>>
>> bysort year professor: egen University_all=mean(University)
>>
>> But I get the warning "type mismatch".
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/