Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

AW: st: AW: egen(mean or suchlike) for a string variable


From   "Martin Weiss" <[email protected]>
To   <[email protected]>
Subject   AW: st: AW: egen(mean or suchlike) for a string variable
Date   Fri, 9 Oct 2009 16:49:54 +0200

<> 



Jeph`s and Eva`s http://www.stata-journal.com/article.html?article=dm0039
may also be useful for Joe...



HTH
Martin


-----Ursprüngliche Nachricht-----
Von: [email protected]
[mailto:[email protected]] Im Auftrag von Nick Cox
Gesendet: Freitag, 9. Oktober 2009 16:49
An: [email protected]
Betreff: RE: st: AW: egen(mean or suchlike) for a string variable

You should check for different spellings etc. Spellings could be
inconsistencies in use of upper and lower case, extra leading, internal, or
trailing spaces etc. 

See http://www.stata.com/support/faqs/data/diff.html for some technique for
identifying inconsistencies. 

Nick 
[email protected] 

joe j

Thank you, Nick, for complimenting Martin's advice. I do find a slight
difference in outcomes from the two procedures (may be about 0.1% in a
sample of close to a million, so I can't immediately tell why this is
so; perhaps due to the complications you allude). Good to know also
about -egen, mode()-

On Thu, Oct 8, 2009 at 6:23 PM, Nick Cox <[email protected]> wrote:
> Let's underline that this can all be done with strings. There is no need
to resort to -encode- or otherwise to convert to numeric.
>
> Missing, i.e. empty, strings sort first. Thus after -input- and -trim()-,
Martin's code can be slimmed to
>
> bys year Prof (Uni) : replace Uni = Uni[_N] if missing(Uni)
>
> -- without any need for an extra variable.
>
> However, there is no check here for different non-missing values within
groups of -year Prof-.
>
> In the same territory, note that -egen, mode()- takes string arguments as
well as numeric, so can be used for imputation. However, the direct route
that Martin exemplifies has many advantages.
>
> Nick
> [email protected]
>
> Martin Weiss
>
> *************
> clear*
>
> inp year str10(Uni Prof)
> 1990  Harvard   " S Smith"
> 1990   ""      "S Smith"
> 1990  UCLA      "P Williams"
> 1990  Yale       " K John"
> 1991   ""        "K Evert"
> 1991  Oxford     "K Evert"
> 1991  ""        "K Evert"
> end
>
> replace Uni=trim(Uni)
> replace Prof=trim(Prof)
> compress
>
> gen byte nonmiss=!mi(Uni)
>
> //replace with last obs
> bys year Prof (nonmiss): /*
> */ replace Uni=Uni[_N]  /*
> */ if nonmiss==0
>
> l, noo sepby(year Prof)
> *************
>
> joe j
>
> Thanks. (Your suggestion helped me create a variable that takes a
> numeric value, instead of the university name; this is definitely an
> improvement.)
>
> This is how the data looks like:
>
> Year  University Professor
>
> 1990  Harvard    S Smith
> 1990   ---------     S Smith
> 1990  UCLA      P Williams
> 1990  Yale        K John
>
> 1991   ---------    K Evert
> 1991  Oxford     K Evert
>
> What I want is to replace the missing names above, in 1990 with
> Harvard and in 1991 with Oxford.
>
> On Thu, Oct 8, 2009 at 11:59 AM, Martin Weiss <[email protected]>
>
>> You should turn the string into a numeric variable via -encode-. Then
> -egen-
>> can go to work. Also provide an excerpt of your data and show what you
> want
>> to happen to them...
>
> joe j
>
>> In my data I have a string variable "University", which lists
>> university names. In some years the names are missing. Two other
>> variables I've are "Professor" and "Year". The same "Professor" and
>> "University" can occur multiple times in a year.
>>
>> The problem I have is that there are quite a few University names that
>> are missing. What I want to do is to replace as many missing
>> University names as possible, by assuming that: when a professor is
>> linked to a university at least once in a year, she is linked to the
>> same university during that year - so the missing university name when
>> her name occurs again in the same year can be replaced (why there are
>> missing university names is a complicated story:)).
>
>> I tried the following in Stata (it's foolish, I know):
>>
>>  bysort year professor: egen University_all=mean(University)
>>
>> But I get the warning "type mismatch".
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index