Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: AW: egen(mean or suchlike) for a string variable


From   joe j <[email protected]>
To   [email protected]
Subject   Re: st: AW: egen(mean or suchlike) for a string variable
Date   Fri, 9 Oct 2009 17:34:40 +0200

Thank you Martin and Nick for the suggestions. much much appreciated.
I'd look at your references.
JJ

On Fri, Oct 9, 2009 at 4:49 PM, Nick Cox <[email protected]> wrote:
> You should check for different spellings etc. Spellings could be inconsistencies in use of upper and lower case, extra leading, internal, or trailing spaces etc.
>
> See http://www.stata.com/support/faqs/data/diff.html for some technique for identifying inconsistencies.
>
> Nick
> [email protected]
>
> joe j
>
> Thank you, Nick, for complimenting Martin's advice. I do find a slight
> difference in outcomes from the two procedures (may be about 0.1% in a
> sample of close to a million, so I can't immediately tell why this is
> so; perhaps due to the complications you allude). Good to know also
> about -egen, mode()-
>
> On Thu, Oct 8, 2009 at 6:23 PM, Nick Cox <[email protected]> wrote:
>> Let's underline that this can all be done with strings. There is no need to resort to -encode- or otherwise to convert to numeric.
>>
>> Missing, i.e. empty, strings sort first. Thus after -input- and -trim()-, Martin's code can be slimmed to
>>
>> bys year Prof (Uni) : replace Uni = Uni[_N] if missing(Uni)
>>
>> -- without any need for an extra variable.
>>
>> However, there is no check here for different non-missing values within groups of -year Prof-.
>>
>> In the same territory, note that -egen, mode()- takes string arguments as well as numeric, so can be used for imputation. However, the direct route that Martin exemplifies has many advantages.
>>
>> Nick
>> [email protected]
>>
>> Martin Weiss
>>
>> *************
>> clear*
>>
>> inp year str10(Uni Prof)
>> 1990  Harvard   " S Smith"
>> 1990   ""      "S Smith"
>> 1990  UCLA      "P Williams"
>> 1990  Yale       " K John"
>> 1991   ""        "K Evert"
>> 1991  Oxford     "K Evert"
>> 1991  ""        "K Evert"
>> end
>>
>> replace Uni=trim(Uni)
>> replace Prof=trim(Prof)
>> compress
>>
>> gen byte nonmiss=!mi(Uni)
>>
>> //replace with last obs
>> bys year Prof (nonmiss): /*
>> */ replace Uni=Uni[_N]  /*
>> */ if nonmiss==0
>>
>> l, noo sepby(year Prof)
>> *************
>>
>> joe j
>>
>> Thanks. (Your suggestion helped me create a variable that takes a
>> numeric value, instead of the university name; this is definitely an
>> improvement.)
>>
>> This is how the data looks like:
>>
>> Year  University Professor
>>
>> 1990  Harvard    S Smith
>> 1990   ---------     S Smith
>> 1990  UCLA      P Williams
>> 1990  Yale        K John
>>
>> 1991   ---------    K Evert
>> 1991  Oxford     K Evert
>>
>> What I want is to replace the missing names above, in 1990 with
>> Harvard and in 1991 with Oxford.
>>
>> On Thu, Oct 8, 2009 at 11:59 AM, Martin Weiss <[email protected]>
>>
>>> You should turn the string into a numeric variable via -encode-. Then
>> -egen-
>>> can go to work. Also provide an excerpt of your data and show what you
>> want
>>> to happen to them...
>>
>> joe j
>>
>>> In my data I have a string variable "University", which lists
>>> university names. In some years the names are missing. Two other
>>> variables I've are "Professor" and "Year". The same "Professor" and
>>> "University" can occur multiple times in a year.
>>>
>>> The problem I have is that there are quite a few University names that
>>> are missing. What I want to do is to replace as many missing
>>> University names as possible, by assuming that: when a professor is
>>> linked to a university at least once in a year, she is linked to the
>>> same university during that year - so the missing university name when
>>> her name occurs again in the same year can be replaced (why there are
>>> missing university names is a complicated story:)).
>>
>>> I tried the following in Stata (it's foolish, I know):
>>>
>>>  bysort year professor: egen University_all=mean(University)
>>>
>>> But I get the warning "type mismatch".
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index