Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Merging files based on name and year
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Merging files based on name and year
Date
Sun, 4 Sep 2011 10:45:43 +0100
Another is
egen meanscore = mean(score), by(name)
bysort name (score) : replace meanscore = . if missing(score[_N])
On Sun, Sep 4, 2011 at 9:29 AM, Nick Cox <[email protected]> wrote:
> Many ways to this. One is
>
> egen nmissing = total(missing(score)), by(name)
> egen meanscore = mean(score) if nmissing == 0, by(name)
>
> Another is
>
> bysort name (score) : gen meanscore = sum(score)
> by name : replace meanscore = cond(missing(score[_N}), ., meanscore[_N]/_N)
>
> Nick
>
> On Sun, Sep 4, 2011 at 9:11 AM, Joseph Monte <[email protected]> wrote:
>> Nick,
>>
>> Thanks for the help. It worked well. Unfortunately, I'm stuck at the
>> final stage where I need to take the average score for each name
>> conditional on a score being available for each name. Unfortunately, I
>> do have missing observations.
>>
>> . input str1 name score
>>
>> name score
>> 1. "A" 7
>> 2. "A" 8
>> 3. "A" .
>> 4. "B" 6
>> 5. "B" 7
>> 6. "C" 5
>> 7. end
>>
>> . egen meanscore = mean(score), by(name)
>>
>> . list
>>
>> +--------------------+
>> | name score meanscore |
>> |--------------------|
>> 1. | A 8 7.5 |
>> 2. | A 7 7.5 |
>> 3. | A . 7.5 |
>> 4. | B 7 6.5 |
>> 5. | B 6 6.5 |
>> |--------------------|
>> 6. | C 5 5 |
>> +--------------------+
>>
>> I want meanscore not to be calculated for A since there is a missing
>> observation (i.e. there should be 3 blanks instead of 7.5). How do I
>> get around this issue? I tried adapting the code from the link below
>> but was not successful.
>>
>> http://www.stata.com/support/faqs/data/anyall.html
>>
>> Thanks,
>>
>> Joe
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 25, 2011 at 9:52 AM, Nick Cox <[email protected]> wrote:
>>> I don't think regex is the only approach here. You could consider
>>> using -split-.
>>>
>>> A strategy here is to insert new parsing characters yourself. For
>>> example suppose that ; is not used, which you can check by
>>>
>>> assert strpos(name, ";") == 0
>>>
>>> Then put ; after each terminal element such as "Inc" (there's probably
>>> jargon I don't know)
>>>
>>> clonevar work = name
>>> replace work = subinstr(work, "Inc", "Inc;", .)
>>> replace work = subinstr(work, "LLC", "LLC;", .)
>>> replace work = subinstr(work, "Corp", "Corp;", .)
>>>
>>> and so on.
>>>
>>> You can get all the terminal elements from your file with just
>>> individual names. It is the last word (word(,-1)) of the company name.
>>> You can put that into a new variable and -tab- the results.
>>>
>>> You may need to fix exceptions, which will be shown by the tabulation above.
>>>
>>> Then -split- on ; and then -reshape-.
>>>
>>> Sometimes a very primitive approach like this is much quicker than
>>> spending hours trying to do it a cleverer way. (If someone were
>>> exceptionally fluent with regular expressions that wouldn't be true.)
>>>
>>> When regex works it can be a spectacular solution but with many messy
>>> problems it is often a very long way round.
>>>
>>> Nick
>>>
>>> On Thu, Aug 25, 2011 at 9:24 AM, Joseph Monte <[email protected]> wrote:
>>>
>>>> The file below is the master file containing names of companies and
>>>> years. ABC Inc is one company, XYZ Corp is another company, PNG LLC is
>>>> a third company. I have a total of 1100 different companies. As shown
>>>> below, sometimes two or more companies are listed in the same field
>>>> (there are a maximum of 5 companies listed in the same field). The
>>>> year column has only one year for each observation. There are a total
>>>> of 800 observations in this file.
>>>>
>>>> name year
>>>> ABC Inc 1986
>>>> XYZ Corp 1994
>>>> ABC Inc XYZ Corp 2001
>>>> PNG LLC 2005
>>>> XYZ Corp PNG LLC 2007
>>>>
>>>>
>>>> I have a second file with data in the following format. The 1100
>>>> companies are listed as shown below. YR8084 means the years 1980-1984,
>>>> YR8591 means the years 1985-1991, and so on. The numbers below each
>>>> year are scores assigned to each company during a certain period. For
>>>> example, ABC Inc is assigned a score of 6 during 1980-1984, 7 from
>>>> 1985-1991, and 9 from 2001-2004. ABC Inc is not assigned a score
>>>> during other periods. Scores range from 1 to 9 and may be up to 3
>>>> decimal places.
>>>>
>>>> name YR8084 YR8591 YR9200 YR0104 YR0507 YR0809
>>>> ABC Inc 6 7 9
>>>> XYZ Corp 2 5 6 6
>>>> PNG LLC 7 7 7
>>>> 7 7
>>>>
>>>>
>>>> I want the master file to include a column with scores as shown below.
>>>> For example, ABC Inc gets a score of 7 in 1986, XYZ Corp gets a score
>>>> of 5 in 1994. For observations with two or more names, I want a simple
>>>> average of scores. For example, for ABC Inc XYZ Corp, the score will
>>>> be (9+6)/2=7.5. If a company has a year in the master file, then it
>>>> definitely has a score for that year (i.e. time period) in the second
>>>> file.
>>>>
>>>> name year score
>>>> ABC Inc 1986 7
>>>> XYZ Corp 1994 5
>>>> ABC Inc XYZ Corp 2001 7.5
>>>> PNG LLC 2005 7
>>>> XYZ Corp PNG LLC 2007 6.5
>>>>
>>>> I am using Stata 12. I expect I would need the -regexm()- command to
>>>> split the company names, then -reshape- to get all company names one
>>>> below the other and then -merge-. Since there are 1100 companies, I
>>>> would need some kind of a loop to use the -regexm() command. I am
>>>> having trouble writing the code.
>>>>
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/