Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: comparing xtdes-like patterns for variables
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: comparing xtdes-like patterns for variables
Date
Thu, 1 Nov 2012 13:30:31 +0000
I've done a quick hack of a program to show where the missings lie.
Its effectiveness in showing structure seems likely to diminish with
dataset size.
Example:
sysuse nlsw88
missingplot
*! 1.0.0 NJC 1 November 2012
program missingplot
version 8.2
syntax [varlist] [if] [in] [ , all varnames * ]
quietly {
marksample touse, novarlist
count if `touse'
if r(N) == 0 error 2000
local y = 0
tempvar obsno
gen long `obsno' = _n if `touse'
label variable `obsno' "observations"
local toomany = 0
foreach v of local varlist {
local include = 1
if "`all'" == "" {
count if `touse' & missing(`v')
if r(N) == 0 local include = 0
}
if `include' {
local ++y
if `y' > 20 {
local toomany = 1
continue, break
}
tempvar ynew
gen `ynew' = `y' if missing(`v')
if "`varnames'" != "" {
local which "`v'"
}
else {
local which : var label `v'
if `"`which'"' == "" local which "`v'"
}
local call `call' `y' `"`which'"'
local Y `Y' `ynew'
}
}
}
if "`Y'" == "" {
di as txt "nothing to plot!"
exit 0
}
if `toomany' {
di as txt "note: only first 20 variables plotted"
}
scatter `Y' `obsno' if `touse', ///
yla(`call', ang(h) noticks) ytitle("") ///
legend(off) mcolor(blue ..) ms(dh ..) `options'
end
On Thu, Nov 1, 2012 at 12:46 PM, Nick Cox <[email protected]> wrote:
> Sorry for previous premature send.
>
> If you had several variables you could try something like this
>
> local y = 0
> gen long obsno = _n
>
> qui foreach v of var <whatever> {
> local ++y
> gen y`y' = `y' if missing(`v')
> local which : var label `v'
> if "`which'" == "" local which "`v'"
> local call `call' `y' "`which'"
> local Y `Y' y`y'
> }
>
> scatter `Y' obsno, ms(dh ..) yla(`call', ang(h) noticks) legend(off)
>
>
>>
>> On Thu, Nov 1, 2012 at 1:10 AM, Nick Cox <[email protected]> wrote:
>>> You could create variables like
>>>
>>> gen yxmiss = missing(y) - missing(x)
>>> gen long obs = _n
>>>
>>> scatter yxmiss obs if missing(y, x)
>>>
>>> On Wed, Oct 31, 2012 at 7:39 PM, László Sándor <[email protected]> wrote:
>>>> Thanks, Nick.
>>>>
>>>> The values definitely don't line up that neatly, but that's a worry
>>>> for another day.
>>>>
>>>> Basically my problem is, if I know I can expect differences between
>>>> the variables, is there a neat way to compare their missing patterns
>>>> (one always starting early, or one mistakenly having the years in
>>>> reverse order)?
>>>>
>>>> On Wed, Oct 31, 2012 at 3:15 PM, Nick Cox <[email protected]> wrote:
>>>>> If # different versions of the same data should be the same, there
>>>>> will be # duplicates of everything in a combined dataset.
>>>>>
>>>>> This applies to missings too.
>>>>>
>>>>> -duplicates- is therefore something that springs to mind. Panels are
>>>>> no problem, as panel identifiers are just other variables
>>>>>
>>>>> Naturally, if the combined dataset is extremely large, this won't be
>>>>> very practical. .
>>>>>
>>>>> Nick
>>>>>
>>>>> On Wed, Oct 31, 2012 at 7:02 PM, László Sándor <[email protected]> wrote:
>>>>>
>>>>>> I have a panel-data cleaning problem that probably has some neat
>>>>>> solution, probably already out there. I am happy to try any solutions
>>>>>> for Stata 12.1 MP.
>>>>>>
>>>>>> Background: I had to try to look up supposedly the same data from
>>>>>> multiple sources. (Financial data for the same securities, but
>>>>>> different data sources were expected to cover different subsets of my
>>>>>> universe, or for different time periods.)
>>>>>>
>>>>>> But now I have a panel where I would like to cross-check different
>>>>>> version of the same data, and most crucially, I would like to verify
>>>>>> that I got the years correctly for each version. (FYI: financial data
>>>>>> sources can be opaque about how they handle missing data if you ask
>>>>>> for "end-of-year prices for the last 15 calendar years", and whether
>>>>>> they give years in ascending or descending order). For this, I would
>>>>>> like to compare what periods I have non-missing values for a family of
>>>>>> variables, say, bloomberg_price and reuters_price.
>>>>>>
>>>>>> Presumably, if I got the start and the end years right, I could hope
>>>>>> -compare- those, (e.g. -compare *_price_first- ). And hope that the
>>>>>> patterns will be clear.
>>>>>>
>>>>>> That said, I'm afraid some more nuanced analysis of missing value
>>>>>> patterns might be justified. What are good tools for that? (How can I
>>>>>> "xtdes by variable"? Or "misstable pattern in a panel"?)
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/