Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: converting high frequency data to low frequency
From
Nick Cox <[email protected]>
To
"'[email protected]'" <[email protected]>
Subject
RE: st: converting high frequency data to low frequency
Date
Fri, 5 Nov 2010 11:59:13 +0000
David's suggestion strikes me as right in principle, but I think he's still thinking in terms of the bad old days before Stata 10 when people had to work out their own awkward ways of handling times of day. That's a misunderstanding here.
As always, the _format_ of these data is a matter of how they are to be displayed, and not a matter of how they are stored. (An article on the most common misunderstandings of Stata would surely include this one.)
Dimitry's data look exactly like standard Stata date-times, allowed in Stata 10 up, meaning that underneath the cosmetic format they are times in milliseconds (ms). Therefore, he wants to round in units of 1000 * 60 * 5 = 300000.
Here is a concrete example which covers everything needed to understand this problem.
Using a %tc format for a -clock()- conversion of 11:31:00 today gives us back, not surprisingly, the same information:
. di %tc clock("5 Nov 2010 11:31:00", "DMYhms")
05nov2010 11:31:00
But underneath all that, the precise date-time _really_ is just an integer with units ms.
. di %20.0f clock("5 Nov 2010 11:31:00", "DMYhms")
1604575860000
(The "20" in the format is much more than I need but causes no problem here.)
You can round down or round up; which way you go is a matter of taste or convention. I almost never round using -int()-. I almost always round using -floor()- or -ceil()- because then I know immediately that I am rounding down (-floor()-) or up (-ceil()-; think ceiling) and I don't get bit around 0 because the way -int()- works with negative numbers is not what I usually want, except that I might forget that or not foresee it might happen with my data.
Now rounding down, for example, in units of 5 minutes is rounding down in units of 300000 ms. There are three steps, except that they can be combined in one line:
1. Divide by 300000.
2. Round down to the next integer below.
3. Multiply by 300000.
So, the result is another large integer,
. di %20.0f 300000 * floor(clock("5 Nov 2010 11:31:00", "DMYhms")/300000)
1604575800000
But we should check that we did it right:
. di %tc 300000 * floor(clock("5 Nov 2010 11:31:00", "DMYhms")/300000)
05nov2010 11:30:00
With a variable it's going to be
gen double binnedtime = 300000 * floor(ordertime/300000)
format binnedtime %tc
Never forget the -double-. Then you can -collapse- (or better -contract-) in terms of the new variable. (If it's really just time of day you care about, you must get there first by subtraction.)
(I suggested generalising -floor()- and -ceil()- some years ago to StataCorp so that with two arguments -floor(ordertime, 300000), say, would do what is above, but the suggestion is still lurking in their files. A good argument against would be that the long-winded way to do it, as above, is easy enough.)
See also if desired
SJ-3-4 dm0002 . . . . . . . . Stata tip 2: Building with floors and ceilings
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q4/03 SJ 3(4):446--447 (no commands)
tips for using floor() and ceil()
Nick
[email protected]
David Kantor
============
I believe you need to do just what you stated. In particular, if h, m
, and s represent the hours, minutes and seconds, then you want
gen long s2 = 3600 * h + 60 * m + s
then take int(s2/300)
(You may want to s2 to include date*24*3600 -- if you don't use the
date as a separate component of the identifier.)
There may be other methods available that I'm not aware of.
Dimitry Masterov
================
>I have data on orders that looks like this:
>
>store_id ordertime
>1901 10oct2010 15:48:18
>1901 10oct2010 15:50:36
>1901 10oct2010 16:41:35
>1901 10oct2010 16:53:37
>1901 10oct2010 18:38:24
>1901 10oct2010 19:09:22
>1901 10oct2010 21:14:22
>.....
>1906 10oct2010 17:54:59
>1906 12oct2010 19:21:27
>1906 17oct2010 16:10:01
>1906 17oct2010 19:34:39
>1906 17oct2010 21:11:00
>1906 17oct2010 22:18:55
>1906 17oct2010 22:55:13
>1906 19oct2010 22:23:10
>1906 20oct2010 01:19:24
>
>Ordertime is in %tc format. I would like to count the number of orders
>in 5 minute buckets by store_id. I would like to use
>
>collapse (count) ordertime, by(store_id bucket)
>
>but I have no idea how to group the ordertime into 5 minute intervals
>that are nicely labeled. I imagine I could calculate the elapsed time
>since midnight for each day, divide that into 5 minute intervals (or
>6000 unit intervals), and find some way to label, but I have not
>gotten this to work yet. Is there's an easier way? I tried googling
>and findit-ing, but I am not sure I know what this type of
>transformation is called.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/