|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Stata, data processing, databases, and consultants
On Sep 6, 2007, at 9:15 PM, Buzz Burhans wrote:
Is anyone using Stata for data processing and report generating
like this, where the data repositories are small local databases?
Is it foolish not to convert such a system entirely to a database
program (and move it all away from Stata?)?
<snip>
Has anyone used *.dta files as a database, or is this foolish and
would it be much better to use real databases for the data repository?
We do this sort of thing all the time (i.e., use Stata to manage
large amounts of data from different sources and to generate
"reports"). Sometimes we do this in conjunction with an actual
database, sometimes not. It depends entirely on the application.
With a good database and adequate programming skills, you can do
anything as far as data management goes. Thus, those who are most
comfortable working this way will always advocate a database-centered
solution. However, this introduces a certain amount of overhead
which may not always be necessary; moreover, in my experience,
relatively few people are really good enough with SQL and/or a
suitable programming language that can interface with their database
to really use this strategy effectively. I've certainly seen plenty
of examples where someone thought they needed a database, and, once
they had one, couldn't manage to do what they needed to do with the
data. A good object-relational mapper (e.g., SQLAlchemy) can help
with this, but only if you are already comfortable working in another
programming language (e.g., SQLAlchemy is a toolkit for Python).
As you know, a .dta file is not a database, nor, for that matter, is
an Excel file (actually, I cringe whenever I hear of someone using
Excel for data because of it's penchant for auto-formatting and the
inability to version or diff files). Whether or not you need a
database depends on things like the following:
1) do you need to provide distributed, real-time access (e.g., over
the web) to the data?
2) do you need to integrate the data into a larger application or
workflow?
3) do you need to provide concurrent access (especially write access)?
4) do you have a complicated data model which would benefit from a
relational or object-oriented design?
5) do you need to store things that would be difficult (or
impossible) to store in Stata (e.g., Unicode strings, graphics files,
etc.)?
6) are you working with *very* large amounts of data?
If your answer to any of these questions is yes, then it's likely
that you should be using a database. However, there are lots of data
management applications (especially in the scientific community where
I work) that don't meet these criteria, and for these a strictly
Stata-based system is often very effective.
Unfortunately I am swamped right now with other projects, but if you
want, contact me off list and I might be able to provide a bit more
help.
-- Phil
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/