Home  /  Products  /  Stata 11  /  Data management

This page announced updates in Stata 11. See a complete overview of all of Stata's data management features.

Order

What’s new in data management

  • Existing command merge has all new syntax. It is easier to use, easier to read, and makes it less likely that you will make a mistake. Merges are classified as 1:1, 1:m, m:1, and m:m. When you type merge 1:1, you are saying that you expect the observations to match one-to-one. merge 1:m specifies a 1-to-many merge; m:1, a many-to-1 merge; and m:m, a many-to-many merge. New options assert() and keep() allow you to specify what you expect the outcome to be and what you want to keep from it. For instance,
        . merge 1:1 subjid using filename, assert(match)
    
    means that you expect all the observations in both datasets to match each other, whereas
        . merge 1:1 subjid using filename, assert(match using) keep(match)
    
    specifies that you expect each observation to either match or be solely from the using data and, assuming that is true, you want to keep only the matches.

    Sorting of both the master and the using datasets is now automatic.

    The new merge does not support merging multiple files in one step. Merge the first two datasets, then merge that result with the next dataset, and so on.

    merge now aborts with error if variables are string in one dataset and numeric in the other unless new option force is specified.

    The old merge syntax continues to work.
  • Existing command append has several new features: 1) it will work even if there are no data in memory; 2) multiple files can be appended in one step; and 3) new option generate(newvar) creates a variable indicating the source of the observations, numbered 0, 1, ... append now aborts with error if variables are string in one dataset and numeric in the other unless new option force is specified. Old behavior is preserved under version control.
  • Stata’s default memory allocations have changed:

    • Stata/SE and Stata/MP now default to allocating 50M of memory rather than 10M. Stata/IC now defaults to 10M rather than 1M. Stata’s required footprint has not grown; we reset these defaults because users were resetting to larger numbers anyway.
    • Stata/IC now defaults matsize to 400 rather than 200; the default for Stata/SE and Stata/MP remains 400. The default for Small Stata is now 100 rather than 40.
  • Existing command order now does what order, move, and aorder did. Old commands aorder and move continue to work but are no longer documented.
  • New commands zipfile and unzipfile compress and uncompress files and directories in zip archive format.
  • New command changeeol converts text from one end-of-line format to another. Stata does not care about end-of-line format, but some editors and other programs do.
  • New command snapshot saves to disk and restores from disk copies of the data in memory. snapshot used by the new Data Editor. An important feature of the Data Editor is that it can log all the changes you make interactively. snapshot will show up in those logs. snapshot really is a command of Stata, so you can replay logs to duplicate past efforts. For your own use, however, it is better if you continue using preserve and restore.
  • You can now copy-and-paste commands from logs and execute them without editing out the period (the dot prompt) in front! Stata 11 ignores leading periods.
  • Existing command notes has new options search, replace, and renumber.
  • Concerning value labels:

    • Existing command label define has new option replace so that you do not have to drop the value label before redefining it.
    • New command label copy copies value labels.
    • Existing command label values now allows a varlist, so you can label (or unlabel) a group of variables at the same time.
  • Existing command expand has new option generate(newvar) that makes it easier to distinguish original from duplicated observations.
  • Concerning egen:

    • New function rowmedian(varlist) returns, observation by observation, the median of the values in varlist.
    • New function rowpctile(varlist), p(#) returns, observation by observation, the #th row percentile of the values within varlist.
    • Existing function mode(varname) with option missing treats missing values as a category. When version is set to 10 or less, missing does not treat missing as a category.
    • Existing functions total(exp) and rowtotal(varlist) have new option missing. If all values of exp or varlist for an observation are missing, then that observation in newvar will be set to missing.
  • Existing command copy now allows copying a file to a directory without having to type the filename twice.
  • Existing command clear now allows clear matrix to clear all Stata matrices (as distinguished from Mata matrices) from memory.
  • Existing command outfile now exports date variables as strings rather than their underlying numeric values. Under version control, old behavior is restored.
  • Existing command reshape now preserves variable and value labels when converting from long to wide and restores variable and value labels when converting from wide to long. Thus the value and variable labels for the i variable, which exists in long form and not in wide form, are restored when converting back from wide to long. The value labels of the xij variables are similarly restored. Prior behavior is preserved when version is 10 or earlier.
  • Existing command collapse now allows new statistics semean, sebinomial, and sepoisson for obtaining the standard error of the mean.
  • Existing command destring allows new option dpcomma to convert to numeric form string representation of numbers using commas as the decimal point.
  • Concerning existing command odbc:

    • odbc insert now uses parameterized inserts, which are faster.
    • The dialogs for odbc load and odbc insert can now store a data source user ID and password for a Stata session.
    • odbc query has new options verbose and schema. verbose lists any data source alias, nickname, typed table, typed view, and view along with tables so that data from these table types can be loaded. schema lists schema names with the table names if the data source returns schema information.
    • odbc insert has a new dialog.
    • Existing option dsn() now allows the data source to be up to 499 characters.
    • odbc now reports driver errors directly. Previously, odbc would issue the error “ODBC error; type set debug on and rerun command to see extended error information” when an ODBC driver issued an error.
    • odbc, with set debug on, for security reasons no longer displays the data source name, user ID, and password used for connecting to your data source.
  • New function strtoname() converts a general string to a string meeting Stata’s naming conventions. Also, existing functions lower(), ltrim(), proper(), reverse(), rtrim(), and upper() now have synonyms strlower(), strltrim(), ..., and strupper(). Both sets of names work equally well.
  • New function soundex() returns the soundex code for a name, consisting of a letter followed by three numbers. New function soundex_nara() returns the U.S. Census soundex for a name, also consisting of a letter followed by three numbers, but produced by a different algorithm.
  • New functions sinh(), cosh(), asinh(), and acosh() join existing functions tanh() and atanh() to provide the hyperbolic functions.
  • New functions binomialp(); hypergeometric() and hypergeometricp(); nbinomial(), nbinomialp(), and nbinomialtail(); and poisson(), poissonp(), and poissontail() provide distribution and probability mass for the binomial, hypergeometric, negative binomial, and Poisson distributions.
  • New functions invnbinomial() and invnbinomialtail(), and invpoisson() and invpoissontail() provide inverses for the negative binomial and Poisson distributions.
  • Algorithms for the existing functions normal() and lnnormal() have been improved to operate in 60% and 75% of the time, respectively, while giving equivalent double-precision results.
  • New functions rbeta(), rbinomial(), rchi2(), rgamma(), rhypergeometric(), rnbinomial(), rnormal(), rpoisson(), and rt() produce random variates for the β, binomial, χ2, γ, hypergeometric, negative binomial, normal, Poisson, and Student’s t distributions, respectively.

    Old function uniform() has been renamed to runiform(), but uniform() continues to work.

    All random-variate functions start with r.

  • Existing command drawnorm now uses new function rnormal() to generate random variates. When version is set to 10 or earlier, drawnorm reverts to using invnormal(uniform()).
  • Existing command describe now respects the width of the Results window when formatting output.
  • Existing command renpfix now returns the list of variables changed in r(varlist).
  • Previously existing command impute still works but is now undocumented. It is replaced by the new multiple-imputation command mi. Click here for more information.

Back to highlights