Stata | FAQ: Approximating the size of a dataset

Home / Resources & support / FAQs / Approximating the size of a dataset

How big will my dataset be?

Title		Approximating the size of a dataset
Author		William Gould, StataCorp

A back-of-the-envelope calculation for the size of a dataset is

                                       N*V*W + 4*N
        number of megabytes  =  M  =  --------------
                                          1024²

where

        N  =  number of observations
        V  =  number of variables
        W  =  average width in bytes of a variable

In approximating W, remember

        +-------------------------------------------------------------+
        | Type of variable                               Width        |
        |-------------------------------------------------------------|
        | Integers,     −127 <= x <=           100          1         |
        |            —32,767 <= x <=        32,740          2         |
        |     —2,147,483,647 <= x <= 2,147,483,620          4         |
        | Floats,                                                     |
        |       single precision (default)                  4         |
        |       double precision                            8         |
        | strings                                      maximum length |
        +-------------------------------------------------------------+

Say that you have a 20,000-observation dataset. That dataset contains

         1  string identifier of length 20               20
        10  small integers (1 byte each)                 10
         4  standard integers (2 bytes each)              8
         5  floating-point numbers (4 bytes each)        20
        -----------------------------------------------------
        20  variables total                              58

Thus the average width of a variable is W = 58/20 = 2.9 bytes.

The size of your dataset is

                                       N*V*W + 4*N
        number of megabytes  =  M  =  --------------
                                          1024²
        
                                      20000*20*2.9 + 4*20000
                                   =  ----------------------
                                              1024²
        
                                   =  1.18 megabytes

This result slightly understates the size of the dataset because we have not included any variable labels, value labels, or notes that you might add to the data. That does not amount to much. For instance, imagine that you added variable labels to all 20 variables and that the average length of the text of the labels was 22 characters. That would amount to a total of 20*22=440 bytes or 440/1024²=.00042 megabytes.

Click here for an interactive dataset calculator.

Explanation of formula

                                       N*V*W + 4*N
        number of megabytes  =  M  =  --------------
                                         1024²

N*V*W is, of course, the total size of the data. To that, we added 4*N because Stata secretly stores a 4-byte pointer with each observation.

The 1,024² in the denominator rescales the results to megabytes. Yes, the result is divided by 1,024² even though 1,000² = a million.

Computer memory comes in binary increments. Although we think of k as standing for kilo, in the computer business, k is really a “binary” thousand, 2¹⁰ = 1,024.

A megabyte is a binary million—a binary k squared:

        1 MB = 1024 KB = 1024*1024 = 1,048,576 bytes

With cheap memory, we sometimes talk about a gigabyte. Here is how a binary gig works:

        1 GB = 1024 MB = 1024³ = 1,073,741,824 bytes

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

How big will my dataset be?

Explanation of formula

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

How big will my dataset be?

Explanation of formula

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies