Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Roberto Ferrer <refp16@gmail.com> |
To | Stata Help <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: -project- and big data files slowing down -build- |
Date | Mon, 29 Jul 2013 20:48:24 +0100 |
Thank you Robert. It is much clearer now. I will keep working with -project- and post any comments/doubts. Regards, Roberto On Wed, Jul 24, 2013 at 4:53 PM, Robert Picard <picard@netbox.com> wrote: > Glad to hear that commenting out the dependency works for you. There > is indeed a cost in doing so in that the file will not appear in the > various listings and will be swept away if you use the -cleanup- > option to remove the files that are not part of the project from the > project directory. > > With respect to the -checksum- calls, the first thing to understand is > that -checksum- is called only once per build per file. So if your big > file is used in 20 different do-files, the -checksum- is only computed > once. In terms of order of checking for changes, as I explained > earlier, it is done from the smallest to the largest file. This > process happens in the order do-files are executed (or skipped). The > master do-file inherits the dependencies of all do-files in the > project. Therefore any change to any file in the project will > immediately trigger its execution. When -project- encounters the next > -project, do()- statement, it checks the do-file's dependencies to > determine if it has to be run again. Like for the master do-file, any > do-file inherits the dependencies of do-files nested within it and as > before, the check is done from smallest to largest files. Any checksum > already computed since the start of the build is reused. And so on... > > So there is certainly overhead in using -project- that you would not > have otherwise. When working with large files, this overhead remains > low compared to all the other tasks (analysis and estimation) you are > likely to do. Some projects I work with take more than 24 hours to > perform a full replication build. They do include gigabyte-size files > and yet performing a build on them can take me to any do-file I'm > working on (just edited) i a few seconds at most. > > > On Wed, Jul 24, 2013 at 10:40 AM, Roberto Ferrer <refp16@gmail.com> wrote: >> Thank you for your reply, Robert. >> >> Let me see if I understand correctly: >> -project- checks if any do-file was modified. If it finds one, then it >> runs that do-file and that of any dependency. In this case, there is >> no need to -checksum- data files associated with the do-files. But if >> no do-file was modified, then it goes on with the -checksum- of every >> file in the project to verify this. >> >> I'm still confused on the order of the file checking. Does it check >> all do-files and then all data files? Does it check only considering >> size and independently of type? Or does it do something else? >> >> Probably the best way to figure out all this stuff out is looking at >> the source file for -project- but it will probably take me some time >> to understand. I plan on doing so in the future. >> >> As for the -ignore_chsum- option I mentioned in my previous email, I >> still think it would be useful. It can be a bit dangerous but it gives >> the user more power. It gives the opportunity to include a file in the >> build process although the file is not expected to change (e.g. >> because it is in a write-protected directory). It would be good for >> documenting purposes since -project, list(build)- no longer mentions >> the file if we comment out the -project, original()- and it would keep >> everything in a unique project. I would just flag it somehow (a * >> maybe) in the listing output. >> >> As to my solution for now, I'll follow your advice and do exactly >> that: comment out -project, original()- for the big file. I'd rather >> do that than create a new project for a simple do-file. >> >> Thanks, >> Roberto >> >> On Wed, Jul 24, 2013 at 3:23 AM, Robert Picard <picard@netbox.com> wrote: >>> There's a lot of things to think about when dealing with big datasets. >>> -project- indeed uses -checksum- to check for changes in dependencies. >>> It tries to be smart about it by checking files in increasing order of >>> file size. You are correct however that if there are no changes in any >>> file, -project- will have to run a -checksum- on your large file to >>> confirm that. But I can assure you that if any of the do-files have >>> changed, your master do-file will start running before you can blink. >>> >>> Personally, I would not tolerate such slowdown either. The simplest >>> solution is to not declare a dependency for this large dataset. You >>> don't need a -ignore_chsum- option, just comment the -project, >>> original()- statement. The downside is that -project- won't be able to >>> notice changes in the large dataset and react accordingly. It kind of >>> defeats the point of -project- but could perhaps be worth it in your >>> specific case. >>> >>> If all you are doing is extracting some data from this large dataset >>> and working on a small subset, you could also split that preliminary >>> step into a separate project. You bite the bullet on a project that >>> manages the large file(s) and create smaller files that are processed >>> in separate projects. My biggest project handles 10GB of files, mostly >>> large raw text files that are input and converted to a smaller Stata >>> dataset (2GB). >>> >>> If it is at all possible, you should consider investing in a faster >>> system. Your description matches the performance of a computer with a >>> regular hard disk. No matter how smart you try to organize your work, >>> loading such a large dataset will take 20 plus seconds. A fast SSD >>> will greatly improve that load time. More RAM can also be helpful as >>> modern operating systems will cache I/O in RAM. It won't help for the >>> first load from disk but the second time will be much faster. On my >>> system, a dataset your size takes about 2 seconds to load, half a >>> second the second time around. >>> >>> >>> On Tue, Jul 23, 2013 at 8:57 PM, Roberto Ferrer <refp16@gmail.com> wrote: >>>> User-written package -project- by Robert Picard and installed using: >>>> net from http://robertpicard.com/stata >>>> >>>> What is the recommended course of action if I have a big data file >>>> that seems to slow down the project (re)build ? >>>> >>>> According to the help file: >>>> >>>> "The do(do_filename) build directive will not run do_filename if the >>>> do-file has not changed and all files linked to it have not changed >>>> since the last build." >>>> >>>> So I imagine there's a -checksum- slowing down the build even if no >>>> files change. I'm thinking of some option that would tell the build >>>> process to ignore this specific file. This file is the first input in >>>> the whole sequence (an -original-) and I'm sure it cannot change since >>>> it is in a write-protected directory. >>>> >>>> I suppose I can take this step out of the build and modify the >>>> corresponding files. At the end of the project, I could stick it back >>>> in. But a build directive like >>>> >>>> project, original(dta_filename) ignore_chsum >>>> >>>> would be nice. >>>> >>>> The data file is 1.4GB in size and a build with no changes is taking >>>> around 30 seconds. I did an isolated -checksum- on the file and it's >>>> over 24 seconds. Other than that one I have few (38 linked) and small >>>> (<2mb) files. >>>> >>>> Thanks, >>>> Roberto >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/