During the Fall 2004 semester I was involved with a project for my Digital Forensics class that entailed performing static and running analyses on both production and synthetic file systems. The goal of the project was to assemble some conclusions on file-overwrite characteristics in common usage scenarios and develop a model for the likelihood of finding some portion of a long-deleted file after varying amounts of disk access.
In order to simulate file deletions, though, we needed a quick model of how files were distributed across systems. I did an analysis of the hard drives on the eight systems I had lying around, which mostly consisted of doing a full system directory dump and looking at the results in Matlab.
It was obvious that the distribution of file sizes across the system was represented by a log-normal curve; that is, the logarithm of file sizes was a normal distribution.
This model works extraordinarily well for nearly all cases and doesn't have ridiculous implications for either large or small files. The only shortcoming is that it underrepresents the importance of exceedingly large files, which generally dominate space usage on a machine. Aside from fixed-cost high overhead files (Windows swap files, etc.), there's no obviously consistent way to model these files.
Those of you who are familiar with this story, however, know that my teammate and good friend Rob ran across a paper a week or two ago which detailed the results of a nearly identical analysis performed sometime in 1999.
The results are nearly identical, as well. I ended up calculating a distribution with mean of 8.53 and std.dev of 2.54, and they report a mean of 8.45 and std.dev of 2.35 (although theirs are reported in base-2, whereas these numbers are base-e). The difference in the means is around 1%, for those of you who are counting. Their curves were significantly narrower, and I attribute this to the fact that my statistics were gathered from my file server, which has an unusual weighting of ~3MB and Very Large files.
What cracks me up about this is that their analysis involved some 10,568 systems and mine involved 8.
I find it pretty funny, too, that the paper was written by a couple of guys at Microsoft Research.
- John R. Douceur, William J. Bolosky, A Large-Scale Study of File-System Contents, ACM SIGMETRICS Performance Evaluation Review 27.1, 1999
- L. Carothers, D. Driscoll, R. Erbes, J. Kearney, The Analysis and Classification of Deleted-File Overwrite Characteristics in Common Usage Scenarios, Proceedings of the CS491/589 Projects Presented on an Overcast December Afternoon, 2004
P.S.
The title of this post comes from some much older work that was referenced in the paper by Douceur and Bolosky. Apparantely, early models fit to this type of data were all hyperexponential. The fact that hyperexponential doesn't accurately account for small files did somewhat motivate my comment earlier that log-normal does so. In fact, on large filesystems, a hyperexponential distribution would imply a ridiculously large number of small files to exist, likely clogging whatever mechanism happens to be installed to keep files where they should be.

Leave a comment