[SLL] limit on number of files in a directory and hashed dir vs. flat dir file access time
Chuck Wolber
chuckw at quantumlinux.com
Thu Nov 1 20:19:33 PDT 2007
On Thu, 1 Nov 2007, Ana wrote:
> On Wed, Oct 31, 2007 at 10:13:55PM -0800, Chuck Wolber wrote:
>
> > It's a self consistency issue. Stuff like drive caching, etc cancels
> > out. All we care about is the slope of the graph, not the actual
> > values on the graph. My hypothesis is that the slope is near zero as
> > size increases.
>
> forgive me for being a little slow. I want to make sure I understand.
> By "slope is near zero as size increases", I think you mean the file
> "seek-and-open" time (or whatever operating time) will not increase or
> will increase very little as the number of files in a directory
> increases. Is that right? ... "O(1)", or nearly so.
Yes, except for the O(1) part. We don't care about the internal algorithm
(which is what the O(1) part implies). It could be O(n^exp(n^n)) for all I
care. Each directory entry is discrete so a slope of 0 implies that access
time for a given directory entry is the same regardless of the entry and
regardless of how many sibling entries exist with it.
> I think that the slope would have to be non-zero, no matter what
> indexing method is used, because, in my experience/understanding, none
> are perfect. hehe. actually, wouldn't perfect efficiency, even when
> talking about computer science, violate thermal dynamics? ;)
Perfection has nothing to do with efficiency. Get a good physics book and
read all about entropy if you want to understand why.
The slope would most definitely be zero unless there is an algorithmic
reason why the number of entities affects the speed of open/close
operations on individual files. Perhaps there would be bumps here and
there in the graph, but a statistically significant distribution would
cancel all of that out and give you a true idea of the algorithmic
relationship between open/close rate and directory contents.
Not to get too existential here, but 100% efficiency would not violate the
laws of thermodynamics. The universe is 100% efficient. If it wasn't,
energy would "leak" out into some parallel universe. There are some
religious implications there, but I won't get into them. If anyone's
interested, contact me offlist and we can start a discussion list and see
where it goes.
> I seem to remember reading, a long time, that ext2 places the data on
> the actual disc at random positions, in order to avoid fragmentation.
> I'm not sure, but it seems to me this randomness would indeed cancel out
> drive caching, etc.
I'd love to see a reference to that. It sounds a bit dubious to me.
> > echo "Hello World" > $file
>
> Software would be a red herrring, when it comes to testing ext3
> efficiency. What I'm talking about though is real-world applications
> where actual programs are not perfect and programmer time is not cheap.
You're missing the point. Even if you had the worst programmer in the
world, if the test showed a zero slope then the file access time on the
first file would be the same as the file access time on the nth file.
> As you know, in "real world" settings, you usually have to make do with
> the software at hand. Even if you have perfectly efficient file system,
> if what you're forced to work with is something like "rm" that, for
> whatever reason, insists on looking at every single entry in the target
> directory before unlink()ing the named file, it might be best to keep
> your directories as small as possible.
That's just arguing for the sake of speculation. Show me a real life
example of such a situation and I'll show you a trivial code patch that
will remedy the situation.
> Testing and theory is fun for me, but I don't make a living in R&D (I
> wish).
I do. I also deliver software into the exceedingly pragmatic aviation
world.
> I make a living creating "we need it now" products that leverage
> existing software. I cannot forget my pragmatism. :)
If you're using tools that do absurdly stupid things like parse an entire
directory just to open or delete one file, then you're using the wrong
tools. Or like a friend once said, "if it hurts, you're doing it wrong" :)
..Chuck..
--
http://www.quantumlinux.com
Quantum Linux Laboratories, LLC.
ACCELERATING Business with Open Technology
"Stay Hungry. Stay Foolish."
-The Whole Earth Catalog
More information about the linux-list
mailing list