[SLL] limit on number of files in a directory and hashed dir vs. flat dir file access time

Chuck Wolber chuckw at quantumlinux.com
Thu Nov 1 20:19:33 PDT 2007


On Thu, 1 Nov 2007, Ana wrote:

> On Wed, Oct 31, 2007 at 10:13:55PM -0800, Chuck Wolber wrote:
>
> > It's a self consistency issue. Stuff like drive caching, etc cancels 
> > out. All we care about is the slope of the graph, not the actual 
> > values on the graph. My hypothesis is that the slope is near zero as 
> > size increases.
> 
> forgive me for being a little slow.  I want to make sure I understand. 
> By "slope is near zero as size increases", I think you mean the file 
> "seek-and-open" time (or whatever operating time) will not increase or 
> will increase very little as the number of files in a directory 
> increases.  Is that right?  ... "O(1)", or nearly so.

Yes, except for the O(1) part. We don't care about the internal algorithm 
(which is what the O(1) part implies). It could be O(n^exp(n^n)) for all I 
care. Each directory entry is discrete so a slope of 0 implies that access 
time for a given directory entry is the same regardless of the entry and 
regardless of how many sibling entries exist with it.


> I think that the slope would have to be non-zero, no matter what 
> indexing method is used, because, in my experience/understanding, none 
> are perfect.  hehe.  actually, wouldn't perfect efficiency, even when 
> talking about computer science, violate thermal dynamics?  ;)

Perfection has nothing to do with efficiency. Get a good physics book and 
read all about entropy if you want to understand why.

The slope would most definitely be zero unless there is an algorithmic 
reason why the number of entities affects the speed of open/close 
operations on individual files. Perhaps there would be bumps here and 
there in the graph, but a statistically significant distribution would 
cancel all of that out and give you a true idea of the algorithmic 
relationship between open/close rate and directory contents.

Not to get too existential here, but 100% efficiency would not violate the 
laws of thermodynamics. The universe is 100% efficient. If it wasn't, 
energy would "leak" out into some parallel universe. There are some 
religious implications there, but I won't get into them. If anyone's 
interested, contact me offlist and we can start a discussion list and see 
where it goes.


> I seem to remember reading, a long time, that ext2 places the data on 
> the actual disc at random positions, in order to avoid fragmentation. 
> I'm not sure, but it seems to me this randomness would indeed cancel out 
> drive caching, etc.

I'd love to see a reference to that. It sounds a bit dubious to me.


> > echo "Hello World" > $file
> 
> Software would be a red herrring, when it comes to testing ext3 
> efficiency.  What I'm talking about though is real-world applications 
> where actual programs are not perfect and programmer time is not cheap. 

You're missing the point. Even if you had the worst programmer in the 
world, if the test showed a zero slope then the file access time on the 
first file would be the same as the file access time on the nth file.


> As you know, in "real world" settings, you usually have to make do with 
> the software at hand.  Even if you have perfectly efficient file system, 
> if what you're forced to work with is something like "rm" that, for 
> whatever reason, insists on looking at every single entry in the target 
> directory before unlink()ing the named file, it might be best to keep 
> your directories as small as possible.

That's just arguing for the sake of speculation. Show me a real life 
example of such a situation and I'll show you a trivial code patch that 
will remedy the situation.


> Testing and theory is fun for me, but I don't make a living in R&D (I 
> wish).

I do. I also deliver software into the exceedingly pragmatic aviation 
world.


> I make a living creating "we need it now" products that leverage 
> existing software.  I cannot forget my pragmatism.  :)

If you're using tools that do absurdly stupid things like parse an entire 
directory just to open or delete one file, then you're using the wrong 
tools. Or like a friend once said, "if it hurts, you're doing it wrong" :)

..Chuck..

-- 
http://www.quantumlinux.com
 Quantum Linux Laboratories, LLC.
 ACCELERATING Business with Open Technology

"Stay Hungry. Stay Foolish."
	-The Whole Earth Catalog


More information about the linux-list mailing list