OddThinking

A blog for odd things and odd thoughts.

Collection Correction

Is this just me?


Okay, Julian, what’s the next programming task. Ah, yes, deleting the old objects. Should be easy:

def delete(oldobjects):
    for oldobject in oldobjects:
        oldobject.delete()

Hmm… You know the object deletion is going to be slow – it may need to talk to the remote servers holding the objects, and it will get blocked on I/O. Maybe it would have better performance if I created a pool of threads, to delete multiple objects concurrently? Oh well, make a note to look at that if there is a performance issue. If it is fast enough, no need to optimise. YAGNI.

Okay, Julian, what’s the next programming task. Ah, yes, collecting the old objects. Should be easy; I just need to find the right point to insert a line of code:

if object.is_old():
    oldobjects.add???(object)

Hmm…. what is the correct verb here? Do I add? insert? append? To answer that, I guess I have to decide what the data structure of oldobjects is. It is a collection of some sort, that supports iteration, but I really don’t care what sort; it isn’t that important.

I guess I will make it a list then. But, wait – the list data structure implies an ordering. My collection is unordered. I would rather not even suggest there is an ordering, because that will make it seem trickier to add concurrency later.

I guess I will make it a set then. But, wait – the set data structure implies no duplication. My collection is guaranteed to be free of duplications (and even if it wasn’t, object.delete() is idempotent, making it harmless). I don’t want the computer wasting its time enforcing a constraint that I don’t even have.

A more minor point is that set implementations generally involve using hash functions – I see no reason to enforce the constraint that my objects should support hash functions; they probably do, but if I shouldn’t need to think about it. I also shouldn’t need to thinking about whether hash function is efficient.

If I want efficient, I guess I could go with an array, but there is the implied ordering again, plus the concept of an index, which is another piece of semantics I don’t care about.

Bags, a.k.a. multi-sets? I wonder what the constant time of the insert operation is. I wonder what the insert operation is called. I wonder if there even is a built-in library for this, or do I have to go and search the net for an implementation?

Hmmm… If I was a multi-set implementation in the standard library, what would I be called?

WHOA! Stop! You started by saying the collection type was unimportant, and now you are searching for new libraries, and their performance characteristics? What is wrong with you? Make it a damned list and move on!


So, is it just me, or do other developers go through this internal debate just about every time they use a collection?


Comments

  1. It’s just you. When the collection type doesn’t matter, we all use a List.

  2. If you honestly cared about raw performance you couldn’t justify writing it in Python to begin with. It’s just your old reflexes responding.

  3. One of the things that either Stroustrup or some other C++ guy liked about the way they made the STL is the documented time bounds on the types of collections. The thing with the collection you choose is, you shouldn’t care until the very end (or start) what kind of collection you have.

    “add” basically means “shove it in wherever is most convenient”. All collections should support an “add” method. “insert” implies an ordering. Only lists should have an “insert” method. “Append” sounds like a special cased “insert”. When talking about your collection, you should say things in a way that denotes the kind of operation you want to do.

    Afterwards (or initially) you can choose the type of collection that offers best performance. Often, this will be a Vector (but not a java Vector, as it turns out) or an ArrayList (in Java), because collections are usually small enough that array semantics makes them quite fast (C++ Adage: When in doubt, choose Vector. Java: When in doubt, ArrayList). If it’s large, then performance characteristics do come into play here. You’re probably looking for what Java calls an IdentitySet. If all you’re going to do is “add” and “iterate over the whole thing”, then LinkedList might also be the best option.

    In any case, no matter what you choose, you’re supposed to use “add” in that code above, because you don’t care where it’s added to!

  4. Aristotle,

    The example code here is indeed in Python, and Python is indeed slower than many other languages, but this same problem strikes in any language.

    I don’t want to specify the exact implementation, and risk sending the wrong message to someone reading the code. I want to specify my desired semantics (unordered, behaviour on duplicates undefined)* and have the most efficient implementation chosen for me.

    * Perhaps after Sunny’s comment, I should add “not very large” to the semantics.

  5. Sunny,

    Ah, this is where the problem *is* different in different languages.

    Unfortunately, Python’s standard types don’t have a common “add” method. You add to sets. You insert in or append to lists and arrays. You put into Queue, but append into deque.

    Looks like another entry for Python style guide hate.

  6. I don’t want to specify the exact implementation, and risk sending the wrong message to someone reading the code. I want to specify my desired semantics […] and have the most efficient implementation chosen for me.

    I suspect there are languages that will do that for you. I don’t personally know of any, and I would be surprised if you didn’t quickly find other things to complain about in those languages.

    But if you can deign to choose the implementation yourself, you can achieve your other stated goal (specifying the desired semantics to the reader) fairly simply. Just create your own collection type. In Python, I would recommend subclassing list and creating an add method that does append. It’s not a lot of work (three very short lines, even with full adherence to PEP 8!), and it’s fairly self-documenting. (I would think completely self-documenting to any experienced Python programmer, though if you are truly paranoid, you can specify an explanatory docstring for the new class.)

    Of course, if by “sending the wrong message” you mean “revealing that I [Julian] agonize too much over matters that are ultimately of little consequence” then better to go with configurator’s answer. 😉

  7. I am torn between feeling chastised by John pointing out I am wasting time, and rushing off to create my own generic collection.

  8. I actually believe the discussion is fairly important. To some extent, collections are computing. I’m struggling to think of a case where computing isn’t iterating over a data structure somehow (with that data structure almost always being a collection or a map). In side-effect free languages, the data structures are even more strictly defined and limited in their scope — sometimes iterating over a collection is all you can do.

    I see python as being designed by people who keep asking “Why does this have to be so hard?” or “why do I really need to think so hard about this?” Sometimes they succeed in making things simpler, and other times you look at the outcome and think “yeah… that’s why it has to be hard”. Microsoft does similar things in Windows, and sometimes it works, but mostly it just eventually comes crawling back to the unix model.

    In a previous project we created a factory for creating collections. Eventually I realised that I actually did care what kind of implementation I wanted to use, and the factory started to seem fairly silly. You could create a factory that took flags and gave the most reasonable collection back, but the universe would probably die before that factory needed changing (unless there were bugs). Ultimately you can’t get away from the fact that you have to care about what kind of computation your data structure is going to undergo.

  9. You could create a factory that took flags and gave the most reasonable collection back

    Erm… I guess you could use a factory to solve the problem. Sure, it could let you tell it the semantics, and give you the most appropriate collection back, but isn’t that a development decision rather than a run-time one? (He says hypocritically, as Python leaves a lot, I think perhaps too much, decision-making until run-time.)

    John’s solution keeps the decision back at development time.

    the universe would probably die before that factory needed changing

    Perhaps, but that misses the point of the exercise – it isn’t to allow someone to substitute a linked-list for an array at a later point. It is to say “Dear code-maintainer, this is what I consider important when choosing a data-structure. Any new code that uses this data-structure needs to maintain these semantics, or change the existing code.”

    Ultimately you can’t get away from the fact that you have to care about what kind of computation your data structure is going to undergo.

    Perhaps, but that is part of what got me to this point in the first place. I want set-like semantics, without the extra cost of set-like constraints. I want array/vector/linked-list efficiency, without needing half of the functionality. A list object is overkill.

  10. Aristotle: I don’t see any reason why using python gives you a license to not think about performance. I’m happy that the developers of Bazaar, of which I am a daily user, have spent a lot of time optimising the performance of their python-based tool.

    So anyway if I were faced with this problem, the first thing I’d ask is: what exactly does is_old do? Does it, as the name implies, return the current state, where the state might have changed based on some event (become_old perhaps)? If so I’d be looking to insert the object into the list (yes a list) in the course of handling that event (ie in handle_become_old). This would remove the is_old check, and the assumed loop around it entirely.

    Of course the full definition of the handle_become_old function is left as an exercise for the reader…

  11. Alastair,

    It’s the right instinct; I had the same one as I wrote the example code, but I was trying to keep it as simple and uncontroversial as possible.

    If it helps, imagine object is defined in a third-party library, offering a generic publish/subscribe interface, and that this if statement is in the observer’s callback, so it knows that something has just changed, but not exactly what. Now the implied for-loop has disappeared.

  12. John’s solution keeps the decision back at development time.

    Yeah, I was expecting that you’d do both. My algorithms lecturer said “Don’t think about real-time performance, think about scalability”. Ultimately, the list is going to cause you scalability problems (depending on the max size and how you use it), but the factory is going to be roughly O(1) cost. If you care about it being done at development time, maybe a typedef if python supports one?!?

    OK so I’m going to mention your previous article:

    I am influenced by the original Booch Components (Not the new-fangled Ada-95 version) in which the package name documents excruciating detail about the components memory-management model, time and space semantics and the like. The first act of importing a new component is to alias it – so you can reference the collection called “Queue_Unbounded_Managed_Balking_White_With_Two_Sugars” as “Queue” in your code.

    I think that the complete definition of a data structure is practically the code itself. You usually give these data structures and algorithms names like “quicksort” (and everyone knows what that means) or “hashset” (and everyone knows what that means). There’s no reason to call “hashset” by what it offers, because only a hashset can offer the exact same semantics as a hashset. If you want something like a set but without caring about the set semantics, then you can create a data structure which works better than a hashset and use that, because often the ability of a set to do fast lookups depends upon the set semantics. If you don’t really care, use a comment to mention that you don’t really care about the set semantics?

    I think ultimately what you might be after is that in the future if there’s a fancy new collection with those semantics, you want to be able to tell whoever is maintaining that code that it’s safe to pull out the data structure and add in the new one (or is there something I’m missing about what you want to do?). I guess a simple comment does that best…

  13. Sunny,

    I see python as being designed by people who keep asking “Why does this have to be so hard?” or “why do I really need to think so hard about this?” Sometimes they succeed in making things simpler, and other times you look at the outcome and think “yeah… that’s why it has to be hard”.

    I think that is a reasonable assessment of Python, and well expressed.

    Ultimately, the list is going to cause you scalability problems (depending on the max size and how you use it)

    Julian specifically mentioned “not very large” in one of his comments. That is why I suggested a list as the underlying implementation. A set would actually be better, and closer to what he is asking for, but only provided the elements are hashable.

    If you care about it being done at development time, maybe a typedef if python supports one?!?

    Python is such that defining a class is the typedef. It also fits terminology quite well, as Python classes are types.

  14. Julian,

    I want set-like semantics, without the extra cost of set-like constraints. I want array/vector/linked-list efficiency, without needing half of the functionality. A list object is overkill.

    I’m a little surprised by your fear of set performance. If your objects are not hashable, then you can’t even put them into sets, so that’s right out. If your objects are hashable, then the set is pretty much the best-performing Python data structure there is.

    Do you have some reason to believe that your objects hash but somehow don’t hash “well”?

  15. Alastair:

    I don’t see any reason why using python gives you a license to not think about performance.

    Oh, I completely agree.

    That is why Julian shouldn’t be thinking about the type of his collection.

    I once wrote a heap in Perl to avoid a bottleneck in a routine where I’d be picking the smallest element from a set that I knew would be very large in typical operation – several 100,000 elements. With a heap this operation is O(1), whereas sorting a list and picking the first element is O(n log n). I was certain in my choice. To my dismay, someone else demonstrated to me that using Perl’s built-in sort was faster up to something like 15,000,000 array elements.

    Big-O notation may omit constants and smaller terms, but in practice they matter. A lot.

    No collection implemented in Python is ever going to be faster than one of the built-in ones that are ultimately implemented in C, except maybe at far extremes of scalability.

    That is why I say that Julian’s reflexes are misfiring.

    I’m happy that the developers of Bazaar, of which I am a daily user, have spent a lot of time optimising the performance of their python-based tool.

    An SCM operating at the limits of performance is I/O-bound by definition, where Julian’s conundrum concerned a scenario that at limit is CPU-bound, so I’m not sure what point you want to make with this. Even so, Bazaar is… not exactly notorious for being a speed demon. I’ll grant that for Mercurial, though. But then again, that is not written entirely in Python – though they do get most of its performance out of it design (again, cf. CPU- vs I/O-bound performance).

Leave a comment

You must be logged in to post a comment.