I wanted to listen to a range of podcasts feeds over my car CD player and/or over my iPod.
I didn’t want iTunes to get involved, because its user-interface frustrates me to tears.
So, I wrote some straightforward Python code to download my favourite podcasts.
I don’t want to listen to multiple programs from the same feed in a row, because I like variety.
I don’t want to listen to them a randomly shuffle order, because many of the feeds have a internal sequence to their podcasts, and because that will lead to occasional runs of the same feed.
I don’t want to listen to them in the order that they were downloaded, because some podcast feeds are bursty in nature, leading to multiple of the same type in a row. This is especially true after I add a new feed; I get a lot of back-issues of the feed all at once.
Instead, I want my podcasts carefully sorted so as to maximise the space between podcasts of the same feed.
My first solution was a simplistic one. Twice a week, a script downloads the latest podcasts. It downloads them in a round-robin fashion from all of the feeds that have new articles.
If all the feeds have a similar number of articles, this works well. However, if one feed has many more shows than the others, it will lead to several shows in a sequence at the end of the list.
It wasn’t good enough, and it was bugging me. I often found myself hand-sorting the files before burning to CD.
I posed the problem to the blog readers. I lightly disguised as a Wine-Gum Selection Puzzle, mainly so I could abstract away some of the complexity.
The result was some very interesting suggestions that lead to me doing even more reading and thinking. (Thank you to all the people who contributed! I am very appreciative.) I tried to apply each of the pieces of advice, but I still couldn’t get my head around any neat solution.
I remained suspicious that the problem is NP-complete. So, how do you address NP-complete problems?
One way is to give up! I did that for a while, but it continued to bug me whenever I listened to two similar podcasts in a row, or found myself doing re-sorting by hand.
Another way is brute force, for small values of
n. This is just a search problem, right? Remembering the classic A* search algorithm, I found an open-source implementation and tried to fit this problem into it.
It didn’t go!
A* is a branch-and-bound search. However, it was all branch, and no bound! Unlike a travelling salesman example, you never visit the same node going via another node. It is a tree, not a graph. A* is wasting its time.
I dumped the A* implementation, and wrote another basic best-first search implementation from scratch.
The implementation consisted of a collection of partial solution objects.
Each partial solution object included a list of items (wine-gum colours or feed names) so far. The object could also derive a metric describing the “interestingness” of the ordering so far, and an estimate of what the final metric would be once all the rest of the items were placed.
Interestingness and Estimation
The interestingness metric mapped to the function that I previously defined.
The estimate of the final metric was a tricky procedure. I want the estimate to be as accurate as possible, but the estimate may never be an under-estimate if the search algorithm is to work. The estimate also needs to be calculated relatively quickly.
After a little thinking, my over-estimate algorithm was to notionally place each of the remaining pieces in the last possible position and compute the interestingness it added to the overall solution. The last possible position maximises the interestingness, so if I optimistically over-assume every remaining piece gets that position, I am never under-estimating the final cost.
The main algorithm plucked out the partial solution from the collection with best estimated final metric, added each of the possible next items to make a new set of partial solutions, and pushed them all back into the collection. This was repeated until one of the partial solutions actually uses up all of the items – and hence is a final solution. By the magic of best-first searching, this is guaranteed to be an optimal solution.
The right choice of collection data-structure is some sort of B-tree, so I stored it in a plain list instead! I had a quick look at the built-in Python support for B-trees and decided I couldn’t be bothered. I put it in a list, and just sorted the list over and over again.
Later, I decided that it was worth a teensy bit of optimisation. I moved from a list to a set (to allow easier deletions). Rather than sorting the whole list after each iteration, I just zipped through it looking for the maximum value, which was all I needed.
The result wasn’t a huge surprise for an NP-complete problem. It was slow. I left it puzzling overnight on a 3-feed, 15-item problem, and it was still looking for the optimal solution the next morning.
No wonder iTunes doesn’t offer this feature!
I was expecting the result to be slow, and I already I had a plan.
There is a third way of dealing with NP-complete problems that I hadn’t explored yet – accepting approximate answers.
Realistically, I don’t need a perfect ordering of podcasts – just something that doesn’t have too many repeats too often. I will accept an ordering that isn’t optimal, but is near enough.
I put an escape clause so that the program would stop running after a few minutes, and return the best partial solution so far. I could accept that this was close enough to the best ordering of the first few items.
I can then remove those items out of the equation and run the program again on the remaining items.
This is sub-optimal, but I figure it is a good enough approximation.
I ran the program for a short period and looked at the best partial solution it had found. I was shocked to discover it was an appallingly bad partial solution.
It was trying to optimise the following items: O-O-O-O-O-O-R-R-R-R-Y-Y-Y-Y, and it had got as far as O-R-Y-R-Y-R-Y-R. That was going to lead to a large consecutive sequences of Os at the end. How could that be a contender as the best solution?
I realised that my heuristic for estimating actually encouraged the postponement of the placement of the biggest group of items. It assumes that all 5 remaining Os are all in the best possible position (the last one), so there is no need to start placing them sooner.
New Estimation Function
I replaced the estimation function with one that assumed that the remaining Os would be evenly distributed amongst remaining positions. I took care to ensure it remained an over-estimate, but it was still much more accurate when you have a large group of items from one colour/feed.
I ran the tests again, and discovered it ran much, much faster. The task that previously ran overnight now finished in 3 or 4 minutes. The first estimation function was leading the poor search algorithm into a number of dead-ends that the new function avoided.
Nonetheless, it was still too slow to find the optimal solution for significant numbers of n. The worst-case situation I needed to handle was also the first set of data it needed to handle – my current backlog of 140 podcasts from about 12 feeds.
I combined the escape clause (run for a while before returning a partial solution, removing those items from the pool and run again) with the new estimating function, and tackled the real data.
It was still too slow – the partial solutions I was reaching in a reasonable time were only a few items long.
I further constrained the problem to only process a maximum of 9 feeds, 5 items per feed and 25 items over-all, in each pass. Again, this will lead to sub-optimal solutions, but I figure those numbers are high enough to introduce enough variety to keep the interestingness high.
After each pass where it gives up and returns the best partial solution found so far, the list of articles and feeds still available is updated, and a new set of feeds and items are selected. So, each time it only considers a limited pool of variety, but the pool changes every 10 or so podcasts, so it is less noticeable.
I have integrated this implementation with the downloading component I already had.
The downloading component runs automatically twice a week, and stores the podcast in neatly sorted piles by feed.
When I need another CD for my car, I run a program to discover what’s been downloaded, sort the shows into a single list using the algorithm I have described, and prepare them for burning to CD.
So, the solution appears to be working well enough for now. It only takes 5 or 10 seconds of visual inspection of the list to see some sub-optimal choices, but the mistakes don’t seem too egregious (e.g. there are no cases of three of the same track in a row, but there are situations where swapping the order of two items would improve the metric.)
So I am happy for now, but I haven’t listened to my first CD yet. Ask me again once I have been listening for a while.