OddThinking

A blog for odd things and odd thoughts.

WordPress and Text Encoding

Dear Mythical WordPress Architect,

I am one of your biggest fans; of all the mythical creatures I believe in, you are my favourite. However, this is an issue that continues to bother me, and I thought I should bring it to your attention.

I have been looking through the code trying to understand why sometimes tags appear in inappropriate places, and sometimes they are stripped out in inappropriate places.

My conclusion is that there simply isn’t sufficient modelling of the different versions of my content.

The pipelining model has its advantages in terms of making it easy for many simple plugins. It has some disadvantages too, which I have covered before.

One of the disadvantages that I am uncovering is that the string that represents the blog content is going through repeated transformations from one encoding to another, but it retains the same type – indeed the same name – throughout the transformation.

This makes it hard to see that there isn’t just one type, but many. As a result, it seems to be commonplace for the wrong encoding to be used at the wrong levels.

Let me give a quick example.

My blog content can appear in at least three encodings.

  1. My original content, in a human-writeable markup language.
  2. The same content, converted to standard HTML.
  3. The same content, converted to plain-text.

In practice, all three of these roles seem to be served by the same the_content() function.

This results in my old issue of MarkDown conflicting with Live Comment Preview. MarkDown is writing #1 in the same string that Live Comment Preview is reading as #2.

It also results in email-based subscribers sometimes seeing various inappropriate markups. (Should be using #3, but is seeing #1 or #2.)

The problem gets worse when we talk about the_extract() (which is used to summarise posts; sometimes it is written by the author, or else it is automatically generated). It should have the same three encodings as above but WordPress doesn’t recognise this.

Sometimes, the extract is displayed to the user’s browser with the HTML tags stripped off. Sometimes it isn’t. (This cost over an hour of my time, this week, as I went off on a tangent trying to where my markup was being stripped out. I eventually discovered it was only stripped out with automatic extracts, but not from manual ones.)

The very same issue applies to article titles. Article titles are often treated as plain text in the code, but they don’t have to be.

I hope this observation is useful to you as you work on the next major version of WordPress. I am afraid I am not coming up with positive suggestions on how to fix this without breaking many plugins, but I hope merely becoming aware of the problem will help guide you toward a solution.

Regards,

Julian


Comments

  1. Julian, I came across this description of how text filters work in the Typo blogging engine. While they have almost certainly got other issues, the Typo guys have certainly put a lot of thought into this problem, hopefully driven by the wordpress experience. I’m pretty certain emailshroud would be very easy to implement, for instance. Have a look anyway; some Rails knowledge assumed but not required.

  2. I’m pretty certain emailshroud would be very easy to implement

    Like this, maybe.

  3. Interesting post. A year later is the problem the same? Any new insights into possible solutions?

  4. Lloyd, the short answer to your query is “No change”.

Leave a comment

You must be logged in to post a comment.