OddThinking » How Python Logging Bit Me, Part 1.

How Python Logging Bit Me, Part 1.

Filed by: Julian on April 29th 2010

This is offered to the world in the spirit of “If other programmers can learn from my mistakes, maybe my threads won’t have died in vain.” The trouble is, I don’t think anyone will ever find this by Googling. Maybe I am wrong.

Here is a heavily-sanitised snippet of Python code, sitting inside a loop, being run by a pool of threads.


    start_time = datetime.now()
    if start_time >= next_run_due:
        logging.debug("Starting the performance-critical section")
        # Ensure that the results we send are based on the freshest data.
	   
        fetch_data() # Have noticed this runs slow sometimes. Keep an eye on it.
        fetch_completed_time = datetime.now() # Don't examine now. Wait until outside performance-critical section!
        process_data()
        send_results()
	   
        # Relax. Now outside the performance-critical section.

        # This performance check has been placed outside the performance-critical section, because logging.critical() may send a (slow) email.
        if fetch_completed_time - start_time > threshold:
            logging.critical("Fetch time is running slow again.") # Implicitly sends an email.
            # If you see this message, check the network and CPU aren't congested.
        next_run_due = next_due_date()

So this code checks if some processing is due, and if so, fetchs the data, processes it and sends out the results. It tries to make sure that when it sends out results, it is based on the freshest possible data, so the key section does as little extraneous work as possible.

Sometimes, the data-fetch takes longer than expected – historically, that has been because something else on the server is chewing up more resources that it should. I want to know when that happens, so I can go and fix it, so I log a critical message. The critical message log is linked to send an SMTP message with Python’s logging library’s SMTPHandler. I do this check outside the key section, so processing the email doesn’t (further) slow the sending of the results.

Can you see the problem with this code? I couldn’t. (I love asking this question, because people tend to find dozens of other problems I never thought about!)

Sure, there is no throttling on the critical messages, but that is because this doesn’t occur frequently enough to cause email flooding.

Here’s a hint. Multi-threading and locking.

Some code, in some thread, somewhere, called logging.critical(). Logging.critical() tried to send an email. I expected logging.critical() to write the email to a local buffer, and for sendmail (or whatever) to pick it up and deliver it later. I knew it would be slow, but I was expecting sub-second response times. I haven’t worked out why that isn’t always true, but it isn’t always true. It can take many seconds to send this email.

But that’s okay, because there are no logging.critical() calls during any performance-sensitive areas, so it won’t affect the “freshness” or the data which is really the critical performance characteristic.

Can you see the call to logging.debug()? It does not send an email; it writes to a log file. Its execution time is measured in hundreds of microseconds, so it isn’t a concern.

Or so I thought.

It turns out that logging.debug() blocks, waiting for the logging.critical() call to complete. Now the debug message is taking 10,000 times longer than it normally does! It still isn’t inside the critical part, but note that it is inside the part that is timed. So when fetching + logging it takes more time than the threshold, out goes another logging.critical() call, which is another email message, which is another blocked logging system, which holds up the next thread, and so it goes.

Meanwhile, the time taken to do a logging.critical() call is increasing as they all wait in queue to send emails. The first one goes out, as expected, in less than a second. The second one took twice as long. The third, three times as long. It got to the stage that my threads were freezing up for 90 seconds. I said before that there weren’t enough of these calls to cause email flooding. I stand by that – there was a burst of few hundred over a couple hours which should be easily handled by a server’s email system. But that is assuming that all the threads in the system are not waiting for every email to actually reach its destination before they continue.

Short Term Solution

I’ve turned off emailing of critical logs for the time being. I have moved the debug log outside of the timed section, to stop it from propagating the problem, in any case.

Outstanding Questions

Before I can implement a long-term solution, I have to understand some issues.

Is my MTA running too slowly, even though it is on the localhost? Could it be making me wait while it sends the email, rather than buffering and sending it later? Could it be protecting itself from ~~elephant-interferers~~ err.. spammers by deliberately running slowly? Or is it inappropriate for me (and SMTPHandler) to have every thought it should be fast enough?

Does SMTPHandler buffer emails before sending them out in a different thread? Apparently not. Could it be configured or rewritten to do so? Do I really need to write such a beast myself?

Could BufferingSMTPHandler help? It seems to try to minimise emails sent by chunking them and only sending when the buffer is full. Presumably, it still blocks other logging during this occasional email, so it remains a problem.

Could SysLogHandler help by taking it out of the domain of Python? Or given my limited experience with syslog, will I just have a new set of problems?

In the meantime, if you use SMTPHandler, and you get random periods of sustained bad performance, maybe this will help you?

28 CommentsCategories: Heroic Failures,Insufficiently Advanced Technology,S/W Dev
Tags: logging. gotcha, Python, SMTPHandler, software

Comments

RSS feed for comments on this post. TrackBack URI

Comment by Kamil Kisiel on April 29, 2010

If it were me, I would not rely on the application to send email itself. You should be able to post the critical message somewhere external as soon as you can and then just resume doing what you should be doing. Let something else pick up the pieces and handle delivering the notification.

I think SysLogHandler is actually a good fit for this role. You can post messages to syslog pretty much instantaneously. Just select a facility for your application (one of the LOCAL facilities) and write your messages there. You can then configure your syslogd to send copies of any critical messages from that facility to an email.

There are other notification options you could explore from here as well (logcheck, SNMP, Nagios, something else?) that could poll or be called from the syslog facility. It’s up to you. The point is you can tune the delivery method independent from your application.
Comment by Sunny Kalsi on April 29, 2010

Perhaps you’ve got reasoning against this already, but the debug log is waiting for the critical log to finish. It doesn’t matter how slow the mail handling is, multithreaded code will bottleneck on this problem (assuming the logger is not per-thread). You need your own logging middle end which will queue up messages and send them in another thread.
Comment by sagara on April 29, 2010

A quick glance at the code for SMTPHandler shows it uses the std smtplib module to connect to the mail host and send emails right away, which is probably not something you’d want in general. This blocks whichever thread triggered the handler (if I remember all logging stuff is protected by locks).

Basicly I agree with Kamil; just let something else handle it outside your application. Or write a handler to pass records to email via a Queue.Queue to a seperate thread you create that’ll buffer/send them out (or something similar) if you wanna keep the emailing in your app. That way it won’t block your other threads (except for the time spent waiting on Queue.put).
Comment by Julian on April 29, 2010

Kamil,

I understand the concept of separating the log-message handling from the application. In fact, this is the exact reason I originally adopted the Python logging system.

For what it is worth, I configure the logging system with a configuration file separate to my application, and, to a large extent, the application can ignore the logging subsystem.

There are several places where that has turned out to be NOT true, however, and this post discusses one of them. (There may be a couple more posts on the subject coming.) Not least, that even before I found that the the module sends the email immediately, I had assumed in the code that a Critical message would take much longer than a Debug message.

Perhaps my real mistake was simply assuming that, with the Python Seal-Of-Approval, the built-in logging system would be sophisticated enough to take care of these sorts of details for me.
Comment by Julian on April 29, 2010

Sunny,

Yes, you are right. The solution which I had (wrongly) assumed the logging subsystem did for me was to either (a) use a separate thread to handle the sending, or (b) wrote quickly to a outbox file in a place that would be picked up by sendmail in the fullness of time.

(These are very similar solutions, except one uses a Python thread and Queue.Queue for message-passing, while the other uses a system process and the file-system for message-passing. The latter has the advantage that it will still send the message even if the Python application is killed or horribly disfigured by whatever event it is trying to report. The former has the advantage that it is probably doable by subclassing from the current Python logging framework.)
Comment by Julian on April 29, 2010

sagara,

It is interesting that you say that [sending an email immediately] “is probably not something youâ€™d want in general”. I agree, which is why I was surprised to learn it worked this way. Who would want that in any real-time system?

You mention it blocks the thread that triggered the handler, but, as you say, all the logging is protected by locks, so it also ends up blocking every thread that calls the logging system – even for debug messages – until the email clears.

I was pondering a solution exactly as your described with Queue.Queue inheriting from SMTPHandler (or maybe BufferingHandler or similar?)

I suspected that the biggest problem is the age-old issue with logging systems: First, do no harm.

When do you shut-down the thread? Should you ever quit the thread before before the email is sent? If so, what about the lost notifications of the reason for the sudden quit? If not, is your application going to fail to shut-down whenever the machine is ever disconnected from the network?

Not to mention that importance of never having the logging system raise an exception or block the application. Logging systems are not trivial, and often re-invented, and right now I am feeling far too old to consider doing it all again. 🙁

I think you may be right. Time to read up on syslog again.
Comment by Vinay Sajip on April 29, 2010

You can still use logging, but instead of using an SMTPHandler directly from your application, use a SocketHandler instead to send events to another process (if necessary, on a different machine). Have a socket listener in that process pick up the events and email from there (e.g. using an SMTPHandler configured in that process). If there is a slowdown in email sending, the other process will take up the slack.

There’s a working example of such a socket listener process in the Python docs.
Comment by Vinay Sajip on April 29, 2010

There’s an enhanced version of the program in the Python docs available here.
Comment by Julian on April 29, 2010

Well, I was also wrong about people not finding this. Someone posted it to reddit, which is why I am seeing intelligent comments from some new people. Thank you all involved. I will have to follow up with some more issues with Python logging.
Comment by Vinay Sajip on April 30, 2010

Hi Julian,

I’m the maintainer of Python logging. If you have issues with Python logging, as well as posting them here, it would be better to post them on comp.lang.python and (if appropriate) raise issues on bugs.python.org. I generally monitor both of these actively and am reasonably responsive to issues raised there.

Regarding your comment

Perhaps my real mistake was simply assuming that, with the Python Seal-Of-Approval, the built-in logging system would be sophisticated enough to take care of these sorts of details for me

I would say that Python’s logging is as sophisticated as most (if not all) other logging systems in common use out there, or at least it’s intended to be 😉 [If you don’t agree, please point out the other systems and I’ll see if it’s feasible and sensible to add their features.] However, it’s not telepathic, and cannot automatically divine when an I/O call will block (even socket calls can block, after all). However, it’s made very easy for people to design their own handlers that can deal with special cases.

I notice from your snippet that you’re using the root logger (via the module-level convenience functions), which is not normal for production use except for short, simple utility scripts. There’s no harm in using it, but typically all sorts of handlers (e.g. file, console) might be attached to the root logger, and if any of them blocks then the others won’t get a chance to handle until the blocking handler resumes. In fact, this might be what’s happening in your very case, because I don’t see where else the lock contention would be occurring.

If you haven’t already done so, please have a look at my Python Logging 101 to get more about how loggers and handlers work together.

All that being said, your problem is at heart not really about logging per se, and more about thread interactions in general. For example, you could have equally well pointed the finger at smtplib for not being smart enough to include an automagical email queue and separate sender thread. After all, the problem you described would also surface if you were sending emails from an application thread using smtplib directly. You just happened to go through logging in this case 🙂
Comment by Christopher Dunn on April 30, 2010

Vinay,

Don’t worry. logging is great. Julian is just pointing out a use case to beware. It’s good to see evidence that the module is indeed thread-safe, actually.
Comment by Vinay Sajip on April 30, 2010

@Christopher,

What, me worry? 😉
Comment by Julian on April 30, 2010

Vinay,

Thank you very much for your comments. I have been reading some of your logging explanations work earlier today. I am impressed you stopped by to comment at all, and I am happy you spent the time to make some very good points.

The first (in your main comment), about directing my comments somewhere that might actually be useful, is a fair one. I am not a regular reader of comp.lang.python, which makes me a bit loathe to bluster in and commit some local faux pas by complaining about something that may well have been well-covered before. I would also say that the logging module is working exactly as designed, so it isn’t a bug, per se; it just isn’t meeting my assumptions, so a bug report seems inappropriate. Nonetheless, having managed customer-facing products before, I realise how much more productive a customer comment sent to the right channel can be compared to pointless griping to my friends.

You say that Python logging is as sophisticated as other logging systems. You are certainly more of an expert on this than me, and I imagine you are absolutely right, but I think I can still expect more!

I agree that you can’t tell when I/O is going to block. I agree that the socket-based solutions (whether SocketHandler or sSysLogHandler) are going to have similar problems (although, if I can constrain them, in my case, to only speak to other sockets on the same machine, I can probably satisfy myself that they will won’t block for too long, in the same way that I don’t really worry much about the time spent blocking on logging to a local file.)

However, this general problem of unpredictable blocking is an argument for a general mechanism in the logging system for decoupling the client reporting the issue from the handler actually dealing with the I/O – whether that I/O is SMTP, syslog, special-purpose socket traffic, an old-fashioned line-printer, etc. That said, I can accept there may be some technical challenges in implementing such an feature – see my comment above, about when to give-up when you are shutting down.

I also have to heartily agree that your infrastructure is easily customisable. When looking at the BufferingSMTPHandler, linked in the article, I was startled at how little code was required – at first I thought some of it must be missing, and I was only seeing a snippet!

Despite the fact my “heavily-sanitised” example uses it, I am not really using the root logger in my production code; Instead, I am using eight loggers and seven handlers (or more depending how you count the one-FileHandler-per-thread for a large set of similar threads.) I simplified the snippet of code to use the convenience functions to focus on the key features.

But your comment about lock contention has got me thinking. All of my loggers propagate up to a top-level logger, which is where, in one place, I link in the critical-level SMTP handler. Perhaps I could denormalise this so there was a separate SMTPHandler per thread. This wouldn’t stop the initial delay in waiting for the email to go out, but it might stop the problem from propagating from thread to thread and getting worse as the threads block on each other.

I hadn’t read your Logging 101, but I had read a number of tutorials (and the Help file, repeatedly!) when I took “operationalised” my code from being a prototype to a production system. That’s when I set-up the logging hierarchy. I don’t recall any of the documents warning me that (a) SMTPHandler calls would block until the email was sent, and (b) non-SMTP calls could be blocked on the SMTPHandler calls. Again, I am not claiming this is necessarily a bug; merely it isn’t what I expected, which is why I offered this warning article to others.

Finally, I also agree that the problem could have been equally solved at the smtplib level. If the delay is in waiting for the MTA to accept an email (is that it? Or is it actually waiting to the destination server to accept the email? I haven’t checked), then it could be sped up by decoupling the log call from the call to SMTPHandler, from SMTPHandler to smtplib, or from smtplib to the call to port 25.

My mistake was in my assumption that someone in this chain was taking care of it; it still seems to me to be a generally useful feature.

My plan is now to continue my search for someone who has customised the SMTPHandler to do this, or else see if the SysLog handler is susceptible to the same problem, or else customise the SMTPHandler myself.

Once again, thanks for your comment. Thanks for the pointer to where the cross-thread lock contention could be. I also clearly agree with your comments, and I hope you can see there’s a missing feature from somewhere in the log->email chain.

I have to second Chistopher Dunn’s comments. Your logging is great; this is just one use-case. I am will be continuing to use it, even if I have to go down the syslog path. I have just been bitten by a few surprises, which I hope to prevent some others from repeating.
Comment by sagara on April 30, 2010

Julian,

I should have phrased that part better (it was 2am at the time); what I meant to say was that you don’t want to send the email via the current thread (since that would hold up its execution to do the sending), which is why I gave the simple thread sender example. This would only cause a minor block (which is unavoidable since the whole framework requires locking) during the emit process. I agree the builtin SMTPHandler could be somewhat better in this regard, and should probably be documented as having this behavior. Although I don’t mind diving into code, things like this should be documented.

Again, apologies for the misunderstanding.
Comment by Vinay Sajip on April 30, 2010

Hi Julian,

The blocking can happen for any number of reasons, which are typically beyond an application developer’s control. Case in point: a DNS query (which happens very much under the hood) can block for longer than you would like. If your SMTP server were a public or even a local-network remote server accessed via a DNS lookup, this could bite you in the future, quite independently of your MTA or any aspect of the SMTP protocol. Network latency effects are completely unpredictable in the general case, after all.

The answer is in general to design using a combination of multiple threads, multiple processes and multiple machines which is appropriate to the problem domain and the specifics of your situation. What about my suggestion regarding using a SocketHandler? Gabriel Genellina’s LoggingWebMonitor (linked to in my second post on this thread) gives you a ready-to-run program which you can modify to add an SMTPHandler for, and this will receive logging events from your application and send them via SMTP without holding up your main application, as long as there is no latency in the network connection between the LoggingWebMonitor machine and the machine running your application.
Comment by Julian on April 30, 2010

Vinay,

You explain that networks can create unpredictable delays. Again, I absolutely agree. You appear to be using that fact to justify (the basic) logger as being subject to unpredictable delays. I am using that same fact to justify a separate thread in the logging subsystem to protect the client from that problem. You argue that the problem can be solved by application, which is true. I argue that such a problem is general enough (for SMTPHandler, at the least, but sometimes the network-based items as well) that I wrongly assumed the logging system would handle it.

When I picked up the logging subsystem, I did 10 seconds of due diligence. “Is there anything about my logging needs that are unusual or differentiate me from my competitors? No, this is a stock standard server application from that perspective. Well, then don’t design your own system; go and use the standard system and normal practices.” I am still a little surprised to find that my needs aren’t stock-standard for a Python server. I am yet to really begin my search to find someone else who suffered the same problem and solved it.

I did look at your SocketHandler suggestion, when you first offered it. It struck me as being equivalent to the SysLogHandler solution – in both cases, I would run a socket-listening application on the same machine as the monitored application, which would convert log messages into emails.

In one, it would be a hand-written customisation of the Python code you linked to. The other would be a hand-configured version of a standard Unix utility. Much of a muchness.

In both, I would need to stress-test it to confirm that the random blocking delays that you discuss aren’t too bad when talking to localhost and don’t still require another thread in the calling application – in which case I can add that in and drop the custom socket or the syslog socket and return to using the SMTP socket.

Thanks again.
Comment by Vinay Sajip on April 30, 2010

I am using that same fact to justify a separate thread in the logging subsystem to protect the client from that problem.

As I mentioned earlier (and I think you concurred), this could equally well have been a separate thread in smtplib – which would benefit all users of smtplib, not just logging and indirectly, in this case, your application.

However, neither logging nor smtplib (nor the other networking modules such as imaplib, poplib etc.) implement this additional background thread to protect you from network latency surprises, because it’s only needed in scenarios which are arguably the domain of application design rather than library design.
Comment by Julian on May 1, 2010

Having looked at SysLog, I can understand why Vinay Sajip was emphasizing customising a SocketHandler. It seems a lot of overhead to learn – particular mapping the mappings (e.g. modules to integers).

I wouldn’t dismiss it if I was managing hundreds of disparate applications on disparate servers, but doesn’t look like a good fit to me.
Comment by Julian on May 1, 2010

Vinay,

Again, I agree with you!

The next step would be for me to show dozens of examples of other users of Python logging who suffered the same problem as me. That would hopefully be a convincing proof that this problem was general enough to be solved (somewhere) in the library stack rather than limited to the application.

I exercised my Google Fu for a good while searching for such examples (not to point-score in an online discussion, but because I was hoping to find some code I could use in practice.)

I found a couple of people who were having problems when their remote SMTP server was offline but none complaining (as I am) that a functioning SMTP server was too slow and interfering with the operation of their applications.

I could well have missed some examples, but this doesn’t bode well for me having an portfolio of suffering users to convince you of the need for change.

Looks like I will have to implement my own (SocketHandler and/or threaded LogHandler). More important tasks have priority for now.
Comment by Charlie on May 2, 2010

I would be loathe to count on anything in the Python stdlib to handle a network operation in a timely manner. A lot of the stdlib (like Vinay describes) assumes a non-“mission critical” situation, like lists & dicts aren’t threadsafe, SocketServer is slow, and so on. Serious stuff I’ve worked on ends up using a system of Queues and Locks for practically everything (I/O and data structures), which is probably how it should be anyway. A lot of “real-world” Python looks a lot different than examples you’d see in the stdlib :), but if those examples were littered with Lock acquisition and Queue abstractions it would look waaaaaay too “enterprise-y”.
Comment by Julian on May 2, 2010
Charles, I agree with your general sentiment.

However, I am not sure about your examples. You want the basic types of lists and dicts to be thread-safe? I am not convinced that I do!

I don’t want the overhead of locks being dragged into every single-threaded script I write. Even on my multi-threaded applications, most of the data-structures are local to that thread.

There is also the question of what thread-safety means in an example like:

for x in range(len(thread_safe_list)): thread_safe_list[x] = f(thread_safe_list[x])

There is an application-level decision to be made about what locking is appropriate on this code, when other threads can be mutating the list in the meantime. The Queue.Queue is special case where the semantics can be specified easily.

Do any other languages have thread-safe mutable collection types by default? Perhaps there are solutions that I can’t see.
Comment by Charlie on May 3, 2010

Yeah those are good points. It was just weird to me that everything I was used to using as a Python programmer was suddenly inadequate for doing serious work. I suppose this might actually speak to Python’s awesomeness as a language, like, no one assumes C arrays are threadsafe. Maybe, like you point out, it’s a usage issue. Maybe the stdlib is for scripts, small programs, and “native” datatypes, and for anything more serious you have to look elsewhere or roll your own. It’s confusing though; logging is a good example because I can’t think of any other Python logging packages, probably because everyone thinks stdlib logging is great. I mean, it is but it’s not as turnkey as it first appears.

Maybe I want some kind of enterprise framework with threadsafe data structures (that can be transparent interfaces for persistence layers like an RDBMS or key-value store) and an I/O interfaces that actually utilize an task queue/event loop. Heh, maybe I’m talking about Twisted… kind of a lot to think about :).
Comment by Vinay Sajip on May 3, 2010

Charlie,

It’s not a good idea for the basic Python data structures to be thread-safe; it would introduce too much overhead. Look at Java, which started out with Vector and StringBuffer (thread-safe, because synchronised on all or most operations) but later added ArrayList and StringBuilder because the overhead of thread safety was too much for universal usage. Nowadays you hardly ever see Vector or StringBuffer even in “enterprise” Java code – and that’s because they just aren’t that useful for most scenarios. Threading is moderately hard, and even experienced developers get bitten from time to time. It’s always best for the application developer to take control of how threading is organised, rather than expecting library modules to do it – they can’t anticipate all scenarios and still remain maximally useful.

The stdlib contains the full panoply of threading helpers – locks, conditions, events, semaphores etc. as well the thread-safe Queue data structure which is handy for passing info between threads. I’ve not encountered any real problems implementing concurrent solutions for “enterprise” scenarios.

I’m not sure what you’re expecting from the logging package in terms of being “more turnkey” – feel free to post questions on comp.lang.python if there’s something particular you think is inordinately hard to do with the package.
Comment by Vinay Sajip on September 23, 2010

Hi Julian,

I know this thread was a while ago, but I have been pretty busy all summer. Anyway, I see (having come here to post this comment) that you’ve developed a non-blocking handler and put it up on PyPI – great!

You may be interested in what I came up with to solve the same problem. It grew out of a question about using logging together with multiprocessing and so my solution has evolved a little differently than if I had looked at solving your problem directly. However, I think it would work for you (even though you’ve implemented your own solution already), and as you have experienced this problem yourself, I’d be grateful for any input you might have about:

http://plumberjack.blogspot.com/2010/09/improved-queuehandler-queuelistener.html

Thanks!
Comment by Alice Bevan-McGregor on August 30, 2011

Marrow Mailer (formerly TurboMail) offers a logging handler which can deliver e-mail in a background thread, amongst numerous other features. I realize this post is old, but this may help others running into problems like this.

You must be logged in to post a comment.

Web Mentions

Who's been talking about this post?