Parsing Apache Log Files with Python

By , on [atom feed]

The common log format of the Apache web server contains information about page views on a web page. Today I figured out how this format can be parsed in a Python program.

Typical log entries look like this (I added line breaks for better readability):

66.249.71.138 - - [20/Sep/2008:10:09:09 +0100]
  "GET / HTTP/1.1"
  200 4235 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
67.15.143.7 - - [20/Sep/2008:13:35:06 +0100]
  "GET //errors.php?error=http://www.rooksgambit.org/thebigsystem//sistem.txt? HTTP/1.1"
  404 2272 "-" "libwww-perl/5.79"

The first step of parsing these lines is easy, the regular expression library allows to break a line into the individual fields:

import re

parts = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.+)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referer>.*)"',               # referer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')

line = ... a line from the log file ...
m = pattern.match(line)
res = m.groupdict()

This results in a Python dictionary, containing the individual fields as strings.

The second step is to convert the field values from strings to Python data types and to remove the oddities of the log file format. Most fields are easy:

if res["user"] == "-":
    res["user"] = None

res["status"] = int(res["status"])

if res["size"] == "-":
    res["size"] = 0
else:
    res["size"] = int(res["size"])

if res["referer"] == "-":
    res["referer"] = None

Converting the time value into a Python datetime object is surprisingly messy, mainly because of the timezone information. The best way I found until now is the following:

import time, datetime

class Timezone(datetime.tzinfo):

    def __init__(self, name="+0000"):
        self.name = name
        seconds = int(name[:-2])*3600+int(name[-2:])*60
        self.offset = datetime.timedelta(seconds=seconds)

    def utcoffset(self, dt):
        return self.offset

    def dst(self, dt):
        return timedelta(0)

    def tzname(self, dt):
        return self.name

tt = time.strptime(res["time"][:-6], "%d/%b/%Y:%H:%M:%S")
tt = list(tt[:6]) + [ 0, Timezone(res["time"][-5:]) ]
res["time"] = datetime.datetime(*tt)

If you know of any better way of doing this, please let me know. Hints are very welcome!

This is an excerpt from Jochen's blog.
Newer entry: Durham, NC
Older entry: Large Deviations for One Dimensional Diffusions with a Strong Drift

Copyright © 2008 Jochen Voss. All content on this website (including text, pictures, and any other original works), unless otherwise noted, is licensed under the CC BY-SA 4.0 license.