By Jochen Voss, on
The
common log format
of the Apache web server contains information about page views on a web
page. Today I figured out how this format can be parsed in a Python
program.
Typical log entries look like this (I added line breaks for better readability):
66.249.71.138 - - [20/Sep/2008:10:09:09 +0100] "GET / HTTP/1.1" 200 4235 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 67.15.143.7 - - [20/Sep/2008:13:35:06 +0100] "GET //errors.php?error=http://www.rooksgambit.org/thebigsystem//sistem.txt? HTTP/1.1" 404 2272 "-" "libwww-perl/5.79"
The first step of parsing these lines is easy, the regular expression library allows to break a line into the individual fields:
import re
parts = [
r'(?P<host>\S+)', # host %h
r'\S+', # indent %l (unused)
r'(?P<user>\S+)', # user %u
r'\[(?P<time>.+)\]', # time %t
r'"(?P<request>.+)"', # request "%r"
r'(?P<status>[0-9]+)', # status %>s
r'(?P<size>\S+)', # size %b (careful, can be '-')
r'"(?P<referer>.*)"', # referer "%{Referer}i"
r'"(?P<agent>.*)"', # user agent "%{User-agent}i"
]
pattern = re.compile(r'\s+'.join(parts)+r'\s*\Z')
line = ... a line from the log file ...
m = pattern.match(line)
res = m.groupdict()
This results in a Python dictionary, containing the individual fields as strings.
The second step is to convert the field values from strings to Python data types and to remove the oddities of the log file format. Most fields are easy:
if res["user"] == "-": res["user"] = None res["status"] = int(res["status"]) if res["size"] == "-": res["size"] = 0 else: res["size"] = int(res["size"]) if res["referer"] == "-": res["referer"] = None
Converting the time value into a Python datetime object is surprisingly messy, mainly because of the timezone information. The best way I found until now is the following:
import time, datetime class Timezone(datetime.tzinfo): def __init__(self, name="+0000"): self.name = name seconds = int(name[:-2])*3600+int(name[-2:])*60 self.offset = datetime.timedelta(seconds=seconds) def utcoffset(self, dt): return self.offset def dst(self, dt): return timedelta(0) def tzname(self, dt): return self.name tt = time.strptime(res["time"][:-6], "%d/%b/%Y:%H:%M:%S") tt = list(tt[:6]) + [ 0, Timezone(res["time"][-5:]) ] res["time"] = datetime.datetime(*tt)
If you know of any better way of doing this, please let me know. Hints are very welcome!
This is an excerpt from Jochen's blog.
Newer entry: Durham, NC
Older entry: Large Deviations for One Dimensional Diffusions with a Strong Drift
Copyright © 2008 Jochen Voss. All content on this website (including text, pictures, and any other original works), unless otherwise noted, is licensed under the CC BY-SA 4.0 license.