Today I’d like to present a little Python script that extrapolates Webalizer reports. Webalizer is a “fast, free web server log file analysis program” that my webhost uses to produce HTML traffic reports. These reports include month-by-month totals for the last 12 months. Unfortunately, the numbers for the current (partial) month aren’t directly comparable to those for previous months. This script grabs the summary report and does a linear extrapolation on the most recent month. It’s a bit silly, but I like the apples-to-apples comparison.
Fetch and Parse
The first thing to do is to fetch the summary report, and then extract the relevant data from it. The parsing code you see here is ridiculously specific to the version and configuration of Webalizer that my host runs. It’d be no big deal to change it if necessary, and this is much much faster than trying to be general. Here’s the code:
def get_report(url):
fp = urllib2.urlopen(url)
s = fp.read()
fp.close()
return s
def get_gen_date(s):
t = re.search('Generated\s+(.+)<BR>', s).group(1)
return time.strptime(t, '%d-%B-%Y %H:%M %Z')
def get_table_row(s, i):
cells = re.findall('<TD.+<FONT.+>(.+)</FONT>.*</TD>', s)
return cells[i*13+0:i*13+1] + cells[i*13+5:i*13+6] + cells[i*13+8:i*13+13]
I use regexes to search for certain “magic” HTML patterns. I assume that the results table is 13 columns wide, and that the data I want are in certain specific columns:
- Column 0: Month
- Column 5: Uniques
- Column 8: KB Out
- Column 9: Visits
- Column 10: Pages
- Column 11: Files
- Column 12: Hits
Extrapolation
To do my linear extrapolation, I need to compare the time between report generation and the start of the month against the length of the entire month. This is a little fiddly, but not too bad:
def get_factor(t):
if (t[1] == 12):
som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
eom = time.mktime((t[0]+1, 1, 1, 0, 0, 0, 0, 0, -1))
else:
som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
eom = time.mktime((t[0], t[1]+1, 1, 0, 0, 0, 0, 0, -1))
return (eom-som)/(time.mktime(t)-som)
Here, t
is a Python time.struct_time
sequence. Its first three elements are year, month (1 to 12), and day. Its last element is a DST indicator; a value of -1 means “do the right thing”. DST might not be handled 100% correctly in some cases. Boo hoo.
Output
Okay, now we can dump some output to the console.
def output(t, r, f):
print 'Built on', time.strftime('%d-%B-%Y %H:%M %Z', t), 'for', r[0]
print
d = [int(f*int(i)) for i in r[1:]]
print 'Projected Uniques/Visits: %d / %d' % (d[0],d[2])
print 'Projected Pages/Files/Hits: %d / %d / %d' % tuple(d[-3:])
print 'Projected GB Served: %.3f' % float(d[1]/1000000.0)
Note that:
t
is atime.struct_time
returned byget_gen_date()
r
is atuple
returned byget_table_row()
f
is afloat
returned byget_factor()
Make It Go
Here’s the main function that co-ordinates all the parts:
def report():
s = get_report(PUT_AN_APPROPRIATE_URL_HERE)
t = get_gen_date(s)
f = get_factor(t)
if (f < 60):
output(t, get_table_row(s, 0), f)
else:
output(t, get_table_row(s, 1), 1)
Please note that when I don’t have enough data for the current month I return non-extrapolated data from the previous month. I set the threshold at a factor of 60, or about 12 hours of data. (This code assumes that there is a previous month of data.)
Code
Here’s the complete script:
#!/usr/bin/python
import urllib2
import time
import re
def get_report(url):
fp = urllib2.urlopen(url)
s = fp.read()
fp.close()
return s
def get_gen_date(s):
t = re.search('Generated\s+(.+)<BR>', s).group(1)
return time.strptime(t, '%d-%B-%Y %H:%M %Z')
def get_table_row(s, i):
cells = re.findall('<TD.+<FONT.+>(.+)</FONT>.*</TD>', s)
return cells[i*13+0:i*13+1] + cells[i*13+5:i*13+6] + cells[i*13+8:i*13+13]
def get_factor(t):
if (t[1] == 12):
som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
eom = time.mktime((t[0]+1, 1, 1, 0, 0, 0, 0, 0, -1))
else:
som = time.mktime((t[0], t[1], 1, 0, 0, 0, 0, 0, -1))
eom = time.mktime((t[0], t[1]+1, 1, 0, 0, 0, 0, 0, -1))
return (eom-som)/(time.mktime(t)-som)
def output(t, r, f):
print 'Built on', time.strftime('%d-%B-%Y %H:%M %Z', t), 'for', r[0]
print
d = [int(f*int(i)) for i in r[1:]]
print 'Projected Uniques/Visits: %d / %d' % (d[0],d[2])
print 'Projected Pages/Files/Hits: %d / %d / %d' % tuple(d[-3:])
print 'Projected GB Served: %.3f' % float(d[1]/1000000.0)
def report():
s = get_report(PUT_AN_APPROPRIATE_URL_HERE)
t = get_gen_date(s)
f = get_factor(t)
if (f < 60):
output(t, get_table_row(s, 0), f)
else:
output(t, get_table_row(s, 1), 1)
if __name__ == '__main__':
report()