Today, just a quick little project: Let’s use Python to extract Congressional Districts from web pages. This is mostly a regex demo.
Motivation
This project grew out of something I built for a friend: the URL scraper on the stoptheleft.org site. If you see a post or article that lists a bunch of competitive House races (by Congressional District), then you can run that post or article through the scraper to get a list of donation links. This lets you direct your money to where it will likely have the most impact.
There’s a certain amount of tedious HTML and JS mucking about to make the scraper work, and the server-side stuff is written in PHP (blah), so this demo is just going to concentrate on the core filtering logic, as demonstrated in Python.
Code
Here’s the interesting part of the code:
all_cds = set([ 'AKAL', 'AL01', 'AL02', 'AL03', 'AL04', 'AL05', 'AL06', 'AL07', 'AR01', 'AR02', 'AR03', 'AR04',
'AZ01', 'AZ02', 'AZ03', 'AZ04', 'AZ05', 'AZ06', 'AZ07', 'AZ08', 'CA01', 'CA02', 'CA03', 'CA04',
# … lots more districts …
'WA07', 'WA08', 'WA09', 'WI01', 'WI02', 'WI03', 'WI04', 'WI05', 'WI06', 'WI07', 'WI08', 'WV01',
'WV02', 'WV03', 'WYAL'])
# Transform a CD to a standard representation
def normalize(od):
# Standardize on upper case
od = od.upper()
# Find state code
st = od[:2]
# Find district number
if (od[2] == '-'):
cd = od[3:]
else:
cd = od[2:]
# Return result
if (cd[:2] == 'AL'):
return st+'AL'
else:
return '%s%02d'%(st,int(cd))
# Pull a page
def fetch(url):
return urllib.urlopen(url).read()
# Pull all CD-looking things from a string
def filter(s):
return re.findall('[a-z]{2}-?(?:(?:[0-9]{2}(?=[^0-9]|$))|(?:[0-9](?=[^0-9]|$))|(?:al))', s, re.I)
# Remove, from a list of possible (normalized) CDs, all duplicates and invalid codes
# Preserve original order where possible
def cull(l):
d_list = []; d_set = set()
for cd in l:
if (cd in d_set) or (cd not in all_cds): continue
d_list.append(cd); d_set.add(cd)
return d_list
(You can also download the whole thing here.)
Regex
The heart of this little program is pretty clearly the regular expression inside filter
. Here it is again:
[a-z]{2}-?(?:(?:[0-9]{2}(?=[^0-9]|$))|(?:[0-9](?=[^0-9]|$))|(?:al))
It’s a bit complicated, so let’s take it piece-by-piece. It begins with two letters:
[a-z]{2}
… followed by an optional hyphen:
-?
… followed by a non-capturing group that matches one of three things:
(?:…|…|…)
The first possible match:
(?:[0-9]{2}(?=[^0-9]|$))
… is a non-capturing group that begins with two digits:
[0-9]{2}
… followed by a lookahead to a non-digit or the end-of-string.
(?=[^0-9]|$)
The second possible match:
(?:[0-9](?=[^0-9]|$))
… is a non-capturing group that begins with a digit:
[0-9]
… followed by a lookahead to a non-digit or the end-of-string.
(?=[^0-9]|$)
The third possible match is just a non-capturing group composed of the letters “al”:
(?:al)
So, to sum up: a Congressional District code is made up of two letters, possibly followed by a hyphen, followed by:
- two digits when followed by a non-digit or end-of-string, or
- one digit when followed by a non-digit or end-of-string, or
- the letters “al”
This is all specified in lower-case, but the re.I
argument passed to findall()
triggers a case-insensitive search.
Culling
The regex gets us far, but not quite all the way home. There’s a lot of junk in modern web pages that matches our pattern; consider the “utf-8” charset declaration found almost everywhere. (Our regex will match this as “tf-8”.) There are a number of approaches one might take to filtering out this junk:
- (Even) more complex regexes
- Semantic analysis (ignore
SCRIPT
tags, for instance) - Whitelisting
We’ll be using whitelisting, as it’s the simplest approach. There are only 435 valid congressional districts, which means that they’re (relatively) easily enumerated, and that its pretty unlikely that a random spurious match will pass the whitelist test.
The whitelist is represented by a set
and stored in the all_cds
global variable. The CDs in the set were retrieved from this handy XML file taken from the NRCC’s website.
CDs are normalized into a standard form (two capital letters, followed by two digits) before whitelist testing; this also facilitates duplicate detection.
Testing
You can test the scraper with code like this:
print '\n'.join(cull(map(normalize, filter(fetch('http://www.nationalreview.com/corner/247706/first-reads-64-jonah-goldberg')))))
Running against the URL in the example yields pretty good results: Of the 64 CDs that should be returned, only one is missing (IN08), and that’s due to a typo (a missing “I”) in the original post. There are 5 spurious CDs thrown off (IA01, TX06, NE01, NE02, NE03) from the behind-the-scenes junk (complex link URLs, hidden input fields, javascript) found on most modern web pages, but I don’t think that’s too bad for such little code.