Full Text Search (Part 2)

Today we’re back to implementing Full-Text Search (FTS) on the iPhone. On Monday I presented an FTS prototype written in Python, and since then I’ve been working on porting it to Objective-C, targeting Cocoa and iPhone OS. I’ve implemented a working prototype on that platform – albeit one that runs very slowly on the device – and today I’d like to present some of the more interesting parts of this work-in-progress.

Tokenization

We need some code to cut a string into indexable tokens. I put this code into a (singleton) class called Tokenizer, with the following declaration:

@interface Tokenizer : NSObject
{
	NSSet*			stopWords;
	NSCharacterSet*		splitChars;
}

+ (Tokenizer*)sharedTokenizer;

- (NSSet*)stopWords;
- (NSCharacterSet*)splitChars;

- (NSMutableSet*)tokenize:(NSString*)string;

@end

The tokenize: method is obviously the workhorse here. Its implementation is straightforward, and relies upon the very sophisticated string and internationalization (i18n) support built into the iPhone OS:

- (NSMutableSet*)tokenize:(NSString*)string
{
	// Remove diacritics and convert to lower case
	string = [string stringByFoldingWithOptions:kCFCompareCaseInsensitive|kCFCompareDiacriticInsensitive locale:[NSLocale systemLocale]];

	// Split on [^a-z0-9']
	NSArray* a = [string componentsSeparatedByCharactersInSet:[self splitChars]];

	// Convert to a set, remove stopwords (including the empty string), and return
	NSMutableSet* s = [NSMutableSet setWithArray:a];
	[s minusSet:[self stopWords]];
	return s;
}

This code relies upon the other two Tokenizer instance methods: stopWords and splitChars. These are simple methods that return frequently-referenced values:

- (NSSet*)stopWords
{
	if (stopWords) return stopWords;
	
	NSString* path = [[[NSBundle mainBundle] bundlePath] stringByAppendingPathComponent:@"stopwords.plist"];
	return (stopWords = [[NSSet alloc] initWithArray:[NSArray arrayWithContentsOfFile:path]]);
}

- (NSCharacterSet*)splitChars
{
	if (splitChars) return splitChars;

	NSCharacterSet* cs = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyz0123456789'"];
	return (splitChars = [[cs invertedSet] retain]);
}

The stopWords methods references a stopwords.plist file, which I built in Python (with plistlib) from the MySQL stop word list. The only non-obvious thing here is that I added the empty string to stopwords.plist, since it can be generated during string splits, and should be ignored for the purposes of indexing.

The performance of this code is actually pretty good; on the device it takes about 1.5s to tokenize 79 blocks of text totaling 175,553 bytes. Since I’d only be tokenizing a small amount of text (several hundred bytes?) at a time in the normal course of operations, this is more than fast enough for me.

Schema

Here’s the Core Data schema I’m using for my prototype. (Please note that this is likely to change, as I work to resolve the performance problems I’ll be discussing shortly.)

What you see is pretty much what you get:

The text attribute is a mandatory string
The keyword attribute is a mandatory, indexed string
The words and hits relationships are optional, to-many, and mutual inverses
The words and hits relationships have Nullify delete rules

Indexing

I elected to put my indexing code inside the willSave method of my Text managed objects. Not only does it seem reasonable to perform a search only on the saved versions of objects, but my app’s work flow won’t permit the user to access FTS until all edits have been either committed or abandoned. Here’s the current version of the indexing code:

- (void)willSave
{
	// Superclass
	[super willSave];

	// Omit keyword processing for deleted objects
	// (This was missing from the first version of this post, and is vitally important)
	if (self.isDeleted) return;

	// Process this object for keywords
	NSMutableSet* tokens = [[Tokenizer sharedTokenizer] tokenize:self.text];

	// Create and perform a request for existing keyword records for these tokens
	NSError* err;
	NSFetchRequest* request = [[NSFetchRequest alloc] init];
	request.entity = [NSEntityDescription entityForName:@"Keyword" inManagedObjectContext:self.managedObjectContext];
	request.predicate = [NSPredicate predicateWithFormat:@"keyword IN %@",tokens];
	NSMutableSet* keywords = [NSMutableSet setWithArray:[self.managedObjectContext executeFetchRequest:request error:&err]];
	[request release];

	// Find nonexistent keywords (by removing existing keywords from tokens) ...
	[tokens minusSet:[keywords valueForKey:@"keyword"]];
	// ... then create them
	for (id k in tokens)
	{
		NSManagedObject* newManagedObject = [NSEntityDescription insertNewObjectForEntityForName:@"Keyword" inManagedObjectContext:self.managedObjectContext];
		[newManagedObject setValue:k forKey:@"keyword"];
		[keywords addObject:newManagedObject];
	}
	
	// Make changes
	NSMutableSet* keywordsToAdd = [[keywords mutableCopy] autorelease];
	NSMutableSet* keywordsToRemove = [[self.words mutableCopy] autorelease];
	[keywordsToAdd minusSet:self.words];
	[keywordsToRemove minusSet:keywords];
	if ([keywordsToAdd count]) [self addWords:keywordsToAdd];
	if ([keywordsToRemove count]) [self removeWords:keywordsToRemove];
}

This runs like a pigdog. On my device (iPhone 3G) this code takes perhaps 90s above and beyond the time normally required to tokenize and commit the 175KB of test text to the DB. (I.e. it takes 90s to build the index.) This might be just barely acceptable, if I was sure that I’d never be indexing more than about 200 bytes at a time. Frankly, I’m not sure of that, and want to improve this code’s performance.

Search

Now that we’ve gone to all the trouble to build an index, it’s time to search it. For now, I’m attaching the following predicate to the NSFetchRequest I’m using to retrieve Text objects:

[NSPredicate predicateWithFormat:@"ANY words.keyword IN %@",keywords]

(This assumes that I’ve got a keywords iterable containing the NSString* keywords for which I want to search.)

This doesn’t work so well – it takes about 7s to process one lousy keyword. That’s not even sort of acceptable, so we’ll have to work on it. See you tomorrow.

Share and Enjoy:

Tokenization

Schema

Indexing

Search

Services

Find Stuff

Pages

Buy My Apps

Other Stuff I’ve Built

Book Club

Archives

Categories

Blogroll