Today we’re back to implementing Full-Text Search (FTS) on the iPhone. On Monday I presented an FTS prototype written in Python, and since then I’ve been working on porting it to Objective-C, targeting Cocoa and iPhone OS. I’ve implemented a working prototype on that platform – albeit one that runs very slowly on the device – and today I’d like to present some of the more interesting parts of this work-in-progress.
Tokenization
We need some code to cut a string into indexable tokens. I put this code into a (singleton) class called Tokenizer
, with the following declaration:
@interface Tokenizer : NSObject
{
NSSet* stopWords;
NSCharacterSet* splitChars;
}
+ (Tokenizer*)sharedTokenizer;
- (NSSet*)stopWords;
- (NSCharacterSet*)splitChars;
- (NSMutableSet*)tokenize:(NSString*)string;
@end
The tokenize:
method is obviously the workhorse here. Its implementation is straightforward, and relies upon the very sophisticated string and internationalization (i18n) support built into the iPhone OS:
- (NSMutableSet*)tokenize:(NSString*)string
{
// Remove diacritics and convert to lower case
string = [string stringByFoldingWithOptions:kCFCompareCaseInsensitive|kCFCompareDiacriticInsensitive locale:[NSLocale systemLocale]];
// Split on [^a-z0-9']
NSArray* a = [string componentsSeparatedByCharactersInSet:[self splitChars]];
// Convert to a set, remove stopwords (including the empty string), and return
NSMutableSet* s = [NSMutableSet setWithArray:a];
[s minusSet:[self stopWords]];
return s;
}
This code relies upon the other two Tokenizer
instance methods: stopWords
and splitChars
. These are simple methods that return frequently-referenced values:
- (NSSet*)stopWords
{
if (stopWords) return stopWords;
NSString* path = [[[NSBundle mainBundle] bundlePath] stringByAppendingPathComponent:@"stopwords.plist"];
return (stopWords = [[NSSet alloc] initWithArray:[NSArray arrayWithContentsOfFile:path]]);
}
- (NSCharacterSet*)splitChars
{
if (splitChars) return splitChars;
NSCharacterSet* cs = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyz0123456789'"];
return (splitChars = [[cs invertedSet] retain]);
}
The stopWords
methods references a stopwords.plist
file, which I built in Python (with plistlib
) from the MySQL stop word list. The only non-obvious thing here is that I added the empty string to stopwords.plist
, since it can be generated during string splits, and should be ignored for the purposes of indexing.
The performance of this code is actually pretty good; on the device it takes about 1.5s to tokenize 79 blocks of text totaling 175,553 bytes. Since I’d only be tokenizing a small amount of text (several hundred bytes?) at a time in the normal course of operations, this is more than fast enough for me.
Schema
Here’s the Core Data schema I’m using for my prototype. (Please note that this is likely to change, as I work to resolve the performance problems I’ll be discussing shortly.)
What you see is pretty much what you get:
- The
text
attribute is a mandatory string - The
keyword
attribute is a mandatory, indexed string - The
words
andhits
relationships are optional, to-many, and mutual inverses - The
words
andhits
relationships haveNullify
delete rules
Indexing
I elected to put my indexing code inside the willSave
method of my Text
managed objects. Not only does it seem reasonable to perform a search only on the saved versions of objects, but my app’s work flow won’t permit the user to access FTS until all edits have been either committed or abandoned. Here’s the current version of the indexing code:
- (void)willSave
{
// Superclass
[super willSave];
// Omit keyword processing for deleted objects
// (This was missing from the first version of this post, and is vitally important)
if (self.isDeleted) return;
// Process this object for keywords
NSMutableSet* tokens = [[Tokenizer sharedTokenizer] tokenize:self.text];
// Create and perform a request for existing keyword records for these tokens
NSError* err;
NSFetchRequest* request = [[NSFetchRequest alloc] init];
request.entity = [NSEntityDescription entityForName:@"Keyword" inManagedObjectContext:self.managedObjectContext];
request.predicate = [NSPredicate predicateWithFormat:@"keyword IN %@",tokens];
NSMutableSet* keywords = [NSMutableSet setWithArray:[self.managedObjectContext executeFetchRequest:request error:&err]];
[request release];
// Find nonexistent keywords (by removing existing keywords from tokens) ...
[tokens minusSet:[keywords valueForKey:@"keyword"]];
// ... then create them
for (id k in tokens)
{
NSManagedObject* newManagedObject = [NSEntityDescription insertNewObjectForEntityForName:@"Keyword" inManagedObjectContext:self.managedObjectContext];
[newManagedObject setValue:k forKey:@"keyword"];
[keywords addObject:newManagedObject];
}
// Make changes
NSMutableSet* keywordsToAdd = [[keywords mutableCopy] autorelease];
NSMutableSet* keywordsToRemove = [[self.words mutableCopy] autorelease];
[keywordsToAdd minusSet:self.words];
[keywordsToRemove minusSet:keywords];
if ([keywordsToAdd count]) [self addWords:keywordsToAdd];
if ([keywordsToRemove count]) [self removeWords:keywordsToRemove];
}
This runs like a pigdog. On my device (iPhone 3G) this code takes perhaps 90s above and beyond the time normally required to tokenize and commit the 175KB of test text to the DB. (I.e. it takes 90s to build the index.) This might be just barely acceptable, if I was sure that I’d never be indexing more than about 200 bytes at a time. Frankly, I’m not sure of that, and want to improve this code’s performance.
Search
Now that we’ve gone to all the trouble to build an index, it’s time to search it. For now, I’m attaching the following predicate to the NSFetchRequest
I’m using to retrieve Text
objects:
[NSPredicate predicateWithFormat:@"ANY words.keyword IN %@",keywords]
(This assumes that I’ve got a keywords
iterable containing the NSString*
keywords for which I want to search.)
This doesn’t work so well – it takes about 7s to process one lousy keyword. That’s not even sort of acceptable, so we’ll have to work on it. See you tomorrow.