Monday, September 7, 2009

Bandwagon jumping

The rise of Twitter and the 140-character limit imposed on tweets has led to an explosion in the number and usage of URL shortening websites., Tinyurl, and many other such sites all work on the principle that you can encode a vast amount of permutations in a small number of characters., for instance, generates a URL that resolves to the desired URL using just a five-or-six character code. However, these characters can be numbers, lower-case or upper-case letters. This means that each character position has 62 different possibilities (26 upper, 26 lower and 10 numbers), so a 5-character code has 62^5 possible combinations, or 916,132,832, while a six-character code has a mind-boggling 56,800,235,584 possible permutations; that's a lot of website links.

The world probably doesn't need another such service, but I did start thinking about how one might go about designing the back-end for a URL shortening app. What challenges arise?

The basic premise is pretty simple, of course; you need to store the mapping between any given code and the "real" unshortened URL that it needs to point toward. Let's say then that you've got a very simplified table tbMappedURLs with two columns:

ShortCode char(6) not null,
URL nvarchar(1024) not null

I can imagine that you'll want to record creation dates, hit counts, link expiry and all sorts of extra information if you're doing this for real, but let's keep this simple for the purposes of this exercise. We'll avoid the eternal discussion over the merits of functional or technical primary keys for now, too.

The tricky bit is, when a new user requests a short URL, how do I know which codes are available and which are already in use? I'll need some kind of function that spits out a new unused code each time, but what we need to avoid is having to search all 56,800,235,584 combinations for a match (or no match) in real time when the user clicks on the "Request URL" button.

There are several different ways of attacking this problem, but my first idea is to create a sort of pool with a relatively limited number of available codes. You populate this in the background, rather than when the user makes a request.

Let's say we get 50 requests a minute at the moment. We want to allow for ten times that amount to allow for medium-term growth, so we're going to have a pool with 30,000 numbers. We have a background procedure that generates new codes for the pool once an hour or so. This runs on a replica of the main database, so that there's no resource conflict between users accessing previously-generated links and the new generator check.

INSERT INTO tbAvailablePool (ShortCode, AvailableBit) VALUES('gT7uKf', True)

Indexing these 30,000 records should result in lightning-fast queries, so when a new user requests a code you quickly get a fresh code from the pool. I'd imagine that you'd call a procedure that works a bit like this (this is pseudo-code for illustration purposes, not true or working TSQL):


' Get PK for the first available code record

SELECT @var_CodeID = MIN(ID)
FROM tbAvailablePool
WHERE AvailableBit = TRUE

' Set the record as unavailable to preempt locking problems

UPDATE tbAvailablePool
SET AvailableBit = FALSE
WHERE ID = @var_CodeID

' Get the code from the pool

SELECT @var_ShortCode = ShortCode
FROM tbAvailablePool
WHERE ID = @var_CodeID

' Update the main table with the new data

INSERT INTO tbMappedURLs (ShortCode, URL)
VALUES(@var_ShortCode, N'')

' Pass the code out to the calling application or web UI

SELECT @var_ShortCode


Once an hour, then, you might clean out all the tbAvailablePool records with a FALSE AvailableBit, and run a procedure to repopulate the table with fresh available codes.

I have no idea if this approach is actually in use by any of these sites, but it seems like a reasonable means of assuring performance when creating URL mappings.

This is obviously an oversimplification. I can see, for instance, that you might want to aggregate codes so that each new request for a given (popular) URL returns the same code, and you'll probably need to partition your data fairly carefully, but the beauty of imaginary projects is the lack of real-world problems...