360Data: September 2009 - Organize, protect, connect and present your information

The rise of Twitter and the 140-character limit imposed on tweets has led to an explosion in the number and usage of URL shortening websites. Bit.ly, Tinyurl, short.to...these and many other such sites all work on the principle that you can encode a vast amount of permutations in a small number of characters.

Bit.ly, for instance, generates a URL that resolves to the desired URL using just a five-or-six character code. However, these characters can be numbers, lower-case or upper-case letters. This means that each character position has 62 different possibilities (26 upper, 26 lower and 10 numbers), so a 5-character code has 62^5 possible combinations, or 916,132,832, while a six-character code has a mind-boggling 56,800,235,584 possible permutations; that's a lot of website links.

The world probably doesn't need another such service, but I did start thinking about how one might go about designing the back-end for a URL shortening app. What challenges arise?

The basic premise is pretty simple, of course; you need to store the mapping between any given code and the "real" unshortened URL that it needs to point toward. Let's say then that you've got a very simplified table tbMappedURLs with two columns:

ShortCode char(6) not null,
URL nvarchar(1024) not null

I can imagine that you'll want to record creation dates, hit counts, link expiry and all sorts of extra information if you're doing this for real, but let's keep this simple for the purposes of this exercise. We'll avoid the eternal discussion over the merits of functional or technical primary keys for now, too.

The tricky bit is, when a new user requests a short URL, how do I know which codes are available and which are already in use? I'll need some kind of function that spits out a new unused code each time, but what we need to avoid is having to search all 56,800,235,584 combinations for a match (or no match) in real time when the user clicks on the "Request URL" button.

There are several different ways of attacking this problem, but my first idea is to create a sort of pool with a relatively limited number of available codes. You populate this in the background, rather than when the user makes a request.

Let's say we get 50 requests a minute at the moment. We want to allow for ten times that amount to allow for medium-term growth, so we're going to have a pool with 30,000 numbers. We have a background procedure that generates new codes for the pool once an hour or so. This runs on a replica of the main database, so that there's no resource conflict between users accessing previously-generated links and the new generator check.

INSERT INTO tbAvailablePool (ShortCode, AvailableBit) VALUES('gT7uKf', True)

Indexing these 30,000 records should result in lightning-fast queries, so when a new user requests a code you quickly get a fresh code from the pool. I'd imagine that you'd call a procedure that works a bit like this (this is pseudo-code for illustration purposes, not true or working TSQL):

BEGIN TRANSACTION GetCode;

' Get PK for the first available code record

SELECT @var_CodeID = MIN(ID)
FROM tbAvailablePool
WHERE AvailableBit = TRUE

' Set the record as unavailable to preempt locking problems

UPDATE tbAvailablePool
SET AvailableBit = FALSE
WHERE ID = @var_CodeID

' Get the code from the pool

SELECT @var_ShortCode = ShortCode
FROM tbAvailablePool
WHERE ID = @var_CodeID

' Update the main table with the new data

INSERT INTO tbMappedURLs (ShortCode, URL)
VALUES(@var_ShortCode, N'http://www.thisistheverylongURLsubmittedbytheuser.com')

' Pass the code out to the calling application or web UI

SELECT @var_ShortCode

COMMIT TRANSACTION / ROLLBACK TRANSACTION

Once an hour, then, you might clean out all the tbAvailablePool records with a FALSE AvailableBit, and run a procedure to repopulate the table with fresh available codes.

I have no idea if this approach is actually in use by any of these sites, but it seems like a reasonable means of assuring performance when creating URL mappings.

This is obviously an oversimplification. I can see, for instance, that you might want to aggregate codes so that each new request for a given (popular) URL returns the same code, and you'll probably need to partition your data fairly carefully, but the beauty of imaginary projects is the lack of real-world problems...

Labels: General

You could be forgiven for thinking that a book entitled "The Visual Display of Quantitative Information" promises little in the way of entertainment if you're not familiar with Professor Edward Tufte's groundbreaking work in information design. Nothing could be further from the truth, though: this and all of Professor Tufte's books are classic studies in the history and principles of data presentation, the sort of books that reward the patience and attention of the layman just as much as the professional.

One of the most interesting concepts presented by Professor Tufte is that of sparklines; small, data-rich graphs displayed inline and in the context of documents and information graphics, providing a quick but detailed overview of trends and comparisons in data. A more complete explanation and some examples are presented on his website.

The concept has been around for some years by now, and you see sparklines popping up in all sorts of information displays recently, but support in software applications has been patchy, at best, meaning that it's relatively troublesome to create sparklines for your own data.

The best solution up to now, for my money, has been SparkMaker from the German firm Bissantz. This is an Excel add-in and it works beautifully, but is relatively expensive and suffers from the important limitation that, as a third-party add-in, it's not standard, so sharing sparklined files or working on multiple computers causes problems.

Microsoft have seen the light, however: Excel 2010 is going to offer native support for sparklines, which has to be the best improvement in some years to the unfairly neglected charting functionality. As far as I'm concerned, if the Microsoft implementation works well then it's reason enough on its own to upgrade. I hope that it's supported by the SQL 2008 R2 Reporting Services charting controls, too - imagine the level of detail you can add to performance reports if that's the case...

Labels: Data Presentation, Excel, Gemini, SQL 2008 R2

Monday, September 7, 2009

Bandwagon jumping

Tuesday, September 1, 2009

Sparklines in Excel

About the Author

Links

Previous Posts

Archives