Pseudo-Numeric Identifiers

Let’s say you’re a programmer, and your application uses Library of Congress Control Numbers for books, e.g., 2001012345, or ZIP codes, like 90210. What data types would you use to represent them? Or maybe something like the Dewey Decimal System, which uses 320 to classify a book as Political Science, 320.5 for Political Theory, and 320.973 for “Political institutions and public administration (United States)”?

If you said “integer”, “floating point”, or any kind of numeric type, then clearly you weren’t paying attention during the title.

The correct answer was “string” (or some kind of array of tokens), because although these entities consist of digits, they’re not numbers: they’re identifiers, same as “root” or “Jane Smith”. You can assign them, sort them, group them by common features, but you can’t meaningfully add or multiply them together. If you’re old enough, you may remember the TV series The Prisoner or Get Smart, where characters, most of them secret agents, refer to each other by their code numbers all the time; when agents 86 and 99 team up, they don’t become agent 185 all of a sudden.

If you keep in mind this distinction between numbers, which represent quantities, and strings that merely look like numbers because they happen to consist entirely of integers, you can save yourself a lot of grief. For instance, when your manager decides to store the phone number 18003569377 as “1-800-FLOWERS”, dashes and all. Or when you need to store a foreign phone number and have to put a plus sign in front of the country code.

Advertisements
This entry was posted in Hacking and tagged . Bookmark the permalink.

2 Responses to Pseudo-Numeric Identifiers

  1. Jimmy sm says:

    doesn’t an integer use less storage space/memory than an equivalent string?

    Like

    • arensb says:

      Oh, absolutely. A string like “2001012345” takes up 11 bytes in a language like C, whereas the integer 2001012345 fits in 32 bits, or 4 bytes. So using a long integer allows you to save 7 bytes. If you have a billion of these, you can save about 7 Gb/
      Likewise, comparing two 11-byte strings should take roughly four times as long as comparing a four-byte long integer. Call it ten times, to allow for various factors like having to bring more data to the processor. On my desktop machine, long integer comparison takes about 22 nanoseconds. So if you compare a billion such identifiers, you can save 198 seconds, or a little over three minutes, best case.
      Now, when (not if) you find that an integer can’t easily represent your data (for instance, I just pulled up a book whose copyright page says, “Library of Congress Catalog Card No. 67-26020”. This pre-Y2K LCCN has a dash in it. Oh, and the same book’s ISBN ends in “X”) you’re going to spend far more than three minutes refactoring your code.
      These days, processor time, memory, and storage space are in almost all cases far cheaper than programmer time.
      Now, if you like, what you can do is hide these implementation details in a class: define an LCCN class, and only define assignment and comparison operations on it, not addition or anything, so you won’t be tempted to treat it as numeric. Under the covers, you can implement it as an integer, until such time when you find out that integers don’t work, and then you can quickly switch to strings or something.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.