Wanted: universal element separator symbol; successful applicant must fix CSV too

Mathematical coordinate notation is a mess, but it could be be fixed now that Unicode is close enough to universal. While we’re at it, we can fix the CSV data format too.

The root of the problem is the comma, that unassuming but highly useful perky little squiggle of a punctuation mark, which unfortunately has been overloaded with too many globally conflicting meanings over the centuries. Were the comma only used in text, we could eliminate one of the most persistent headaches in software internationalization.

How do you write a list of values that represent coordinates? The traditional way is to put parentheses around the list and separate each value by a comma. For example, here’s a 3D coordinate:

(4, 3, 8)

But what if the values are decimal numbers, like this?

(12.7, 33.912, 81.07)

Here we encounter the major issue with the comma: in many countries it’s used as the decimal separator. These include not just Estonia and Insignificantistan, but major countries like Germany, France and Brazil. In handwriting, you might get away with using commas for both decimals and coordinate separators, but on a computer that kind of symbolic overlap won’t fly… So the usual recommendation is to use the semicolon:

(12,7; 33,912; 81,07)

This starts to look quite unpleasant, with all those visually heavy punctuation marks surrounding the actual data. Worse than that, the inconsistency of these notational systems makes it difficult to write software that works with coordinates and other numeric element lists represented as plain text.

If the coordinates could be thought of as a matrix row, you could write them within square brackets:

[12.7 33.912 81.07]

The problem with this notation is that coordinates can just as well be matrix columns rather than rows, so the matrix row representation carries a meaning that goes beyond a plain list of elements and is thus unwanted. Let’s forget about it.

If we were to use a unique separator rather than a comma or a space, then we could do away with the surrounding brackets completely. The presence of the unique “list element separator symbol” would be enough to indicate that this is a list.

Let’s try that idea with the backtick, which is one of the more rarely used ASCII characters:

12.7`33.912`81.07

This could work! But the backtick has problems of its own. It’s probably too similar to a number of other symbols like the apostrophe and the prime symbol, which is heavily used in math. The backtick is also already heavily overloaded in various programming languages, so it would have to be encoded in some representations and contexts. This is not good because we’d ideally like to establish an element separator symbol that could be truly universal.

There’s nothing left in 7-bit ASCII that hasn’t been overloaded a thousand times in computing history, so we’ll have dig into Unicode to find such a symbol.

One candidate I like is the “dot-above”, U+02D9, which looks like this: ˙

It’s a non-combining version of the “combining dot above” character used for writing diacritic letters such as .

When writing coordinates the dot-above has a visual association with the regular point as decimal separator, which I find quite pleasing:

12.7˙33.912˙81.07

But the downsides of the dot-above are many. It’s probably too small to be practical: in typical programming fonts, it looks too much like a space. It also might be confused with the interpunct a.k.a. middle dot, which is commonly used as a multiplication sign (often in the same countries that use a comma as decimal separator).

If we expand the dot-above into a vertical line, we get a non-combining macron, like this: ¯

The macron is wide enough to work as a programming symbol. Visually it’s a counterpart to the widely used _ underscore character. It’s not easily confounded with any other mathematical symbols:

12.7¯33.912¯81.07

As an additional bonus, an HTML entity shortcut exists for this character, it’s ¯. For people who produce HTML by hand, that’s definitely nicer to write than the raw Unicode encoding ¯

Fixing CSV

The macron is a suitable candidate for the universal element separator. In addition to coordinate lists, we can also apply it to fix the CSV format… Which, it must be said, is not so much an actual data format as an ancient bugbear that hides endless hideously incompatible plain-text encodings of all sorts of data. (CSV is like Oogie Boogie from Tim Burton’s Nightmare Before Christmas: a colony of bugs wrapped in a potato sack, pretending to be a single monster.)

The name “CSV” was originally short for “comma-separated values”, although it was later generalized to “character-separated values”. This is indicative of the utter lack of consistency surrounding this format — people can’t even agree about its name.

CSV is important because so many systems accept and produce it. In particular, it’s one of the most common interchange formats for spreadsheet data. Spreadsheets work mostly with numbers, and thus the unfortunate global overloading of the comma becomes eminently painful in any kind of international setting.

The German language edition of Excel will by default produce CSV files that use commas as decimal separators and semicolons as value separators. An American Excel user will receive the file, take a look at it, and complain loudly that the crazy German has produced a file of “comma-separated values” where the commas separate anything but… In a world of growing international tensions, this is no way to promote global peace and harmony.

By using the macron as a separator for CSV data, we can turn CSV into an actual file format — one with a simple and commonly agreed definition and syntax, just like JSON. Wouldn’t that be amazing?

Such an MSV (macron-separated values) file format would fix major issues with CSV by standardizing the following:

  • Unicode support. By using a Unicode character as the value separator, we ensure that non-Unicode-aware systems can’t mess with MSV.
  • Line breaks. The ASCII newline mess of LF vs. CR+LF characters is never going to die, unfortunately. There is a CSV semi-standard that specifies CRLF, so it probably would make sense for MSV to adopt the same.
  • Supported value types. Unicode strings and IEEE floating point numbers are a given. Anything else can be application-specific and treated as a string by other parsers.
  • Varying-length lines. Sure, why not? It should be allowed for the number of values on a line to change from line to line. This makes it easy to concatenate MSV files from different sources.

Besides, “macron” is just a cool word. It’s fun to pronounce and has a vague technological aura, so it can make you look smart in the right company. That reminds me, I must rush off to reserve the domain “macron.io” before the inevitable gold rush of new macron-related startups begins…

This entry was posted in Names, Programming. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>