Please, don't use timestamps

Published on Saturday, August 15th 2020 at 14:22
Last updated on Thursday, March 21st 2024 at 15:58

A timestamp is the number of seconds elapsed since January 1st 1970. Easy enough. So what’s the argument against using timestamps? Let me explain. But first, for those who don’t want to read ahead, here’s the TL;DR:

all libraries use variations of timestamps with different resolution (milli, micro, nano);
it’s a nightmare to reconcile them across the whole application stack;
for data transfer objects: use unambiguous date/time formats like ISO-8601 instead;
for persistence: let a framework handle the persistence details and only use resolution-less abstractions instead.

The problem

In my experience as a software engineer, I witnessed several systems being built, maintained and thrive. All of them, of course, made use of some sort of date and/or time indication, e.g. to record creation and update times at the very least. Some of them were a bit more involved and also needed to work with time-series, display time-dependant charts or alternative visualizations, transmit and store long arrays of data points, or perform machine learning data manipulation or inference. In particular, the latest epic I worked on dealt with trading algorithms and whatnot. The stack included both python (with the usual numpy and pandas environment) and React/TypeScript. Within the span of two weeks we found several bugs in different part of the application, and they all had to do with timestamps.

First, the library that performed the necessary algorithmic operations used pd.Timestamp, which stores timestamps in nanoseconds. By the way, I invite you to read the documentation of pandas Timestamp. There is no method to retrieve the actual timestamp in a given resolution, thus you need to blindly use its value attribute - which isn’t documented - and deduce by the number of zeros at the end that it is indeed stored as nanoseconds. Sure, there are a ton of other methods to convert it to other useful objects, like numpy.datetime64, native datetime, other timestamps with different time zones, etc… And there are several mentions to other objects being returned having nanoseconds precision. However, that does not imply that value actually is a number of nanoseconds. For all we know, those methods could just multiply by powers of ten on the spot. End of rant about the terrible documentation python libraries have (yes, python flexibility fosters poor documentation, read here for more python rants).

However, the library that the frontend used to plot data required timestamps in milliseconds. Naturally, since they are both integers, there is no apparent incompatibility between types so everything went on quietly since both parts were being developed separately. And yes, if you argue that they should have been developed full-stack since they deal with the same concern, I would approve of the criticism. Vertical slices are often better at avoiding these kinds of problems. So naturally, when they came together, charts were broken because all dates were invalid.

Once the latter was fixed, another bug arose because timestamps produced by the algorithmic library were forwarded into another data structure that was only used later on. Then, even within the same layer (in the backend), another bit of code that performed some additional operations on the results we obtained assumed those timestamps to be seconds instead of nanoseconds and again it produced a bug.

This is just one of many examples that showcase how awful it is to work with timestamps. Last year, I was working on a different product. We stored large data sets in a compact format through Apache Parquet. In particular, times were stored as microsecond timestamps, but once again the frontend required millisecond timestamps to produce valid charts and display valid dates. That caused several problems as well. And again, a few months ago I was working with Apache Kafka for event streaming. Kafka uses millisecond timestamps. We sent some of the events over to a serialized Tensorflow model, which interpreted times as nanosecond timestamps. I’ll let you imagine the consequences. Oh, and if you are using Postgres, date/time columns are stored in microseconds!

In all of this, I’ll let you picture dozens and dozens of helpers functions scattered across the codebase to keep converting between different resolutions, which only makes the code more dispersive and messy.

To summarize, all problems caused by timestamps can be condensed in few main points.

Timestamps are inherently ambiguous because no one follows a single format. Yes, they all abide by the definition of instant zero, which is the same across the board, but every library arbitrarily decides the resolution they want to apply, be it seconds, millis, micros or nanos.
It’s hard to enforce cohesion across a codebase. Communication is key, but even then developers will forget or make assumption that seem reasonable in the moment yet turn out to be incorrect at a later date.
Problems are only visible end-to-end since single layers are likely to be able to maintain a certain consistency thanks to common code and shared practices, but services or layers that are distant from each other only rely on contracts that are easily broken without proper checks. The latter are especially troublesome to get right since they require additional infrastructure and governance dedicated to maintaining them: end-to-end tests are complex and tedious to write, api documentation is pretty much the same unless you have dedicated technical writers, and not everyone uses versionable schemas like protocol buffers or gRPC.

The solution

In my opinion, the solution is quite simple: don’t use timestamps. In particular, I would express two different concerns:

regarding DTOs or data transfer objects. These are the payload various application components exchange: json, protobufs, xml, csv, whatever your heart desires. When you use DTOs, you need to make sure that both parties participating in the exchange understand the meaning of data being exchanged. It is crucial that you use unambiguous formats here! Even better if you rely on widely accepted standards. Personally, I prefer ISO-8601 over the alternatives.
regarding persistence. If the database (or any other persistence medium) is a sink or a leaf in your application graph, then there is no need to worry. Let the framework you are using deal with persistence however it sees fit, but only rely on abstractions when you fetch data from it. For example, in Java, I advise you only use Instant and LocalDateTime. In JS-like systems, Date or even more appropriately Moment. By using abstractions, you need not be concerned about the nitty gritty implementation and manipulation of time data.

Remarking on a related topic, I think data should be transferred in its purest form, and then let the frontend decide how it should be presented to the user. Too often I had backend APIs make arbitrary decisions on how to return pre-serialized dates according to what would be displayed in the UI. This reduces flexibility as it binds APIs to visualization, rendering them less usable in isolation. By the way, don’t be mislead by my usage of terms like frontend: if you are using server-side rendering, you still have frontend code, yet in a different application layer (i.e. the server itself). Also remember that time-zones are hard! By transmitting absolute times - or instants - you can postpone the decision on how to properly render time-zones appropriate for the users. However, let us defer this discussion to another time, as there are only three hard things in computer science:

naming things;
cache invalidation;
off-by-one errors;
date/time handling and time-zones.