How big is your Big Data data set? If it is not measured in petabytes then perhaps it doesn’t qualify. In any case, as the amount of data grows the challenges grow disproportionately and in areas that perhaps you don’t expect. In this post, I am raising a number of challenges and concerns that you should be thinking about, especially if you are interested in business analytics on large data sets.
I recommend that if you are serious about large data sets and numerical precision, you have no choice but to adopt 128-bit arithmetic in most scenarios, especially financial.
How big is Big?
A petabyte is defined as two to the power of fifty, or one (base-2) million gigabytes. This is equivalent to 1 MB of data for each of approximately one billion records, or 16 billion 64 KB Oracle database blocks. If it is raw data, then you could imagine that a traditional row of data would be around two to four kilobytes of data, which means you have between 256 and 512 billion records. In mathematical terms, you require twelve digits of precision to simply count the number of rows.
If this data was accumulated over a single year, there would be 752 million records a day, or almost nine thousand records per second. Over a ten year period, this would still be close to one thousand records per second. So that really is a lot of data.
Perhaps your data doesn’t quite make the petabyte-scale, but you may want to apply the same mathematics to see the breadth of your data.
Numerics and Analytics
So, you need something better than a simple spreadsheet to handle this data. Most importantly, you need to understand the numerical behaviour of any implementation and how the assumptions of its developers affect the quality of the result. What level of precision do you need?
- Very broad range, not so concerned about significant digits
- Use IEEE double-precision floating point
- Lots of range but at most 17 digits of decimal precision
- Best for scientific and engineering problems
- Integer values with an absolute value of up to 2 billion
- Use 32-bit integers
- Small, fast and supported by hardware
- Up to 10 digits of decimal precision (can be used to represent non-integer values with a fixed scaling)
- Bets for general purpose calculations not requiring a greater range
- Integer values with an absolute value of up to one thousand billion billion
- Use 64-bit integers
- Larger, but still fast and supported by hardware
- Up to 20 digits of decimal precision (can be used to represent non-integer values with a fixed scaling)
- Best for general purpose calculations
- Larger integer values
- Use 128-bit integers (or larger if you need)
- Big, expensive to move around, and not supported by hardware directly
- Up to 39 digits of decimal precision (can be used to represent non-integer values with a fixed scaling)
- Best for financial calculations exceeding the range of 64-bit integers
- Algebraic computation and arbitrary precision numerics
- Implementations for these exist but they are extremely slow and large
- Best for specialist computations
Typical database and programming platforms provide other data types that are intended to represent fixed-point real numbers. Internally, these are almost always based on 32-bit, 64-bit and 128-bit integers, as this is an efficient way of implementing them.
Precision meets real-world
Imagine that you want to produce the arithmetic mean of a value that is provided for each of your rows of source data. In the worst case, the maximum value of the nominator (summing all the values) is the multiplication of the maximum stored value by the total number of rows. That requires the same number of decimal digits as the sum of those that are required to represent each part. Twelve digits for the count leaves:
- IEEE double-precision floating point
- With at most five digits of precision remaining, this choice limits the individual values to tens of thousands (excluding scaling)
- Outcome: Unusable for most financial calculations
- 32-bit integer
- Even the count is out of range
- Outcome: Unusable
- 64-bit integer
- With at most eight digits of precision remain, this limits the individual values to tens of millions which is still okay
- Outcome: Usable, range and scaling must be taken into account
- 128-bit integer
- With at most twenty-seven digits of precision remain, the individual values are not subject to any significant limitation
- Outcome: Usable
Adding exchange rate calculations
Well, tens of millions sounds okay, right? but, since we’ve collected this data on such a massive scale, shouldn’t we also use pretty accurate exchange rates. So, now we’ll apply a unique exchange rate to each value, where each rate has eight digits of decimal precision. This precision needs to be included in the calculation:
- 64-bit integer
- The exchange rate has pushed the value out of range in the worst case, but most likely errors have been introduced without moving out of range
- Outcome: Unusable, reliability cannot be guaranteed
- 128-bit integer
- With at most nineteen digits of precision remaining, this is still not really limiting at all
- Outcome: The only realistic option
Unfortunately, 128-bit integers are not supported in most hardware and do not benefit from wide adoption. Instead, various mathematical models exist to support them in code, either based on arbitrary precision numerics or fixed-size algorithms. In all cases, these require at least twice the memory bandwidth versus 64-bit calculations and typically four times as many clock cycles. This gets quite slow!
Side note: Significance and floating-point numbers
The problems are exacerbated by floating-point numeric types when combined with aggregation. There are two side effects:
- The result becomes order-dependent due to the timing of when when the accumulator cannot continue to represent the full precision of the remaining values, resulting in partial or complete discards of these values
- The result loses precision for the same reason, but can be orders of magnitude out
Product support (feel free to add your products to the comments!)
Listed below are a variety of products and there support for 128-bit numerics. Individual product implementations may differ significantly. Note that the cost of each product is a very significant aspect as it varies widely:
- Microsoft Excel: no support for 128-bit numerics
- Microsoft PowerPivot: no support for 128-bit numerics
- Microsoft SQL Server 2012 Database Engine: supported (ColumnStore does not support 128-bit numerics on disk but calculations are unaffected)
- Microsoft SQL Server 2012 Analysis Services (all model types): no support for 128-bit numerics
- Oracle Database Engine: supported
- Sybase IQ: supported
- PostgreSQL: supported
- MySQL: supported
- DB2: partial support (maximum precision: 31 digits, DB2 v10)
- Greenplum: supported
Hardware acceleration for 128-bit arithmetic is definitely desired, but right now there is no road map because the need is not clearly perceived. GPUs are of no help either because they focus on small integer and single-precision floating point numerics.
Most concerning is that many developers fail to recognise the need for this precision.
So long as 128-bit arithmetic is not implemented in hardware, developers will be forced to work around these issues, either by using third-party libraries or using software products that have their own workarounds embedded in them. However, the main challenge is to ensure that the decision-making around choice of data types involves a coherent discussion about the requirements. For me, 128-bit arithmetic is for today, not the future.