ROOT CAUSE
This incident was caused by a software bug in our data ingest layer that occasionally wrote improperly formatted data records to disk, which in turn caused our query system to return no data for certain time slices. This bug was present from 10/04 22:18 UTC to 10/05 13:15 UTC. No data was lost.
MITIGATION
Kentik Engineering rolled back the software in question to prevent further ingest issues, and immediately began working to restore our full resolution/forensic dataset to its proper format for all affected records. This was completed by 10/05 23:00 UTC.
The team moved on to restoring the “trending” or fast resolution dataset, but discovered this would take significantly more time. In the interim, to make all data accessible, we forced any query covering any portion of the affected period to use the full resolution dataset. This was completed by 10/06 01:30 UTC.
This fast data mitigation strategy adversely affected the performance of our query system more than anticipated, so we pushed another patch around 10/07 00:30 UTC to only fill in the affected period with full data, while still allowing the usage of the fast dataset for surrounding times.
RESOLUTION
All full data records have been restored, query times and metrics appear nominal, and the original root cause bug has been resolved and is in testing in our integration environment.