Query Engine Issues for Subset of Customers

Incident Report for Kentik SaaS EMEA Cluster

Postmortem

ROOT CAUSE

This incident was caused by a software bug in our data ingest layer that occasionally wrote improperly formatted data records to disk, which in turn caused our query system to return no data for certain time slices. This bug was present from 10/04 22:18 UTC to 10/05 13:15 UTC. No data was lost.

MITIGATION

Kentik Engineering rolled back the software in question to prevent further ingest issues, and immediately began working to restore our full resolution/forensic dataset to its proper format for all affected records. This was completed by 10/05 23:00 UTC.

The team moved on to restoring the “trending” or fast resolution dataset, but discovered this would take significantly more time. In the interim, to make all data accessible, we forced any query covering any portion of the affected period to use the full resolution dataset. This was completed by 10/06 01:30 UTC.

This fast data mitigation strategy adversely affected the performance of our query system more than anticipated, so we pushed another patch around 10/07 00:30 UTC to only fill in the affected period with full data, while still allowing the usage of the fast dataset for surrounding times.

RESOLUTION

All full data records have been restored, query times and metrics appear nominal, and the original root cause bug has been resolved and is in testing in our integration environment.

Posted Oct 07, 2022 - 16:19 UTC

Resolved

This incident has been resolved.

Posted Oct 07, 2022 - 13:47 UTC

Update

We are continuing to monitor the fix we implemented and will close out the incident when we have full confidence query times are restored for all customers.

Posted Oct 06, 2022 - 16:15 UTC

Monitoring

A fix has been implemented and we are continuing to monitor. All data from the affected time period should now be available to query.

Posted Oct 06, 2022 - 01:56 UTC

Identified

Root cause has been identified, we are working to resolve.

Posted Oct 05, 2022 - 22:56 UTC

Update

The issue is resolved for data moving forward from ~13:05 UTC, and the team is investigating restoring data in the affected period to a queryable state.

Posted Oct 05, 2022 - 16:54 UTC

Investigating

We discovered a problem storing ingested data. A subset of customers have reported an issue running queries from 23:10-12:50 UTC today where data was returning with storage artifacts. We are continuing to investigate and monitor.

Posted Oct 04, 2022 - 23:10 UTC

This incident affected: Flow Ingest.