MattSaforrian
7 years agoQuickbase Staff
Regarding the recent Automations Service Incident (8/2 - 8/7)
Dear Quick Base Community,
Over the past week some of you may have noticed that one of our newest features, Quick Base Automations, has been experiencing some issues. As of Tuesday morning, Automations have been running promptly, and we believe that the service is stable.
As we continue to invest in the development of this key piece of technology, we believe that you deserve to know exactly what has been going on. Since the start of the incident, IÕve been working with our development team to ensure that we resolve the most immediate issues that might affect your experience. The team is also working on a longer-term plan that will prevent future interruptions of service. ItÕs my hope that our transparency will provide you with increased confidence in the Automations feature as well as the entire Quick Base platform, and that this will serve as a starting point for a deeper conversation with you, our customers.
With that said, see below for details:
What exactly were the issues and what was the impact?
Customers who had built automations may have experienced any of the following issues:
So, what exactly happened with the automations service?
Over the past few days we have done extensive work to triage the issues and understand exactly what went wrong. While we are still working on our full root cause analysis, I did want to provide you with some of our preliminary findings. We plan to publish our full root cause analysis in a follow up post with more details, once we have them.
While many aspects of Quick Base need to be flexible in nature, few other features reach the complexity of automations. It can be easy to forget that at its core, automations provide a visual way to build custom programs that are run on our platform. That leads to a unique situation where there can be bugs, not only in the code for the feature itself, but also in the automations being created by Quick Base customers when they are building their apps.
According to our logs, we had a spike in the number of automations running on Thursday August 2, which exposed a bug in automations that our manual and automated tests did not catch. This bug resulted in the automation servers running out of resources which caused some of them to fail. If you had automations running around then, you would have experienced issue C. Additionally, because the servers went down, if you were trying to access the automations page, you would have experienced issue A.
We were quickly alerted to these issues and started working on stabilizing the service. As we investigated the issues we tried changing how many automations would be processed simultaneously, which caused issue B.
Over the weekend, while still working on the issue, we discovered that the service was not processing automations fast enough and we needed to speed this up. Unfortunately, any automations that were in that queue may have been dropped, causing issue D. This was also compounded by several automations that were running in a loop (An automation that triggers itself, or two or more automations that trigger each other). Once we identified the looping automations and shut them down, it became much easier to address the issues and speed up the queue.
Additionally, there has been an unrelated bug in the runs page where it takes too long to load if you have a failed run (issue E). We have fixed this bug; but also, was a red herring during the investigation process.
What have we been doing about it?
First and foremost, we have been focused on making sure automations run in a timely manner and that we donÕt lose any automations that should run. As previously mentioned, we feel that the service is now stable and will perform as expected.
Second, we have been trying to address the issue with the runs page loading and have deployed a fix.
Third, we have been collecting data about how this new feature can fail, building monitoring tools that allow us to jump into action, and brainstorming ways that we can make the system more resilient so that no manual action is required to keep it up and running.
What are your plans for making it more resilient?
While we currently do not have a timeline, we do know that there will be three problems we want to address:
WhatÕs next?
While we are confident that the system is stable, we are working on some immediate improvements that are designed to make the system more resilient. At the same time, we are analyzing the incident and debriefing to capture as many details as we can. After that, we will put a more detailed plan in place to make sure that the feature continues to run smoothly moving forward.
If you are interested in learning more about those details, we will be sharing them here as we move forward, so please subscribe to this community thread for updates.
If you have specific concerns that you want to discuss with us, please feel free to reach out to me at msaforrian@quickbase.com.
Thank you for your patience as we continue to work on this issue.
-Matt Saforrian
Product Manager, Automations
Over the past week some of you may have noticed that one of our newest features, Quick Base Automations, has been experiencing some issues. As of Tuesday morning, Automations have been running promptly, and we believe that the service is stable.
As we continue to invest in the development of this key piece of technology, we believe that you deserve to know exactly what has been going on. Since the start of the incident, IÕve been working with our development team to ensure that we resolve the most immediate issues that might affect your experience. The team is also working on a longer-term plan that will prevent future interruptions of service. ItÕs my hope that our transparency will provide you with increased confidence in the Automations feature as well as the entire Quick Base platform, and that this will serve as a starting point for a deeper conversation with you, our customers.
With that said, see below for details:
What exactly were the issues and what was the impact?
Customers who had built automations may have experienced any of the following issues:
- A. The Automations feature not loading
- B. Triggered automations taking longer than normal to run
- C. Receiving an automations error email that said ÒInternal Server Error. Please contact Support.Ó
- D. Previously run automations displaying a status of ÒRunningÓ but never finishing
- E. The automations run page not loading
So, what exactly happened with the automations service?
Over the past few days we have done extensive work to triage the issues and understand exactly what went wrong. While we are still working on our full root cause analysis, I did want to provide you with some of our preliminary findings. We plan to publish our full root cause analysis in a follow up post with more details, once we have them.
While many aspects of Quick Base need to be flexible in nature, few other features reach the complexity of automations. It can be easy to forget that at its core, automations provide a visual way to build custom programs that are run on our platform. That leads to a unique situation where there can be bugs, not only in the code for the feature itself, but also in the automations being created by Quick Base customers when they are building their apps.
According to our logs, we had a spike in the number of automations running on Thursday August 2, which exposed a bug in automations that our manual and automated tests did not catch. This bug resulted in the automation servers running out of resources which caused some of them to fail. If you had automations running around then, you would have experienced issue C. Additionally, because the servers went down, if you were trying to access the automations page, you would have experienced issue A.
We were quickly alerted to these issues and started working on stabilizing the service. As we investigated the issues we tried changing how many automations would be processed simultaneously, which caused issue B.
Over the weekend, while still working on the issue, we discovered that the service was not processing automations fast enough and we needed to speed this up. Unfortunately, any automations that were in that queue may have been dropped, causing issue D. This was also compounded by several automations that were running in a loop (An automation that triggers itself, or two or more automations that trigger each other). Once we identified the looping automations and shut them down, it became much easier to address the issues and speed up the queue.
Additionally, there has been an unrelated bug in the runs page where it takes too long to load if you have a failed run (issue E). We have fixed this bug; but also, was a red herring during the investigation process.
What have we been doing about it?
First and foremost, we have been focused on making sure automations run in a timely manner and that we donÕt lose any automations that should run. As previously mentioned, we feel that the service is now stable and will perform as expected.
Second, we have been trying to address the issue with the runs page loading and have deployed a fix.
Third, we have been collecting data about how this new feature can fail, building monitoring tools that allow us to jump into action, and brainstorming ways that we can make the system more resilient so that no manual action is required to keep it up and running.
What are your plans for making it more resilient?
While we currently do not have a timeline, we do know that there will be three problems we want to address:
- If a loop or significant load is put on the system, we will need to mitigate the issue by isolating, throttling, or disabling these automation(s).
- If an automation or step of an automation fails, we should have a way to retry it, both automatically and manually.
- If a server goes down for whatever reason, we should have a way to know which automations were being processed at the time and replay them.
WhatÕs next?
While we are confident that the system is stable, we are working on some immediate improvements that are designed to make the system more resilient. At the same time, we are analyzing the incident and debriefing to capture as many details as we can. After that, we will put a more detailed plan in place to make sure that the feature continues to run smoothly moving forward.
If you are interested in learning more about those details, we will be sharing them here as we move forward, so please subscribe to this community thread for updates.
If you have specific concerns that you want to discuss with us, please feel free to reach out to me at msaforrian@quickbase.com.
Thank you for your patience as we continue to work on this issue.
-Matt Saforrian
Product Manager, Automations