MattSaforrian
7 years agoQuickbase Staff
Automations Performance Challenges and Improvements
Hi Quick Base Community,
I've been having a lot of conversations about the uptime and performance of Automations, so I thought it would be helpful if I pulled together my thoughts on the topic to share with everyone.
At the root of these challenges, there has been something that I like to call ÒRunaway AutomationsÓ. These are incidents where someone creates and triggers an Automation that consumes a significant portion of our server resources.
After our first downtime incident we learned a tremendous amount about how things can go wrong. Since then, weÕve been working on an improved architecture that will allow us to continue to scale to the needs of our customers. In fact, we've already implemented a number of fixes that have delivered immediate results. On top of that, we have also learned how to better identify and respond to these incidents in order to minimize the impact that they have across the platform.
If you just want to read about these improvements, you should skip to the bottom. That said, in order to truly understand why we are changing the infrastructure it's helpful to understand what types of Automations cause problems. If youÕre interested in finding out what causes excessive load, then you can read on for more detail.
Generally, I have found that there are three patterns which create enough load to impact the entire platform:
1. Automations that continually run in a loop.
Occasionally, an app builder will create a series of Automations, Webhooks, and Actions where one item in the series will eventually circle back around and re-trigger the first item. Sometimes the loop eventually comes to an end, meeting some pre-defined condition, but sometimes it will continue to loop indefinitely. While there are a few rare but legitimate use cases for a limited loop, there is not a legitimate use case for an infinite loop. In either scenario, these loops consume a significant portion of our server resources as they try to consume any available power until theyÕre finished.
When we originally launched automations, we thought that the rate limiting which we had introduced for Webhooks would limit these loops. Our early test cases showed that it did do its job and would limit automations. What we didnÕt know, is that when the system is under a significant load, it can slow down just enough to stay under the rate limit. The loop is then allowed to continue to run, and that means we need a different approach.
To address this pattern, we've examined a handful of solutions and landed on a rather simple fix. We've added logic to create a counter that is passed between Automations and incremented each time it runs. If the counter goes over our new limit of 100, it shuts off the Automation and notifies the owner.
We added this logic in September and have been monitoring the results in order to set a baseline. What we found is that most automations only loop or call other automations a small handful of times. ItÕs only a very small percentage of Automations that have a loop which repeats 75 times or more, so we consider anything over that to be abnormal. Based on that finding, we decided to set the limit to 100 and will be continuing to monitor our logs to learn if that limit eventually needs to be raised.
In the time since we implemented this new loop-limiting function, we have already observed a handful of cases where the new logic has disabled runaway automations. Given this immediate success, we believe that we have found the right way to address this pattern, ensuring that loops no longer impede your experience with Automations. That said, we will continue to monitor our alerts for any automations that exhibit this behavior.
2. "Fan" Automations with multiplicative effects
In the interest of full transparencyÐthis is the type of Automation that causes the most headaches. Unlike looping automations, these don't come back around to repeat a cycle but instead look something like this:
Unfortunately, the solution we implemented for the "looping" logic scenario doesnÕt catch this problem, as the number of "iterations" counted only goes up to 2 (not even close to our limit of 100). The problem at the root here is that we perform edits one at a time instead of all at once.
When we first built automations, we looked at our ImportFromCSV API and found that it didn't have the full support that we needed for updating records. We opted to use the EditRecord API instead, not realizing that people would chain together Automations in this manner. Since that time, we have learned that we really need to optimize the way that we update data, and as a result we are implementing a change to our ImportFromCSV API so that it can also be used under the hood to manage bulk data changes. Our goal is to have the API updated in November and to then make the switch in Automations shortly thereafter.
Once implemented, we expect to see the number of API calls and webhooks firing to drop dramatically. We are currently testing this change, and I believe that it will make a huge difference.
3. "Table to Table" Imports
Table imports are a very handy tool in the Quick Base ecosystem. Not everyone knows this, but importing data was actually one of the driving reasons to build the Automations feature. Our customers have been asking for a "scheduled table to table import" for years now, and the flexible nature of Automations seemed like the perfect platform to provide that new functionality.
That said, we could not have predicted that some customers would build Automations to run dozens of table imports for each and every record that is changed in their apps. The challenge with this scenario is that table imports can sometimes be a very resource intensive operation for our servers to process. This can cause the rest of an app to slow to a crawl if it is being asked to do too may imports, which in turn causes Automations to become backed up and results in poor system performance across the board.
To address this issue, we are adding "cool down" logic to our API calls in Quick Base so that your apps can have a chance to catch up. Additionally, we are adding new logic to limit the number of Automations from an app that can run at any one time so that these types of Automations can't consume all of the resources.
Fixes and Improvements
So now that you know about the 3 main scenarios where Automations can cause problems, letÕs take a look at the fixes that we have already implemented, and a few more that are in the works.
Fixes already implemented
Automatic Looping Shutoff
We now disable an Automation if it has seen any one job or run come back around too many times (100). You shouldn't expect this to ever affect you unless you have a loop built.
Raised the Rate Threshold Limit to 20/second
Over the summer, many customers struggled with Automations performing too many edits at once. This coupled with our throttling logic (next item) have resolved this issue.
Throttling API Calls
We now throttle our API calls so that we stay under the 20/second threshold for triggering downstream automations. This slows down edits and adds so that Webhooks can be sent out without hitting the rate threshold issue. That said, if you have multiple automations in an app all firing at once, it's possible for your webhooks to hit the rate threshold.
Implemented Better Database Pooling
While there are multiple issues that caused our outage in early August, the way we manage our database connections resulted in the servers becoming very unresponsive. We've since switched to a new library and have seen significant improvements.
Added Retry Logic for Failed Requests
This is unrelated to the performance issues but sometimes get lumped into this bucket. Some customers were receiving sporadic "network errors" and "internal server errors" that were caused by network traffic. We've added logic that have eliminated all of these errors.
Improved Monitoring and Alerting
As mentioned, we've learned how to respond to issues and are able to take action very quickly. We've added logging in various different places and are alerted as soon as automation jobs start to backup. Additionally, we've improved how our teams collaborate so that we can respond better.
Automatic Disabling of Automations with too many Errors
This has been in for a long time but is worth mentioning. Early on, we found that sometimes a runaway automation is encountering a lot of errors but continues to run. This ends up eating up resources for something that is destined to fail. We added logic to disable these automations until they are fixed.
3 Minute Runtime Limit
Like the above item, we implemented this a few months ago as a way to prevent runaway Automations from running for too long.
Limits on # of Edits/Deletes
One other thing we found through various incidents is that editing too many records is a fairly sure sign that an Automation is incorrectly configured. We put this limit in place primarily to prevent users from trashing their apps but also to put a cap on how much work an Automation might try to perform.
In monitoring support cases, we've found that this limit has helped save people from bulk editing records that they did not intend to edit.
That said, if you really do need to edit more than 1,000 records at a time, IÕd love to hear about your app and use case!
Upcoming Changes & Fixes
Modify Records in Bulk
Of all the changes we have considered, we believe that this is going to have the most significant impact on performance and stability of Automations. When we implement this change, Automations that are "chained" together will result in significantly fewer API calls and jobs running at the same time.
Cooldown logic for API Calls
We're implementing additional throttling of API calls for when Quick Base takes a long time to respond (>50ms). This will give Quick Base some time to process other API calls and get back up to speed.
"Back of line" for Busy Automations
We're also implementing additional throttling that will limit the number of concurrent jobs that a single Automation can have running at one time. When this limit is exceeded, it will put new jobs at the "end of the line" so that other customers Automations can run. As jobs that are in-flight complete, then more jobs from that Automation can start up.
Longer Term Architecture
While I can't fully disclose our back-end architecture. The few relevant points I can point out are that we are moving towards a more modern tech stack for queuing of events and a more scalable tech stack for storing all information about when an automation runs.
Copy Records Action to Replace Table Imports
WeÕre working on implementing a Copy Records Action that works very much like Table Import but with some added tweaks that make it unique to Automations. YouÕll be able to dynamically select a set of records and pass in data from the trigger which means youÕll need fewer steps in an automation or fewer automations to do the same task as before. Moreover, we will have more control over how the Copy Records action works and can tune the performance as needed.
Conclusion
If you have made it all the way through this post, then I want to thank you for sticking with me! I hope that this has given you a better idea of how we are thinking about the Automations feature, and that you can now feel more confident in the future of Automations. If you want to talk to someone more deeply about any one of these topics, please donÕt hesitate to reach out to me or to our Customer Success team.
-Matt
I've been having a lot of conversations about the uptime and performance of Automations, so I thought it would be helpful if I pulled together my thoughts on the topic to share with everyone.
At the root of these challenges, there has been something that I like to call ÒRunaway AutomationsÓ. These are incidents where someone creates and triggers an Automation that consumes a significant portion of our server resources.
After our first downtime incident we learned a tremendous amount about how things can go wrong. Since then, weÕve been working on an improved architecture that will allow us to continue to scale to the needs of our customers. In fact, we've already implemented a number of fixes that have delivered immediate results. On top of that, we have also learned how to better identify and respond to these incidents in order to minimize the impact that they have across the platform.
If you just want to read about these improvements, you should skip to the bottom. That said, in order to truly understand why we are changing the infrastructure it's helpful to understand what types of Automations cause problems. If youÕre interested in finding out what causes excessive load, then you can read on for more detail.
Generally, I have found that there are three patterns which create enough load to impact the entire platform:
1. Automations that continually run in a loop.
Occasionally, an app builder will create a series of Automations, Webhooks, and Actions where one item in the series will eventually circle back around and re-trigger the first item. Sometimes the loop eventually comes to an end, meeting some pre-defined condition, but sometimes it will continue to loop indefinitely. While there are a few rare but legitimate use cases for a limited loop, there is not a legitimate use case for an infinite loop. In either scenario, these loops consume a significant portion of our server resources as they try to consume any available power until theyÕre finished.
When we originally launched automations, we thought that the rate limiting which we had introduced for Webhooks would limit these loops. Our early test cases showed that it did do its job and would limit automations. What we didnÕt know, is that when the system is under a significant load, it can slow down just enough to stay under the rate limit. The loop is then allowed to continue to run, and that means we need a different approach.
To address this pattern, we've examined a handful of solutions and landed on a rather simple fix. We've added logic to create a counter that is passed between Automations and incremented each time it runs. If the counter goes over our new limit of 100, it shuts off the Automation and notifies the owner.
We added this logic in September and have been monitoring the results in order to set a baseline. What we found is that most automations only loop or call other automations a small handful of times. ItÕs only a very small percentage of Automations that have a loop which repeats 75 times or more, so we consider anything over that to be abnormal. Based on that finding, we decided to set the limit to 100 and will be continuing to monitor our logs to learn if that limit eventually needs to be raised.
In the time since we implemented this new loop-limiting function, we have already observed a handful of cases where the new logic has disabled runaway automations. Given this immediate success, we believe that we have found the right way to address this pattern, ensuring that loops no longer impede your experience with Automations. That said, we will continue to monitor our alerts for any automations that exhibit this behavior.
2. "Fan" Automations with multiplicative effects
In the interest of full transparencyÐthis is the type of Automation that causes the most headaches. Unlike looping automations, these don't come back around to repeat a cycle but instead look something like this:
- User edits a single record which triggers Automation A
- Automation A Ð Modifies 20 parent records
- Automation B Ð The 20 parent records modified by Automation A trigger an edit for each of their 30 child records
Unfortunately, the solution we implemented for the "looping" logic scenario doesnÕt catch this problem, as the number of "iterations" counted only goes up to 2 (not even close to our limit of 100). The problem at the root here is that we perform edits one at a time instead of all at once.
When we first built automations, we looked at our ImportFromCSV API and found that it didn't have the full support that we needed for updating records. We opted to use the EditRecord API instead, not realizing that people would chain together Automations in this manner. Since that time, we have learned that we really need to optimize the way that we update data, and as a result we are implementing a change to our ImportFromCSV API so that it can also be used under the hood to manage bulk data changes. Our goal is to have the API updated in November and to then make the switch in Automations shortly thereafter.
Once implemented, we expect to see the number of API calls and webhooks firing to drop dramatically. We are currently testing this change, and I believe that it will make a huge difference.
3. "Table to Table" Imports
Table imports are a very handy tool in the Quick Base ecosystem. Not everyone knows this, but importing data was actually one of the driving reasons to build the Automations feature. Our customers have been asking for a "scheduled table to table import" for years now, and the flexible nature of Automations seemed like the perfect platform to provide that new functionality.
That said, we could not have predicted that some customers would build Automations to run dozens of table imports for each and every record that is changed in their apps. The challenge with this scenario is that table imports can sometimes be a very resource intensive operation for our servers to process. This can cause the rest of an app to slow to a crawl if it is being asked to do too may imports, which in turn causes Automations to become backed up and results in poor system performance across the board.
To address this issue, we are adding "cool down" logic to our API calls in Quick Base so that your apps can have a chance to catch up. Additionally, we are adding new logic to limit the number of Automations from an app that can run at any one time so that these types of Automations can't consume all of the resources.
Fixes and Improvements
So now that you know about the 3 main scenarios where Automations can cause problems, letÕs take a look at the fixes that we have already implemented, and a few more that are in the works.
Fixes already implemented
Automatic Looping Shutoff
We now disable an Automation if it has seen any one job or run come back around too many times (100). You shouldn't expect this to ever affect you unless you have a loop built.
Raised the Rate Threshold Limit to 20/second
Over the summer, many customers struggled with Automations performing too many edits at once. This coupled with our throttling logic (next item) have resolved this issue.
Throttling API Calls
We now throttle our API calls so that we stay under the 20/second threshold for triggering downstream automations. This slows down edits and adds so that Webhooks can be sent out without hitting the rate threshold issue. That said, if you have multiple automations in an app all firing at once, it's possible for your webhooks to hit the rate threshold.
Implemented Better Database Pooling
While there are multiple issues that caused our outage in early August, the way we manage our database connections resulted in the servers becoming very unresponsive. We've since switched to a new library and have seen significant improvements.
Added Retry Logic for Failed Requests
This is unrelated to the performance issues but sometimes get lumped into this bucket. Some customers were receiving sporadic "network errors" and "internal server errors" that were caused by network traffic. We've added logic that have eliminated all of these errors.
Improved Monitoring and Alerting
As mentioned, we've learned how to respond to issues and are able to take action very quickly. We've added logging in various different places and are alerted as soon as automation jobs start to backup. Additionally, we've improved how our teams collaborate so that we can respond better.
Automatic Disabling of Automations with too many Errors
This has been in for a long time but is worth mentioning. Early on, we found that sometimes a runaway automation is encountering a lot of errors but continues to run. This ends up eating up resources for something that is destined to fail. We added logic to disable these automations until they are fixed.
3 Minute Runtime Limit
Like the above item, we implemented this a few months ago as a way to prevent runaway Automations from running for too long.
Limits on # of Edits/Deletes
One other thing we found through various incidents is that editing too many records is a fairly sure sign that an Automation is incorrectly configured. We put this limit in place primarily to prevent users from trashing their apps but also to put a cap on how much work an Automation might try to perform.
In monitoring support cases, we've found that this limit has helped save people from bulk editing records that they did not intend to edit.
That said, if you really do need to edit more than 1,000 records at a time, IÕd love to hear about your app and use case!
Upcoming Changes & Fixes
Modify Records in Bulk
Of all the changes we have considered, we believe that this is going to have the most significant impact on performance and stability of Automations. When we implement this change, Automations that are "chained" together will result in significantly fewer API calls and jobs running at the same time.
Cooldown logic for API Calls
We're implementing additional throttling of API calls for when Quick Base takes a long time to respond (>50ms). This will give Quick Base some time to process other API calls and get back up to speed.
"Back of line" for Busy Automations
We're also implementing additional throttling that will limit the number of concurrent jobs that a single Automation can have running at one time. When this limit is exceeded, it will put new jobs at the "end of the line" so that other customers Automations can run. As jobs that are in-flight complete, then more jobs from that Automation can start up.
Longer Term Architecture
While I can't fully disclose our back-end architecture. The few relevant points I can point out are that we are moving towards a more modern tech stack for queuing of events and a more scalable tech stack for storing all information about when an automation runs.
Copy Records Action to Replace Table Imports
WeÕre working on implementing a Copy Records Action that works very much like Table Import but with some added tweaks that make it unique to Automations. YouÕll be able to dynamically select a set of records and pass in data from the trigger which means youÕll need fewer steps in an automation or fewer automations to do the same task as before. Moreover, we will have more control over how the Copy Records action works and can tune the performance as needed.
Conclusion
If you have made it all the way through this post, then I want to thank you for sticking with me! I hope that this has given you a better idea of how we are thinking about the Automations feature, and that you can now feel more confident in the future of Automations. If you want to talk to someone more deeply about any one of these topics, please donÕt hesitate to reach out to me or to our Customer Success team.
-Matt