ContributionsMost RecentMost LikesSolutionsAutomations Performance Challenges and ImprovementsHi Quick Base Community, I've been having a lot of conversations about the uptime and performance of Automations, so I thought it would be helpful if I pulled together my thoughts on the topic to share with everyone. At the root of these challenges, there has been something that I like to call ÒRunaway AutomationsÓ. These are incidents where someone creates and triggers an Automation that consumes a significant portion of our server resources. After our first downtime incident we learned a tremendous amount about how things can go wrong. Since then, weÕve been working on an improved architecture that will allow us to continue to scale to the needs of our customers. In fact, we've already implemented a number of fixes that have delivered immediate results. On top of that, we have also learned how to better identify and respond to these incidents in order to minimize the impact that they have across the platform. If you just want to read about these improvements, you should skip to the bottom. That said, in order to truly understand why we are changing the infrastructure it's helpful to understand what types of Automations cause problems. If youÕre interested in finding out what causes excessive load, then you can read on for more detail. Generally, I have found that there are three patterns which create enough load to impact the entire platform: 1. Automations that continually run in a loop. Occasionally, an app builder will create a series of Automations, Webhooks, and Actions where one item in the series will eventually circle back around and re-trigger the first item. Sometimes the loop eventually comes to an end, meeting some pre-defined condition, but sometimes it will continue to loop indefinitely. While there are a few rare but legitimate use cases for a limited loop, there is not a legitimate use case for an infinite loop. In either scenario, these loops consume a significant portion of our server resources as they try to consume any available power until theyÕre finished. When we originally launched automations, we thought that the rate limiting which we had introduced for Webhooks would limit these loops. Our early test cases showed that it did do its job and would limit automations. What we didnÕt know, is that when the system is under a significant load, it can slow down just enough to stay under the rate limit. The loop is then allowed to continue to run, and that means we need a different approach. To address this pattern, we've examined a handful of solutions and landed on a rather simple fix. We've added logic to create a counter that is passed between Automations and incremented each time it runs. If the counter goes over our new limit of 100, it shuts off the Automation and notifies the owner. We added this logic in September and have been monitoring the results in order to set a baseline. What we found is that most automations only loop or call other automations a small handful of times. ItÕs only a very small percentage of Automations that have a loop which repeats 75 times or more, so we consider anything over that to be abnormal. Based on that finding, we decided to set the limit to 100 and will be continuing to monitor our logs to learn if that limit eventually needs to be raised. In the time since we implemented this new loop-limiting function, we have already observed a handful of cases where the new logic has disabled runaway automations. Given this immediate success, we believe that we have found the right way to address this pattern, ensuring that loops no longer impede your experience with Automations. That said, we will continue to monitor our alerts for any automations that exhibit this behavior. 2. "Fan" Automations with multiplicative effects In the interest of full transparencyÐthis is the type of Automation that causes the most headaches. Unlike looping automations, these don't come back around to repeat a cycle but instead look something like this: User edits a single record which triggers Automation A Automation A Ð Modifies 20 parent records Automation B Ð The 20 parent records modified by Automation A trigger an edit for each of their 30 child records Due to the multiplicative nature of this scenario, we wind up with 21 automations running and 600 (20 x 30) edits in a matter of a few seconds. This gets to be even more drastic when a user performs a grid edit of dozens to hundreds of records OR if there are additional automations running in parallel or in sequence. Unfortunately, the solution we implemented for the "looping" logic scenario doesnÕt catch this problem, as the number of "iterations" counted only goes up to 2 (not even close to our limit of 100). The problem at the root here is that we perform edits one at a time instead of all at once. When we first built automations, we looked at our ImportFromCSV API and found that it didn't have the full support that we needed for updating records. We opted to use the EditRecord API instead, not realizing that people would chain together Automations in this manner. Since that time, we have learned that we really need to optimize the way that we update data, and as a result we are implementing a change to our ImportFromCSV API so that it can also be used under the hood to manage bulk data changes. Our goal is to have the API updated in November and to then make the switch in Automations shortly thereafter. Once implemented, we expect to see the number of API calls and webhooks firing to drop dramatically. We are currently testing this change, and I believe that it will make a huge difference. 3. "Table to Table" Imports Table imports are a very handy tool in the Quick Base ecosystem. Not everyone knows this, but importing data was actually one of the driving reasons to build the Automations feature. Our customers have been asking for a "scheduled table to table import" for years now, and the flexible nature of Automations seemed like the perfect platform to provide that new functionality. That said, we could not have predicted that some customers would build Automations to run dozens of table imports for each and every record that is changed in their apps. The challenge with this scenario is that table imports can sometimes be a very resource intensive operation for our servers to process. This can cause the rest of an app to slow to a crawl if it is being asked to do too may imports, which in turn causes Automations to become backed up and results in poor system performance across the board. To address this issue, we are adding "cool down" logic to our API calls in Quick Base so that your apps can have a chance to catch up. Additionally, we are adding new logic to limit the number of Automations from an app that can run at any one time so that these types of Automations can't consume all of the resources. Fixes and Improvements So now that you know about the 3 main scenarios where Automations can cause problems, letÕs take a look at the fixes that we have already implemented, and a few more that are in the works. Fixes already implemented Automatic Looping Shutoff We now disable an Automation if it has seen any one job or run come back around too many times (100). You shouldn't expect this to ever affect you unless you have a loop built. Raised the Rate Threshold Limit to 20/second Over the summer, many customers struggled with Automations performing too many edits at once. This coupled with our throttling logic (next item) have resolved this issue. Throttling API Calls We now throttle our API calls so that we stay under the 20/second threshold for triggering downstream automations. This slows down edits and adds so that Webhooks can be sent out without hitting the rate threshold issue. That said, if you have multiple automations in an app all firing at once, it's possible for your webhooks to hit the rate threshold. Implemented Better Database Pooling While there are multiple issues that caused our outage in early August, the way we manage our database connections resulted in the servers becoming very unresponsive. We've since switched to a new library and have seen significant improvements. Added Retry Logic for Failed Requests This is unrelated to the performance issues but sometimes get lumped into this bucket. Some customers were receiving sporadic "network errors" and "internal server errors" that were caused by network traffic. We've added logic that have eliminated all of these errors. Improved Monitoring and Alerting As mentioned, we've learned how to respond to issues and are able to take action very quickly. We've added logging in various different places and are alerted as soon as automation jobs start to backup. Additionally, we've improved how our teams collaborate so that we can respond better. Automatic Disabling of Automations with too many Errors This has been in for a long time but is worth mentioning. Early on, we found that sometimes a runaway automation is encountering a lot of errors but continues to run. This ends up eating up resources for something that is destined to fail. We added logic to disable these automations until they are fixed. 3 Minute Runtime Limit Like the above item, we implemented this a few months ago as a way to prevent runaway Automations from running for too long. Limits on # of Edits/Deletes One other thing we found through various incidents is that editing too many records is a fairly sure sign that an Automation is incorrectly configured. We put this limit in place primarily to prevent users from trashing their apps but also to put a cap on how much work an Automation might try to perform. In monitoring support cases, we've found that this limit has helped save people from bulk editing records that they did not intend to edit. That said, if you really do need to edit more than 1,000 records at a time, IÕd love to hear about your app and use case! Upcoming Changes & Fixes Modify Records in Bulk Of all the changes we have considered, we believe that this is going to have the most significant impact on performance and stability of Automations. When we implement this change, Automations that are "chained" together will result in significantly fewer API calls and jobs running at the same time. Cooldown logic for API Calls We're implementing additional throttling of API calls for when Quick Base takes a long time to respond (>50ms). This will give Quick Base some time to process other API calls and get back up to speed. "Back of line" for Busy Automations We're also implementing additional throttling that will limit the number of concurrent jobs that a single Automation can have running at one time. When this limit is exceeded, it will put new jobs at the "end of the line" so that other customers Automations can run. As jobs that are in-flight complete, then more jobs from that Automation can start up. Longer Term Architecture While I can't fully disclose our back-end architecture. The few relevant points I can point out are that we are moving towards a more modern tech stack for queuing of events and a more scalable tech stack for storing all information about when an automation runs. Copy Records Action to Replace Table Imports WeÕre working on implementing a Copy Records Action that works very much like Table Import but with some added tweaks that make it unique to Automations. YouÕll be able to dynamically select a set of records and pass in data from the trigger which means youÕll need fewer steps in an automation or fewer automations to do the same task as before. Moreover, we will have more control over how the Copy Records action works and can tune the performance as needed. Conclusion If you have made it all the way through this post, then I want to thank you for sticking with me! I hope that this has given you a better idea of how we are thinking about the Automations feature, and that you can now feel more confident in the future of Automations. If you want to talk to someone more deeply about any one of these topics, please donÕt hesitate to reach out to me or to our Customer Success team. -MattRe: Automation to add records to one table from a table whose records are created via API_AddRecord is being limitedHi Tyler, Automations don't care how a record is added. Is it possible that your add on is creating records too quickly and so hits our webhook rate limit? If possible, I would recommend that you create multiple records using API_ImportFromCSV. Best, -Matt Re: Webhook ResponsesHi Scott, Webhooks were designed to be a "one-way" call into outside systems. If you want the ability to make API calls and process the response, you'll need to work with an integration platform (Zapier or Workato) or a QSP to make that happen. This is something we might implement in Automations but isn't in our short-term plan. If it's something that would be valuable to you, I recommend that you vote on it over at UserVoice. Thanks!Re: How do I get an Automation Action to apply a timestamp to a Date/Time field for the moment the Action occurs?Hi Clayton, We don't yet have formulas available in automations but one way to get a timestamp is to use the "Date Modified" value from the record that triggered the automation to run. Hope that helps! -Matt Saforrian Product Manager, AutomationsRe: Quick Base automation not workingHi Venkateswarlu, It looks like you are trying to modify or delete more than 1,000 records in a single action. This is a limit that we have in place to prevent automations from going wild. If that is your intention, I'm curious to hear about your use case and why you need to change so many records at once. You can read more about our limits here in this help doc. Best, -Matt Saforrian Product Manager, AutomationsRegarding the recent Automations Service Incident (8/2 - 8/7)Dear Quick Base Community, Over the past week some of you may have noticed that one of our newest features, Quick Base Automations, has been experiencing some issues. As of Tuesday morning, Automations have been running promptly, and we believe that the service is stable. As we continue to invest in the development of this key piece of technology, we believe that you deserve to know exactly what has been going on. Since the start of the incident, IÕve been working with our development team to ensure that we resolve the most immediate issues that might affect your experience. The team is also working on a longer-term plan that will prevent future interruptions of service. ItÕs my hope that our transparency will provide you with increased confidence in the Automations feature as well as the entire Quick Base platform, and that this will serve as a starting point for a deeper conversation with you, our customers. With that said, see below for details: What exactly were the issues and what was the impact? Customers who had built automations may have experienced any of the following issues: A. The Automations feature not loading B. Triggered automations taking longer than normal to run C. Receiving an automations error email that said ÒInternal Server Error. Please contact Support.Ó D. Previously run automations displaying a status of ÒRunningÓ but never finishing E. The automations run page not loading So, what exactly happened with the automations service? Over the past few days we have done extensive work to triage the issues and understand exactly what went wrong. While we are still working on our full root cause analysis, I did want to provide you with some of our preliminary findings. We plan to publish our full root cause analysis in a follow up post with more details, once we have them. While many aspects of Quick Base need to be flexible in nature, few other features reach the complexity of automations. It can be easy to forget that at its core, automations provide a visual way to build custom programs that are run on our platform. That leads to a unique situation where there can be bugs, not only in the code for the feature itself, but also in the automations being created by Quick Base customers when they are building their apps. According to our logs, we had a spike in the number of automations running on Thursday August 2, which exposed a bug in automations that our manual and automated tests did not catch. This bug resulted in the automation servers running out of resources which caused some of them to fail. If you had automations running around then, you would have experienced issue C. Additionally, because the servers went down, if you were trying to access the automations page, you would have experienced issue A. We were quickly alerted to these issues and started working on stabilizing the service. As we investigated the issues we tried changing how many automations would be processed simultaneously, which caused issue B. Over the weekend, while still working on the issue, we discovered that the service was not processing automations fast enough and we needed to speed this up. Unfortunately, any automations that were in that queue may have been dropped, causing issue D. This was also compounded by several automations that were running in a loop (An automation that triggers itself, or two or more automations that trigger each other). Once we identified the looping automations and shut them down, it became much easier to address the issues and speed up the queue. Additionally, there has been an unrelated bug in the runs page where it takes too long to load if you have a failed run (issue E). We have fixed this bug; but also, was a red herring during the investigation process. What have we been doing about it? First and foremost, we have been focused on making sure automations run in a timely manner and that we donÕt lose any automations that should run. As previously mentioned, we feel that the service is now stable and will perform as expected. Second, we have been trying to address the issue with the runs page loading and have deployed a fix. Third, we have been collecting data about how this new feature can fail, building monitoring tools that allow us to jump into action, and brainstorming ways that we can make the system more resilient so that no manual action is required to keep it up and running. What are your plans for making it more resilient? While we currently do not have a timeline, we do know that there will be three problems we want to address: If a loop or significant load is put on the system, we will need to mitigate the issue by isolating, throttling, or disabling these automation(s). If an automation or step of an automation fails, we should have a way to retry it, both automatically and manually. If a server goes down for whatever reason, we should have a way to know which automations were being processed at the time and replay them. WhatÕs next? While we are confident that the system is stable, we are working on some immediate improvements that are designed to make the system more resilient. At the same time, we are analyzing the incident and debriefing to capture as many details as we can. After that, we will put a more detailed plan in place to make sure that the feature continues to run smoothly moving forward. If you are interested in learning more about those details, we will be sharing them here as we move forward, so please subscribe to this community thread for updates. If you have specific concerns that you want to discuss with us, please feel free to reach out to me at msaforrian@quickbase.com. Thank you for your patience as we continue to work on this issue. -Matt Saforrian Product Manager, AutomationsRe: Do Automations count towards API limits?Hi Ryan, They shouldn't count against your API limits but we will double check. In the meantime, could you please open a support case for this issue so that we can track it through the proper channels? Thanks, -Matt Saforrian Product Manager, Automations Re: Automation - Add Record with formulaHi QuickBasePros_IDS, The issue you encountered with automations not populating data was a bug and has been resolved. You shouldn't have to rebuild automations anymore. If you encounter this issue in the future, just open your automation and then save it. That should resolve the error. Best, -MattRe: Limit the # of times an Automation runs for Multiple updated records in the same table.Great to hear that you figured it out. We're always looking to improve automations. If you have any specific ideas or suggestions, PLEASE post them over at UserVoice. I check it at least once and week and use it to judge what the most important topics are. Best, -Matt Saforrian Product Manager, AutomationsRe: Limit the # of times an Automation runs for Multiple updated records in the same table.Hi there, Without reviewing your app, there are two approaches I can think of to resolve this. 1. Build a scheduled automation that runs your table import once a day. 2. Use an add/modify action that mirrors the record into the destination table instead of a table import. Let me know if that helps. Thanks! -Matt