Mixed Encoding Scanner now available to help identify issues with UTF-8 data

Qrew Captain

3 years ago

Mixed Encoding Scanner now available to help identify issues with UTF-8 data

When the Quickbase platform first emerged on the market, it was targeted primarily at US customers. However, there has always been innovative app builders in many countries who have used our platform with non-English data. We are making product improvements to better serve those builders and to expand Quickbase internationally. One of these improvements is a new UTF-8 setting to ensure non-English character sets can be stored and viewed properly.

In some cases, a user may have entered non-English characters into the system while the UTF-8 switch was turned off, or prior to the existence of the UTF-8 switch. This would result in characters being saved in the system without proper character encoding. We refer to this as "mixed character encoding," which can lead to inconsistent results (e.g., displaying, sorting, filters, etc) when the UTF-8 switch is enabled later.

We’re happy to announce our new Mixed Encoding Scanner app that will be published on the Quickbase Exchange. This app can be used to help identify mixed character encoding issues. Read on to learn more about character encoding, and how you can use our new Mixed Encoding Scanner app to determine which tables – and the fields in those tables – contain UTF8-encoded characters. Once you have identified the affected tables, you can proceed to the next step of resolving the UTF-8 encoding issues.

To try and help simplify some of these technical concepts, I worked with Lynn Hedegard to write this post. Lynn is Senior Solutions Architect on the Quickbase Best Practices team.

What is character encoding and what is UTF-8?

Character encodings are a bit like languages. Computers use character encodings to ensure data is stored and displayed consistently. When two pieces of software communicate, they need to agree on what “language” they are speaking (this is referred to as character-encoding). One of those character encodings is called “UTF-8”. “UTF-8” is a Unicode standard that defines how computers store and communicate characters (like C, or ó, or й, or 好, or even an emoji) using 1s and 0s. The UTF-8 standard maps every printable character from any language to a given set of byte values – and includes a mechanism to create new characters going forward.

To see how this works in practice, imagine you’re signing up for a newsletter on a website:

You type an ó into a web form
The browser has a setting that says "Treat this data as UTF-8”
The browser submits that text to a database and includes the character encoding (i.e., UTF-8) used for this data as part of the submission.
The database says "I've been told this incoming information is encoded as UTF-8, so I know how to store that data (i.e. the character string)."

Now, imagine that someone wants to retrieve that information and display it in their browser. In this scenario, the opposite process occurs. The database says "Hey web browser, I'm sending you some information and the data is encoded as UTF-8." In this case the text will be displayed on the browser as expected. But, if the database does not specify the character encoding, the browser will not know how to display character data. It’s possible that the ó will be displayed as a question mark (?) or possibly #00FB, which is the byte value for the ó.

Basically, when one computer sends data to another computer, the sender needs to specify the character-encoding used to encode the data being sent.

UTF-8 in Quickbase

Quickbase has a setting for character encoding that can be set at either the realm level or the app level. That setting has two options:

Non-UTF-8 mode: This is designed for apps whose record data includes only English-language text, numbers, and basic symbols such as the comma, period, at-sign, etc. For a complete list of supported characters, see the ASCII list of printable characters. (ASCII’s extended characters are not supported by Quickbase’s non-UTF-8 mode.)

UTF-8 mode: This supports apps whose record data includes text in virtually any language, numbers, special symbols, etc. UTF-8 has emerged as the international standard for character encoding.

To determine if UTF-8 mode has been enabled on either the app or realm level, do the following:

Go to the App Settings by clicking on the gear icon near the top left part of the Home page on your app.
Select “App properties” located at the bottom of the first column.
Near the middle of that page, you’ll find the Advanced Settings section, which contains the “options” sub-section. One of the check boxes is the UTF-8 switch.
There are 3 possible states for that selection item.

The check box is grayed out. This indicates that the REALM-Level UTF-8 switch is ON. In this scenario you can NOT change the setting at the App Level, so the line is grayed out.
The check box is empty. This indicates that both the REALM-Level UTF-8 switch AND the APP-Level UTF-8 are OFF.
The check box has a check mark. This indicates that the REALM-Level UTF-8 switch is OFF, and the APP-Level UTF-8 is ON.

NOTE: For customers who created their Quickbase account before December 2021, your realm defaults to non-UTF-8 mode. For customers who created their account in December 2021 or later, your realm defaults to UTF-8 mode.

To enable UTF-8 mode for your entire Quickbase account, please request the change by opening a support case. (You can open a support case by clicking the ? icon at the top of any page in Quickbase, then clicking Manage Support Cases.)

Mixed character encoding

For the most part, the mixed character encoding scenario originates when non-English characters are entered into the system while the UTF-8 switch is OFF (or prior to the existence of the UTF-8 switch). A user can enter any text string into a text field – including text that contains UTF-8 characters – and not receive any errors. Users may even see these characters displayed correctly on the Quickbase UI – until the UTF-8 switch is enabled.

Setting the UTF-8 switch after that data was entered into the system will NOT re-encode the data. It simply tells the system to specify UTF-8 encoding when sending data to the browser. But since the data is currently stored in the system as ASCII, the browser will not know how to handle the UTF-8 characters that are improperly encoded in the first place. When this happens, you will see the UTF-8 characters displayed as a byte-code (e.g. #00FB), “?”, etc.

There are a few ways in which mixed encoded characters can get into the system. As we discussed previously, this can happen when text containing extended language characters are typed into a text field for a record. Another scenario occurs when a block of text is copied from another window (e.g., webpage, a PDF, Microsoft Outlook/Word/Excel, etc.) and pasted into the text field. One more scenario occurs with integrations. In this case, non-English characters enter the App by way of an integrations such as an API, XML, Pipelines, etc.

Here’s a closer look at how mixed character encoded data can be introduced to a Quickbase app:

A user updates a text field with a character string that includes “好”, and saves the record while the UTF-8 switch is not enabled.
Quickbase responds by saying "Okay. I'll store this data, but I don't really know what it is" because it’s not an ASCII character.
The next time someone retrieves that information (e.g., via a report or a form), the browser will not know how to render those character(s) because the web browser and Quickbase never agreed on the character encoding. These characters may display as a byte-code (e.g. #00FB) or “?” or depending on the browser.

The Mixed Encoding Scanner is a Quickbase application designed to help app builders address this issue.

The Mixed Encoding Scanner

The first step in resolving the mixed character encoding problem is to scan your app and determine how much UTF-8 data you have in your App, if any. To simplify this scanning process, we asked for help from one of our premier Quickbase Solution Providers - Data Collaborative. They have developed a tool that can automatically scan apps and determine how much UTF-8 data exists in your apps. The Mixed Encoding Scanner is now available on the Quickbase app exchange. To get a copy of the app, click Explore Sample Apps on the My Apps page and search for "Mixed Encoding Scanner".
After creating your copy of the app from Exchange, you can perform an analysis on any app in your realm. Simply enter the app’s DBID and an app token to get started.

Next, the Scanner will estimate how long the analysis will take to run. This depends on the speed of your internet connection and the volume of data in the app.

After the analysis is complete, the results are displayed. You can browse the results to determine which tables, fields, records, and app variables need to be fixed.

Resolving the Mixed Character Encoding Problem

If you determine that only a few occurrences of problematic characters exist in your data, you can turn on UTF-8 mode for the app, and then simply edit those records, fields, etc. If your keyboard mapping supports all the affected characters, manually re-typing the characters that are not displaying properly will correct the issue.

Some of our more technical app builders may leverage the encoding tricks available in VSCode or a similar tool to solve the problem.

On the other hand, you may discover that your app has many occurrences of problematic characters. In that case, you’ll want to use a more automated approach to solve the problem. We recommend most builders work with Data Collaborative for a more hands-on approach to cleaning up the data. After cleaning up the data, enable the UTF-8 setting to ensure data entered going forward is saved properly.