In the context of our collaboration with Belgium’s Federal Public Service on Policy and Support, we have been exploring how to leverage Machine Learning to identify accessibility issues on Belgium’s public websites. One such issue is the presence of low-contrast text, which we will focus on in this piece. Low-contrast text is a vital accessibility issue because it can significantly hamper readability for people with impaired eyesight or users of text-to-speech assistive technology.
Specifically, the problem statement is the following: given an arbitrary web page, how to automatically list all the locations where low-contrast text (as defined by in the Web Context Accessibility Guidelines) is present. This should work for several common situations:
Normal text on top of a uniform background
Normal text on top of a background that is image
Text that is part of an image
An example of the latter is the following banner from Belgium’s federal government’s website:
https://www.federale-regering.be/nl
In this example, the image contains text that we want to detect to assess if its contrast is sufficient.
The following sections will walk you through the different steps involved to solve this problem in Python, and we will discuss the challenges encountered along the way. The main steps are the following:
Capturing the web page
Finding all the text on the page
Computing the contrast
Capturing the web page
The first step consists of deciding what representation of a web page to work on. Two possibilities exist:
Working with the source code of the page
Capturing a visual representation of the entire web page
Option 1 offers a lot of flexibility in navigating the hierarchy of the elements present on the page. The main drawback is that this option isn’t of much help to identify text present within images.
Given that constraint to handle images and detect the text inside them, we decided to approach the entire problem in that way and use one single image, a screenshot of the web page, as the input for our text detection step.
A screenshot of a web page can easily be obtained by using the Selenium library (and its Python integration). Selenium allows to simulate a browser’s behavior on a specific web page and render the entire page as your browser would, rendering HTML, stylesheets and dynamic content. The code below shows how Selenium can be used to generate a screenshot of a web page.
At this point, we are now in possession of one big image representing the entire web page. We now need to detect all the text on that image.
Finding all the text on the page
In the course of the development, we explored two main approaches for this step:
Text detection: detect where text is located on the page, without attempting to read it. In this case, the output is only a set of coordinates for each detected text element, e.g. {x: 12, y: 25, width: 30, height: 12}
Text recognition: detect where text is located on the page and identify what is written. In addition to the text coordinates, the output now includes a string representation of the text, e.g. {x: 12, y: 25, width: 30, height: 12, text: "Lorem ipsum"}
In this section, we will discuss both approaches, as they both presented interesting challenges.
Ultimately, your specific use case should determine which one you follow, based on whether you need string representations of the text or not. Note, however, that in our experience, the model used for text detection ended up detecting more text than the text recognition approach, which motivated us to stick to the former.
Text detection
We achieved text detection by using a pre-trained text detection model called EAST. The model artefact can be loaded in OpenCV and used to generate a list of bounding boxes indicating the detected words’ coordinates. Our implementation comes from the following guide, which provides a very good example of using it in practice.
As you will observe, the detection performance is very high. In many cases, all the text of a page is successfully detected (no false negatives). On the other hand, some non-text elements are sometimes detected (false positives). The trade-off between the two can be controlled by adjusting the model’s confidence level depending on your requirements.
Text recognition
Text recognition on an image is a prevalent task called Optical Character Recognition (OCR). We explored several Python implementations of this technology and decided to go for Tesseract and its Python wrapper Pytesseract. One line of code suffices to run it, and it returns the coordinates of the bounding boxes of each text element, as well as string representation of the text and a confidence level.
We experienced the main downside that the detection performance tends to be significantly lower than the previous approach, especially for colored text on exotic backgrounds. The models running under the hood of Tesseract seem to have been mostly trained on clean, black-on-white text (with the purpose of automating the digitization of printed documents), and therefore fail to detect wilder instances of text. This is particularly problematic in our case, since detecting unusual combinations of colored text on colored backgrounds (susceptible to exhibit low-contrast) is precisely what we want to find.
The performance can be increased in several ways. One way is to increase the resolution of the screenshot. To come back to a previous line of code used when taking the screenshot:
The scale variable can be used to control the resolution and can be, for example, set to 2 to obtain retina-like resolution. Note, however, that the larger the image, the slower all approaches presented here will run.
Another way to help the OCR process is to generate several variants of the initial screenshot and submit them to the OCR before merging the results. In the code example below, we perform text recognition three times:
Once on the original image ( img )
Once on an altered version of the image aimed at enhancing contrast ( cv2.convertScaleAbs(img, alpha=3, beta=0) )
Once on a color-inverted version of the image ( (255 - img) )
Other variants, such as grayscale, are of course possible. A simplified version of the code would look like the following:
Note that a lot of text will be detected on several of the multiple variants, thus leading to many duplicates. Unfortunately, from one image variant to another, the same text might be recognized at slightly different coordinates, complicating the process of weeding out duplicates.
We came up with a relatively simple solution for this problem:
For each text element, compute a simple hash of the bounding box, mapping a bounding box to the sum of its coordinates. This allows to efficiently represent each box by a single number (with a relatively low likelihood of collisions)
For any given pair of text elements, compare their hashes and consider them duplicates if the difference between the hashes is lower than a certain threshold. This accounts for the fact that duplicates will have very similar, but not identical bounding boxes coordinates.
The full code for handling image variants, including duplicates filtering, is visible below.
While this approach increases the performance, having to process multiple image variants incurs a cost in running time. Moreover, the detection performance never got remotely close to the EAST model’s detection performance described above.
Computing contrast
Regardless of the approach you adopt, the previous step’s outcome should result in a list of bounding boxes indicating the text elements’ locations on a page. The last step is to compute the contrast between text and background for each bounding box and list low-contrast text instances. The image below shows an example of a detected low-contrast text element:
Specifically, given one such bounding box, the output should be a value indicating the contrast ratio between the foreground and background colors.
Computing the contrast ratio is well documented. The code below shows our Python implementation:
This algorithm requires two colors as input: the foreground and background colors. However, in the example above, multiple colors are present due to anti-aliasing: the black of the text, the gray of the background, and several shades of black to gray around the text. How do we know which 2 colors to use out of several?
The heuristic we have applied consists in detecting the two most frequent colors. In practice, this heuristic works very well and will always yield the foreground and background colors. It doesn’t always allow to identify which of the two colors is the background color, and which one is the foreground color (depending on the size of the font and the size of the cropped background). This, however, is not an issue, as the code above handles both colors interchangeably. The code below shows how the two most common colors are efficiently identified.
With that code in hand, all ingredients are at your disposal to compute the identified text boxes’ contrast.
Wrapping it up
In this post, we explained how we tackled the problem of automatically listing all the locations where low-contrast text occurs on an arbitrary web page in Python. Our approach combines several well-known technologies such as headless browser rendering using Selenium, OCR using Pytesseract or text detection using a dedicated neural network architecture called EAST, as well as our own heuristics to combine these technologies into an end-to-end solution. Depending on your specific requirements, many variations of this solution are possible. We hope that by describing one such approach, you will be better able to make the right decisions for your own project on low-contrast text.