Automatic alternate text, something I want to research since Facebook has announced that their IOS app has the capability to provide automatic alternate text for the photos uploaded on the website. Automatic alternative text, or automatic alt text, is a new feature that generates a description of a photo through object recognition technology for someone who cannot see the photo.
It’s definitely a great step in making social media accessible. The visually impaired facebook users can now recognize the photos available on facebook IOS app using voiceover. The alternate text might not be providing complete information but it is definitely better than having no alternate text. Later the feature seems rolled out to www.facebook.com too. A random check of the website using NVDA is announcing automatic alternate text for the photos.
Some sample automatic alternate text information for the images are “Image may contain: 2 people, sunglasses and outdoorâ€, Image may contain: 1 person, Image may contain: meme, text and indoor etc. When the software is unable to provide any alternate text the screen reader speaks out the photo as “No automatic alt text available.â€. These alternate text information might not provide the complete visualization of the photo but will help the screen reader users to understand at least the context. In my honest opinion not even a manually provided alternate text can serve equal level of sense that a sighted person see on the photo.
How is automatic alternate text built?
The biggest challenge for facebook while designing the automatic alternate text engine is to balance the desire of the users to have more information about the images with the quality and social intelligence of such information. Interpretting what is important for users on a visual content is a tough task. Though many people like to identify the persons on the photo, in some instances the background, location and other aspects in the picture does make significant impact. Facebook engineers should balance this equation ensuring that incorrect information is not conveyed in the process.
To accomplish the task of providing automatic alternate text, facebook has broken down the process into the following components.
Content understanding at scale
The computer vision platform, part of Facebook’s applied machine learning has a visual recognition engine whose role is to “see” inside images and videos to understand what they depict. The engine can know if the image contains a cat, image taken at outdoor, image of a beach etc.
Selection of concepts
Though the visual recognition technology can identify wide variety of objects and scenes, Facebook has carefully selected around 100 of them in their first launch. This 100 different objects or scenes are selected depending on the prominence in photos as well as the accuracy of the visual recognition engine. Few of the identified objects and scenes for the first launch are people’s appearances (baby, eyeglasses, beard, smiling), Nature (Outdoor, mountain, snow), Transportation (car, boat, airplane), food (ice cream, pizza, dessert) etc.
Construction of sentence
After detecting the objects and scenes in the photo, a sentence need to be framed that represents the automatic alternate text. For the framing of the description of the automatic alternate text, the sentence begins with the word “Image may containâ€. Later the objects and scenes follow persons, objects and scenes sequence. For example if the photo contains two faces, the alternate text will be “two persons†then the properties if any such as “smilingâ€. The next in the sequence are the objects such as sunglasses, ice cream. After the object are the scenes such as outdoor, sky, indoor.