Speech recognition

December 2018

Speech recognition is a machine or program’s ability to recognise spoken words and phrases and convert them into a machine-readable format. The software is now a common feature in several devices, including smartphones, computers and virtual assistants.

Speech recognition is an intricate area of computer science, using a mixture of complex linguistics, mathematics and computing. It has been revolutionised in the last decade or so by the application of artificial intelligence (AI) and is by far the largest current application of AI.

In simple speech recognition software, such as automated telephone systems used in call centres, the computer is trained to recognise a very small number of words, such as yes, no and numbers. It matches the sounds to preloaded patterns, and can recognise it through a range of accents.

Nowadays, computers carry out several steps to recognise human speech by digitising and processing sounds that can be matched to phonemes, the smallest unit of sound in speech (44 in English). These can be analysed to recognise them as meaningful language.

Speaking creates vibrations in the air that a microphone changes to a continuous electrical signal. An analogue-to-digital convertor converts the speech into a digital signal. It digitises the sound by taking measurements of the soundwave at frequent intervals and turning them into a digital format.

Speaking creates vibrations in the air that a microphone changes to a continuous electrical signal. An analogue-to-digital convertor converts the speech into a digital signal

The computer processes this digitised signal to find the speech within all the captured sound; breaks it down into ‘phones’, small units of the actual sound, and processes these ‘phones’ to make them easier to compare to phonemes.

Any sound, and speech is no different, is made up of many frequencies just as a chord in music is made up of several different notes. The first two steps use signal processing techniques that identify the frequencies and their relative intensities at a point in time. Complex statistical models, and more recently AI, are used to identify the patterns within these that are speech and the ‘phones’ it is made up of.

The third step is to make those ‘phones’ consistent. When we speak, we speed up and slow down and the volume of our voice varies. To match ‘phones’ to standard phonemes the ‘phones’ are normalised – matched to a consistent rate and volume.

Any sound, and speech is no different, is made up of many frequencies just as a chord in music is made up of several different notes

The program then needs to put each phoneme into the context of the other phonemes around them, allowing the computer to work out what it was likely that the user was saying. This is where AI comes in: training and statistical models help the speech recognition program recognise words that sound the same, such as ‘see’ and ‘sea’. The context generally allows the program to work out which one is being used.

The AI task of recognising words correctly in the presence of background noise and different accents and individual speech patterns is considerable. Therefore, it is a task that cannot easily be carried out by laptops or smartphones. Popular speech recognition systems, such as those from Apple and Google, depend on passing the task of recognition to very powerful computers in the ‘cloud’.

Research on analogue and digital electronics that might enable speech recognition in portable devices is ongoing but is still at an early stage.

***
This article has been adapted from "How does that work? Speech recognition", which originally appeared in the print edition of Ingenia 77 (December 2018).

Keep up-to-date with Ingenia for free

Related content

Technology & robotics

A man sitting in a moving car autonomous vehicle, with his hands on his lap.

When will cars drive themselves?

There are many claims made about the progress of autonomous vehicles and their imminent arrival on UK roads. What progress has been made and how have measures that have already been implemented increased automation?

The C-Enduro in the ocean with land in the background.

Autonomous systems

The Royal Academy of Engineering hosted an event on Innovation in Autonomous Systems, focusing on the potential of autonomous systems to transform industry and business and the evolving relationship between people and technology.

Fisherman on a boat with electronic equipment.

Hydroacoustics

Useful for scientists, search and rescue operations and military forces, the size, range and orientation of an object underneath the surface of the sea can be determined by active and passive sonar devices. Find out how they are used to generate information about underwater objects.

Instilling robots with lifelong learning

In the basement of an ageing red-brick Oxford college, a team of engineers is changing the shape of robot autonomy. Professor Paul Newman FREng explained to Michael Kenward how he came to lead the Oxford Mobile Robotics Group and why the time is right for a revolution in autonomous technologies.