Speech is as good an identification method as a fingerprint. Nowadays, speaker recognition technology is widely used, and it has also caught the eye of con artists.
Associate Professor Tomi Kinnunen from the School of Computing has been studying voice biometrics for more than 15 years.
“Speech processing is a never-ending topic of research. Speech has been studied since the 1950s, and researchers have focused not only on what is being said, but also on who says it, in what language, where, in what kind of an environment and under which emotion,” Kinnunen explains.
Moreover, the speaker’s native language, accent, dialect, gender, style, melody, rhyme and word choices constitute part of what makes his or her speech individual. The shape of a person’s vocal cords and the dimensions of his or her lips and tongue affect the tone colour, which is the most abundant source of information.
In speaker recognition, recorded acoustic signals are divided into several short segments. For example, a sound sample lasting for one second can be divided into 100 segments, each providing 50 different numerical values.
“The sample is like a puzzle with 100 pieces. You can’t tell the individual pieces apart by ear; instead, a computer will calculate parameters to identify the speaker.”
It is Kinnunen’s goal and that of many other researchers to improve the standards of speaker recognition. International data for speaker recognition development is freely available, and research groups all over the world are analysing it by using their own methods.
“All findings will eventually be combined and that’s when we’ll see how many different approaches can be taken to the same data. This is open science at its best.”
Kinnunen’s research group participated in the recently concluded H2020-funded OCTAVE project, which studied biometric attacks and spoofing attacks. The latter are familiar from Hollywood films, where pre-recorded speech is used to trick speaker recognition systems.
“The OCTAVE project made significant advances in the field’s research. Being able to recognise the speaker on different devices is a classic problem.”
A microphone, phone voice compression, data transfer and acoustics – all of these make the speaker seem like a different person.
“However, speaker recognition technology is starting to be accurate enough for its introduction to new applications, such as electronic signatures requiring strong authentication, or teleconference speaker verification.”
In some fields such as criminal investigation, however, technology can never fully replace humans, but instead serves as a tool supporting decision-making.
“Automated speaker recognition is never 100% accurate, but merely gives the probability of the speaker being who he or she is supposed to be.”