Voice Recognition: How does it work?
Like many in recent years, you have experienced the massive democratization of personal assistants. In the same way, seeing a friend or co-worker giving strange orders or asking funny questions on his phone has never been more normal than today. As you may have noticed, we are in the era of cognitive technologies, and voice recognition is one of them.
Many people talk about it, but who really knows how it works? A clue: you, just after reading this article.
To better understand this process, we must understand what it is composed of. In total, 5 technological bricks form the process of speech recognition.
The first step that initiates the whole process is commonly called Wake-up Word(or Hot Word). This is not necessarily a voice command, it can also be a button to press or other interaction between the user and the machine. The main purpose of this step is to activate the speech recognition (STT that will be explained later), or “wake up” the system so that it starts recording.
This is all the more important when we look at the context in which we are, today people are afraid of technology for fear of having their privacy and privacy baffled. Thus, without having to perform the action or pronounce the necessary words, the voice recognition will be on standby and will not record any tracks.
Once the system is active, it is necessary to use speech. To do this, it is first and foremost important to record and digitize it via the STT: to recognize it simply! At the end of this step, the voice is translated into sound frequencies (like music for example) that can be interpreted by the system. In order to improve understanding of these frequencies, different treatments are performed:
- Standardization in order to suppress peaks and troughs in different frequencies in order to harmonize them.
- The removal of background noise to improve the audio quality.
- Segment cutting into phonemes (distinctive units, expressed in thousandths of a second, to distinguish words from each other)
The frequencies can be analysed by a previously trained neural network (Deep Learning): an algorithm capable of analysing a large amount of information and constituting a “database” listing associations between frequencies and words.
This allows, through a statistical analysis in particular, to match a frequency to the most common word and therefore theoretically the fairest.
For example, let’s take two sentences “a glass of water” and “a water worm”. The one that will be retained will be the first because “glass” is more used than “towards” with “water”.
Once the voice recognition and the different treatments are done, the data is sent directly to the NLP (Natural Language Processing) system. The main mission of this technology is to analyse the sentence and extract its meaning. To do this, it starts by associating tags with the words of the sentence, this is called tokenization. These are actually “labels” that are affixed to each word to characterize them. For example, “I” will be defined as singular pronoun of the first person, “lights” as the verb defining an action, “the” as the determinant referring to “light” which is a proper name but also a COD etc … and this for each element of the sentence. Then comes syntactic and semantic analysis to model the structure of the sentence and understand the relationships between the different words.
The importance of NLP lies in its ability to translate textual elements (iewords and sentences) into standardized orders (always in the same format) that can be interpreted by artificial intelligence in addition.
To concretely achieve the stated order, the AI is the centrepiece. Artificial intelligences work in different ways, some more basic than others. In the case of Vivoka, the AI developed for 5 years now works by aggregating different elements.
The idea is to group these different elements and make links between them to obtain results that are relevant and effective. Here is a (very basic) illustration of AI in the context of home automation:
Context: Home, Controlling Connected Objects, for Users
Information: Lamp, Refrigerator, Shutters, Television (lit), heating (26 °)
External services: Weather, Wikipedia, SNCF
The TTS (Text To Speech) concludes the process. It corresponds to the feedback of the AI which is characterized by a sound, a voice or a displayed text for example. The latter makes it possible to communicate information to the user, symbol of a complete human-machine interface.
Once the cycle is complete, an individual can converse with the machine and
give him orders. Summarizing, the sentence is captured, then interpreted
and then executed as an action that gives feedback from the system (voice
feedback or not).
The most experienced of you will understand, this article explains in a very simple way a complex technology. The idea here is not to make you experts in this area but to make you aware of the functioning and its articulation.
E-commerce :
At a time when the human-machine interface (HMI) is at the centre of attention, voice technologies are strategic elements for constantly rethinking the customer experience.
Developed for more than 50 years, the voice arrives today at one of its most advanced stages. The capacities it brings are numerous and very diversified. Discover in this article how voice recognition shapes tomorrow’s e-commerce .
A phenomenon at the beginning of its growth.
The vocal trade (also called v-commerce) is today one of the fields of predilection of the big technological actors like Google or Amazon. However, in any case, speech recognition still has many challenges to overcome to actually enter the e-commerce market.
First of all, it is a question of precision. Currently, the error ratio is 6.3% and this figure must necessarily evolve to allow users an infallible usage experience. In addition, as a new technology, the adoption of the latter by individuals is not yet harmonious, because it remains unknown to some. Nevertheless, the democratization of voice assistants is trending upwards, and this is also reflected in the multiplication of vocal solutions in our daily lives, whatever the moment.
A simpler buying experience.
Let’s get straight to the point. Let’s start from the most obvious observation, to speak is the most natural way that Man has to communicate. Formerly reserved only for peers (and very often for pets), the word is nowadays more and more used to interact with our technologies. With a powerful NLP system , users are able to browse offers on different shopping platforms in a very intuitive way.
Through artificial intelligence, the possibilities are endless. From the memorization of purchases, to complex requests such as “Offer me an outfit for my dinner this weekend” through personalized recommendations and synergies with the many external services, for cooking recipes for example, it It’s easy to believe that only the imagination of Man is the limit.
In addition, the issues of UX (User Experience) and UI (User Interface) are commonplace nowadays. The 2.0 client looks for ease of navigation while having a nice interface. It is even more complicated to attract new customers today with the multiplication of offers and the banality that often surrounds their availability. Thanks to voice recognition, the e-commerce customer benefits from a new experience that brings him unmatched comfort in his career and this has the effect of encouraging his adherence to the brand.
To adapt to changing uses.
Who would have thought, 30 years ago, that we could almost buy everything via the internet and have it delivered to our homes with disconcerting ease? As you will have understood, the technologies and the daily life of the Man evolve in rhythm.
Today, the simplicity of the interaction between the consumer and the signs makes it possible to create a complicity. Through this relationship, loyalty is a growth opportunity for brands. In a hyper-connected world where time is one of the most precious resources, the intuitiveness of the voice is a major asset to establish a link with the customer, whatever its context.
It is now possible for him to concentrate on the essential.
For example, with a single request, all races are scheduled for a kidnapping the same evening when the user leaves his obligations to return home for example. This example seems obvious today, but it was not there a few years ago, it is however not a unique case of the new habits of life.
In an effort to continually adapt to the lifestyle of their customers, today’s businesses must continually come up with innovative ways to consume.
Author Bio: Sundar working in mippin.com, mainly writes for Zero turn mowers as a manager. He is a writer for more than a year, and also working as a freelancer SEO analyst. He helps his clients to grow their business by advising them, how to advertise and market.