Artificial intelligence softwares like Google assistance and Siri are created with following technologies:
Speech to Text (STT) Engine,
Text to Speech (TTS) Engine,
Noise Reduction Engine,
Speech Compression Engine,
UI for Call Outs.
STT: Speech2Text engine should get the voice from a user then convert it to text. The voice could be a voice file or a stream.
TTS: Text2Speech engine should convert text to voice. It is important for a user that listen the response while for example the user drives.
Tagging: The text which is created via STT is not always simple, The tagging technology should tag the text as what is the user wants via that speech. For Example, user asks what should I wear tomorrow, then the tagging engine can tag the information with weather or calendar info tag.
Noise Reduction Engine: User speech is not always simple, there could be some noise (for example, air-condition noise) around. The noise reduction engine should eliminate the white noise from the voice.
Voice Biometrics: Mobile Assistants can give account based information such as credit card monthly report. Therefore authentication is important, Voice biometrics one of the authentication methods. Via voice biometrics technology, the mobile assistant can authenticate you to do system.
Speech Compression Engine: If your assistants works slow, the users can give up quickly about the application and choose to search on web via writing the text. The Internet communication is really important, in addition to this the packet size for the transaction is also important. Small packets can transfer fast, and the result gets fast. That is why, A good mobile assistant application should have a speech compression engine. The client should send the compressed voice to server fast. The compression is different than the normal compression, because there is not so much repeating data in voice files. G711 can be chosen for the compression algorithm, one of the reason for this choice is that the algorithm is not lost the data.
UI for Call Outs: After the server sends result you should play an audio, in addition to this you should show some info on the device screen inside call outs.
Architecture of Mobile Assistants
Mobile device and main server should have a communication as streaming, because users doesn't like waiting voice data download and slow communication. Being fast is really important for this application, because if it is fast, user feel more nature. User can feel that he is speaking with a real agent or assistant.
When users asks a question from client via clicking a button, client starts streaming the question byte by byte to Main Server. Main server sends the data to STT Server, STT server finds the text of the speech, The text sends to the main server then main server send the text to tagging server to find out what the user wants. Tagging server create a tag for the request. Such as “weather_info” . Tagging server sends the tag to the main server, main server sends the tag to information server, if the tag needs an authentication before the sends information server, security server checks the authentication. At last, the response comes to the main server, main server creates the response text, response graphic and speech text (via in communication TTS Server) and sends the response class to Mobile Device.
Information server can be in communication with 3rd pary servers for some informations that are not stored in Information server. Security server can consists more than one authentication technology such as Voice Biometrics, IMSI-IP Radius Lookup, Account-Password authentication, etc.
Audio compression reduces the size of audio data. The compressed audio data can be transferred more quickly via GSM Network. The compression type can be lossy and lossless.
Lossy: The method can reduces the amount of data during coding process. However, the retained data acceptable for recognition.The advantage of lossy method is that the data can be smaller.
Lossless: Via this method, the audio can be compressed without losing its original quality. It is important if the recognition or recording tools dont have any noise reduction process.
Some of data reduction does not effect directly the quality of speech data. Simply, if the recorded audio data will be used for speech recognition, The data which is not useful for speech recognition can be reduced. Human hearing sensivity is in 20 Hz - 20 KHz audiable frequency. The Outer of the range can be removed.
G.711: You can use G.711 standard for audio compression. The compression method is lossless one. It can compress your data as much as 50 percent.
Other methods can be used are, MPEG-1 Layer III (MP3), MPEG-1 Layer II Multichannel, MPEG-1 Layer I, AAC, HE-AAC, MPEG Surround ,MPEG-4 ALS, MPEG-4 SLS, MPEG-4 DST, MPEG-4 HVXC, MPEG-4 CELP, USAC, G.718, G.719, G.722, G.722.1, G.722.2, G.723, G.723.1, G.726, G.728, G.729, G.729.1, Speex, Vorbis, WMA, Codec2 .
Platforms or tools aren't the concern here, it's algorithms and training data. Basically what many people/companies do is process a lot of speech examples in various ways to make it easier for the program to learn from it, and then feed it into a (e.g. recurrent) neural network, which is basically just a program which can learn mainly via data that is already labeled, i.e. someone already transcripted it in this case. We'll, this is one approach.
The keyword here is machine learning, and people study this subject for years to understand most of the intricate details, research ongoing. If you are just looking for a high level explanation, Google will provide you with a much better explanation than I ever could.
If you really want to get into it you need basic knowledge in statistics, probability and linear algebra, then you could check out an online course ob machine learning, e.g. from MITs Opencourseware.