Use speech recognition in gambas

gambafeliz · Post by **gambafeliz** » Saturday 8th April 2023 7:34pm

Hello everyone

I am trying to start a project. It is a voice recognition but very brief. It's probably two words. My questions are:

1. Does this possibility exist?
2. If it exists, is it possible to obtain the written result for an application made in Gambas?

They are encouraged to guide me in this challenge.

Thank you.

Note:
======
Have:
Debian as the operating system.

More:

It is possible that I have not explained well.

What I want is this:
1. A user says two words into a microphone.
2. Those two words are received by a free software voice recognizer that I still don't know what it will be.
3. This library will convert speech to text.
4. This is exactly what I want to do. Recover the text and compare it with orders that I am going to give to the system from Gambas

So I need:
1. What voice recognition that converts the sound of the microphone to text do I need so that Gambas can use it, or just know how to use a voice recognition and its result in text to use it for Gambas.

I hope now my idea is clear.

vuott · Post by **vuott** » Sunday 9th April 2023 11:42am

I say banally that it is evident you have to use the program, to convert speech to text, together with the "Shell" command, or use the functions of the external library of the resource that allows that conversion.

gambafeliz · Post by **gambafeliz** » Sunday 9th April 2023 1:51pm

Yes, of course, you're not going wrong. But someone has done the experience with, for example, a library to be able to teach me the process so that I can later use code in Gambas.

I have seen this:
Vosk Speech Recognition Toolkit

But to be honest I have no idea how I can talk to Vosk and then use it on Gambas. Because at the end of everything I only want this:

A person says something to a computer and through Vosk, for example, it translates what the person says into text, and then I take the text and if it meets what I want in a comparison that I will do in Gambas, I execute an order so that another order by net fulfills the wish of someone elsewhere.

vuott · Post by **vuott** » Sunday 9th April 2023 9:59pm

gambafeliz wrote: ↑Sunday 9th April 2023 1:51pm But to be honest I have no idea how I can talk to Vosk and then use it on Gambas.

Well, I found this code in C language:
https://github.com/alphacep/vosk-api/bl ... _speaker.c
I don't know if it's suitable; it would seem so.
It should be noted that this code does not translate speech directly to text via microphone, but uses a "wav" format audio file, in which the speech has been previously recorded.
I didn't install Vosk library, however

I tried to translate it into Gambas language using the external functions of Vosk API:
https://github.com/alphacep/vosk-api/bl ... vosk_api.h
I also specify that, since I haven't installed the Vosk resource, I obviously

couldn't test my code.

Library "libvosk..."

' VoskModel *vosk_model_new(const char *model_path)
' Loads model data from the file and returns the model object.
Private Extern vosk_model_new(model_path As String) As Pointer

' VoskSpkModel *vosk_spk_model_new(const char *model_path)
' Loads speaker model data from the file and returns the model object.
Private Extern vosk_spk_model_new(model_path As String) As Pointer

' VoskRecognizer *vosk_recognizer_new_spk(VoskModel *model, float sample_rate, VoskSpkModel *spk_model)
' Creates the recognizer object with speaker recognition.
Private Extern vosk_recognizer_new_spk(model As Pointer, sample_rate As Single, spk_model As Pointer)

' int vosk_recognizer_accept_waveform(VoskRecognizer *recognizer, const char *data, int length)
' Accept voice data
Private Extern vosk_recognizer_accept_waveform(recognizer As Pointer, data As Byte[], length As Integer) As Integer

' const char *vosk_recognizer_result(VoskRecognizer *recognizer)
' Returns speech recognition result.
Private Extern vosk_recognizer_result(recognizer As Pointer) As String

' const char *vosk_recognizer_partial_result(VoskRecognizer *recognizer)
' Returns partial speech recognition.
Private Extern vosk_recognizer_partial_result(recognizer As Pointer) As String

' const char *vosk_recognizer_final_result(VoskRecognizer *recognizer)
' Returns speech recognition result. It doesn't wait for silence.
Private Extern vosk_recognizer_final_result(recognizer As Pointer) As String

' void vosk_recognizer_free(VoskRecognizer *recognizer)
' Releases recognizer object.
Private Extern vosk_recognizer_free(recognizer As Pointer)

' void vosk_spk_model_free(VoskSpkModel *model)
' Releases the model memory.
Private Extern vosk_spk_model_free(model As Pointer)

' void vosk_model_free(VoskModel *model)
' Releases the model memory.
Private Extern vosk_model_free(model As Pointer)


Library "libc:6"

Private Enum SEEK_SET = 0, SEEK_CUR, SEEK_END

' FILE *fopen (const char *__restrict __filename, const char *__restrict __modes)
' Open a file and create a new stream for it.
Private Extern fopen(__filename As String, __modes As String) As Pointer

' int fseek(FILE *__stream, long int __off, int __whence)
' Seek to a certain position on STREAM.
Private Extern fseek(__stream As Pointer, __off As Long, __whence As Integer) As Integer

' int feof (FILE *__stream)
' Return the EOF indicator for STREAM.
Private Extern feof(__stream As Pointer) As Integer

' size_t fread(void *__restrict __ptr, size_t __size, size_t __n, FILE *__restrict __stream)
' Read chunks of generic data from STREAM.
Private Extern fread(__ptr As Pointer, __size As Long, __n As Long, __stream As Pointer) As Long

' int fclose (FILE *__stream)
' Close STREAM.
Private Extern fclose(__stream As Pointer) As Integer


Public Sub Main()

  Dim wavin, model, spk_model, recognizer As Pointer
  Dim buf As New Byte[3200]
  Dim nread, final As Integer
  
  model = vosk_model_new("model")
  spk_model = vosk_spk_model_new("spk-model")
  recognizer = vosk_recognizer_new_spk(model, 16000.0, spk_model)
  
  wavin = fopen("/path/of/file.wav", "rb")
  fseek(wavin, 44, SEEK_SET)
  
  While Not feof(wavin)
    nread = fread(buf, 1, buf.Count, wavin)
    final = vosk_recognizer_accept_waveform(recognizer, buf, nread)
    If final
      Print vosk_recognizer_result(recognizer)
    Else 
      Print vosk_recognizer_partial_result(recognizer)
    Endif 
  Wend 
  Print vosk_recognizer_final_result(recognizer)

  fclose(wavin)
  vosk_recognizer_free(recognizer)
  vosk_spk_model_free(spk_model)
  vosk_model_free(model)

End

cogier · Post by **cogier** » Monday 10th April 2023 11:08am

This might be of interest, https://unix.stackexchange.com/question ... -for-linux which shows some examples of the Vosk software that vuott talks about.

gambafeliz · Post by **gambafeliz** » Monday 10th April 2023 1:44pm

Thank you very much sirs

As always both to the rescue. I will try to use what you indicate to see if I am able to start the idea.

thatbruce · Post by **thatbruce** » Friday 14th April 2023 8:50am

Interesting! In fact I have spent the entire afternoon looking at the state of linux speech-to-text software options.

To be frank, in general they are still mainly useless. The accuracy is generally very poor.

I assume that you want to use a utility that doesn't require training too be done by the speaker, in other words you want to use a default model provided by the utility. Now, most of these are "english as she is spoke by Amer-kans" which is to be expected. (I am a "Strine" which has much nicer phonemes by the way!)
After repeating the following input into the microphone a dozen times I gave up attempting to say the phrase the same way without mistakes and finally installed an audio recorder, I used the gnome-audio-recorder by Osmo Antero just for convience sake. It's pretty rudimentary but does the job. This is the input phrase:

"There was movement at the station for the word had passed around,
that the colt from 'Old Regret' had got away."

I'll ignore the other dozen or so that I tried that just delivered garbage like "ten wars moon men" and just report the two that stood out.

pocketsphinx
PRO: fast
CON: medium accuracy for untrained models
RESULT: there was movement at the station the word that caused the rare that the call from all regret had gone already

vosk
PRO: much better untrained accuracy than any other I tried
CON: very slow at first as it has to generate it's default model, but speeds up as long as you don't reboot.
RESULT: there was movement at the station for the word had passed around that the cult from all regret had got away

Both do have API's that can be used as Vuott says. I haven't looked at them yet apart from the pocketsphinx api looks a lot simpler than the vosk one, which is v e r y complex (but possibly worth the effort).
Looking forwards to your results!

p.s. Tried to attach the input mp3 file I used but it appears that phpBB has never heard of audio files

gambafeliz · Post by **gambafeliz** » Tuesday 2nd May 2023 7:24pm

Thank you very much, I appreciate your interest.
I also note that you found something useful.

I tell you I don't want a conversation recognizer.

I look for the user to say something that he sees on the screen, example 4A, and with that I have enough to interpret it as a command.

I'm telling you this so you know exactly what I'm looking for.

Let's say it's this:

1. I present some codes on the screen at will.
2. The user chooses with his voice.
3. I get this converted to text in Gambas.
And finally I execute some command programmed for this obtained string.

thatbruce · Post by **thatbruce** » Thursday 4th May 2023 5:24am

Here's just a couple of thoughts.
Regardless of the speech-to-text library you employ, you will need to "convert" the string it "heard" to something your program will know. For example, suppose the user picks "4X", in english models (and with an english speaker) you are likely to end up with something like "four eggs" or "for eggs". So you need to train your program, not the S2T library, that "four eggs" and "for eggs" is the possible text for the "4X" command.
Now given your location I raise the question, what language(s) are your users going to use? Are dialects going to be a problem? etc
So to get you moving I think you will need some sort of a lookup table to convert the delivered text as spoken by user X in language Y with dialect Z into the required command.
b

Gambas ONE

Use speech recognition in gambas

Use speech recognition in gambas

Re: Use speech recognition in gambas

Re: Use speech recognition in gambas

Re: Use speech recognition in gambas

Re: Use speech recognition in gambas

Re: Use speech recognition in gambas

Re: Use speech recognition in gambas

Re: Use speech recognition in gambas

Re: Use speech recognition in gambas