AI-powered Personal VoiceBot for Language Learning | by Gamze Zorlubas | Aug, 2023

Before we dive into the pipeline, you might want to take a look at the entire code on my Github page, as I will be referring to some sections of it.

The figure below explains the workflow of AI-powered virtual language tutor that is designed to set up a real-time, voice-based conversational learning experience:

Chart of the pipeline — Image by the author
  • The user begins the conversation by initiating a recording of their speech, temporarily saving it as a .wav file. This is accomplished by pressing and holding the spacebar, and the recording is stopped when the spacebar is released. The sections of the Python code that enable this press-and-talk functionality are explained below.

The following global variables are used to manage the state of the recording process:

recording = False       # Indicates whether the system is currently recording audio
done_recording = False # Indicates that the user has completed recording a voice command
stop_recording = False # Indicates that the user wants to exit the conversation

The listen_for_keys function is for checking key presses and releases. It sets the global variables based on the state of the spacebar and esc button.

def listen_for_keys():
# Function to listen for key presses to control recording
global recording, done_recording, stop_recording
while True:
if keyboard.is_pressed('space'): # Start recording on spacebar press
stop_recording = False
recording = True
done_recording = False
elif keyboard.is_pressed('esc'): # Stop recording on 'esc' press
stop_recording = True
elif recording: # Stop recording on spacebar release
recording = False
done_recording = True

The callback function is used to handle the audio data when recording. It checks the recording flag to determine whether to record the incoming audio data.

def callback(indata, frames, time, status):
# Function called for each audio block during recording.
if recording:
if status:
print(status, file=sys.stderr)

The press2record function is the main function which is responsible for handling voice recording when the user presses and holds the spacebar.

It initialises global variables to manage the recording state and determines the sample rate, and it creates a temporary file to store the recorded audio.

The function then opens a SoundFile object to write the audio data and an InputStream object to capture the audio from the microphone, using the previously mentioned callback function. A thread is started to listen for key presses, specifically the spacebar for recording and the ‘esc’ key to stop. Inside a loop, the function checks the recording flag and writes the audio data to the file if recording is active. If the recording is stopped, the function returns -1; otherwise, it returns the filename of the recorded audio.

def press2record(filename, subtype, channels, samplerate):
# Function to handle recording when a key is pressed
global recording, done_recording, stop_recording
stop_recording = False
recording = False
done_recording = False
# Determine the samplerate if not provided
if samplerate is None:
device_info = sd.query_devices(None, 'input')
samplerate = int(device_info['default_samplerate'])
# Create a temporary filename if not provided
if filename is None:
filename = tempfile.mktemp(prefix='captured_audio',
suffix='.wav', dir='')
# Open the sound file for writing
with sf.SoundFile(filename, mode='x', samplerate=samplerate,
channels=channels, subtype=subtype) as file:
with sd.InputStream(samplerate=samplerate, device=None,
channels=channels, callback=callback, blocksize=4096) as stream:
print('press Spacebar to start recording, release to stop, or press Esc to exit')
listener_thread = threading.Thread(target=listen_for_keys) # Start the listener on a separate thread
# Write the recorded audio to the file
while not done_recording and not stop_recording:
while recording and not q.empty():
# Return -1 if recording is stopped
if stop_recording:
return -1

except KeyboardInterrupt:
print('Interrupted by user')

return filename

Finally, the get_voice_command function calls press2record to record user’s voice command.

def get_voice_command():
# ...
saved_file = press2record(filename="input_to_gpt.wav", subtype = args.subtype, channels = args.channels, samplerate = args.samplerate)
# ...
  • Having captured and saved the voice command in a temporary .wav file, we now enter the transcription phase. In this stage, the recorded audio is converted into text using Whisper. The corresponding script for simply running transcription task for a .wav file is given below:
def get_voice_command():
# ...
result = audio_model.transcribe(saved_file, fp16=torch.cuda.is_available())
# ...

This method takes two parameters: the path to the recorded audio file, saved_file, and an optional flag to use FP16 precision if CUDA is available to enhances performance on compatible hardware. It simply returns the transcribed text.

  • Then, the transcribed text is sent to ChatGPT to generate an appropriate response in the interact_with_tutor() function. The corresponding code segment is as follows:
def interact_with_tutor():
# Define the system role to set the behavior of the chat assistant
messages = [
{"role": "system", "content" : "Du bist Anna, meine deutsche Lernpartnerin.
Du wirst mit mir chatten. Ihre Antworten werden kurz sein.
Mein Niveau ist B1, stell deine Satzkomplexität auf mein Niveau ein.
Versuche immer, mich zum Reden zu bringen, indem du Fragen stellst, und vertiefe den Chat immer."}
while True:
# Get the user's voice command
command = get_voice_command()
if command == -1:
# Save the chat logs and exit if recording is stopped
return "Chat has been stopped."

# Add the user's command to the message history
messages.append({"role": "user", "content": command})

# Generate a response from the chat assistant
completion = openai.ChatCompletion.create(

# Extract the response from the completion
chat_response = completion.choices[0].message.content # Extract the response from the completion
print(f'ChatGPT: {chat_response} n') # Print the assistant's response
messages.append({"role": "assistant", "content": chat_response}) # Add the assistant's response to the message history
# ...

The function interact_with_tutor starts by defining the system role of ChatGPT to shape its behaviour throughout the conversation. Since my goal is to practice German, I set the system role accordingly. I called my virtual tutor as “Anna” and set my language proficiency level for her to adjust her responses. Additionally, I instructed her to keep the conversation engaging by asking questions.

Next, the user’s transcribed voice command is appended to the message list with the role of “user.” This message is then sent to ChatGPT. As the conversation continues within a while loop, the entire history of user commands and GPT responses is logged in the messages list.

  • After the each response of ChatGPT, we convert the text message into speech using gTTS.
def interact_with_tutor():
# ...
# Convert the text response to speech
speech_object = gTTS(text=messages[-1]['content'],tld="de", lang=language, slow=False)"GPT_response.wav")
current_dir = os.getcwd()
audio_file = "GPT_response.wav"
# Play the audio response
play_wav_once(audio_file, args.samplerate, 1.0)
os.remove(audio_file) # Remove the temporary audio file

The gTTS() function gets 4 parameters : text, tld, lang, and slow. The text parameter is being assigned the content of the last message in the messages list (indicated by [-1]) which you want to convert into speech. The tld parameter specifies the top-level domain for the Google Translate service. Setting it to "de" means that the German domain is used, which can be significant for ensuring that the pronunciation and intonation are appropriate for the German. The lang parameter specifies the language in which the text should be spoken. In this code, the language variable is set to 'de', meaning that the text will be spoken in German.slow=False: the slow parameter controls the speed of the speech. Setting it to False means that the speech will be spoken at a normal speed. If it were set to True, the speech would be spoken more slowly.

  • The converted speech of ChatGPT response is then saved as a temporary .wav file, played back to the user, and then removed.
  • Theinteract_with_tutor function repeatedly runs when user continues the conversation by pressing the spacebar again.
  • If the user presses “esc”, conversation ends and the entire conversation is saved to a pickle file,chat_log.pkl. You can use it later for some analysis.

Source link

Leave a Comment