SeaVoice Python SDK

This is the tutorial about how to use SeaVoice Python SDK to try Seasalt.ai Speech-To-Text (STT) and Text-To-Speech (TTS) services.

Please contact info@seasalt.ai if you have any questions.

Speech-to-Text Example:

Prerequisites

You will need a SeaVoice speech service account to run this example. Please contact info@seasalt.ai and apply for it.

Install and import

To install SeaVoice SDK:

pip install seavoice-sdk

To import SeaVoice SDK:

import seavoice_sdk.speech as speechsdk

Recognition

In the example below, we show how to recognize speech from an audio file. You can also apply recognition to an audio stream.

Speech Configuration

Use the following code to create SpeechConfig (contact info@seasalt.ai for the speech service APIKEY):

speech_config = speechsdk.SpeechConfig(
    apikey=SEASALT_APIKEY,
    speech_recognition_language="zh-TW",
    speech_recognition_punctuation=True
)

speech_recognition_language could be zh-TW or en-US.

speech_recognition_punctuation is used to set whether recognition results will have punctuations.

Audio Configuration

Use the following code to create AudioConfig.

audio_format = speechsdk.audio.AudioStreamFormat(samples_per_second=16000, bits_per_sample=16, channels=1)
audio_stream = speechsdk.audio.PushAudioInputStream(audio_format)
audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

Recognizer initialization

SpeechRecognizer can be initialzed as follows:

::
speech_recognizer = speechsdk.SpeechRecognizer(

speech_config=speech_config, audio_config=audio_config

)

Note

The SpeechRecognizer can initialzed with context biasing to boost the probability of certain key words or phrases.

  • "contexts": a json dict to boost certain hotwords and/or phrases for recognition, and optionally rewrite certain spoken forms to a specific written form. Each key is a word/phrase for context biasing; each corresponding value is an optional dict containing a key ‘rewrite’ which maps to a list of possible spoken forms that will be rewritten to the written form (the key). In the above example, the word “seasalt” will be boosted and all occurences of “sea salt” and “c salt” will be rewritten to the capitalised “Seasalt”. Also, if a certain sentence is expected, you can also boost the whole sentence, e.g. “Seasalt is an AI company”

::
contexts = {
“Seasalt”: {

“rewrite”: [“sea salt”, “c salt”]

}, “SeaVoice”: {

“rewrite”: [“c voice”]

}

}

speech_recognizer = speechsdk.SpeechRecognizer(

speech_config=speech_config, audio_config=audio_config contexts=contexts

)

Callbacks connection

SpeechRecognizer has 5 kinds of callbacks:

  • Recognizing - called when recognition is in progress.

  • Recognized - called when a single utterance is recognized.

  • Canceled - called when a continuous recognition is interrupted.

  • Session_started - called when a recognition session is started.

  • Session_stopped - called when a recognition session is stopped.

To connect the callbacks:

speech_recognizer.recognizing.connect(
    lambda evt: print(f"Recognizing: {evt.result.text}"))
speech_recognizer.recognized.connect(
    lambda evt: print(f'Recognized: {evt.result.text}'))
speech_recognizer.canceled.connect(
    lambda evt: print(f'Canceled: {evt}'))
speech_recognizer.session_started.connect(
    lambda evt: print(f'Session_started: {evt}'))
speech_recognizer.session_stopped.connect(
    lambda evt: print(f'Session_stopped: {evt}'))

Recognizing speech

Now it is ready to run SpeechRecognizer. SpeechRecognizer has two ways for speech recognition:

  • Single-shot recognition - Performs recognition once. This is to recognize a single audio file. It stops recognition after a single utterance is recognized.

  • Continuous recognition (async) - Asynchronously initiates continuous recognition on an audio stream. Recognition results are available through callback functions. To stop the continuous recognition, call stop_continuous_recognition_async().

speech_recognizer.start_continuous_recognition_async()
# Code commented out is for Single-shot recognition.
# speech_recognizer.recognize_once()

Putting everything together

Now, put everything together and run the example:

import seavoice_sdk.speech as speechsdk
import time

SEASALT_APIKEY = "xxxxxxxxx"
speech_config = speechsdk.SpeechConfig(
    apikey=SEASALT_APIKEY,
    speech_recognition_language="zh-TW",
    speech_recognition_punctuation=True
)
audio_format = speechsdk.audio.AudioStreamFormat(samples_per_second=16000, bits_per_sample=16, channels=1)
audio_stream = speechsdk.audio.PushAudioInputStream(audio_format)
audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

speech_recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

done = False
def stopped_handler(evt):
    global done
    print(f'Session_stopped: {evt}')
    done = True

speech_recognizer.recognizing.connect(
    lambda evt: print(f"Recognizing: {evt.result.text}"))
speech_recognizer.recognized.connect(
    lambda evt: print(f'Recognized: {evt.result.text}'))
speech_recognizer.canceled.connect(
    lambda evt: print(f'Canceled: {evt}'))
speech_recognizer.session_started.connect(
    lambda evt: print(f'Session_started: {evt}'))
speech_recognizer.session_stopped.connect(stopped_handler)

speech_recognizer.start_continuous_recognition_async()
with open("test.wav", "rb") as audio_bytes:
    audio_stream.write(audio_bytes.read())
    audio_stream.write("EOS".encode('utf-8'))
while not done:
    time.sleep(3)
speech_recognizer.stop_continuous_recognition()
print("Finished recognizing")

Text-to-Speech Example:

Prerequisites

You will need a SeaVoice speech service account to run this example. Please contact info@seasalt.ai and apply for it.

Install and import

To install SeaVoice SDK:

pip install seavoice-sdk

To import SeaVoice SDK:

import seavoice_sdk.speech as speechsdk

Synthesis

In the example below, we show how to synthesize text to generate an audio file. You can also receive synthesis results from an audio stream.

Speech Configuration

Use the following code to create SpeechConfig (contact info@seasalt.ai for the speech service account):

speech_config = speechsdk.SpeechConfig(
    account_id=SEASALT_ACCOUNT,
    password=PASSWORD,
    speech_synthesis_language="en-US",
    speech_synthesis_voice_name="TomHanks",
    speech_synthesis_output_format_id="riff-22khz-16bit-mono-pcm",
    speech_synthesis_output_pitch=0.0,
    speech_synthesis_output_speed=1.0
)

Options for speech_synthesis_language could be zh-TW, en-US or en-GB.

For zh-TW, speech_synthesis_voice_name could be Tongtong or Vivian.

For en-US, speech_synthesis_voice_name could be TomHanks, ReeseWitherspoon or AnneHathaway. For en-GB, speech_synthesis_voice_name could be DavidAttenborough.

Options for speech_synthesis_output_format_id could be riff-22khz-16bit-mono-pcm, riff-16khz-16bit-mono-pcm or riff-8khz-16bit-mono-pcm.

speech_synthesis_output_pitch could be a value between -12.0 and 12.0, where 0.0 is the default/normal value.

speech_synthesis_output_speed could be a value between 0.5 and 2.0, where 1.0 is the default/normal value.

Audio Configuration

Use the following code to create AudioOutputConfig.

import seavoice_sdk.audio as audio
# Code commented out is an example for receiving synthesis results from an audio stream.
# audio_stream = audio.AudioOutputStream()
# audio_config = audio.AudioOutputConfig(stream=audio_stream)
audio_config = audio.AudioOutputConfig(filename="output.wav")

Synthesizer initialization

Synthesizer can be initialzed as follows:

speech_synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=audio_config
)

Callbacks connection

SpeechSynthesizer has 4 kinds of callbacks:

  • Synthesis_started - called when synthesis is started.

  • Synthesizing - called when each time part of synthesis result is given.

  • Synthesis_completed - called when all text was synthesized.

  • Synthesis_canceled - called when synthesis is interrupted.

To connect the callbacks:

speech_synthesizer.synthesis_started.connect(
    lambda : print("synthesis started"))
speech_synthesizer.synthesizing.connect(
    lambda audio_data: print("synthesizing"))
speech_synthesizer.synthesis_completed.connect(
    lambda audio_data: print("synthesis completed"))
speech_synthesizer.synthesis_canceled.connect(
    lambda : print("synthesis canceled"))

Synthesizing text

Now it is ready to run SpeechSynthesizer. There are two ways to run SpeechSynthesizer:

  • Synchronized - Perform synthesis until got all result.

  • Asynchronized - Start synthesis and return a speechsdk.ResultFuture, which you could call its get() function to wait and get synthesis result.

    # Code commented out is for synchronized synthesis
    # result = speech_synthesizer.speak_text("Input your text to synthesize here.")
    result = speech_synthesizer.speak_text_async("Input your text to synthesize here.").get()
    # Code commented out is an example for reading synthesis result from an audio stream.
    # audio_data = audio_stream.read()
    

Judge result reason –> Check result

Both the synchronized and asynchronized methods return a speechsdk.SpeechSynthesisResult object, which indicates if synthesis was completed successfully:

if result.reason == speechsdk.ResultReason.ResultReason_SynthesizingAudioCompleted:
    print("finished speech synthesizing")

Putting everything together

Now, put everything together and run the example:

from seavoice_sdk import speech as speechsdk
from seavoice_sdk import audio as audio

if __name__ == "__main__":
    SEASALT_ACCOUNT = "xxxxxxxxx"
    PASSWORD = "xxxxxxxx"
    speech_config = speechsdk.SpeechConfig(
        account_id=SEASALT_ACCOUNT,
        password=PASSWORD,
        speech_synthesis_language="en-US",
        speech_synthesis_voice_name="TomHanks",
        speech_synthesis_output_format_id="riff-22khz-16bit-mono-pcm",
        speech_synthesis_output_pitch=0.0,
        speech_synthesis_output_speed=1.0
    )
    audio_config = audio.AudioOutputConfig(filename="output.wav")
    speech_synthesizer = speechsdk.SpeechSynthesizer(
        speech_config=speech_config,
        audio_config=audio_config
    )
    speech_synthesizer.synthesis_started.connect(
        lambda : print("synthesis started"))
    speech_synthesizer.synthesizing.connect(
        lambda audio_data: print("synthesizing"))
    speech_synthesizer.synthesis_completed.connect(
        lambda audio_data: print("synthesis completed"))
    speech_synthesizer.synthesis_canceled.connect(
        lambda : print("synthesis canceled"))

    # result = speech_synthesizer.speak_text("Seasalt.ai is a service company focusing on multi-modal AI solutions.")
    result = speech_synthesizer.speak_text_async("Seasalt.ai is a service company focusing on multi-modal AI solutions.").get()

    if result.reason == speechsdk.ResultReason.ResultReason_SynthesizingAudioCompleted:
        print("finished speech synthesizing")

Change Log

[0.2.2] - 2021-8-16

Bugfixes

  • Some callbacks were never called

[0.2.1] - 2021-7-25

changed sdk name to seavoice

[0.1.14] - 2021-4-9

Improments

  • Added output of post-processing result

[0.1.13] - 2021-4-1

Improments

  • Added output of segment and word alignment information

[0.1.12] - 2020-12-10

Bugfixes

  • Remove unused variable

Improvements

  • Added websocket packages in requirements.txt file