SeaVoice Python SDK

This is the tutorial about how to use SeaVoice Python SDK to try Seasalt.ai Speech-To-Text (STT) and Text-To-Speech (TTS) services.

Please contact info@seasalt.ai if you have any questions.

Prerequisites 

You will need a SeaVoice speech service account to run the following examples. Please contact info@seasalt.ai to apply for the SEAVOICE_TOKEN.

Speech-to-Text Example:

Install and import 

To install SeaVoice SDK:

pip install seavoice-sdk-test

To import SeaVoice SDK:

import seavoice_sdk_beta.speech as speechsdk

Recognition 

In the example below, we show how to recognize speech from an audio file. You can also apply recognition to an audio stream.

Speech Configuration 

Use the following code to create SpeechConfig:

recognizer = SpeechRecognizer(
    token=SEAVOICE_TOKEN,
    language=LanguageCode.EN_US,
    sample_rate=16000,
    sample_width=2,
    enable_itn=True,
    contexts={},
    context_score=0
)

Note

language: Input audio language, choose from LanguageCode.ZH_TW, LanguageCode.EN_US
enable_itn: Whether to run Inverse Text Normalisation (ITN) to add punctuation and output written form instead of spoken form, i.e. output words like Mr. instead of mister
contexts: A json dict to boost certain hotwords and/or phrases for recognition, and optionally rewrite certain spoken forms to a specific written form. Each key is a word/phrase for context biasing; each corresponding value is an optional dict containing a key ‘rewrite’ which maps to a list of possible spoken forms that will be rewritten to the written form (the key). In the above example, the word “seasalt” will be boosted and all occurences of “sea salt” and “c salt” will be rewritten to the capitalised “Seasalt”. Also, if a certain sentence is expected, you can also boost the whole sentence, e.g. “Seasalt is an AI company”
contexts = { "Seasalt": { "rewrite": ["sea salt", "c salt"] }, "SeaVoice": { "rewrite": ["c voice"] } }
context_score: The strength of the above provided contexts. We recommend starting with a score of 2.0 and try it out.

Recognizing speech 

Now we use the recognizer to send audio at audio_path for recognition.

async with recognizer:
    async def _send():
        with wave.open(audio_path, mode="rb") as audio_file:
            frames = audio_file.readframes(frames_sent_per_command)
            while frames:
                await recognizer.send(frames)
                frames = audio_file.readframes(frames_sent_per_command)
        await recognizer.finish()

    asyncio.create_task(_send())
    async for event in recognizer.stream():
        print(event)

Note

frames_sent_per_command: you can add asyncio.sleep() depending on the number of frames sent for each chunk to mimic a streaming setting with local audio file testing.

There are three types of events from the recognizer:

InfoEvent : contains the recognition status of one of the following SpeechStatus.BEGIN, SpeechStatus.END, SpeechStatus.ERROR
RecognizingEvent : contains the following information
- text: this is the partial transcription that might change in the RecognizedEvent
- segment_id: 0-based index of this recognizing segment.
- voice_start_time: timestamp in seconds of the start of this segment relative to the start of the audio.
- word_alignments: a list of WordAlignment objects containing the start timestamp of each word relative to the start of the audio.
RecognizedEvent : similar to the RecognizingEvent with an additional duration in seconds for this recognized segment.

Here are some examples of the events:

InfoEvent(payload=InfoEventPayload(status='begin'))
RecognizingEvent(payload=RecognizingEventPayload(segment_id=4, text=' how much', voice_start_time=29.17, word_alignments=[WordAlignment(word='how', start=29.169998919963838, length=1), WordAlignment(word='much', start=29.32999891638756, length=1)]))
RecognizedEvent(payload=RecognizedEventPayload(segment_id=4, text=' How much it was? ', voice_start_time=29.17, word_alignments=[WordAlignment(word='How', start=29.169998919963838, length=1), WordAlignment(word='much', start=29.32999891638756, length=1), WordAlignment(word='it', start=29.609998917579652, length=1), WordAlignment(word='was?', start=29.68999890089035, length=1)], duration=0.67))

Putting everything together 

Now, put everything together and run the example:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import asyncio
import wave
import argparse

from seavoice_sdk_beta import LanguageCode, SpeechRecognizer
from seavoice_sdk_beta.events import InfoEvent

async def recognize(
    audio_path: str,
    language: LanguageCode,
    sample_rate: int,
    sample_width: int = 2,
    frames_sent_per_command: int = 150,
):
    seavoice_token = os.getenv("SEAVOICE_TOKEN", None)
    assert seavoice_token, "SEAVOICE_TOKEN is not set."
    recognizer = SpeechRecognizer(
        token=seavoice_token,
        language=language,
        sample_rate=sample_rate,
        sample_width=sample_width,
        enable_itn=True,
        contexts={},
        context_score=0
    )

    async with recognizer:
        async def _send():
            with wave.open(audio_path, mode="rb") as audio_file:
                frames = audio_file.readframes(frames_sent_per_command)
                while frames:
                    await recognizer.send(frames)
                    frames = audio_file.readframes(frames_sent_per_command)

            await recognizer.finish()

        asyncio.create_task(_send())
        async for event in recognizer.stream():
            if type(event) is InfoEvent:
                print(f"{type(event).__name__}: status {event.payload.status}")
            else:
                print(f"{type(event).__name__}: {event.payload.text}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--audio", type=str, required=True,
                        help="path to the audio file to be recognized")
    parser.add_argument("--sample-rate", type=int, required=True,
                        help="sample rate of the audio")
    parser.add_argument("--language", type=str, required=True,
                        help="language of the provided audio, choose from 'en' and 'zh'")
    args = parser.parse_args()
    if args.language == "en":
        lang = LanguageCode.EN_US
    elif args.language == "zh":
        lang = LanguageCode.ZH_TW
    else:
        raise Exception("for 'language', choose from 'en', 'zh'")

    asyncio.run(recognize(args.audio, language=lang, sample_rate=args.sample_rate))

Text-to-Speech Example:

Install and import 

To install SeaVoice SDK:

pip install seavoice-sdk-test

To import SeaVoice SDK:

import seavoice_sdk_beta as speechsdk

Synthesis 

In the example below, we show how to synthesize text to generate an audio file. You can also receive synthesis results from an audio stream.

Voice Configuration 

Use the following code to create SpeechSynthesizer and SynthesisSettings (contact info@seasalt.ai for the SEAVOICE_TOKEN):

from seavoice_sdk_beta import SpeechSynthesizer, LanguageCode, Voice
from seavoice_sdk_beta.commands import SynthesisSettings

synthesizer = SpeechSynthesizer(
    token=SEAVOICE_TOKEN,
    language=LanguageCode.EN_US,
    sample_rate=22050,
    voice=Voice.TOMHANKS,
)

settings = SynthesisSettings(
    pitch=0.0,
    speed=1.0,
    volume=50.0,
    rules="Elon | eelon\nX Æ A12 | x ash ay twelve",
    sample_rate=22050,
)

Note

language: choose from LanguageCode.ZH_TW, LanguageCode.EN_US, LanguageCode.EN_GB
voice: voice options of the synthesized audio
- ZH_TW : choose from Voice.TONGTONG, Voice.VIVIAN
- EN_US : choose from Voice.ROBERT, Voice.TOM, Voice.MIKE, Voice.ANNE, Voice.LISSA, Voice.MOXIE, Voice.REESE
- EN_GB : choose from Voice.DAVID
pitch : to adjust the pitch of the synthesized voice, choose a value between -12.0 and 12.0, where 0.0 is the default/normal value, where positive values raise the pitch and negative values lower the pitch.
speed : to adjust the speed of the synthesized voice, choose a value between 0.5 and 2.0, where 1.0 is the default/normal value, where values > 1.0 speed up the speech and values < 1.0 slows down the speech.
volume : to adjust the volume of the synthesized voice, choose a value between 0.0 and 100.0, where 50.0 is the default/normal value, where values > 50.0 increases the volume and values < 50.0 decreases the volume.
rules : to specify pronunciation rules for special word representations, input string in the following format <WORD1> | <PRONUNCIATION1>\n<WORD2> | <PRONUNCIATION2> where \n is the delimiter.
- ZH_TW : pronunciation can be specified in zhuyin, pinyin, or Chinese characters, e.g. “TSMC | 台積電n你好 | ㄋㄧˇ ㄏㄠˇn為了 | wei4 le5”
- EN_US and EN_GB : pronunciation can be specified with English words, e.g. “XÆA12 | ex ash ay twelvenSideræl|psydeereal”
sample_rate: make sure the sample rate matches the sample rate setting for the output audio file.

Text Configuration 

Use the following code to create SynthesisData :

from seavoice_sdk_beta.commands import SynthesisData

data = SynthesisData(
    text="Good morning, today's date is<say-as interpret-as='date' format='m/d/Y'>10/11/2022</say-as>",
    ssml=True,
)

Note

text is the text to be synthesized. ssml should be True if text is an SSML string, i.e. using SSML tags. See Supported SSML Tags Tags for more info.

Output File Configuration 

Use the following code to initialise the synthesized output audio:

import wave

f = wave.open("output.wav", "w")
f.setnchannels(1)
f.setsampwidth(2)
f.setframerate(22050)

Synthesis Command 

Use the following code to create SynthesisCommand using the SynthesisData and SynthesisSettings from previous steps:

from seavoice_sdk_beta.commands import SynthesisCommand

command = SynthesisCommand(
    payload=SynthesisPayload(
        data=data,
        settings=settings,
    )
)

Putting everything together 

Now, put everything together and run the example:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import os
import asyncio
import wave

from seavoice_sdk_beta import LanguageCode, SpeechSynthesizer, Voice
from seavoice_sdk_beta.commands import SynthesisCommand, SynthesisData, SynthesisPayload, SynthesisSettings
from seavoice_sdk_beta.events import AudioDataEvent

SAMPLE_RATE: int = 8000

async def synthesize():
    seavoice_token = os.getenv("SEAVOICE_TOKEN", None)
    assert seavoice_token, "SEAVOICE_TOKEN is not set."
    synthesizer = SpeechSynthesizer(
        token=seavoice_token,
        language=LanguageCode.ZH_TW,
        sample_rate=SAMPLE_RATE,
        voice=Voice.TONGTONG,
    )

    data = SynthesisData(
        text="Good morning, today's date is<say-as interpret-as='date' format='m/d/Y'>10/11/2022</say-as>",
        ssml=True,
    )

    settings = SynthesisSettings(
        pitch=0.0,
        speed=0.9,
        volume=100.0,
        rules="SeaX | sea x",
        sample_rate=SAMPLE_RATE,
    )

    command = SynthesisCommand(
        payload=SynthesisPayload(
            data=data,
            settings=settings,
        )
    )

    f = wave.open("output.wav", "w")
    f.setnchannels(1)
    f.setsampwidth(2)
    f.setframerate(SAMPLE_RATE)

    async with synthesizer:
        async def _send():
            await synthesizer.send(command)

        asyncio.create_task(_send())
        async for message in synthesizer.stream():
            print(message)
            if isinstance(message, AudioDataEvent):
                f.writeframes(message.payload.audio)


if __name__ == "__main__":
    asyncio.run(synthesize())

Change Log 

[0.2.3] - 2022-9-23

Improvements

Add reconnection mechanism

[0.2.2] - 2021-8-16

Bugfixes

Some callbacks were never called

[0.2.1] - 2021-7-25

changed sdk name to seavoice

[0.1.14] - 2021-4-9

Improvements

Added output of post-processing result

[0.1.13] - 2021-4-1

Improvements

Added output of segment and word alignment information

[0.1.12] - 2020-12-10

Bugfixes

Remove unused variable

Improvements

Added websocket packages in requirements.txt file