SeaVoice Python SDK
This is the tutorial about how to use SeaVoice Python SDK to try Seasalt.ai Speech-To-Text (STT) and Text-To-Speech (TTS) services.
Please contact info@seasalt.ai if you have any questions.
Prerequisites
You will need a SeaVoice speech service account to run the following examples. Please contact info@seasalt.ai to apply for the SEAVOICE_TOKEN.
Speech-to-Text Example:
Install and import
To install SeaVoice SDK:
pip install seavoice-sdk-test
To import SeaVoice SDK:
import seavoice_sdk_beta.speech as speechsdk
Recognition
In the example below, we show how to recognize speech from an audio file. You can also apply recognition to an audio stream.
Speech Configuration
Use the following code to create SpeechConfig
:
recognizer = SpeechRecognizer(
token=SEAVOICE_TOKEN,
language=LanguageCode.EN_US,
sample_rate=16000,
sample_width=2,
enable_itn=True,
contexts={},
context_score=0
)
Note
language
: Input audio language, choose fromLanguageCode.ZH_TW
,LanguageCode.EN_US
enable_itn
: Whether to run Inverse Text Normalisation (ITN) to add punctuation and output written form instead of spoken form, i.e. output words likeMr.
instead ofmister
contexts
: A json dict to boost certain hotwords and/or phrases for recognition, and optionally rewrite certain spoken forms to a specific written form. Each key is a word/phrase for context biasing; each corresponding value is an optional dict containing a key ‘rewrite’ which maps to a list of possible spoken forms that will be rewritten to the written form (the key). In the above example, the word “seasalt” will be boosted and all occurences of “sea salt” and “c salt” will be rewritten to the capitalised “Seasalt”. Also, if a certain sentence is expected, you can also boost the whole sentence, e.g. “Seasalt is an AI company”contexts = { "Seasalt": { "rewrite": ["sea salt", "c salt"] }, "SeaVoice": { "rewrite": ["c voice"] } }
context_score
: The strength of the above providedcontexts
. We recommend starting with a score of 2.0 and try it out.
Recognizing speech
Now we use the recognizer to send audio at audio_path
for recognition.
async with recognizer:
async def _send():
with wave.open(audio_path, mode="rb") as audio_file:
frames = audio_file.readframes(frames_sent_per_command)
while frames:
await recognizer.send(frames)
frames = audio_file.readframes(frames_sent_per_command)
await recognizer.finish()
asyncio.create_task(_send())
async for event in recognizer.stream():
print(event)
Note
frames_sent_per_command
: you can add asyncio.sleep()
depending on the number of frames sent for each chunk to mimic a streaming setting with local audio file testing.
There are three types of events from the recognizer:
InfoEvent
: contains the recognition status of one of the followingSpeechStatus.BEGIN
,SpeechStatus.END
,SpeechStatus.ERROR
RecognizingEvent
: contains the following informationtext
: this is the partial transcription that might change in theRecognizedEvent
segment_id
: 0-based index of this recognizing segment.voice_start_time
: timestamp in seconds of the start of this segment relative to the start of the audio.word_alignments
: a list ofWordAlignment
objects containing the start timestamp of each word relative to the start of the audio.
RecognizedEvent
: similar to theRecognizingEvent
with an additionalduration
in seconds for this recognized segment.
Here are some examples of the events:
InfoEvent(payload=InfoEventPayload(status='begin'))
RecognizingEvent(payload=RecognizingEventPayload(segment_id=4, text=' how much', voice_start_time=29.17, word_alignments=[WordAlignment(word='how', start=29.169998919963838, length=1), WordAlignment(word='much', start=29.32999891638756, length=1)]))
RecognizedEvent(payload=RecognizedEventPayload(segment_id=4, text=' How much it was? ', voice_start_time=29.17, word_alignments=[WordAlignment(word='How', start=29.169998919963838, length=1), WordAlignment(word='much', start=29.32999891638756, length=1), WordAlignment(word='it', start=29.609998917579652, length=1), WordAlignment(word='was?', start=29.68999890089035, length=1)], duration=0.67))
Putting everything together
Now, put everything together and run the example:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import asyncio
import wave
import argparse
from seavoice_sdk_beta import LanguageCode, SpeechRecognizer
from seavoice_sdk_beta.events import InfoEvent
async def recognize(
audio_path: str,
language: LanguageCode,
sample_rate: int,
sample_width: int = 2,
frames_sent_per_command: int = 150,
):
seavoice_token = os.getenv("SEAVOICE_TOKEN", None)
assert seavoice_token, "SEAVOICE_TOKEN is not set."
recognizer = SpeechRecognizer(
token=seavoice_token,
language=language,
sample_rate=sample_rate,
sample_width=sample_width,
enable_itn=True,
contexts={},
context_score=0
)
async with recognizer:
async def _send():
with wave.open(audio_path, mode="rb") as audio_file:
frames = audio_file.readframes(frames_sent_per_command)
while frames:
await recognizer.send(frames)
frames = audio_file.readframes(frames_sent_per_command)
await recognizer.finish()
asyncio.create_task(_send())
async for event in recognizer.stream():
if type(event) is InfoEvent:
print(f"{type(event).__name__}: status {event.payload.status}")
else:
print(f"{type(event).__name__}: {event.payload.text}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--audio", type=str, required=True,
help="path to the audio file to be recognized")
parser.add_argument("--sample-rate", type=int, required=True,
help="sample rate of the audio")
parser.add_argument("--language", type=str, required=True,
help="language of the provided audio, choose from 'en' and 'zh'")
args = parser.parse_args()
if args.language == "en":
lang = LanguageCode.EN_US
elif args.language == "zh":
lang = LanguageCode.ZH_TW
else:
raise Exception("for 'language', choose from 'en', 'zh'")
asyncio.run(recognize(args.audio, language=lang, sample_rate=args.sample_rate))
Text-to-Speech Example:
Install and import
To install SeaVoice SDK:
pip install seavoice-sdk-test
To import SeaVoice SDK:
import seavoice_sdk_beta as speechsdk
Synthesis
In the example below, we show how to synthesize text to generate an audio file. You can also receive synthesis results from an audio stream.
Voice Configuration
Use the following code to create SpeechSynthesizer
and SynthesisSettings
(contact info@seasalt.ai for the SEAVOICE_TOKEN):
from seavoice_sdk_beta import SpeechSynthesizer, LanguageCode, Voice
from seavoice_sdk_beta.commands import SynthesisSettings
synthesizer = SpeechSynthesizer(
token=SEAVOICE_TOKEN,
language=LanguageCode.EN_US,
sample_rate=22050,
voice=Voice.TOMHANKS,
)
settings = SynthesisSettings(
pitch=0.0,
speed=1.0,
volume=50.0,
rules="Elon | eelon\nX Æ A12 | x ash ay twelve",
sample_rate=22050,
)
Note
language
: choose fromLanguageCode.ZH_TW
,LanguageCode.EN_US
,LanguageCode.EN_GB
voice
: voice options of the synthesized audioZH_TW
: choose fromVoice.TONGTONG
,Voice.VIVIAN
EN_US
: choose fromVoice.ROBERT
,Voice.TOM
,Voice.MIKE
,Voice.ANNE
,Voice.LISSA
,Voice.MOXIE
,Voice.REESE
EN_GB
: choose fromVoice.DAVID
pitch
: to adjust the pitch of the synthesized voice, choose a value between-12.0
and12.0
, where0.0
is the default/normal value, where positive values raise the pitch and negative values lower the pitch.speed
: to adjust the speed of the synthesized voice, choose a value between0.5
and2.0
, where1.0
is the default/normal value, where values > 1.0 speed up the speech and values < 1.0 slows down the speech.volume
: to adjust the volume of the synthesized voice, choose a value between0.0
and100.0
, where50.0
is the default/normal value, where values > 50.0 increases the volume and values < 50.0 decreases the volume.rules
: to specify pronunciation rules for special word representations, input string in the following format<WORD1> | <PRONUNCIATION1>\n<WORD2> | <PRONUNCIATION2>
where\n
is the delimiter.ZH_TW
: pronunciation can be specified in zhuyin, pinyin, or Chinese characters, e.g. “TSMC | 台積電n你好 | ㄋㄧˇ ㄏㄠˇn為了 | wei4 le5”EN_US
andEN_GB
: pronunciation can be specified with English words, e.g. “XÆA12 | ex ash ay twelvenSideræl|psydeereal”
sample_rate
: make sure the sample rate matches the sample rate setting for the output audio file.
Text Configuration
Use the following code to create SynthesisData
:
from seavoice_sdk_beta.commands import SynthesisData
data = SynthesisData(
text="Good morning, today's date is<say-as interpret-as='date' format='m/d/Y'>10/11/2022</say-as>",
ssml=True,
)
Note
text
is the text to be synthesized.
ssml
should be True if text
is an SSML string, i.e. using SSML tags. See Supported SSML Tags Tags for more info.
Output File Configuration
Use the following code to initialise the synthesized output audio:
import wave
f = wave.open("output.wav", "w")
f.setnchannels(1)
f.setsampwidth(2)
f.setframerate(22050)
Synthesis Command
Use the following code to create SynthesisCommand
using the SynthesisData
and SynthesisSettings
from previous steps:
from seavoice_sdk_beta.commands import SynthesisCommand
command = SynthesisCommand(
payload=SynthesisPayload(
data=data,
settings=settings,
)
)
Putting everything together
Now, put everything together and run the example:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import os
import asyncio
import wave
from seavoice_sdk_beta import LanguageCode, SpeechSynthesizer, Voice
from seavoice_sdk_beta.commands import SynthesisCommand, SynthesisData, SynthesisPayload, SynthesisSettings
from seavoice_sdk_beta.events import AudioDataEvent
SAMPLE_RATE: int = 8000
async def synthesize():
seavoice_token = os.getenv("SEAVOICE_TOKEN", None)
assert seavoice_token, "SEAVOICE_TOKEN is not set."
synthesizer = SpeechSynthesizer(
token=seavoice_token,
language=LanguageCode.ZH_TW,
sample_rate=SAMPLE_RATE,
voice=Voice.TONGTONG,
)
data = SynthesisData(
text="Good morning, today's date is<say-as interpret-as='date' format='m/d/Y'>10/11/2022</say-as>",
ssml=True,
)
settings = SynthesisSettings(
pitch=0.0,
speed=0.9,
volume=100.0,
rules="SeaX | sea x",
sample_rate=SAMPLE_RATE,
)
command = SynthesisCommand(
payload=SynthesisPayload(
data=data,
settings=settings,
)
)
f = wave.open("output.wav", "w")
f.setnchannels(1)
f.setsampwidth(2)
f.setframerate(SAMPLE_RATE)
async with synthesizer:
async def _send():
await synthesizer.send(command)
asyncio.create_task(_send())
async for message in synthesizer.stream():
print(message)
if isinstance(message, AudioDataEvent):
f.writeframes(message.payload.audio)
if __name__ == "__main__":
asyncio.run(synthesize())
Change Log
[0.2.3] - 2022-9-23
Improvements
Add reconnection mechanism
[0.2.2] - 2021-8-16
Bugfixes
Some callbacks were never called
[0.2.1] - 2021-7-25
changed sdk name to seavoice
[0.1.14] - 2021-4-9
Improvements
Added output of post-processing result
[0.1.13] - 2021-4-1
Improvements
Added output of segment and word alignment information
[0.1.12] - 2020-12-10
Bugfixes
Remove unused variable
Improvements
Added websocket packages in requirements.txt file