The Vivoka SDK is meant to greatly facilitate the integration of Voice Recognition, Speech Synthesis and Voice Biometrics into your application, regardless of who's providing the underlying engine! This short guide will show you how to get started quickly.

Installation

This guide is not meant to be an install guide and will assume everything is at the right place to start developping.

Configuration

All VSDK engines are to be initialized with a JSON configuration file. We strongly recommend you put this file under a config directory because some engines will generate additional configuration files.

The content of this file is discussed in separate documents as each engine will have its own configuration block.

Error handling model

The SDK will throw exceptions as a way to report errors and avoid having to check every single function call. The following base program is recommended:

#include <vsdk/Exception.hpp>
 
int main() try
{
    // use VSDK here
    return EXIT_SUCCESS;
}
catch (std::exception const & e)
{
    fmt::print(stderr, "A fatal error occured:\n");
    Vsdk::printExceptionStack(e);
    return EXIT_FAILURE;
}

Please note that some part of the SDK might run on another thread of execution, and exceptions can't cross thread boundaries. You have to protect your threads from exceptions by either catching them inside or sending further execution callbacks to the main thread.

Version

You can access the versions of both VSDK and the underlying engines like so:

#include <vsdk/global.hpp>
#include <fmt/format.h>
 
// To initialize an engine, see further below
fmt::print("VSDK v{} ; Engine v{}\n", Vsdk::version(), engine->version());

Audio Pipeline

Starting from VSDK 6, you have access to the Audio Pipeline feature. Put simply, you have an audio producer that will push its audio buffers to consumers. Audio modifiers can be installed in the middle to transform the audio as it goes through the pipeline.

using namespace Vsdk::Audio;
auto recognizer = asrEngine->recognizer("rec");
// ... setup the recognizer callbacks, etc ...
 
Pipeline p;
p.setProducer(Producer::File::make("path/to/input/file")); // defaults to 16kHz, 1 channel, 2048B read size
p.pushBackConsumer(recognizer);
p.run(); // This pipeline reads from a file synchronously and the bytes are sent to the recognizer

ASR

Includes

#include <vsdk/asr.hpp> // main ASR engine class

#include <vsdk/asr/csdk.hpp> // underlying ASR engine, here we chose CSDK

Starting the engine

Vsdk::Asr::EnginePtr engine = Vsdk::Asr::Engine::make<Vsdk::Asr::Csdk::Engine>("config/vsdk.json");

// engine is a std::shared_ptr, copy it around as needed but don't let it go out of scope while you need it!

Vsdk::Asr::EnginePtr

std::shared_ptr< Engine > EnginePtr

Definition: asr/Engine.hpp:50

You can't create two separate instances of the same engine! Attempting to create a second one will get you another pointer to the existing engine. Terminate the first engine (e.g. let it go out of scope) then you can make a new instance.

That's it! If no exception was thrown your engine is ready to be used.

Native engine

Sometimes you need to access a feature specific to the engine you chose. You can access the underlying engine like so:

using CsdkAsrEngine = Vsdk::Asr::Csdk::Engine;
auto engine = Vsdk::Asr::Engine::make<CsdkAsrEngine>("config/vsdk.json");
auto native = engine.asNativeEngine<CsdkAsrEngine>(); // also a std::shared_ptr

Although this has been made possible, you should avoid using the native engine unless there is no other way.

TTS

Includes

#include <vsdk/tts.hpp>

#include <vsdk/tts/baratinoo.hpp> // Here we chose the baratinoo engine

Starting the engine

auto engine = Vsdk::Tts::Engine::make<Vsdk::Tts::Baratinoo::Engine>("config/vsdk.json");

Listing the available channels and voices

// With C++17 or higher
for (auto const & [channel, voices] : engine->availableVoices())
    fmt::print("Available voices for '{}': [{}]\n", channel, fmt::join(voices, " - "));
 
// With C++11 or higher
for (auto const & it : engine->availableVoices())
    fmt::print("Available voices for '{}': [{}]\n", it.first, fmt::join(it.second, " - "));

Creating a channel

Remember, channel must be configured beforehand!

Vsdk::Tts::ChannelPtr channelFr = engine->channel("channel_fr");

channelFr->setCurrentVoice("Arnaud_neutre"); // mandatory before any synthesis can take place

Vsdk::Tts::ChannelPtr

std::shared_ptr< Channel > ChannelPtr

Definition: Channel.hpp:126

You can also activate a voice right away:

auto channelEn = engine->channel("channel_en", "laura"); // fetch and setCurrentVoice combined

// ChannelPtr is also a std::shared_ptr

Destruction order is important! Channels must be destroyed before the engine dies or you're in for a ride.

Speech Synthesis

Speech Synthesis is synchronous! That means the call will block the thread until the synthesis is done or an error occured. If you need to keep going, put that in another thread.

Vsdk::Tts::SynthesisResult const resultFr = channelFR->synthesizeFromText("Bonjour ! Je suis une voix synthétique.");
 
// Also works with SSML input
auto const ssml = R"(<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
                       Here is an <say-as interpret-as="characters">SSML</say-as> sample.
                     </speak>)";
auto const resultEn = channelEn->synthesisFromText(ssml);

You can load your text or SSML input from a text file too:

auto const result = channel->synthesisFromFile("path/to/file");

SynthesisResult is NOT a pointer type! Avoid copying it around, prefer move operations.

Playing the result

VSDK does not provide an audio player of any sort, it is up to you to choose the one that suits your needs. Once you've chosen one, using the result is very easy:

PortAudioPlayer player;

player.play(audio.data(), audio.channelCount(), audio.sampleRate());

The audio data is a 16bit signed Little-Endian PCM buffer. Channel count is always 1 and sample rate varies depending on the engine:

Engine	Sample Rate (kHz)
csdk	`22050`
baratinoo	`24000`
vtapi	`16000`

Storing the result on disk

result.saveToFile("path/to/file.pcm");

Only PCM extension is available, which means the file has no audio header of any sort.

You can play it by supplying the right parameter, i.e.:

$ aplay -f S16_LE -s <sample_rate> -c 1 file.pcm

Or add a WAV header:

$ ffmpeg -f s16le -ar <sample_rate> -ac 1 -i file.pcm file.wav

Voice biometry

Two providers are supported : tssv and idvoice

Includes

#include <vsdk/biometrics.hpp> // main biometrics engine class

#include <vsdk/biometrics/[name of the provider].hpp> // underlying engine

Creating an engine

First of all you will need the engine of the provider you chose.

auto engine = Vsdk::Biometrics::Engine::make<Vsdk::Biometrics::[name of the provider]::Engine>("vsdk.json");

Creating a model

To use the voice biometrics you will need to create models. You can do it via the engine by providing a name for the model and its type (text dependant or independant):

auto model = engine->makeModel("test", Vsdk::Biometrics::ModelType::TEXT_DEPENDANT);

Vsdk::Biometrics::ModelType::TEXT_DEPENDANT

@ TEXT_DEPENDANT

After that you must add records to users via the method addRecord:

model->addRecord("myuser", "data/myuserEnrollment.wav");

Creating an Authenticator/Identificator

Now that you have a user in your model you can create an authenticator or an identificator to test audio against it.

The difference is that an authenticator only tests 1 user, that you must provide as parameter, but the identificator will test the audio provided against all users in the model.

auto recognizer = engine->makeIdentificator("ident", model);

Then you need to subscribe a callback to result event:

recognizer->subscribe([this] (Vsdk::Biometrics::Identificator::Result const & result)
{
    auto const id = result.json["id"].get<std::string>();
    auto const score = result.json["score"].get<float>();
}

Results are only sent if the engine recognized a user, if someone speaks but isn't recognized no error will be sent.

Theses classes are AudioConsumers so you can add them into the pipeline:

p.pushBackConsumer(recognizer);