Last time, I wrote about how to use the UWP and integrate Cortana to use voice commands to start your app on a Windows Phone device.

This time, I’m going to write about how to control a Raspberry Pi with voice commands, and program your UWP app in C# to response to those instructions. This has the potential to really transform the accessibility of driving events in your UWP apps.

Creating the grammar specification file

The .NET framework provides some pretty advanced speech recognition capabilities out of the box – these APIs make integrating grammar specifications into your app very simple. The more complex part is creating the grammar file itself.

Microsoft have an excellent introduction to creating these files on MSDN here. Reading MSDN and augmenting this with the example on Wikipedia here really helped me get started with this.

I’ve started creating my Speech Recognition Grammar Specification (SRGS), which describes “automationCommands” below:

<?xml version="1.0" encoding="utf-8" ?>
<grammar
  version="1.0"
  xml:lang="en-US"
  root="automationCommands"
  xmlns="http://www.w3.org/2001/06/grammar"
  tag-format="semantics/1.0">
  
  <!-- SRGS instructions here -->
 
</grammar>

For the purposes of this article, I want my Raspberry Pi to recognise verbal instructions to control a vehicle. I’m likely to command the vehicle to move forward or backward, and I want to use a few different verbs to describe the action of movement. For example, I want the commands below to work:

  • Move forward
  • Go forwards
  • Turn back

It’s quite easy to see the structure of the sentence, in that there’s a verb which describes the move action (move, go, turn) and then an adverb for the direction (forward, forwards, backward, backwards, back). Therefore, our grammar specification starts to look like this:

<rule id="automationCommands">
  <item>
    <item>
      <ruleref uri="#moveAction" />
      <tag> out.command=rules.latest(); </tag>
    </item>
    <item>
      <ruleref uri="#direction" />
      <tag> out.direction=rules.latest(); </tag>
    </item>
  </item>
</rule>

When the .NET speech recognition engine interprets the voice commands, it will store the instruction it hears within a dictionary object, with keys of “command” and “direction” – you can see these in the <tag> nodes above.

So I now need to describe the rules for the automation commands “moveAction” and “direction”. Let’s look at “moveAction” first.

When the recognition engine hears me say the words “move”, “go” or “turn”, I want the engine to recognise this as an instruction to move. I would like to translate all of these verbal instructions to just one verb – move. This is much better than having to program my application to handle many different words (move, turn, go) which describe the same action (move). I can do this by defining a <tag> within a rule for one of a number of different words, in the way shown below.

<rule id="moveAction">
  <one-of>
    <item>
      <tag> out="MOVE"; </tag>
      <one-of>
        <item>move</item>
        <item>turn</item>
        <item>go</item>
      </one-of>
    </item>
  </one-of>
</rule>

For the rule relating to “direction”, this follows a similar pattern, but this rule has two output tags for forward and backward.

<rule id="direction">
  <item>
    <one-of>
      <item>
        <tag> out="FORWARD"; </tag>
        <one-of>
          <item>forward</item>
          <item>forwards</item>
        </one-of>
      </item>
      <item>
        <tag> out="BACKWARD"; </tag>
        <one-of>
          <item>backward</item>
          <item>back</item>
          <item>backwards</item>
        </one-of>
      </item>
    </one-of>
  </item>
</rule>

So the whole SRGS file – defining the grammar required is shown below. This is also available on Github here.

<?xml version="1.0" encoding="utf-8" ?>
<grammar
  version="1.0"
  xml:lang="en-US"
  root="automationCommands"
  xmlns="http://www.w3.org/2001/06/grammar"
  tag-format="semantics/1.0">
 
  <rule id="automationCommands">
    <item>
      <item>
        <ruleref uri="#moveAction" />
        <tag> out.command=rules.latest(); </tag>
      </item>
      <item>
        <ruleref uri="#direction" />
        <tag> out.direction=rules.latest(); </tag>
      </item>
    </item>
  </rule>
 
  <rule id="moveAction">
    <one-of>
      <item>
        <tag> out="MOVE"; </tag>
        <one-of>
          <item>move</item>
          <item>turn</item>
          <item>go</item>
        </one-of>
      </item>
    </one-of>
  </rule>
 
  <rule id="direction">
    <item>
      <one-of>
        <item>
          <tag> out="FORWARD"; </tag>
          <one-of>
            <item>forward</item>
            <item>forwards</item>
          </one-of>
        </item>
        <item>
          <tag> out="BACKWARD"; </tag>
          <one-of>
            <item>backward</item>
            <item>back</item>
            <item>backwards</item>
          </one-of>
        </item>
      </one-of>
    </item>
  </rule>
</grammar>

Implementing the UWP app in C#

I created a new Windows 10 UWP app in Visual Studio, and added a project reference to the Windows IoT Extensions for the UWP (shown below).

screenshot.1462650773

I also added a NuGet reference to a package I created to simplify coding for speech recognition – Magellanic.Speech.Recognition. I added it using the command below from the package manager console.

Install-Package Magellanic.Speech.Recognition -Pre

Next, I added handlers for the Loaded and Unloaded events in the app’s MainPage.xaml.cs file.

public MainPage()
{
    this.InitializeComponent();
 
    Loaded += MainPage_Loaded;
 
    Unloaded += MainPage_Unloaded;
}

I added the SRGS XML file to the root of the project with the name grammar.xml, and added a member reference to this and the speech recognition manager to MainPage.xaml.cs.

private const string grammarFile = "grammar.xml";
        
private SpeechRecognitionManager recognitionManager;

Inside the event handler “MainPage_Loaded”, I added the code below. This compiles the SGRS grammar file, and also adds an event handler for what to do when the speech recognition engine successfully detects and parses a voice command.

// initialise the speech recognition manager
recognitionManager = new SpeechRecognitionManager(grammarFile);
 
// register the event for when speech is detected
recognitionManager
    .SpeechRecognizer
    .ContinuousRecognitionSession
    .ResultGenerated += RecognizerResultGenerated;
 
// compile the grammar file
await recognitionManager.CompileGrammar();

The code below shows the implementation of the event handler declared above. I’ve chosen to ignore any results which aren’t recognised with a high level of confidence. You can also see how the two keys of “command” and “direction” – which are defined in the “automationCommands” rule in the SRGS – can be interpreted and used in C# for further processing and action.

private void RecognizerResultGenerated(
    SpeechContinuousRecognitionSession session,
    SpeechContinuousRecognitionResultGeneratedEventArgs args)
{
    // only act if the speech is recognised with high confidence
    if (!args.Result.IsRecognisedWithHighConfidence())
    {
        return;
    }
 
    // interpret key individual parts of the grammar specification
    string command = args.Result.SemanticInterpretation.GetInterpretation("command");
    string direction = args.Result.SemanticInterpretation.GetInterpretation("direction");
 
    // write to debug
    Debug.WriteLine($"Command: {command}, Direction: {direction}");
}

The code for MainPage.xaml.cs is available here.

Hardware used by the Raspberry Pi

The Pi doesn’t have any hardware on board which can convert voice commands into electrical signal – I purchased a small USB microphone. The device is shown below.

The image below shows how the Raspberry Pi recognises this device as a USB PnP sound device.

screenshot.1467567420.png

Finally, in order to use this device, I had to modify the app’s Package.appxmanifest file to add the Microphone capability.

screenshot.1467568561

I’ve added all of this code to GitHub here.

Testing it out with some voice commands

I added a small LCD device to my Raspberry Pi to show the output of my voice commands. When I say “Move forward”, the device interprets it in the way below – the LCD screen shows how the command is “MOVE” and the direction is “FORWARD”.

voice_recognition_small

When I say “Turn back”, the device interprets it in the way below.  The image shows how the command is “MOVE” and the direction is “BACKWARD”. So notice how the device doesn’t care about whether you say “move” or “turn”, it interprets it as the command “MOVE”.

voice_recognition_back_small.jpg

This has been a simple introduction to speech recognition in C#, and how to use it with the Raspberry Pi. You can obviously go to a much greater deal of complexity with the SRGS file to make your UWP applications more accessible.