Week 7 : Final Project - IBM Watson Speech to Text recognition in Unity

Concept:  Solve the problem of an indoor AR audio tour guide app that cannot access GPS by integrating a Voice UI which allows users to control the experience's progression.

Implementing the Google Dialogflow into Unity turned problematic.  I turned instead to IBM Watson, and quickly had better results integrating with their SDK.  The possible user responses have to be hardcoded, but using about 5-6 options seems to initially cover what I expect users to say in response to the narrator's prerecorded prompts.

Here's the conversation flow chart.



And here's the video showing the implementation within Unity and the Voice recognition at work.

Unfortunately you cannot hear my user responses since I'm Soundflower recording out of Unity, but you can see in the Console the Speech to Text translation of what I'm saying into my headphone's microphone.   You may also notice the sound quality is inferior at the end of the narration when I respond to the User (I used the laptop mic for these short responses as a mockup).  Also, it fires twice at the end of the segment, with both the affirmative and negative reply, so I'll need to code that so that only one can fire.  Also, next up is get these responses to trigger the next scene, or more likely integrating a manager script that keeps all the experience's chapters in one Unity scene and just swaps out jpegs, wavs, etc.

Wrap-up:  Voice Interfaces was a great class, and I got a lot out of the level of design thinking and discussions within the class as well as from the lectures.  It's a very interesting area to learn about, and as digital assistant voice synthesis gets more and more believable, it will be an increasingly important area to research.  Happy to have been introduced to the history and landscape of Voice UI.

Week 4 : Final Project proposal

For my final project I'd like to return to uncanny valley aesthetics of the Lyrebird voice synthesizer and combine that with a Max project I made previously in Live Image Processing and Performance.

I also see this as investigation in my Thesis project's area of investigation.

Here's the general scope and UX of the project.

A user plays my xylophone.

The different notes are heard and discerned by a microphone running into Max.

The output is two-fold: 1) a projection mapped to the bars of the xylophone which are allowed to pass video when their corresponding notes ring out above a certain volume threshold, and 2) additional sound assets are triggered from Max with each detected hit.

The audio are lyrebird samples of my voice but trained to different notes that are in harmony with their corresponding xylophone trigger notes.  Ideally, these are Markov Chained notes, so that each struck note of the xylophone produces one of three or four possible notes in harmony.

The video imagery is shots of me saying the each word of the script and looking into the camera (and thus the user's eyes).

Here are examples of the work I've made that serve as foundations for this piece.

Xylophone + video projection:

Lyrebird Voice synthesizer with musical constraints:

Week 3 - Google AIY kit experiment

Assembling and configuring the Google AIY cardboard kit was mostly a breeze!

After trying to jump into the deep end with the Cloud version of Google AIY dev and realizing it would come with more capabilities but start me from scratch, I decided to alter course and modify the example Assistant API code to make a simple browser opening function that when asked to "play you the piano" takes you to a youtube video that autoplays a performance of John Cage's "4:33."

Cage's iconic piece was partly about the sounds of the audience in the hall during the 4:33, shuffling in their chairs, coughing, wondering, existing...  This browser gimmick tries to comment on that, as the documentation video shows: asking the AIY to "play me the piano" becomes about listening to the sounds of the room in which your browser is open.  

Week 2 - Google AIY kit, "Who Is Lonely Today?"

This week's assignment was to flip things around and play with Speech-to-Text.  I wanted to create a moment using the AIY kit that prompted the user to feel compassion.  Despite Voice UIs sometimes steering you towards speaking in a manner that can feel a bit alienating, perhaps a simple design can counter this via the act of giving voice to an empathetic concern.

I wrote a Python script to scrape Twitter for tweets from today that contain the word lonely and have 0 replies and 0 likes, put them in a JSON file, and then when you ask the AIY Voice Kit "Who is lonely today?" your web browser is opened and taken to one of these randomly chosen valid tweets.  Ideally you heart the tweet and ask a variety of follow up questions: "Who else is lonely today?"  "Who else?"  "And who else?"

By making these tweets a little less lonely, perhaps we have a positive effect on the loneliness of the person behind the tweet too.  Being in literal dialogue with this important social issue is a nice way to use a simple Voice Interface to feel expressive, in that it allows the user to express something out loud and creates an opportunity for them to act.  


I Would Die 4 U (Lyrebird Karaoke)

After some preliminary training tests on the Lyrebird AI voice synthesizer, I set up a nice mic and trained it properly with 50 sentences performed (20 more than their minimum).  It was an interesting experiment with trying to make a voice synthesizer both expressive and musically useful, using what appears to be the best available voice synthesizer that you can train yourself.  While the track feels ironcially comedic and certainly feels a bit sci-fi, I do hear some amount of effective, emotive longing to the lead vocal when put to this song.

I looped a D piano note in my headphones while reading the training sentences into the microphone, monotonously performing in unison all the words without natural English language pitch inflections.  Ironically, I had to read as robotically as possible so that the output synthesis would be musically usefully.  In the future I'd like to perform more Lyrebird trainings to build out an arsenal of different notes with this workaround technique so that I can reimagine new songs with more dynamic lead vocal melodies.

I think the classic Prince song takes on an interesting recontextualization here.  In the original, Prince offers to make the ultimate sacrifice for his lover, invoking religious and cryptic poetics throughout.  Here, those lines blend with the uncanny valley of AI generated vocals and my synthesized TTS lead vocal offers perhaps a more nuanced sacrifice - relinquished immortality. 

What first came up in Lyrebird conversations with friends Patrick Presto and Alejandro Matamala was the ability to "cryogenically freezing" a loved ones voice before they passed away.  We imagined capturing an ill grandparent's voice via Lyrebird so that their vocal essence would remain (theoretically immortal), perhaps to read future grandkids bedtime stories or recount family stories at holiday gatherings.  The TTS implications of this are potentially wonderful, a nice thought to offset all the negative possibilities of identity forgery and bad actors in this space.

Training set was performed using a Rode K2 through a Universal Audio Apollo interface, edited in Ableton for arrangement of stanzas and rythmic expressiveness.  Downloaded a Prince Karaoke track.  Otherwise all audio essentially straight from Lyrebird's voice synthesizer.