Why no Hardware Accelerated Speech Engine | NotebookReview

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,878

Trophy Points:: 931

It has been over 13 years since the 3D graphics accelerator has hit the consumer market with the Monster 3D. At first I thought 3D video was such a gimmick. Well after witnessing the Monster 3D at work, I bought one on the spot. I had so much fun mainly with Descent 2 for well over a year.

We now have Physics acceleration, which IMHO is not really all that much worth it, but considering nVidia is including it accelerated through their video cards, not a bad idea.

Now it makes me think about a speech engine. So many games these days have narrated lines, which is nice, but it makes for quite a limited number of lines and vocal possibilities.

Imagine the developer being able to just type in a thousand different sayings and choose the type of voice they'd want along with inflections, speed, emphasis, etc. In games you wouldn't have to hear the same lines OVER and OVER and OVER again. This would open up a world of possibilities and most likely cut budget costs. Most likely easily handled by any existing CPU or some technology in an existing chip like GPU, CPU, or sound card.

Plus it would open up a world of possibilities for third party mods and add-ons to tell a better story since you wouldn't need to pay for great voice acting.

Thoughts, comments?

HTWingNut, Nov 14, 2009

#1

@nthony Notebook Evangelist

Reputations:: 558

Messages:: 585

Likes Received:: 0

Trophy Points:: 30

Seems a fairly expensive process - you'd need to pay for very high quality voice acting for someone to pronounce all consonant and vowel sounds such that they can be blended well together, times that by number of characters, could get quite pricey if you're just going to use it for a few thousand possibilities. Here's some more info: http://en.wikipedia.org/wiki/Speech_synthesis

@nthony, Nov 14, 2009

#2

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,878

Trophy Points:: 931

The whole point is that you don't need voice acting. It is all synthesized without that need.

I mean we have super high powered video cards to emulate the most complex sense that humans possess. Why not speech? Sure initially you might need a dozen or so voices but you could process that to an infinite number. It only needs to be done ONCE, and perhaps not at all with all the knowledge we posess already.

HTWingNut, Nov 14, 2009

#3

Harleyquin07 エミヤ

Reputations:: 603

Messages:: 3,376

Likes Received:: 78

Trophy Points:: 116

Interesting idea, but a very narrow niche since most games nowadays are more on graphical eyecandy and twitch-based gameplay then on visual-novel games.

Harleyquin07, Nov 14, 2009

#4

key001 Notebook Evangelist

Reputations:: 776

Messages:: 657

Likes Received:: 7

Trophy Points:: 31

You don't need hardware accelerated speech because there isn't much to accelerate, it takes a few minutes to convert ~500 pages of text to mp3 by using textaloud.

key001, Nov 14, 2009

#5

BrandonSi Notebook Savant

Reputations:: 571

Messages:: 1,444

Likes Received:: 0

Trophy Points:: 55

htwingnut said: ↑

It has been over 13 years since the 3D graphics accelerator has hit the consumer market with the Monster 3D. At first I thought 3D video was such a gimmick. Well after witnessing the Monster 3D at work, I bought one on the spot. I had so much fun mainly with Descent 2 for well over a year.

Click to expand...

I love historical tech.. The first time I heard about this I was like you.. "What a scam, never gonna work", etc.. Then a friend bought one (Vodoo2 I believe, so that was probably around 1998) and I watched him play Unreal (yes, before Unreal Tournament existed) and my mouth dropped to the floor. Amazing stuff, you very rarely get to witness huge performance jumps like that in the tech world.

As far as the Speech Engine, I think it's a cool idea! With compression as such, you could easily have 100+ voices record all the syllables you need and store them on the card itself. Unfortunately I don't think the demand is there.

Now what I would like to see is a serious piece of hardware focused on speech recognition. It's been years since the first speech-enabled apps and OS, and it still sucks. If we could get a dedicated speech analysis card that could understand my Chicago/Southern US accent blend, that would be great.

BrandonSi, Nov 15, 2009

#6

tianxia kitty!!!

Reputations:: 1,212

Messages:: 2,612

Likes Received:: 0

Trophy Points:: 55

such tech would take about 5-10 years to treacle down to consumer level when it's conceived. i don't think i'll even be interested in video games by that time.

tianxia, Nov 15, 2009

#7

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,878

Trophy Points:: 931

tianxia said: ↑

such tech would take about 5-10 years to treacle down to consumer level when it's conceived. i don't think i'll even be interested in video games by that time.

Click to expand...

ah hem... You have to start somewhere, and like I said, look at 3D gaming. That was nothing but a pipe dream and the original Monster 3D card was scoffed at initially.

I guess people are missing my point though. The whole idea is to circumvent the need for voice acting. Create a whole on-the-fly digital re-creation of voices based on algorithms. We know enough about speech and wave patterns that it seems it would just take some dedicated engineers only a couple years time to make it reasonable. Sure you may need a few dozen voices up front reading several hundred specific words or phrases to establish your parameters. But after that, all done.

Then you'd just need a slick interface for identifying the type of voice you want. I'm being inventive here. Just trying to have some fun.

Think outside the box and positively people!

HTWingNut, Nov 15, 2009

#8

tianxia kitty!!!

Reputations:: 1,212

Messages:: 2,612

Likes Received:: 0

Trophy Points:: 55

htwingnut said: ↑

ah hem... You have to start somewhere, and like I said, look at 3D gaming. That was nothing but a pipe dream and the original Monster 3D card was scoffed at initially.

I guess people are missing my point though. The whole idea is to circumvent the need for voice acting. Create a whole on-the-fly digital re-creation of voices based on algorithms. We know enough about speech and wave patterns that it seems it would just take some dedicated engineers only a couple years time to make it reasonable. Sure you may need a few dozen voices up front reading several hundred specific words or phrases to establish your parameters. But after that, all done.

Then you'd just need a slick interface for identifying the type of voice you want. I'm being inventive here. Just trying to have some fun.

Think outside the box and positively people!

Click to expand...

the huge r&d cost would REQUIRE the investment made by someone other than video game developers. you're talking about consolidating linguistics too. and not just english, but many languages. it would be used in public services and cooperates long before pcs and consoles.
i don't know when the technology for real time 3d rendering came about. but i think it should be a good few years before the likes of quake.

edit: also, language is highly illogical and dynamic. many words and phrases are born each day, i suppose this could be fixed by updating the system every few hours, where as the first problem could only be fixed by either document every instance of non regular grammar or enforce lojban on the world population. tbh i like the idea of lojban enforcement.

tianxia, Nov 15, 2009

#9

Melody How's It Made Addict

Reputations:: 3,635

Messages:: 4,174

Likes Received:: 419

Trophy Points:: 151

Ok I'm sorry if I didn't get the point and if I didn't ignore the rest of my post lol

I'm no expert on speech technology, but computerized voices tend to lack in terms of tone and dialect when compared to actual voices(mind you, the last time I fiddled with such software was over 5 years ago).

The thing with going from a written to speech(is that what you're suggesting?) is that a same text can be read 50 different ways. Emphasis on certain words will have to branched into different trees as well as intonation and dialect accents and such. It seems like a hefty amount of work and would require a lot of investment to create a database large enough to compare to a voice actor's possibilities of speech.

Once again, sorry if I missed the point >.<

Melody, Nov 15, 2009

#10

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,878

Trophy Points:: 931

Forever_Melody said: ↑

Ok I'm sorry if I didn't get the point and if I didn't ignore the rest of my post lol

I'm no expert on speech technology, but computerized voices tend to lack in terms of tone and dialect when compared to actual voices(mind you, the last time I fiddled with such software was over 5 years ago).

The thing with going from a written to speech(is that what you're suggesting?) is that a same text can be read 50 different ways. Emphasis on certain words will have to branched into different trees as well as intonation and dialect accents and such. It seems like a hefty amount of work and would require a lot of investment to create a database large enough to compare to a voice actor's possibilities of speech.

Once again, sorry if I missed the point >.<

Click to expand...

Yes, the text can be read 50 different ways. But with voice acting, you get one static sound sample. With this approach, the programmer could put in dozens of different responses and manipulate how the text is to be pronounced. If you just look at the cost and time for a single voice actor, and did the same with the emulated voice, you time and cost would be greatly reduced. Basically the cost of a programmer to modify the voice, plus at the same time could add a number of other various responses.

I dunno. I think it's an interesting concept. It just seems like if we're able to reproduce a visual environment in a 3D world, which is ultimatelly very complex, that speech should be relatively simple in comparison.

HTWingNut, Nov 15, 2009

#11

emike09 Overclocking Champion

Reputations:: 652

Messages:: 1,840

Likes Received:: 0

Trophy Points:: 55

I think this is an awesome idea! With CPU's soon be released in 6-core standards, and 512 shaders on the GPU, there will be room for more acceleration of things like voice. The potential for such a technology is enormous, especially if an speech AI engine is developed to speak non pre-programmed verbatim based on situation.

I'm surprised so many users can't see the potential for such a technology. Everybody mentions how long it would take to develop and how text to speech doesn't sound realistic, and that is completely ignorant. Given the right amount of time and funding, our talent of modern day programmers could happen.

Already, the highest end synthetic text to speech generators and impossible to discern from a real voice.

emike09, Nov 15, 2009

#12

masterchef341 The guy from The Notebook

Reputations:: 3,047

Messages:: 8,636

Likes Received:: 4

Trophy Points:: 206

OK - a few points:

Whether the speech engine is done in hardware or software is totally an implementation detail. That doesn't really matter. A speech engine could exist in a game *today* if the developer wanted to put it there, and current speech synthesis engines can run in real time on modern processors, so we wouldn't necessarily need dedicated hardware for it. Even if we did, it's an implementation detail. This is in contrast to something like dynamic physics, which gets very expensive quickly as the number of interactions between objects grows. In conversational speech, generally just one person is talking at a time.

Even the highest end synthetic text to speech generators all sound very robotic, and they are all easily discernible from human speech. eMike09, I would be very interested if you could show me an example otherwise.

This is the kicker - discernibility is not a direct measure of quality. Synthesized speech doesn't have to map 1:1 with reality for it to be received well. However, synthesized speech engines tend to produce *annoying* results because of how particular people are about sound. If it is just a slight bit off, people will notice and it will bother them.

Of note, this is also true about animation. If the animation looks awkward, it will stick with you and you won't be able to let it go nearly as easily as, say, a 3d model. You could animate a blob character well, and it would be better received than a lifelike human character animated poorly.

The point here being that human (organic) animation is largely migrating to being pre-recorded by actors (motion capture), because that technology is becoming more available to developers and produces high quality results, where a programmatic animation cannot (yet) compete, and manipulating the animation by hand just takes way too much time. Programmatic animation is used for certain things though, like particle effects (fire, smoke, etc) and other inorganic animations (robot characters, space ships, cars, wheels, stuff like that).

Voice is organic sound. Until someone comes up with the algorithms to make life-like, animated, emotion bearing, non-annoying voice, we will have to use voice actors to maintain a high level of quality. You could use text to speech, and get more content, but the quality will be a lot lower, if you were to do that today. Think of synthesized instruments versus the real thing. The is just so much fine detail in a violin, a piano, or a human speaking that it takes a lot of effort to quantify what it is. Until we can do that, we won't have a high quality solution.

In a few years, that all may change. Who knows.

masterchef341, Nov 15, 2009

#13

@nthony Notebook Evangelist

Reputations:: 558

Messages:: 585

Likes Received:: 0

Trophy Points:: 30

htwingnut said: ↑

The whole point is that you don't need voice acting. It is all synthesized without that need.

I mean we have super high powered video cards to emulate the most complex sense that humans possess. Why not speech? Sure initially you might need a dozen or so voices but you could process that to an infinite number. It only needs to be done ONCE, and perhaps not at all with all the knowledge we posess already.

Click to expand...

I think you're confusing synthesized speech with synthesized voice. The former can already be done, albeit expensively, and still requires a live breathing person to produce the parts of speech to synthesize. As for the latter I've yet to hear a synthesized voice that doesn't sound robotic - the average human voice just has too many harmonics. So you're looking at either 'too' expensive or 'Joint DoD-NASA project' expensive...

@nthony, Nov 15, 2009

#14

HTWingNut Potato

Reputations:: 21,580

Messages:: 35,370

Likes Received:: 9,878

Trophy Points:: 931

Before 3D gaming graphics hit the consumer market, it was thought to be only something that would be set up in a server farm and rendering one frame every hour or something silly like that, and for CGI or animation only. Sure the initial graphics had a bit to be desired compared with today's, but it sure advanced gaming quickly and dramatically. I was a nay-sayer to 3D gaming graphics at the time too. But I became a believer quickly. Speech synthesis may not be as dramatic, probably be more like PhysX, but in time will become a commonality.

I'm not asking the computer to have an AI at all. Just on the programming end, they set the paramaters for the voice how they want it, and insert the text. Sure there may need to be some manual manipulation of how you want it, but it could be done so much more quickly and less expensive. Use of a programmer part time versus dozens of voice actors at who knows how much cost. Each sentence would have to be massaged so it sounded right, but an expert could probably do it very quickly. Plus you could add dialog later without having to bring in voice actors again and again.

Visit here. This is pretty primitive, but select Audrey UK English. Incredible IMHO. No, it's not completely natural, but it sounds pretty darn good:

http://www2.research.att.com/~ttsweb/tts/demo.php

Make sure to type in your own text.

HTWingNut, Nov 15, 2009

#15

masterchef341 The guy from The Notebook

Reputations:: 3,047

Messages:: 8,636

Likes Received:: 4

Trophy Points:: 206

The idea is great- don't get me wrong. In fact, it has been thought of before. There is a ton of research being done in language synthesis from an algorithms standpoint.

All I am saying is that this problem is not a well-solved problem. And by well-solved, I mean that there is no solution that produces voice that is as accurate to life or as appealing as human voice, nor is there a solution that exceptionally close. Perhaps equally important, there aren't current solutions in place to handle the nuances of language- minor differences can make drastic changes in meaning and interpretation.

That will all need to be worked out before it can be implemented in games.

Again, when there is a solution for this, it will probably run just fine on the CPU.

masterchef341, Nov 15, 2009

#16

Melody How's It Made Addict

Reputations:: 3,635

Messages:: 4,174

Likes Received:: 419

Trophy Points:: 151

Yeah the issue is producing an organic sounding voice. Currently, it's very feasible to turn text into speech, the issue is turning text into speech with all the possible variations that the human voice can produce in terms of intonation, dialect accents, emphasis and things would need to be coded into trees which would take time and more complex algorithms to "digitize" the whole concept of the human voice.

Melody, Nov 15, 2009

#17

mobius1aic Notebook Deity NBR Reviewer

Reputations:: 240

Messages:: 957

Likes Received:: 0

Trophy Points:: 30

Even if you could develop something like a speech engine, such a massively procedural system would need very complex AI and or scripting in order to bring about the most proper responses in an NPC character. Then you have the issue of making the synthesized voice "actor" properly enunciate words with the proper feeling and what not at the proper moment. It's very complex to think about, and in the end all the work could be done much more quickly and cost effectively by a human being that despite many takes possibly needed to get right, can range their voice and personality into an infinite number of characters. Voice modulation via a sound board gives even more freedom. Voice acting is an art (when done right of course). What you want is like previously said, purely procedural and way to complex to be practical. However I can see it being an important building block of an AI made to purely emulate human emotion.

We can quite easily put physics and graphics into relatively easy to understand mathematical terms/equations/systems, however the infinite breadth of human emotion and our voices in that respect is absolutely unfathomable to put into an easy general theory. Psychology tries to do this, but in the end we can only evaluate and not prove. I hope that puts some perspective on what you want to do.

mobius1aic, Nov 16, 2009

#18

emike09 Overclocking Champion

Reputations:: 652

Messages:: 1,840

Likes Received:: 0

Trophy Points:: 55

Lol the ATT speech generator is like 15 years old.

Here is a newer TTS generator that renders the text like it was calculating 1+1, and it sounds very decent. It understands different dialects from 25 languages, pronounces punctuation, and has a lot of the tonality of a human voice. It doesn't have emotion, that would have to be something programmed separately. The online version's capabilities are extremely limited and poor quality compared to their paid product. Imagine dedicating hardware specifically to rendering voice.

http://www.acapela-group.com/text-to-speech-interactive-demo.html

There are generators much better than the one I just linked to, but I can't seem to find it any more. It's a professional pay app. To say that accomplishing what htwingnut has envisioned is impossible or too far fetched to become a reality is about on key with Bill Gates saying we never would have a need for more than 640KB conventional memory (which he never really said anyways).

It makes sense to start looking into the development of a TTS generator with high enough quality to at least be passable to expectations. Massive open world games are going to become far more massive. There will soon be a point that voice acting hundreds of thousands of phrases will just not be feasible.

emike09, Nov 16, 2009

#19

masterchef341 The guy from The Notebook

Reputations:: 3,047

Messages:: 8,636

Likes Received:: 4

Trophy Points:: 206

emike09 said: ↑

Lol the ATT speech generator is like 15 years old.

Here is a newer TTS generator that renders the text like it was calculating 1+1, and it sounds very decent. It understands different dialects from 25 languages, pronounces punctuation, and has a lot of the tonality of a human voice. It doesn't have emotion, that would have to be something programmed separately. The online version's capabilities are extremely limited and poor quality compared to their paid product. Imagine dedicating hardware specifically to rendering voice.

http://www.acapela-group.com/text-to-speech-interactive-demo.html

There are generators much better than the one I just linked to, but I can't seem to find it any more. It's a professional pay app. To say that accomplishing what htwingnut has envisioned is impossible or too far fetched to become a reality is about on key with Bill Gates saying we never would have a need for more than 640KB conventional memory (which he never really said anyways).

It makes sense to start looking into the development of a TTS generator with high enough quality to at least be passable to expectations. Massive open world games are going to become far more massive. There will soon be a point that voice acting hundreds of thousands of phrases will just not be feasible.

Click to expand...

I don't think anyone thinks that it is too far fetched or impossible. We just recognize that this type of solution doesn't yet exist. And sound rendering will probably be able to be done in realtime in software when the algorithms are out.

And that is why there isn't hardware accelerated speech.

masterchef341, Nov 16, 2009

#20

Melody How's It Made Addict

Reputations:: 3,635

Messages:: 4,174

Likes Received:: 419

Trophy Points:: 151

It's definitely possible beacuse like I said, 5 years ago, there were relatively ok "text to speech" softwares available.

The difficult part that they're still working on is producing an organic sounding voice with all the possible variations that are associated with it. Lots of these things aren't necessarily mathematically "rendered" yet and would need to be done in order to computerize a full on voice and make it sound natural. Sarcasm, accents, the ability to recognize/pronounce slangs, word emphasis and such: these things would need to be first turned into a logical algorithm of some form before they can be read & rendered by a computer.

I personally like good voice acting and I don't think that voice actors will out of a job so soon, but the technology to produce a natural voice is certainly coming near.

Melody, Nov 16, 2009

#21

Why no Hardware Accelerated Speech Engine?

HTWingNut Potato

@nthony Notebook Evangelist

HTWingNut Potato

Harleyquin07 エミヤ

key001 Notebook Evangelist

BrandonSi Notebook Savant

tianxia kitty!!!

HTWingNut Potato

tianxia kitty!!!

Melody How's It Made Addict

HTWingNut Potato

emike09 Overclocking Champion

masterchef341 The guy from The Notebook

@nthony Notebook Evangelist

HTWingNut Potato

masterchef341 The guy from The Notebook

Melody How's It Made Addict

mobius1aic Notebook Deity NBR Reviewer

emike09 Overclocking Champion

masterchef341 The guy from The Notebook

Melody How's It Made Addict