It has been over 13 years since the 3D graphics accelerator has hit the consumer market with the Monster 3D. At first I thought 3D video was such a gimmick. Well after witnessing the Monster 3D at work, I bought one on the spot. I had so much fun mainly with Descent 2 for well over a year.
We now have Physics acceleration, which IMHO is not really all that much worth it, but considering nVidia is including it accelerated through their video cards, not a bad idea.
Now it makes me think about a speech engine. So many games these days have narrated lines, which is nice, but it makes for quite a limited number of lines and vocal possibilities.
Imagine the developer being able to just type in a thousand different sayings and choose the type of voice they'd want along with inflections, speed, emphasis, etc. In games you wouldn't have to hear the same lines OVER and OVER and OVER again. This would open up a world of possibilities and most likely cut budget costs. Most likely easily handled by any existing CPU or some technology in an existing chip like GPU, CPU, or sound card.
Plus it would open up a world of possibilities for third party mods and add-ons to tell a better story since you wouldn't need to pay for great voice acting.
Thoughts, comments?
-
Seems a fairly expensive process - you'd need to pay for very high quality voice acting for someone to pronounce all consonant and vowel sounds such that they can be blended well together, times that by number of characters, could get quite pricey if you're just going to use it for a few thousand possibilities. Here's some more info: http://en.wikipedia.org/wiki/Speech_synthesis
-
The whole point is that you don't need voice acting. It is all synthesized without that need.
I mean we have super high powered video cards to emulate the most complex sense that humans possess. Why not speech? Sure initially you might need a dozen or so voices but you could process that to an infinite number. It only needs to be done ONCE, and perhaps not at all with all the knowledge we posess already. -
Interesting idea, but a very narrow niche since most games nowadays are more on graphical eyecandy and twitch-based gameplay then on visual-novel games.
-
You don't need hardware accelerated speech because there isn't much to accelerate, it takes a few minutes to convert ~500 pages of text to mp3 by using textaloud.
-
The first time I heard about this I was like you.. "What a scam, never gonna work", etc.. Then a friend bought one (Vodoo2 I believe, so that was probably around 1998) and I watched him play Unreal (yes, before Unreal Tournament existed) and my mouth dropped to the floor. Amazing stuff, you very rarely get to witness huge performance jumps like that in the tech world.
As far as the Speech Engine, I think it's a cool idea! With compression as such, you could easily have 100+ voices record all the syllables you need and store them on the card itself. Unfortunately I don't think the demand is there.
Now what I would like to see is a serious piece of hardware focused on speech recognition. It's been years since the first speech-enabled apps and OS, and it still sucks. If we could get a dedicated speech analysis card that could understand my Chicago/Southern US accent blend, that would be great. -
such tech would take about 5-10 years to treacle down to consumer level when it's conceived. i don't think i'll even be interested in video games by that time.
-
I guess people are missing my point though. The whole idea is to circumvent the need for voice acting. Create a whole on-the-fly digital re-creation of voices based on algorithms. We know enough about speech and wave patterns that it seems it would just take some dedicated engineers only a couple years time to make it reasonable. Sure you may need a few dozen voices up front reading several hundred specific words or phrases to establish your parameters. But after that, all done.
Then you'd just need a slick interface for identifying the type of voice you want. I'm being inventive here. Just trying to have some fun.
Think outside the box and positively people! -
i don't know when the technology for real time 3d rendering came about. but i think it should be a good few years before the likes of quake.
edit: also, language is highly illogical and dynamic. many words and phrases are born each day, i suppose this could be fixed by updating the system every few hours, where as the first problem could only be fixed by either document every instance of non regular grammar or enforce lojban on the world population. tbh i like the idea of lojban enforcement. -
Ok I'm sorry if I didn't get the point and if I didn't ignore the rest of my post lol
I'm no expert on speech technology, but computerized voices tend to lack in terms of tone and dialect when compared to actual voices(mind you, the last time I fiddled with such software was over 5 years ago).
The thing with going from a written to speech(is that what you're suggesting?) is that a same text can be read 50 different ways. Emphasis on certain words will have to branched into different trees as well as intonation and dialect accents and such. It seems like a hefty amount of work and would require a lot of investment to create a database large enough to compare to a voice actor's possibilities of speech.
Once again, sorry if I missed the point >.< -
I dunno. I think it's an interesting concept. It just seems like if we're able to reproduce a visual environment in a 3D world, which is ultimatelly very complex, that speech should be relatively simple in comparison. -
I think this is an awesome idea! With CPU's soon be released in 6-core standards, and 512 shaders on the GPU, there will be room for more acceleration of things like voice. The potential for such a technology is enormous, especially if an speech AI engine is developed to speak non pre-programmed verbatim based on situation.
I'm surprised so many users can't see the potential for such a technology. Everybody mentions how long it would take to develop and how text to speech doesn't sound realistic, and that is completely ignorant. Given the right amount of time and funding, our talent of modern day programmers could happen.
Already, the highest end synthetic text to speech generators and impossible to discern from a real voice. -
masterchef341 The guy from The Notebook
OK - a few points:
Whether the speech engine is done in hardware or software is totally an implementation detail. That doesn't really matter. A speech engine could exist in a game *today* if the developer wanted to put it there, and current speech synthesis engines can run in real time on modern processors, so we wouldn't necessarily need dedicated hardware for it. Even if we did, it's an implementation detail. This is in contrast to something like dynamic physics, which gets very expensive quickly as the number of interactions between objects grows. In conversational speech, generally just one person is talking at a time.
Even the highest end synthetic text to speech generators all sound very robotic, and they are all easily discernible from human speech. eMike09, I would be very interested if you could show me an example otherwise.
This is the kicker - discernibility is not a direct measure of quality. Synthesized speech doesn't have to map 1:1 with reality for it to be received well. However, synthesized speech engines tend to produce *annoying* results because of how particular people are about sound. If it is just a slight bit off, people will notice and it will bother them.
Of note, this is also true about animation. If the animation looks awkward, it will stick with you and you won't be able to let it go nearly as easily as, say, a 3d model. You could animate a blob character well, and it would be better received than a lifelike human character animated poorly.
The point here being that human (organic) animation is largely migrating to being pre-recorded by actors (motion capture), because that technology is becoming more available to developers and produces high quality results, where a programmatic animation cannot (yet) compete, and manipulating the animation by hand just takes way too much time. Programmatic animation is used for certain things though, like particle effects (fire, smoke, etc) and other inorganic animations (robot characters, space ships, cars, wheels, stuff like that).
Voice is organic sound. Until someone comes up with the algorithms to make life-like, animated, emotion bearing, non-annoying voice, we will have to use voice actors to maintain a high level of quality. You could use text to speech, and get more content, but the quality will be a lot lower, if you were to do that today. Think of synthesized instruments versus the real thing. The is just so much fine detail in a violin, a piano, or a human speaking that it takes a lot of effort to quantify what it is. Until we can do that, we won't have a high quality solution.
In a few years, that all may change. Who knows. -
-
Before 3D gaming graphics hit the consumer market, it was thought to be only something that would be set up in a server farm and rendering one frame every hour or something silly like that, and for CGI or animation only. Sure the initial graphics had a bit to be desired compared with today's, but it sure advanced gaming quickly and dramatically. I was a nay-sayer to 3D gaming graphics at the time too. But I became a believer quickly. Speech synthesis may not be as dramatic, probably be more like PhysX, but in time will become a commonality.
I'm not asking the computer to have an AI at all. Just on the programming end, they set the paramaters for the voice how they want it, and insert the text. Sure there may need to be some manual manipulation of how you want it, but it could be done so much more quickly and less expensive. Use of a programmer part time versus dozens of voice actors at who knows how much cost. Each sentence would have to be massaged so it sounded right, but an expert could probably do it very quickly. Plus you could add dialog later without having to bring in voice actors again and again.
Visit here. This is pretty primitive, but select Audrey UK English. Incredible IMHO. No, it's not completely natural, but it sounds pretty darn good:
http://www2.research.att.com/~ttsweb/tts/demo.php
Make sure to type in your own text. -
masterchef341 The guy from The Notebook
The idea is great- don't get me wrong. In fact, it has been thought of before. There is a ton of research being done in language synthesis from an algorithms standpoint.
All I am saying is that this problem is not a well-solved problem. And by well-solved, I mean that there is no solution that produces voice that is as accurate to life or as appealing as human voice, nor is there a solution that exceptionally close. Perhaps equally important, there aren't current solutions in place to handle the nuances of language- minor differences can make drastic changes in meaning and interpretation.
That will all need to be worked out before it can be implemented in games.
Again, when there is a solution for this, it will probably run just fine on the CPU. -
Yeah the issue is producing an organic sounding voice. Currently, it's very feasible to turn text into speech, the issue is turning text into speech with all the possible variations that the human voice can produce in terms of intonation, dialect accents, emphasis and things would need to be coded into trees which would take time and more complex algorithms to "digitize" the whole concept of the human voice.
-
mobius1aic Notebook Deity NBR Reviewer
Even if you could develop something like a speech engine, such a massively procedural system would need very complex AI and or scripting in order to bring about the most proper responses in an NPC character. Then you have the issue of making the synthesized voice "actor" properly enunciate words with the proper feeling and what not at the proper moment. It's very complex to think about, and in the end all the work could be done much more quickly and cost effectively by a human being that despite many takes possibly needed to get right, can range their voice and personality into an infinite number of characters. Voice modulation via a sound board gives even more freedom. Voice acting is an art (when done right of course). What you want is like previously said, purely procedural and way to complex to be practical. However I can see it being an important building block of an AI made to purely emulate human emotion.
We can quite easily put physics and graphics into relatively easy to understand mathematical terms/equations/systems, however the infinite breadth of human emotion and our voices in that respect is absolutely unfathomable to put into an easy general theory. Psychology tries to do this, but in the end we can only evaluate and not prove. I hope that puts some perspective on what you want to do. -
Lol the ATT speech generator is like 15 years old.
Here is a newer TTS generator that renders the text like it was calculating 1+1, and it sounds very decent. It understands different dialects from 25 languages, pronounces punctuation, and has a lot of the tonality of a human voice. It doesn't have emotion, that would have to be something programmed separately. The online version's capabilities are extremely limited and poor quality compared to their paid product. Imagine dedicating hardware specifically to rendering voice.
http://www.acapela-group.com/text-to-speech-interactive-demo.html
There are generators much better than the one I just linked to, but I can't seem to find it any more. It's a professional pay app. To say that accomplishing what htwingnut has envisioned is impossible or too far fetched to become a reality is about on key with Bill Gates saying we never would have a need for more than 640KB conventional memory (which he never really said anyways).
It makes sense to start looking into the development of a TTS generator with high enough quality to at least be passable to expectations. Massive open world games are going to become far more massive. There will soon be a point that voice acting hundreds of thousands of phrases will just not be feasible. -
masterchef341 The guy from The Notebook
And that is why there isn't hardware accelerated speech. -
It's definitely possible beacuse like I said, 5 years ago, there were relatively ok "text to speech" softwares available.
The difficult part that they're still working on is producing an organic sounding voice with all the possible variations that are associated with it. Lots of these things aren't necessarily mathematically "rendered" yet and would need to be done in order to computerize a full on voice and make it sound natural. Sarcasm, accents, the ability to recognize/pronounce slangs, word emphasis and such: these things would need to be first turned into a logical algorithm of some form before they can be read & rendered by a computer.
I personally like good voice acting and I don't think that voice actors will out of a job so soon, but the technology to produce a natural voice is certainly coming near.
Why no Hardware Accelerated Speech Engine?
Discussion in 'Gaming (Software and Graphics Cards)' started by HTWingNut, Nov 14, 2009.