Nvidia GPU Failures Caused By Material Problem, Sources Claim5:40 AM - August 26, 2008 by Wolfgang Gruener
Source: Tom's Hardware US Category : Miscellaneous
Chicago (IL) - When Nvidianvidia announced in early July that it has noticed a higher than normal failure rate in some of its notebooknotebook chips, investors reacted concerned, sending the company stock down 22%. The stock recovered after Nvidia apparently demonstrated good control of the issue and a one-time charge of almost $200 million. But what seems to be a closed chapter and a black eyehttp://en.wikipedia.org/wiki/Black_eye for the company could be a much more serious problem that is just taking off: Several industry sources confirmed to TG Daily what has been reported by some publications for some time: In contrast to Nvidias claims that only a limited number of GPUs are affected, sources indicated that "most" recent Nvidia GPUs carry the problem and a chance of failure, pushing the potential damage into stratospheric regions.
We have been chasing the Nvidia GPUgpu problem for quite some time, trying to shed more light on an issue Nvidia refuses to release any meaningful information on other than the statement that a limited number of notebook GPUs is affected. Charlie Demerjian from The Inquirer has been reporting for some time that Nvidias problem may be much larger than the company admits. Demerjian wrote that, in addition to currently repaired notebooks, G84/6 GPUs may show failures and even G92 and G94 chips could be affected. After several weeks of digging, it seems that Demerjians claims may not be as far from the truth as some have claimed. There is a lot of speculation in the market, fueled by Nvidias decision not to reveal any details what the source of the problem is. But the general consensus across industry sources we talked to is that a material problem may be the reason for the trouble and depending on whom you believe, between 15 and 75 million GPUs could be affected.
According to our sources, the failures are caused by a solderhttp://en.wikipedia.org/wiki/Solder bump that connects the I/O termination of the silicon chip to the pad on the substrate. In Nvidias GPUs, this solder bump is created using high-lead. A thermal mismatch between the chip and the substrate has substantially grown in recent chip generations, apparently leading to fatigue cracking. Add into the equation a growing chip size (double the chip dimension, quadruple the stress on the bump) as well as generally hotter chips and you may have the perfect storm to take high lead beyond its limits. Apparently, problems arise at what Nvidia claims to be "extreme temperatures" and what we hear may be temperatures not too much above 70 degrees Celsiushttp://en.wikipedia.org/wiki/Celsius .
What supports the theory that a high-lead solder bump in fact is at fault is the fact that Nvidia ordered an immediate switch to use eutectic solders instead of high-lead versions in the last week of July. Eutectic solders are believed to solve the problem of fatigue cracking. This material is often chosen in such cases as chip designers already have experience with this material. Further out in the future, chip designers will have to consider ROHS exclusions and a transition to lead free bumps using materials such as Tin-Silver. We are speculating here, but a sudden switch of the material could bring additional problems for Nvidia, as such a material switch involving electro-migration requires substantial design work and testing. As a minimum, Nvidia would have to review its power delivery to the chip to avoid high current bumps. We were not able to receive any information whether this has been done or not.
As far as we are told, ATI has been using eutectic solders for some time and appears not to be experiencing a similar problem. However, Nvidias sudden switch to eutectic solders may have limited the availability of the material, impacting AMD production and putting actual chip fabs in the middle. There are questions why Nvidia may have missed potential high-lead issues - and may have missed them for quite some time. There is no doubt that all Nvidia chips were tested according to JEDEC rules. Only Nvidia knows why this issue, if high-lead is actually the problem, slipped through.
If we assume for a moment that high-lead is the cause, then there is this question: Which chips are affected and are only notebook GPUs affected? According to our sources, both desktop chips and notebook chips are affected, but the issue is most likely to pop up in notebook chips due to the increased material constraints amplified by the turning on-and-off procedures. We heard that G84, G86 and G92 GPUs could show failures, but we were not able to confirm G94s. Technically, Nvidia would have to replace all those GPUs and the total number is somewhere north of 70 million. But since the issue tends to show up only in notebooks, it is unlikely that there will be any desktop replacements and therefore we are looking at a number closer to 15 million (notebook) GPUs. Take into account that the repair of such a notebook will cost Nvidia at least $150-$250 and you have a damage that could easily be in the billions of dollars.
At this time we only know that Nvidia has made a switch from high-lead to eutectic, everything else is speculation as long as it is not confirmed by Nvidia. However, the detail of information relating to the material switch is surprising and lends a certain credibility to these sources.
The other question, of course, is how often and in which cases those GPUs actually fail. If Nvidia is right and there are in fact low failure rates, then the $200 million that were allocated to repair affected notebooks should be appropriate. If we assume that Nvidia pays about $200 per repair and that 100% of the potential damage is in the neighborhood of $3 billion, then Nvidias $200 million allocation suggest that substantially less than 10% of (notebook) GPUs are showing failures.
A big problem would be if failure rates are in fact higher than expected and Nvidia is trying to contain the problem by playing it down and avoid a massive recall that could inflict a lot of damage to the companys finances: $3 billion is almost twice of what Nvidia currently has in the bank.
So, what does this mean to you? Obviously, only Nvidia knows how serious the problem really is and there is virtually no way of telling whether your Nvidia-based notebook with an affected GPU will show failures or not, as this will depend on the temperatures the GPU will reach. If it shows failures, however, you should contact your vendor and ask for a replacement, provided you are still covered by a warranty.
-
This is an excellent update to a problem that I think many of us have been worrying about. While it is not yet confirmed for us what the exact problem and its parameters are, a couple things stand out:
One, that the failure can be provoked by temperatures as low as 70c, but that Nvidia evidently changed the manufacturing process in July this year.
Two, that some 15 million notebooks may be affected, and that somehow, desktops are basically dismissed as deserving any help in this issue (how interesting, ignore the larger share of cards and focus on the smaller).
Then of course there is the advice given by the article:
"you should contact your vendor and ask for a replacement, provided you are still covered by a warranty."
As many of us are aware, it is just not that simple. My notebook is currently idling daily in the 50-60c range and I have gone well into the 90c range playing the Crysis demo (good thing it doesn't run that great, I would never get anything done). The only hope I have at the moment is that my notebook, built in August, used a GPU that benefits from the new soldering process. I would tend to doubt that since I am sure it is possible that the parts bin had not yet been updated. I suppose the only way to know for sure would be to take this thing apart and look for a manufacturing date.
While is is good to receive additional information, I still have to stand by and wait for some better resolution to become available to consumers affected by this issue. I don't know about you guys, but I still want a lifetime on the GPU or a new part sent my way ASAP. This thing goes down, and my school life is going to start to suck! -
Has Dell officially announced what they plan to do? If this issue is indeed going to affect most GPU's and knowing that 70 degrees is a possible trigger, I see now why Nvidia is keeping their mouth shut! Say nothing, push this out past warranty option, and they save their business. Dell had better not be so foolish, class action suits will and HAVE developed (for the 1330) and if not today, soon. This idea of a one year GPU warranty is bullshlt..if its going to fail (as 70 degrees is nothing!), it needs to be fixed..end of story.
-
I think its a bit meaningless to mention a specific temperature - especially one as low as 70°C. That's gonna have peeps going nuts for no reason. On a normal non-defective GPU that's not much at all. This is the first thing to mention any kind of over heating as being a cause - the main cause has always been reported as thermal cycling - cooling and warming (and subsequent contracting and expanding) that weakens metal like bending a fork backwards and forwards until it snaps easily.
And Nvidia have already said in their latest statement that the number of chips could be huge so that aint news and up for speculation.
I think for the most part it's correct - this is a major manufacturing defect that may affect all chips manufactured until the change in bump material. -
Jonesy, I think that you slightly misread what the temperature reading implies. What it was saying is that the soldering point is most likely unaffected below that temperature, meaning that users like me probably will not provoke failure until I start getting my temps beyond that. Say, with a an hour or so of Crysis a day.
It sounds to me like the case is basically closed. This is a huge problem for Nvidia and so it is also a huge problem for anyone who has used their parts in recent months / years probably. What this amounts to is defective parts making their way into end products for too long, and now its either cover up or fess up. All Dell has offered so far is an extended GPU warranty of 1 year. That is bullpoo for people that only have a 1 year, and bullpoo for anyone with a 3 year. In my opinion, that is basically asking me to keep a broken computer that I paid full price for. Lame. -
This probably means NVIDIA isn't going to recall its chips, as they don't really pose a safety danger. More than likely, it will replace chips as they fail, but won't replace them just because.
Since several millions of those chips are probably in the hands of normal consumers who won't do anything more graphically intensive than watching an HD movie, it would make no sense to replace those chips which have little to no risk of getting temps high enough for the fatigue cracking to be an issue. -
Lol - Nvidia wont recall all the chips because it would cost twice as much as the companies worth
-
I don't care if typical users fall out of the risk envelope. I still would like to hear the words 'lifetime warranty' come out of Nvidia or Dell's mouths. This means that defective GPU's will be replaced, as well as the components that are destroyed when and if it fails.
Nvidia Update: Problems Not Over, failure around 70 degrees
Discussion in 'Dell' started by mark500, Sep 3, 2008.