Hello everyone.
I made this thread with the hope that I would learn more about what caused my GTX 980M to fail after ~2 years of moderate/heavy gaming and academic use. Pictures can be found here.
As far as I can tell, the front and back of the card looked normal. R47 and R22 had some small deformations on the surface that were a bit hard to see. I believe R22 is some sort of VRM/inductor, but I have no clue what R47 is. The power supply's tip was noticed to be discolored after the failure, but I'm not certain why it happened or how it was related to the failure.
TL;DR: My GTX 980M failed despite my best effort to keep my P750ZM running cool. I'm not sure if the thing shorted, overheated, had an on-board temperature sensor failure, failed as a result of something else failing, or simply just reached the end of its life. No component was OCed.
EDIT 8, 04/14/2020: I started looking into GPU repair recently, and a lot of the things I didn't understand back when I wrote this started to clear up after I spent some time reading how MOSFETs used in GPUs work. Look up N-Channel enhancement-type on Wikipedia. If you suspected one of your 980M's MOSFET bit the dust without something nearby or itself exploding, get a multimeter and check the resistance between T_G and V_SW. Replace the one(s) with a couple of Ohms of resistance between these 2 pins, or maybe replace all of the MOSFETs to reduce the likelihood of failure in the near future. Some MOSFETs are like $3 shipped per chip, so do what's best for you.
EDIT 7: Added new MOSFETs to the card. It's working y'all. Pictures here.
EDIT 6: Rework station arrived. Shorted MOSFET identified. Pictures here.
EDIT 5: Row of 6 capacitor shorted to the power input pad (right one). Requested this from Texas Instrument (free with .edu email extensions) along with hot air rework station + flux + solder.
EDIT 4: Measured the resistances of the black capacitor between the R22 coils (core) and the 2 adjacent to the one on the right (memory). Got 9.5 Ohms for the row of 6 and 24.0 Ohms for the row of 2. Remeasured resistance across the power input pins and got 9.5 Ohms. Possible correlation between resistances of core capacitors and the short.
EDIT 3: @Khenglish suggested measuring resistance across power input pins. If >2kOhm then something else other than the power FET broke. Measured 7 Ohms. Picture here.
EDIT 2: Added picture of backplate with thermal pad to the album.
EDIT 1: R47 and R22 in the pictures had a resistance of about 1.5 Ohms as measured with the red Centech digital multimeter - same as the adjacent clean-looking R22.
Background: the 8 Gb GTX 980M card with copper backplate (removed to take pictures) came with a used Clevo P750ZM I bought around October 2015. The laptop was originally purchased from RJTech, and the previous owner used it mainly for software development. Other than coil whine at high FPS, the machine was fine for the most part with VSync enabled. GPU temperature was checked with HWinfo64 and MSI Afterburner every so often, and I had never seen it being above 80*C. The laptop itself was always cooled by a home-made cooling pad with 4 120 mm ~1800RPM case fans installed, and I normally had internal fans on max speed whenever I play something demanding. Vents and fans were cleaned every 3-5 months, and the GPU vent wasn't blocked when the card failed. In short, I think I took care of the laptop decently given how much it cost me when my salary was $0. I got a few IRQL_not_less_or_equal BSODs related to the touch pad driver here and there, but that was about it for unusual behaviors. I didn't know about ThrottleStop before the failure, so the CPU was running at stock voltage in case anyone thinks power supply issue was involved. I bought the laptop hoping it would last 5 years or more, so I avoided OCing any component.
On average, I gamed 1-2 hours each day for the first ~1.5 years and 4+ for a couple of months leading up to the failure. The laptop itself was kept on for about ~6-12 hours daily. For games like DOOM, BF1, GTA V, and Witcher 3, I lowered the texture and lightning settings to ensure that the GPU temperature was decent at 60 FPS and above classic Runescape graphic. Extraneous settings like bloom, blur, and AA were turned off entirely. Even with those precautions taken to control the thermal behavior, the failure occurred when I was walking around looting things in Witcher 3 ~2-3 hours into the session.
The Failure: screen turned black. MSI Afterburner overlay was active at the time, but I didn't check it before the crash. The laptop turned off mid-game without any sign such as freezing/distorted audio. Gameplay was smooth for the most part, and there wasn't any cue (i.e. freezing, audio distortion, micro-stuttering, MSI Overlay readout etc...) to suspect the CPU or the GPU was running too hot. I attempted to turn on the laptop, and the power supply (Chicony 230W) started clicking at the same time that its indicator light flickered. Both the battery and the power indicator LEDs remained amber while the laptop was plugged in. Pressing the power button resulted in the power indicator briefly turning green and back to amber again (along with the clicking in the power supply). The fans did not turn on. Holding down the power button long enough and the power supply's indicator light turned off completely, and no more clicking sound can be heard. Re-plugging the power supply turned the indicator light on again, and the same scenario repeated when the power button is pressed while the laptop is plugged in. It was noticed that the area around the right speaker (which is directly above the exhaust for the GPU fan), the power supply tip, and the power supply were all really hot to the touch. I measured the temperature with an infrared thermometer, and I had readings at around 45-50 degree Celsius for those regions ~5-10 minutes after the failure. Discoloration on the tip was noticed then.
I then removed the bottom panel and checked for anything unusual. Everything visible looked fine (i.e. no exploded component/charred regions). The power adapter on the motherboard showed no sign of shorting/melting. There wasn't any "burnt-plastic" or solder odor. RAM sticks all looked fine. Battery wasn't hot. I tried holding down the power button with the battery removed for over half a minute before plugging the power supply in, both with and without the battery, but the clicking persisted. NVRAM reset didn't work. The failure occurred at midnight on a Saturday, so I decided to let the laptop sitting unplugged without battery and check it again early on Sunday morning. Problem persisted, so I sent RJTech and RMA request which they promptly granted on Monday.
The Aftermath: I had suspicions that the graphic card might be the cause of the failure to POST, but I decided that it is best to send the laptop to RJTech for them to evaluate the extent of the damage. I figured that I wouldn't be able to do much even if I removed the heatsink to check for the damage, and I was busy with work for the most part to buy a MXM card and do the diagnostic myself. Upon receiving the laptop, technical support noticed that there was some unidentified liquid on the VRAM chips (which I believed to be thermal pad oil) and sent me the 2 pictures that circled the affected chips. I asked them to check if the motherboard was still functional with a new GTX 980M installed, and after some stress testing they confirmed that other components survived. I confirmed with later testing that the power supply was still functional (enough to sustain the CPU under heavy load at least) although I never tried to push it to the +200W regime. While I'm grateful that RJTech accommodated my request for additional testing with a functional card, I decided to get the broken laptop back. I was uncertain about the reliability of my P750ZM at the time, so both getting a new card to restore the P750ZM to pre-failure performance and getting the broken card refurbished by Clevo were out of the question for me.
Now: I found a surprisingly cheap Sager NP9870-S originally from Xotic-PC up for sale on Craigslist of all places. The thing has 980M in SLI, so at least I am comforted by the fact that if 1 card failed, there's always another one inside. It is also nice to know that in the event of simultaneous 2x card failure, I can always build a PE4C eGPU setup like what @bloodhawk did with his P870DM here. Call me paranoid but I already got a PE4C v4.1 and power supply just in case. I also upgraded my cooling pad with 4 of these. As for the P750ZM, I grabbed a GTX 765M and brought it back to life with that. Installed ThrottleStop on both machines and spent awhile to lower the voltages as low as possible.
I still wanted to know what exactly went wrong with the GTX 980M that failed to hopefully prevent future failures. I've been looking around to see if anyone else posted something similar regarding their MXM graphic card causing power supply to short itself while still leaving other components unharmed. There's no schematic for the card floating around, so I hope people with intimate knowledge of the board would be able to help. I'll gladly provide close up pictures to the best of my ability.
-
Is that the original power supply that came with the laptop?
-
-
-
The pads I believed to be from Fujipoly given how it looked.Vasudev likes this. -
-
-
I think you already have a very good idea what went wrong, also the card might still be alive. As you already noticed the inductors are fine only some damage on the surface which really don't mean anything, but the thing that "killed" it was the vram shorting. Your power supply acted like a classic power supply that refused to power on a shorted system. If you clean the card with isopropylalcohol and replace some vram chips (you can buy them from ebay for around 3USD each) then ur card is back on track. If you're really really lucky, then an isobath alone might even "fix" the card.
I think you realize yourself, that this is very likely caused by the cheap thermal pads you were using. So you might want to consider buying high quality ones in the future (grizzly minus for instance). -
I'll look around in my local area (SoCal) to see if there's any computer repair shop offering ultrasonic cleaning service. Given how the card looked, I think you may be correct that the majority of the card itself was intact still. -
Every substance can be more or less conductive, for instance destilled water has very bad conductivity while saltwater is way more conductive, so taking into account that the oil soaked up dirt and other substances it's not that unlikely to cause some issues really, so it doesn't really matter wheter or not the oil itself is conductive. -
woodzstack Alezka Computers , Official Clevo reseller.
Seems fishy to me, honestly.
Do not know how your coils took physical damage either, because it's sort of just not possible. If that card was not touched by you, then either whoever [put it in when upgrading did it, or it came stock like that which would mean its a defect. and because I doubt it came stock like that or you put it in there and somehow damaged it long ago, and the fact the serial is ripped off, i am suggesting someone is replacing your "Alive" card with a dead one. If someone were to touch my seriel numbers, thats the first thought I'd have. Why would it get removed, and even show signs of being ripped off ? the heatsink doesn't even touch there, the laptop doesn;t make contact with it there, there's no reason, honestly. All it can do it help you RMA it or get warranty and identify the card. Since ALL of those apply to this card currently, there is even less chance you'd touch it.
Unless I'm missing something here or just Call me paranoid. Thats my thoughts. I think the RAM was resoldered on, or some sort of oil from a broken components, or foul play because the seriel is missing. DO not know what caused your cards death, it's not plausible that - that oil or damage was there on a new card installed by your seller if it was new, so no idea's.
Last edited by a moderator: Nov 12, 2017Dr. AMK likes this. -
idk about the serial number tbh.woodzstack likes this. -
woodzstack Alezka Computers , Official Clevo reseller.
-
-
Given the amount of oil covered on the R22 coils + VRAM chips, it has got to be from the thermal pads. Even Fujipoly themselves admitted here in the Warranty Statement that silicone oil can leach from their products. I spent some time looking at the components up close, and I couldn't find any trace of something big enough that would potentially hold/leach that much liquid.
How R22 and R47 appeared damaged is beyond me as well. Not sure if this has any connection with the coil whine issue I mentioned in the original post. There wasn't anything sharp on the old pads that would have caused such damage. The temperatures looked good, so I never bothered to do a repaste. I just checked the original purchase invoice that the seller sent me, and I noticed that the P750ZM was purchased as a barebone with 980M installed. Hey @win32asmguy, do you remember seeing anything strange on the 980M when you installed the 4790K?
I considered the Clevo RMA option, but given RJTech's estimate of 4-6 weeks + ~$400 fee + 90(?) days warranty and my "what else would fail next" mentality at the time, I decided to proceed with the NP9870-S purchase and shelf the P750ZM. One can only tolerate so much eye strain and frustration on a 11" Chromebook. I originally planned to use the slave 980M from the NP9870-S to resurrect the P750ZM and sell the Sager to recuperate losses. After looking at the bottom panel of the NP9870 long and hard, I conceded that it had a more superior cooling solution than the P750ZM and kept it for good. I thought about selling the P750ZM for parts, but I gave replacing the GPU w/ something else cheaper a try which fortunately happened to work. I thought I'll revisit the 980M when I have more time + better understanding of what happened to the card. I've been looking at causes since August without much progress, probably due to focusing too much on the shorting + power supply rather than the contaminated oil on the pad. This is why this thread was made.
Last edited: Nov 12, 2017 -
woodzstack likes this.
-
woodzstack Alezka Computers , Official Clevo reseller.
Yes I do think so.
have some rep too for being on the forum, and welcome to NBR !Dr. AMK likes this. -
Honestly I don't see anything really wrong with the card. Thermal pads can leave a bunch of liquid goo behind. I don't think anything you're looking at has anything to do with the failure.
It looks like at some point someone may have scraped up the two inductors (the R22 is the inductor for the 3rd core phase, the R47 is the inductor for either the pci-e or 1.8V voltages). That will have zero effect on their performance though. They are not metal coils internally, but a solid block of bonded iron powder.
You probably had a power FET blow. Usually that's the only failure that can cause a lot of heat to be generated. Power FETs can blow and not look like it. The voltage on anything else like memory is too low to make much heat. Also the laptop would still power up.
Check the resistance between the two giant pins on one side of the mxm slot. This is the card's supply voltage. The resistance should be over 2k ohm. If a fet is dead you'll read 0. A blown cap can also cause a similar failure, but it is less likely.woodzstack, Darker01 and Papusan like this. -
Checked. 7 Ohms, effectively shorted. How do I check which power FET was blown? All of the identical looking ones between/near the R22 inductors had the same resistance across them (~7.7 Ohms).
Nothing unusual was noticed on the package as well.
EDIT: Caps mistaken for MOSFET. Ignore the ~7.7 Ohms measurements. See later posts.Last edited: Nov 13, 2017 -
woodzstack Alezka Computers , Official Clevo reseller.
Well the mosfets are easy enough to replace, any electrical engineer should be able to do that for you.
-
7 ohms is definitely bad. The only way to find a short is to pull FETs one at a time until the short disappears. There are only 3 FETs for the core so there are not many to try.
7 is odd though. I would expect a blown FET to be 0. Measure the resistance across the big rows of caps. There is a row of 6 and a row of 2. They are black. The row of 6 is the core and any non-zero resistance is fine, even like .5 ohm. The row of 2 is the memory. Memory is usually between 10 and 50 ohms.Last edited: Nov 13, 2017Vasudev likes this. -
I measured the resistances in the caps that you mentioned again, and I got another set of values this time. I thought they were MOSFET by mistake. The row of 6 all measured around ~9.5 Ohms, and I noticed the resistance measured across the power input pins is also the same (temperature effect? late evening vs. 6:00 AM?). Row of 2 measured at around 24.0 Ohms.
I edited my previous post to indicate a mistake with the memory cap resistances. -
Other than the GPU core's power FETs there are only 2 components with connections between the GPU core voltage and the card's supply supply voltage. They are the VRM, and the phase driver for the GPU core's 3rd power phase. Both of these chips are on the backside near the top of the card, and the VRM is the bigger of the two. I've never seen these chips fail and short the GPU voltage and supply voltage together, but it's possible. If your VRM died I expect the GPU core to be fried. A working VRM can protect the core from overvoltage if a FET of the phase driver died, but if the short is in the VRM there's nothing to detect it.
I'd still first pull and check each power FET for the GPU core. They're the 3 big chips at the very top of the card.Last edited: Nov 13, 2017 -
-
Here are images showing what's what assuming that you do read 0.
The core power FETs are boxed in red. One of them is probably dead.
If it's not a dead power FET, then it's either the VRM or the 3rd phase's driver (phases 1 and 2 are integrated with the VRM).Ashtrix and Falkentyne like this. -
I'll start looking for the replacement power FET later this evening.
I'm curious. Did I just so happen to have a bad 980M, or are the more recent clevo cards bound to fail like mine eventually? -
The power FET has 87350D written on it. Found the product page from TI ( link). I still have my university email, so I requested 5 samples from them free of charge. I think they'll arrive in a couple of days.
In the mean time I guess I'll start ordering equipments to desolder those FETs. -
You just need a hot air gun, solder flux, and a heat gun for it.
Remember that 2 of the FETs are still good, so just pull one at a time and check the card if it is ok. I recommend filling all 6 FET pads. For just getting the card working though the unused FET pad already has all the required solder and is easier to solder a FET onto than reusing the original pad. Just remember to follow the pin 1 arrows so you don't put the FET on backwards. -
Not sure how it go with other components, but do you think if I get away with just using the hot air gun to remove the FET directly with sufficient heating of the surrounding area?
With regard to the part selection, I think I'm going for this by the virtue of the reviews + EEVBlog video of a similar device. Hopefully it'll work well enough such that I won't have to return it. -
I'm not sure what you mean by "but do you think if I get away with just using the hot air gun to remove the FET directly with sufficient heating of the surrounding area". You only want to use a heat gun. You should not be using an iron at all. You blow hot air directly on the component and board to remove and place a new component. There is nothing in the area that is significantly temperature sensitive that can be damaged by the heat.woodzstack and Darker01 like this. -
I'll let you know the results. -
Hi dude, have you tried? I'm trying to add 3 more mofset on my gtx 980m (I have the same version), for better oc and more power. I am waiting for your feedback
http://www.overclock.net/t/1622452/gtx980m-mxm-sli-ver-w-added-vrm-mosfet-w-paypal-cashLast edited: Nov 23, 2017 -
MahmoudDewy Gaming Laptops Master Race!
-
Happy Thanksgiving everyone. I'm back to provide update regarding the progress of the repair.
TL,DR: 1 MOSFET is indeed shorted. This one had silicone oil on it where the V_SW and V_IN pins supposed to be. Still need flux and wick to clean the pads before soldering the MOSFETs back on. Might take another week or so.
The W.E.P. 858D hot air rework station mentioned in one of my posts arrived on Wednesday. Popped it open and found that the thing wasn't put together haphazardly like some other 858D clones. Fuse's present and was hooked up correctly for the most part. Both the chassis and the metal casing on the heat gun was properly and securely grounded. There was a loose piece of broken plastic inside the heat gun case, and I'm not really sure there that came from. I guess it's a good thing I opened everything up to check. I was a bit worried about some strange magic smoke coming out of the heating element, but it turned out that I had a screw stuck in between the add-on tip and the heat gun's mouth.
Flux is bound to arrive on Friday or Saturday, so I decided to start removing the MOSFETs and checking which one shorted. The 858D didn't explode, which was nice. This was my first time working with surface mount components, so needless to say it took a lot of trial and error to remove all 3 MOSFETs with the last one being the culprit of the short. Pictures are here.
Only 1 MOSFET has V_IN and V_SW pins shorted to ground. Removing that one rid the short between the power pads altogether. The other 2 MOSFETs and the brand new ones did not have shorted pins, which is great I suppose. @Khenglish was right about 1 MOSFET being the issue. Nevertheless, I noticed that this shorted MOSFET had a noticeable amount of silicone oil on the package where the V_IN and V_SW pins supposed to be. Could the factory default thermal pads be the culprit?
I think I'll clean up the pads and apply leaded solder to them before soldering the MOSFETs back. Not sure when the wick I ordered nearly 2 weeks ago from China is going to arrive.
-
Too bad Radioshack no longer exists for flux. You don't need very good flux for soldering FETs, so you'd just spend a couple bucks and not have to wait.
Btw, usually it's best to just get flux from a USA Ebay source. Amtech 4300 is usually the go to solder. Lacks nasty chemicals which sometimes show up in the China solders.
Don't try soldering without flux. Heat transfer from the FET to the pcb will be terrible, so you could easily overheat and kill the FET.Darker01, woodzstack and Vasudev like this. -
I sure hope that I didn't damage the 2 functional FETs pulled from the PCB. I was still getting used to the hot air station while removing those.
EDIT: Apparently I can still request more of the CSD87350Q5D MOSFETs. I guess I don't have to worry about reusing the original FETs. Knowing how fast TI ships things I think I'll resume the project on Monday or so.Last edited: Nov 24, 2017Vasudev likes this. -
-
-
Hello everyone.
Flux arrived on Friday as expected, but the wick didn't. Regardless, I decided to proceed anyway with make-shift wick from a spare composite video cable. The wick wasn't perfect, but it helped getting rid of the extra solder on the center pads after I added leaded solder. I went through this to make soldering the MOSFETs easier since leaded solder melts at a lower temperature than the lead-free solder on the board.
To the point: I added the MOSFETs, checked to see if the pins made contact, repositioned a few MOSFETs, rid excess solder with the soldering iron, wiped nearly all of the leftover flux off with IPA, dried the card with hot air, replaced the crummy thermal pads with Thermal Grizzly Minus Pad 8, installed the card, and booted the laptop.
Laptop booted.
There were a lot of things that could have killed the card for good during the past 5 months ranging from physical damage to ESD. It still amazes me that me with my lack of expertise in electronics and my janky setup somehow managed to get the card repaired. Overheating MOSFETs, stripping pads off of the PCB, burning surrounding components, blowing small capacitors into oblivion, jamming the soldering iron tip where it shouldn't be, not drying the card well enough, not pressing down the MOSFETs to squeeze out excess solder, giving myself 2nd degree burn, etc... were concerns that troubled me up to the point of booting the laptop up. I was prepared to be disappointed, but I guess setting the expectations low made seeing the laptop boots after 5 months all the more satisfying.
After using DDU, I installed the driver. That got GPU-Z to recognize the card, and the specs looked about the same as other CLEVO GTX 980M cards.
The next step was to check if the card is stable.I ran Heaven benchmark for about 5 minutes, and the card drew ~95-104W during the entire time. I didn't notice any artifact or anything unusual on the screen. At this point I decided to stop and have some food since I have been working non-stop for about 6 hours.
I'll do more stability testing on the card in the near future, probably tomorrow. I'll occasionally post follow-ups test results here after that. I have yet to decide whether or not I want to sell this laptop to recuperate the cost of the Sager NP9870-S. As far as I concern the GTX 980M won't be accepted at the CLEVO repair center with its torn serial number and tampered PCB.
Anyways, I believe thank-yous are in order. This repair wouldn't be possible without @Khenglish 's expertise with MXM GPU modification. His diagnostics was spot-on, and through that I saved quite a lot of $ by repairing the board myself (858D ~$50, soldering station $35 off of Craigslist, flux ~$25, FETs were free samples).
I would also want to thank @Danishblunt and @woodzstack for suggesting replacement thermal pads. The Thermal Grizzly Minus Pad 8 is much more robust than the stock pads. Couldn't find anything to replace the pad on the row of MOSFETs though.
As for everyone else, thank you for staying with me for the ride. It's one hell of an adventure going from knowing nothing about what caused my GTX 980M to fail to burning the card with Heaven benchmark.Last edited: Nov 25, 2017moral hazard, @tomX, Falkentyne and 3 others like this. -
woodzstack Alezka Computers , Official Clevo reseller.
Darker01 likes this. -
Hello everyone.
I ran Heaven benchmark 15 times and had HWInfo64 logged GPU stats the entire time. The card consistently drew ~100W for about 1 hour total with about 10-20s of down time between each test. Given that 5/6 added MOSFETs were brand new, I think the card should last for quite some time now. Results + log file are attached.
It struck me as odd that having AA disabled during the first run caused the card to cap out on power consumption. Runs 2 to 14 all have AA enabled.
I guess that's it for now. I think I'll Dremel the bottom plate to improve air flow somewhat and then build a cooling pad built for the P750ZM. I'll compile all of the information I learned thus far and add it to the 1st post some time later next week.Attached Files:
NGX83 likes this. -
-
Thanks again, and have fun with your modded GTX 980M. -
hello. I read all your comments and I have a problem with my 980m. I bought it faulty and it was shorted circuit. I changed the shorted up1642p and 1 shorted mosfet csd87350 of the 3 , but sometimes after a hour in gaming shut down the laptop. is any solution for me?
-
-
thank you for your answer, I changing now the thermal pads . when I say it's shutdown I mean it is closing all the system and I push again the power button. the temperature of the gpu is max 70-80 degrees celcium. the laptop is a eurocom p150em . it is better to use it without the battery?
-
-
I use the xilence thermal paste and i will be waiting for the mx-4 . what is the brand of your laptop?
-
I did a hour-long stress test with Heaven Unigine. Saved the results of the benchmarks here. -
your clevo is newer generation from my laptop. I use the premamod bios to work the 980m gpu. at far cry 5 I have 76 degrees celcium
Broken GTX 980M
Discussion in 'Hardware Components and Aftermarket Upgrades' started by Darker01, Nov 11, 2017.