The Notebook Review forums were hosted by TechTarget, who shut down them down on January 31, 2022. This static read-only archive was pulled by NBR forum users between January 20 and January 31, 2022, in an effort to make sure that the valuable technical information that had been posted on the forums is preserved. For current discussions, many NBR forum users moved over to NotebookTalk.net after the shutdown.
Problems? See this thread at archive.org.

    New computer, but experiencing nvlddmkm.sys VIDEO_DXGKRNL_FATAL_ERROR (code 141)?

    Discussion in 'Sager and Clevo' started by Amnvex, Sep 3, 2019.

  1. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    Does this always point to a faulty card? Or can it be a driver issue or something? A windows 1903 issue, perhaps?

    I've tried just about everything to troubleshoot it. It's not temps and it's not overclocking. SFC and DISM bring back nothing. All memory checks, stress tests, etc come back fine, too. I think it may be a driver issue somewhere, but I can't imagine what driver it is because I've tried fresh install of windows and that doesn't do anything to change the situation. Driver verifier of 3rd party drivers (non microsoft ones) has not caused any crashes, but microsoft-only verification of drivers has caused BSOD loops on startup (when loading into an account) that forced me to reset/turn off the verifier in safemode (this was especially bad after a particular windows update). I can't fathom that a laptop with an RTX 2060 that's only ~2 months old is already failing... and the nature of the this issue has changed from previous times.

    I noticed something peculiar: the longer the PC is on, the more likely it is to fail. For example, can play a game for, say, 2-3 hours, and then it'll crash. It'll never happen before that 2-3 hour mark if the PC is rebooted freshly. Basically no early crashes. But that's not the case for another game. Victor Vran has a map where you choose a destination. When you open and close it, it'll crash. At some point it was so bad (I don't know how it got worse) that the moment you open the map, a guaranteed crash would happen immediately (video TDR, wait ~30-40 seconds, then it'll say windows is shutting down with fans suddenly blasting, and then the whole thing unexpectedly shuts down). If an attempt to manually shut down the PC happens while this crash takes place during the calm period before the system auto-shuts off, it will go to a BSOD that shows ntoskrnl as the driver that's responsible for the crash instead of nvlddmkm.sys when it BSODs (instead of the usual 30-40 second calm period where it automatically attempts to shut down). Sometimes it even does initiate shutdown procedure and goes to a blue screen to shut down, but it never finishes--it just turns off completely.

    Very confused... anyone have any idea? :\
     
    Last edited: Sep 3, 2019
  2. joluke

    joluke Notebook Deity

    Reputations:
    1,040
    Messages:
    1,797
    Likes Received:
    1,215
    Trophy Points:
    181
    What temperatures are you getting in full load and idle?
     
  3. DaMafiaGamer

    DaMafiaGamer Switching laptops forever!

    Reputations:
    1,286
    Messages:
    1,239
    Likes Received:
    1,638
    Trophy Points:
    181
    The GPU is experiencing unreliable voltage to the die in its full load p state, vram seems fine but the power phases seem unreliable which is why when you do sudden things in a game it may require more voltage, voltage that the vrms can't give stably. This causes the gpu core to crash unexpectedly leading to the blue screen. Long story short there is a hardware failure but its not really bad yet, it seems that the gpu can still run properly if you adjust the core frequency correctly. If things were 'bad' bad then the laptop would freeze or black screen and restart or shut down. The fact that the OS knows something is wrong shows the severity of the situation isn't as bad.

    Please try downloading and using nvidiainspector, offset the core in the negative by around 100 to 150mhz and try running your games. Let me know how that goes :)

    Offsetting the core in the negative means that it needs less vcore to power the gpu which leads to less wattage which in turn stresses the vrms less.

    The fact that there is no artifcating of any sort shows that this is INDEED A VOLTAGE ISSUE!
    Clevo fix your vrm schematics!
     
  4. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    Err, sorry, I should have said that I did not OC the graphics card. I underclocked it, and undervolted it (with turbo reduced to 3.6 GHz from its 4.1). I ran Furmark for a while, too, and it's stable. Are you saying I should reduce its clock speed and try it then? Maybe it is a voltage issue as you're saying. I guess I can try to reduce the clock. The base is already reduced since it is a 2060 mobile and not a desktop (I see in furmark that it hits with boost around high ~1580 MHz on the boost. I don't know how that's possible if it is supposed to go no higher than 1200 MHz according to the inspector. The other thing that's weird is this: nvidia inspector flashes in and out these numbers (compare the two images):

    upload_2019-9-3_18-33-0.png
    upload_2019-9-3_18-33-8.png
    upload_2019-9-3_18-38-10.png
    upload_2019-9-3_18-39-11.png

    Note when it refreshes, if I have not yet saved the settings in adjustments for offsets, it will refresh those back to 0 the next time it gets info from the card (e.g. sensor).

    Here's what happens when I reduce the clocks. When it flashes, I happened to capture it. Check the sensor data now (I had to capture it fast since it goes away also equally fast):
    upload_2019-9-3_18-50-36.png

    Idle is around ~40c and full load is around 70. I have not yet seen it go above 70 in Furmark after running it for 10 mins or so.
     
    Last edited: Sep 3, 2019
  5. DaMafiaGamer

    DaMafiaGamer Switching laptops forever!

    Reputations:
    1,286
    Messages:
    1,239
    Likes Received:
    1,638
    Trophy Points:
    181
    Wait the laptop has Optimus? This could be a whole different issue entirely, still related to the gpu but the mux switch could also be bad...
     
  6. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    I don't really know... it's a 630 HD intel and an RTX 2060. I assumed that the RTX 2060 and the Intel GPU switch between each other as necessary and that's what Optimus is for? Maybe that's not right and I don't have it. I'm probably making a mistake by claiming that I do have it.

    But here's something I found out: if I close nvidia inspector, the overclocking resets. Maybe something to do with the bios's speed scaling being enabled? I don't really know. But yeah, everything goes back to stock after applying clocks and voltages to the card if I close the inspector. Seems like it doesn't stick.

    I opened Furmark just now. The sensors are stably reading the card. No flashing, no flickering.

    upload_2019-9-3_19-1-54.png

    Here's with Furmark running:
    upload_2019-9-3_19-3-16.png

    Here's with Furmark with -150MHz on the core clock:
    upload_2019-9-3_19-4-54.png

    And ~3 minutes in:
    upload_2019-9-3_19-5-23.png

    Furmark closed, BUT the application itself is still open. Still with -150MHz on the core clock.
    upload_2019-9-3_19-7-23.png

    What could possibly be the problem...? Is there a way to permanently underclock the card in another way? And again I don't understand how the card is reaching such high clock values when I thought it was supposed to be downclocked for laptops.
     
    Last edited: Sep 3, 2019
  7. DaMafiaGamer

    DaMafiaGamer Switching laptops forever!

    Reputations:
    1,286
    Messages:
    1,239
    Likes Received:
    1,638
    Trophy Points:
    181
    Seems that the card is working fine on the underclock, you need to put the values in two or three times for it to stick. Run furmark and put in the - offset values. This will stick as the dedicated gpu is active. Remember to literally click apply clocks and voltage a good two three times. It shouldn’t revert then if you don’t refresh the program or close it...
     
  8. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    Ok, I did. I had Furmark opened when I applied the clocks and voltages. I spammed the button like 10 times with an OC of -100. When I did that, it kept dropping the value (was at ~1900, then ~1800, then 1455 or something and now 1355 estimated). It doesn't go below anymore if I spam the button. If I turn it off, the values should stick, right? They don't show being "stuck" after closing and reopening Nvidia Inspector (all goes back to 0 for adjustments panel). Idk if that's normal, but it doesn't show that estimated max is as high as it used to be.

    This is with Furmark @ 5 mins runtime and with nvidia inspector restarted. Seems that it is generally not going above ~1500 MHz, which is still more than what the card is rated for (especially with boost).
    upload_2019-9-3_19-17-36.png

    This is a really weird problem... really weird.
     
    Last edited: Sep 3, 2019
  9. DaMafiaGamer

    DaMafiaGamer Switching laptops forever!

    Reputations:
    1,286
    Messages:
    1,239
    Likes Received:
    1,638
    Trophy Points:
    181
    That gpu vbios is really not playing nice with nv inspector lol. There is something up with your laptop. Even I’m struggling to find out!
     
  10. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    *shrug*
    I wish I had an answer. Something driver-related is my guess, but trying to single that out is impossible. It could simply be Microsoft fked up or something. I get Intel HD driver errors saying that The description for Event ID 0 from source igfxCUIService2.0.0.0 cannot be found. I've reinstalled the GPU driver 100 times. Doesn't help. Tried all versions. This is a Clevo P970ED, one of the newest versions of Clevo computers, and so I can only guess why the multitude of issues.

    I don't know what's going on X_X...I've had driver issues from the beginning. I've gotten rid of most of them at this point. I also had Windows 10 upgrade 1903 fail on me and BSOD on me with updates with ntoskrnl driver being the culprit. And yet all tests, memory, SSD, HDD, etc, pass with no problems.

    There is another funky thing: I have a custom fan profile currently. After I updated my drivers (this was not a problem before the updates of windows and other drivers), my fans would shoot to the sky only on the GPU at start-up and stay that way if on performance profile with the CCC. If on entertainment profile, this doesn't happen (CCC 3.0). But yeah, idk what's going on anymore. Computer doesn't BSOD or crash or anything anymore unless dealing with Win10 updates. Then it may. But that's been rare and only happened a couple of times in the last month of using it. The other ~30 crashes were all related to the GPU.

    Next time I boot, I am going to undo the overclocking speed stepping technology that Clevo has enabled in the BIOS. But that's after I try the Nvidia Inspector underclock. I used to have MSI Afterburner, but that was pretty useless for this card other than to change the clock speeds (voltages are locked for RTX laptop cards).
     
    Last edited: Sep 3, 2019
  11. bennyg

    bennyg Notebook Virtuoso

    Reputations:
    1,567
    Messages:
    2,370
    Likes Received:
    2,375
    Trophy Points:
    181
    A negative core offset is more like an overvolt than an underclock, but what it actually does to the card depends on the situation with the load and the power limit. It can either force a lower clock at the same voltage (under power limit, which is under furmark) or the same clock at a higher voltage. If the card is experiencing instability due to transient voltage drops this *may* provide extra stability, but the gpu vrm could also be faulty and it'll have no effect either way.

    Conversely, a +ve core offset mostly acts like an undervolt - allowing the card to boost to a higher Mhz or at a lower voltage or a bit of both - should induce more crashing more often by eating into the stability tolerance zone.

    Locking the card to a specific voltage/frequency (ctrl+L in the afterburner curve editor window) may be helpful to test stability. It doesn't override power limits, which is what the mobile cards spend almost all their time under, the core will still drop clocks, left along the boost curve.

    But any problem that comes on only after multiple hours, and can't be specifically induced, is a giant pain in the backside to troubleshoot. Hopefully you have warranty and a service line to call to help you with it as it does sound like a hardware issue.

    As for clocks under furmark, I'm not seeing anything other than normal behaviour, the core runs the fastest clock possible under the power limit. Furmark is a "heavier" and less variable load than anything else the core will ever run, so it is operating constantly under power limit condition, and stabilises at a lower overall clock (and lower voltage tied to that clock) than it would during a game load.
     
    Last edited: Sep 3, 2019
  12. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    Interesting. So is a negative core offset a good thing or a bad thing? Because it sounds like a bad thing and that I shouldn't mess with it. If this is the case, I think I better just turn off the scaling that is offered in the BIOS. I guess I'm also not the only one. I read on other forums that RTX owners have the same issue and they've RMA'd their cards twice and the issue persists... I get this issue, too:
    https://us.forums.blizzard.com/en/overwatch/t/render-device-lost-fix-for-rtx/263106/472

    Maybe really a BIOS problem. I will need to explore it. The game I played, btw, is Vampyr. I play it on medium settings and at 1440x900 resolution). I even tried playing Victor Vran as I said, and that is played in windowed mode (I think 1280x720 resolution out of the capable 1080x1920 desktop resolution). There's no way a game like that which is also on med-high settings that came out like 4-5 years ago should be causing the GPU to overwork itself and be "lost"...and it used to happen immediately when I opened the map in the game as I mentioned. There was a time that I thought Nvidia Experience app was the problem. After uninstalling it, the map crash stopped. I reinstalled drivers only. Now the problem only happens after playing the game for a longer time (2+ hours, generally). Map no longer crashes on load. Baffling, I'd say that Nvidia Experience was responsible for it. I've since reinstalled the Experience software. No issues with map crashing. How can it be this capricious?

    Edit 1:
    I've disabled performance scaling in BIOS. Let's see what happens... I'm willing to try anything at this point, lol, but will test later (going to work now).

    Lastly:
    After rebooting, I get this when trying to access the NVidia Control Center:
    upload_2019-9-3_20-22-59.png

    But then when I right clicked again, to test if I'd get the same issue, the Nvidia Control Center started with no errors! So confusing.

    Edit 2:
    With Furmark, these are the results are ~3 mins (with GPU Scaling OFF in BIOS). Seems more "stable" with the clocks (no OC offset is used here).
    upload_2019-9-3_20-29-51.png

    Edit 3:
    Apparently this also happened, but I had not noticed it (I think it happened when I tried to close it, so it decided to crash instead... just guessing):
    ntdll as I understand is a kernel-level driver? The mystery deepens...
     
    Last edited: Sep 3, 2019
  13. joluke

    joluke Notebook Deity

    Reputations:
    1,040
    Messages:
    1,797
    Likes Received:
    1,215
    Trophy Points:
    181
    What version of BIOS and EC do you have?

    Reboot your laptop and intermitently press F2 to enter BIOS and you will see said info in the primary screen that pops up
     
  14. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    Thanks for the reply.

    I hope the numbers are the same since I short-cutted this by going into MSINFO32 to get this info: BIOS is INSYDE CORP. 1.07.03P dated 1/15/2019 and EC version is 7.04.
     
    Last edited: Sep 4, 2019
  15. joluke

    joluke Notebook Deity

    Reputations:
    1,040
    Messages:
    1,797
    Likes Received:
    1,215
    Trophy Points:
    181
    Well your BIOS is a bit old!

    the latest one for your model is: BIOS Version 1.07.09

    And the latest EC: 1.07.08

    can you ask the store that sold you the laptop to send ya an update of both EC and BIOS? Worth a shot :)

    (I got the info from clevo e-channel directly)

    you got a mirror here for Clevo BIOS and EC:

    https://repo.palkeo.com/clevo-mirror/P9xxEx/

    But it isn't updated with the latest updated BIOS/EC from Clevo and clevo's ftp for downloading BIOS/EC has been down today (like always lol)

    Edit:

    was able to grab the latest EC 1.07.08:

    https://mega.nz/#F!TBZhGYJA!tuzyRICl5OSs1oPoulkywg <- This is a link for a folder. If i can grab the latest BIOS from Clevo's ftp i will post it there too. For now only has the EC for your model
     
    Last edited: Sep 4, 2019
    Amnvex likes this.
  16. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    I did ask them, actually. The manufacturer, a combination of Pro-Star and Sager, has gone through a restructuring and has not sent them anything despite having waited 3 weeks for an answer. The reason, they said, is that they have to email the company in China, then the China company has to email the U.S. office, and then they have to email to my seller. But that hasn't happened. And I doubt it will it will at this point. I have given up on trying, but I have requested once again for them to re-request the BIOS and EC with instructions.

    And I have no idea how to flash the BIOS properly on a laptop like this. Especially an EC, something I don't recall ever having to mess with on a desktop computer from ~2005. I've done it before, but it was on a DELL desktop many years ago and I'm afraid something may break. Windows 10 1903 already doesn't like this computer with the BSODs that it has given me post-update. 1807 or w/e the version was before this didn't give me this many issues.

    ALSO, I think I should say that no BIOS updates are shown for my laptop model: https://www.clevo.com.tw/en/e-services/download/ftpOut.asp?Lmodel=P9xxEx&ltype=1&submit=+GO+

    Idk why. I guess they think it doesn't need an update.
     
    Last edited: Sep 4, 2019
  17. Meaker@Sager

    Meaker@Sager Company Representative

    Reputations:
    9,436
    Messages:
    58,194
    Likes Received:
    17,909
    Trophy Points:
    931
    Might be worth taking an image backup of your drive and re-installing with the base driver set and seeing if the machine is doing the same thing.
     
  18. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    I can tell you that the answer is certainly NO (i.e., it wouldn't be doing the same thing as it is doing now). If I install the OS without updates to either Windows or drivers, everything works fine. But who wants to be run an unupdated version 1809? I don't know what the problem is. When I ran nvidia inspector first time, I had no issues. Once Windows started updating, OCing the GPU started with errors that pointed to something to the effect of "illegal address modification" or something. In other words, the shortcuts that I could make from nvidia inspector ended up giving memory errors after windows updated. Not sure when (after which update, that is) it started happening. I can also say I started getting ACPI errors after the updates. These ACPI errors made the Clevo Command Center completely useless and the Event Viewer would show the EC returning values when none were requested. None of CCC profiles for the fans would work. The hotkeys for the keyboard would also not work. Nothing to do with the CCC worked in general--the system was fighting it. That was the most annoying because the fan control (automatic fan control) wouldn't work at all so the laptop would just end up hanging up in shutdown stage (with the screen blank) and stay that way for the whole night, overheating itself, essentially. It was HOT. Real hot. I'm guessing TDP limitations kicked in and prevented the CPU from doing anything when it was stuck on the shutdown sequence.

    It wasn't after I uninstalled everything, reinstalled Windows, and didn't do driver updates from Clevo's FTP site that all errors resolved themselves (generally, except now there are GPU problems that seem to somehow be a result of Windows itself). That was also the same time that I requested a BIOS update because why else would there be ACPI errors? I'd get hkmoufltr BSODs, ntoskrnl.exe BSODs, driver verifier BSOD loop (twice, and all on Microsoft drivers), storage data corruption BSODs, etc. It was a nightmare... I was ready to throw the laptop out the window because I thought it was all hardware related originally. How can so many things go wrong in first month of a brand new laptop's life? Impossible. Right?

    Anyway, I thank you all for your help thus far. I've been interested in testing out many things and you've given me ideas on what I can try (helps with the brainstorming). You're much more knowledgeable on this stuff than I am. I really do appreciate it! At some point, I think, a solution will be found. Just a matter of when, I guess.

    Edit 1: I've emailed support and they're reluctant to help me get the BIOS. They've so far said that they want me to reinstall windows again, without keeping files or anything else. They said this may fix it if it's windows-caused. -__- and they said don't update anything except Windows.
    Best advice ever... /sarcasm

    Edit 2: Erased everything as suggested by you and the support people. Time to re-setup the laptop. Will need a couple days to test out.
     
    Last edited: Sep 4, 2019
  19. Amnvex

    Amnvex Notebook Enthusiast

    Reputations:
    0
    Messages:
    42
    Likes Received:
    2
    Trophy Points:
    16
    Ok, update:

    It seems I fixed it. I used to get intermittent sensor reports and blanking out of values as seen previously in my posts, but now everything seems to be stable and reporting as it is supposed to. I can say that this is definitely a Windows problem! What worked for me was this: go into safe mode, uninstall nvidia drivers, reinstall them, then install intel management engine components. This is the order it *must* be done in or it won't work. I know this because there was a sequence in a game that I tried to get through but it'd give me TDR errors and crash. That sequence is no longer a problem and there is no crash anymore! All because of this...

    I also followed these steps (removed Windows, reinstalled but didn't allow auto-updates to anything--more importantly the GPU because it appears that Windows is dropping in corrupted nvidia files), changed TDR (just in case) values in registry, and installed GPU drivers straight from Nvidia's site (latest--even newer than what GeForce Experience offers) without installing any extra software (like GFE or audio drivers). Then I let Windows do updates, but not any that involve drivers of hardware (e.g., realtek audio). I had to rollback the realtek driver and not let it update through device manager (my mistake) because it was a messed up driver from Windows itself! (Just proves that Windows has issues with updating hardware stuff and should NOT be used!). BTW, this is where I got the info on how to fix it: https://www.nvidia.com/en-us/geforc...isplay-driver-nvlddmkm-stopped/?commentPage=2
    upload_2019-9-7_18-32-42.png

    Nothing intermittent: solid reporting. No patchy info anymore and no flickering of temp reporting or any of the other boxes!
    upload_2019-9-7_18-22-16.png

    Typical values during gaming (this is Vampyr):
    So, I guess that's it.
     
    Last edited: Sep 14, 2019