Hi, this is more of info gathering exercise at this stage.
My CLEVOS: P870-DM3 (the alien killer), P170 series (EON-17X), P150 (portable )for trips away.
My P870 i bought without a GPU, which was fine as the laptop was almost brand new, threw a CLEVO 1060M (aux connector version) no sweat runs perfect.
The P150 had a 8970M in it so replaced that with CLEVO 980M , runs like twice as fast now.
For an experiment put the 8970M in the slave spot of the P870, small vapor chamber mods required but it worked OK, got photos if anyone interested.
So then thought as i also mine with the P870 when not in use, the 8970M went ballistic with temp as soon as mining started. For info mining ETHEREUM 1060M hash 20 mh/s and the 8970M was 15 mh/s not too bad really considering the 1060M cost me about $300 USD .
So I buy a cheap RX480M (ex DELL< $150 USD) and of course P870 locks after a min, known EC issue as I learn later form these forums.
Then I score a NIVIDIA M6 which actuaaly fitted very well with the vapor chamber but same lock screen after a bit longer, maybe 2 mins.
So the EC firmware doesnt detect MXM temp sensing or temp sensing is out of limits and auto shutsdown.
i cannot find any CLEVO EC firmware updates that fix this (if there is please let me know)
Seems if this issue could have a workaround MXM replacement options in CLEVOs would be far more especially for owners on a budget.
I was going to buy an MSI 1080M for my P870, but at $500 + USD cant take the chance it wont work.
Decided to do some research into how MXMs deal with temp sensing.
-
Found a 1060M technical manaul
https://www.manualslib.com/download/1489056/Aetina-Geforce-Gtx10-Series.html
Now where talking, this sort of technical detail is what i am after.
Would also be useful to repair MXM GPUs that have VRM issues
-
Looks like there are 2 temp alert Signals for our example 1060M
TH_OVERT# is a required open drain output from the MXM module which alerts the system that a critical temperature threshold has been crossed and the system must be shut down within 500 ms to prevent physical damage. The temperature threshold is defined as the minimum value of the module and the system limits. This feature is a fail-safe and should not occur during normal operation.
TH_ALERT# is an optional open drain input/output of MXM module. On the MXM module side, the module will assert this signal to notify the system that its ALERT temperature has been crossed and it is taking steps to reduce the temperature and power of the module. On the system side, if the system determines the MXM module is operating in a temperature and power range it should not be, the system can assert the TH_ALERT# input to invoke the same temperature reduction mechanism to lower the temperature and power of the module.
Initially I thought maybe a solution like a resistor across pins but this seems to be a drain signal form an OP AMP.
With some more investigation it would be nice to identify the hardware and mechanism for the TH_OVERT# signal, i dont want to probe my 1060M and will probably connect another MXM via PCIE X1 adapter to my PC and do some scope testing.
If anyone has a detailed manual for the RX 480 or M6 MXMs would appreciate a copy, and the 8970M also as this obviously works in CLEVOs and exploring why would be beneficial. -
Few pics of 1060M pinouts and aux J5 connector
-
Below pin out for MXM temp sensing on the 1060M:
The three control signals can be described as system thermal and power protection (TH_OVERT#) and thermal and power system optimization (TH_ALERT# and TH_PWM). TH_OVERT# is a required open drain output from the MXM module which alerts the system that a critical temperature threshold has been crossed and the system must be shut down within 500 ms to prevent physical damage. The temperature threshold is defined as the minimum value of the module and the system limits. This feature is a fail-safe and should not occur during normal operation. TH_ALERT# is an optional open drain input/output of MXM module. On the MXM module side, the module will assert this signal to notify the system that its ALERT temperature has been crossed and it is taking steps to reduce the temperature and power of the module. On the system side, if the system determines the MXM module is operating in a temperature and power range it should not be, the system can assert the TH_ALERT# input to invoke the same temperature reduction mechanism to lower the temperature and power of the module. TH_PWM is an optional output of the MXM module which can be used to control a fan to optimize the MXM module performance and acoustic characteristics. The PWM frequency must be programmable between 10 and 30000Hz with duty cycle steps of no more than 1%.
So PIN 20 is our TH_OVERT signal used to shutdown GPU and then laptop, i would assume first of all MXM tries to temp throtlle via PIN 22 TH_ALERT, reduce power and then finally via EC firmware shutdown laptop to protect MXM card.
IF TH_OVERT is a drain output (i have to measure this first), it not a hard ground but more of a floating earth with a small current drain under normal operation.
To me the OP AMP controlling this would continually drain to gnd at a certain level, when OVERT occurs (in this case 99 C) temp sensors on card cut off the OP AMP drain and shutdown sequence is initiated.
eg; drain signal not present eg it is OPEN as described above.
Interesting that PIN 22 TH_ALERT# is described as "an optional open drain input/output of the MXM module", this would be more likely the difference between the DELL version and CLEVO version of certain MXM GPU's. eg DELL uses only OVERTEMP but CLEVO uses TEMP ALERT and OVERTEMP is the backup option
Lets assume CLEVO is actually the better EC firmware and is looking for PIN 22 TH_ALERT on start up, waits a while ,EC maybe checks PO state is low and then finally gives up and shuts down laptop.
This functionality may not be present on DELL, and other non clevo MXMs or is a different format, this I need to examine the tech manuals and measure PIN 20/22 with an scope.
More excerpts about temp sensors from the 1060M tech manual below
-
If anyone has MXM GPU tech manuals I would like a copy, i also have dead CLEVO1080M would like to troubleshoot it further, in fact I cant find anywhere on the WEB where GPU MXM and PCIE are repaired, yes a PGA ball solder machine would be requires for any ICs, but if the problem is VRM or other discrete component repair may be possible in some cases. There are probably boxes of dead MXMs out there, my motto is to always try and repair anything especially electronics.
MXM 3.1 SPEC MANUAL attached and a few links
https://www.module-store.de/media/pdf/d9/a4/43/MXM_Specification_v31_r10.pdf
http://forum.notebookreview.com/thr...ctor-an-external-pci-e-x16-box.407071/page-20
https://www.nvidia.com/en-us/geforc...enable-temperature-monitoring-on-nvidia-gef/1
regards
OZAttached Files:
-
-
OK so say we have a comparator with a range of 94 to 99 C and a temp sensor input, when we exceed threshold OPEN DRAIN FET to PIN 22 triggers and PIN 22 changes from LOW to HIGH.
EC bios IC senses this and begins thermal protection SHUTDOWN sequence. Without the LOW from the card the EC may see this as open circuit eg temp high and initiate shutdown on say DELL
MXMs if they dont have this optional functioning. Well thats my theory, SM BUS thermal sensing input would also be relevant but maybe thats only used by software apps and not the EC IC.
Any ideas or am i off track here ?
-
If this is correct then EC IC looking for a "sink" to GND at PIN 22, provide that with a 100 OHM resistor to nearest GND which is PIN 17 and maybe the problem is solved.
SOLVING CLEVO ISSUE WITH MXM EC TEMP SENSING
Discussion in 'Sager and Clevo' started by oztrax, Nov 6, 2021.