Is there a program out there that can monitor the bandwidth being used on RAM? I've gotten zillions of Internet bandwidth monitors and RAM usage (in KB/MB) monitors from Google, but not a RAM bandwidth monitor.
Reason being that I've got a (dual-threaded) program getting only about a 30% boost from dual-core, regardless of whether it's doing 100 calculations or 100 million. And since almost all its operations are dual-threaded, I'm thinking memory bandwidth might be limiting it.
-
-
Yes there is, download SISandra. This program can be downloaded from guruof3d's website.
Here is the link:
http://downloads.guru3d.com/Sandra-XL-download-177.html
Simply download, install and then go into the benchmarks tab, and then select Memory bandwidth and memory latency. These are two benchmarks, which will tell you the realtime bandwidth of your memory.
K-TRON -
No I don't know of an actual monitor. Lots of benchmarks yes. Do you monitor your CPU cores to see if both are being fully used? How are you calculating 30%? Of what? Share some numbers.
-
Thanks for Sandra Lite. Unfortunately, it doesn't seem to like my computer. It installs, but will not run. Not even in Win2K compatibility mode. I tried two different version numbers, too. I also made sure I have all the requirements (.NET, Java, etc.). Not sure what's up there. Running WinXPHomeSP2.
I know from Task Manager that both cores are being used at 100%. Java.exe uses 99% of CPU when I run my program. The time savings is 30% - if it takes 100 seconds to run the program in the single-threaded version, it takes 70 seconds to run the multithreaded version. Nearly all of the program runs in multithreaded mode, so I was hoping that would fall to 60 seconds (60% of single-threaded time) at least. I made sure the two threads are using seperate variables, so they aren't spending time waiting for the other to quit using a variable. So I'm not sure why, if the program were CPU-bound, it would not run at least somewhat faster on dual-core. Hence why I think it may be a memory bandwidth limitation - the program changing variables so quickly that the memory bus becomes the bottleneck.
Task Manager reports java.exe as using 7.2 MB of space, steady. Which indicates it doesn't fit in my L2 cache.
edit: Seems my IDE was part of the problem; the dual-core version only takes 59% as long as the single-core in the command line. Still curious about finding out if there's a memory bandwidth bottleneck though. -
You get stuck and absolutely want to test? Remove one of your RAM sticks if they match it will virtually guarantee a 10% reduction in bandwidth. If your performance remains the same then rule bandwidth. If it lowers inconclusive as it could also be because of less RAM. RightMark Multi Threaded Memory Test might work since Sandra wont.
As for as expecting 40% that does sound what I would want also 50% is too much to expect. But when I run single threaded apps I still see activity on both cores. So Watch cores in Task Manager are you getting better than 50%. If so enough to explain why only 30%? -
Hmm, I'll probably skip the memory removal test because I don't have any way to remove the screws on the memory section right now. Looking into RightMark Multi Threaded Memory Test.
Single threaded applications will give you activity on both cores (but not 100% on both - the average will be 50%) unless you restrict their affinity to one core (right click the process in the Process tab, choose Set Affinity, uncheck all but one CPU). But when you look at their CPU use on the Process tab, it will not go above 50. Dual-threaded programs will set both CPU graphs all the way to 100%, and will give 99 under CPU use on the process tab. I'm certain it's running dual-threaded and using 99% CPU.
I'm thinking the CPU might be burning up cycles waiting for the RAM. Or it could simply be inherent inefficiencies of dual-threaded Java programs. I don't think it's Dynamic Acceleration, at least not entirely.
edit: Ran RMMA Quick Tests both with and without my program running. With it running, the max "Real RAM Bandwidth" was 3557 MB/s. Without it running, it was only 1724 MB/s. The synthetic tests also increased signficantly with my program running, except latency, which decreased. If I interpret it correctly, this means my program was using about 1.8 GB/s of memory bandwidth - but that would still be below the bottleneck, as I have PC-5300 RAM with a bottleneck of 5.3 GB/s.
There's a couple asterisks. RMMA used about 33% of my CPU power, so my program was running at 2/3 it's normal CPU power. Thus I would assume it would use 2.7 GB/s at full power. It still seems like it shouldn't be hitting a bottleneck - that's 2.6 GB for other programs and losses to latency. Unfortunately I don't have a three-core Phenom on which to test the difference in memory bandwidth with my program running without interference from RMMA.
Image below.
Only concern is that 5.36 GB/s number for maximal real read bandwidth on the left screen. If that's the number I should be looking at in determining a bottleneck then I might have one. -
I agree with all you are testing and seeing. But go back and download the Multi Threaded version. It splits your two channels of RAM. I am only thinking it will simply give you per channel results. Run CPU 1 thread like you said and just see. I know exactly what you are trying determine just having trouble coming up with the experiment. And as even you acknowledge could be inherent overhead cost of multi threading. But it would be nice to rule out bandwidth. I will ponder and if I come up with anything will post back. But does sound like you have as good or better handle on it as I. Good luck!
-
Yes, RAM I/O (or the FSB) often becomes the bottleneck.
There are plenty of other possibilities though. Do the two threads use variables *next to* each others?
If you have an array a, and thread 0 accesses a[0], thread 1 accesses a[1] and so on, that will hurt performance. Because the CPU caches don't operate with single bytes, but with cache lines (typically 32 byte per line, which corresponds to 8 ints or floats, or 4 doubles)
So if this is the case, and the two threads access data a few bytes away from each others at the same time, they'll have to move that cache line from one core to the other, and back, and forth again, and back. (Since it may not exist in both cores' caches at the same time.
Finally, you may want to use 3 or 4 threads, in order to ensure that there's always a thread ready to run, even if one gets blocked. You generally need slightly more threads than you have cores for best performance.
Without knowing more about how your program works, it's impossible to say what's holding you back.
Another related, but simpler explanation might be that the singlethreaded version just gets better cache locality. It doesn't get as many cache misses as the multithreaded version, for whatever reason. Again, impossible to say without knowing more about your program.
A third option might be that the greater bandwidth usage means your program is seeing relatively higher latencies (because there are more pending requests that have to be served before *your* request returns data, which causes the CPU's to stall and have nothing to do for some of the time. That might be possible to fix by rearranging your code a bit to reduce dependencies between instructions.
Btw, don't run Sandra with your program running. It's meant to profile your system *alone*. Anything you get while the CPU is busy with other processes is going to be highly skewed and inaccurate.
There's no way to determine how much RAM bandwidth is being used at any instant in time. The reason being, to do that, you have to keep track of everything that happens for a few hundred nanoseconds, which would take so much CPU time, it'd skew the results badly.
RAM I/O (Bandwidth) Monitor?
Discussion in 'Hardware Components and Aftermarket Upgrades' started by Apollo13, Apr 26, 2008.