Primecoin GPU miner for nVidia - Now free for 750ti

NoodleDoodle · April 10, 2014, 7:49pm

I got 6 running then after a while gpu_4 and gpu_5 were in constant segmentation fault. While they were running it got up to about 0.0015/block or about 1.5XPM/day according to the ypool stats. I’m not sure if it could have gone up further but only 3 of the 6 GPU’s are working now. The rest are just in “constant” segmentation fault.

jonpry · April 10, 2014, 7:56pm

New version 1.01 uploaded. Should work on > 4 card.

NoodleDoodle · April 11, 2014, 10:31pm

Testing this now, seems to have solved the segmentation fault. Will leave 6 GPUs mining to benchmark.

SuperComputer · April 11, 2014, 10:33pm

@jonpry

Would you please compile a version for sm_35 (GTX 780 Ti). I am curious as to why your application performs so poorly on Kepler. I had the same experience with OpenCL (Montgomery Reduction) on Fermi about three years ago, but that was also my only and last attempt at using OpenCL.

Also, how much do you have remaining to collect on your bounty? I need to learn OpenCL programming on Nvidia’s platform.

I am Supercomputing on bitcointalk.

jonpry · April 11, 2014, 11:13pm

[quote=“SuperComputer, post:24, topic:2147”]Would you please compile a version for sm_35 (GTX 780 Ti). I am curious as to why your application performs so poorly on Kepler. I had the same experience with OpenCL (Montgomery Reduction) on Fermi about three years ago, but that was also my only and last attempt at using OpenCL.

Also, how much do you have remaining to collect on your bounty? I need to learn OpenCL programming on Nvidia’s platform.[/quote]

The posted version will probably “run” on kepler. The big issue being that it is compiled to only use 5 SM’s. So you can multiply the abysmal performance by 3 and get an idea of what is going on. I believe the problem is lack of 32bit multiply on kepler, which is needed in many places but I use it everywhere. Some optimization could be done by switching to mul24 where possible. As a counter argument, the fermat test performance is actually reasonable on kepler which has excessive use of mul32. Sieve is where it falls apart. My sieve kernel is large and encompasses a lot of code so it is not immediately clear what is causing the problem there. Most likely atomics into local memory.

As far as I know, nobody has made any pledges to the bounty. The miner fee has accumulated about 1XPM so far, so we are not getting very close. I hope to speed up the 750ti port enough to make it the defacto primecoin card. Bad news is that I have optimized the algorithmic complexity so much to achieve even this speed, that if AMD folks ever got the code, they would blow us out of the water.

NoodleDoodle, can you also please post the SPS numbers from your GPUs? I do not have any cards on X1 risers and am looking to see if you are having performance difference between X1 and X16 links. I have the code transferring a fair bit of memory. Most of this was for debugging and it takes almost 0 time on X16 link, but could be a problem on X1.

NoodleDoodle · April 11, 2014, 11:50pm

SPS seems to have a median value of 447

Val/h: 5.17303181 - PPS: 80 - SPS: 447.29550171 - ACC: 0 - Primorial: 41 Chain/Hr: 6: 245.04 7: 21.78 8: 2.72

I’ve attached part of the logs where the SPS is shown for all of the cards here:
https://www.dropbox.com/s/0lq5fatq9l1cm2u/gpu0-5.zip

jonpry · April 12, 2014, 12:13am

So I take it your not noticing any difference between x16 and x1? Which one(s) of your gpu’s are on x16? What model of 750ti do you have? I get almost 500SPS on PNY PE/OC.

SuperComputer · April 12, 2014, 1:05am

@jonpry

I will give it a try on a GTX 780 Ti Classified and see how it performs.

You’ve got a PM

NoodleDoodle · April 12, 2014, 1:46am

That’s the thing. All of the GPU are using x1 risers so I can’t really observe how x16 affects SPS.

jonpry · April 12, 2014, 2:49am

Which card are you using? 447 is about what I am seeing on both an EVGA with stock clocks and a Gigabyte Windforce OC that supposedly has 1200mhz core. The PNY is much faster for some reason.

NoodleDoodle · April 12, 2014, 3:59am

I’m using the ASUS OC version on stock clocks.

primer10 · April 12, 2014, 10:06am

Damn cool! What’s the rate like?

jonpry · April 12, 2014, 4:13pm

750ti is about 25 7ch/h depending on the card. For comparison, r9 280x on Claymore’s miner is about 110 7ch/h as far as I know. I don’t have a 280x to actually verify this. So quite a bit slower but at only 38.5 watts max. Also the r9 miners use substantial CPU computing power, where as my miner uses about 2% of a core2duo.

primer10 · April 13, 2014, 1:28am

+1

SuperComputer · April 13, 2014, 8:54am

What do you mean by lack of 32-bit multiply on Kepler?

jonpry · April 13, 2014, 3:31pm

Its my understanding that most GPU’s do not support native 32bit multiply. Fermi being kind of unique here. Since we don’t have access to the native instruction set it’s hard to verify what machine code is actually being run on NV hardware. There are benchmarks available where they compare performance of different width multitplications. As far as I know, there is no reason why a 32bit multiply would take 4x as long as a 24bit unless there is no 32bit multiplier.

SuperComputer · April 13, 2014, 8:05pm

The GTX 750 Ti may soon be the card of choice for Primecoin mining for me I hope. The increased shared memory and L2 cache size in addition to native shared memory atomics should give a significant boost during the sieving phase. The modular exponentiation throughput will probably remain unchanged because of its heavy dependence on register usage. But for about $150 USD and with a runtime power consumption of less than 60 watts, this card is awesome.

ffwong · April 14, 2014, 3:23am

I got 4 R9 290X running Claymore’s miner with speed ranging from 78-100 7ch/h.

@jonpry, will you continue to debug your miner? I tried your miner on a Linux with Tesla K20m. It starts up and connected to server, but then it crashed immediately without mining anything.

jonpry · April 14, 2014, 11:02am

The miner is only for 750ti. Although I am not sure why it wouldn’t run very slowly on K20m. I need more information than “it crashed”. Maybe some output from the program?

ffwong · April 14, 2014, 11:28am

Thanks for your response. Here is the screen cap. For your information, I use proxychains4 to proxify your program in order to connect to socks server. The machine is Linux with a single Tesla K20m.

[proxychains] config file found: ./proxychains.conf
[proxychains] preloading ./libproxychains4.so
[proxychains] DLL init
Found 1 GPUs, selecting 0
(nil)

ptxas info : Compiling entry function ‘sieve_part’ for ‘sm_35’
ptxas info : Function properties for sieve_part
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 55 registers, 33924+0 bytes smem, 344 bytes cmem[0], 16 bytes cmem[2], 576 bytes cmem[3]
ptxas info : Compiling entry function ‘sha’ for ‘sm_35’
ptxas info : Function properties for sha
120 bytes stack frame, 384 bytes spill stores, 276 bytes spill loads
ptxas info : Function properties for sha256_finish
128 bytes stack frame, 368 bytes spill stores, 300 bytes spill loads
ptxas info : Function properties for sha256_process
56 bytes stack frame, 56 bytes spill stores, 56 bytes spill loads
ptxas info : Used 64 registers, 340 bytes cmem[0], 12 bytes cmem[2], 576 bytes cmem[3]
ptxas info : Compiling entry function ‘fermat’ for ‘sm_35’
ptxas info : Function properties for fermat
40 bytes stack frame, 40 bytes spill stores, 60 bytes spill loads
ptxas info : Used 64 registers, 136+0 bytes smem, 348 bytes cmem[0], 16 bytes cmem[2], 576 bytes cmem[3]
ptxas info : Compiling entry function ‘sieve_complete’ for ‘sm_35’
ptxas info : Function properties for sieve_complete
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 13 registers, 16+0 bytes smem, 344 bytes cmem[0], 576 bytes cmem[3]
ptxas info : Compiling entry function ‘sieve’ for ‘sm_35’
ptxas info : Function properties for sieve
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 33792+0 bytes smem, 344 bytes cmem[0], 20 bytes cmem[2], 576 bytes cmem[3]
ptxas info : Compiling entry function ‘fermat_finish’ for ‘sm_35’
ptxas info : Function properties for fermat_finish
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 4+0 bytes smem, 348 bytes cmem[0], 576 bytes cmem[3]
1193957
GeneratePrimeTable2() : prime table [1, 160000] generated with 14683 primes
Last Prime: 159979
Data bytes : 153792776

| jhPrimeMiner - mod by rdebourbon -v3.3beta |
| optimised from hg5fm (mumus) v7.1 build + HP10 updates |
| jsonrpc stats and remote config added by tandyuk |
| author: JH (http://ypool.net) |
| contributors: x3maniac |
| Credits: Sunny King for the original Primecoin client&miner |
| Credits: mikaelh for the performance optimizations |
| Credits: erkmos for the original linux port |
| Credits: tandyuk for the linux build of rdebourbons mod |
| |
| Donations (XPM): |
| JH: AQjz9cAUZfjFgHXd8aTiWaKKbb3LoCVm2J |
| rdebourbon: AUwKMCYCacE6Jq1rsLcSEHSNiohHVVSiWv |
| tandyuk: AYwmNUt6tjZJ1nPPUxNiLCgy1D591RoFn4 |

Launching miner…
GeneratePrimeTable() : prime table [1, 1000000] generated with 78498 primes
Sieve Percentage: 10 %
Connecting to ‘ypool.net’
Using 1 threads
Username:
Password:
Using x.pushthrough protocol
[proxychains] Strict chain … 123.123.12.23:3211 … ypool.net:10034 … OK
xpt: Logged in

Val/h = ‘Share Value per Hour’, PPS = ‘Primes per Second’,
SPS = ‘Sieves per Second’, ACC = ‘Avg. Candidate Count / Sieve’

Keyboard shortcuts:
, - Quit
- Increment Primorial Multiplier
- Decrement Primorial Multiplier
- Increment Sieve size
- Decrement Sive size
- Increment Round Sieve Percentage
- Decrement Round Sieve Percentage
- Print current settings
- Write current settings to config file
New block data - height: 491350 tx count: 3

New Block: 491350 - Diff: 10.68048239 / 7.00000000
Valid/Total shares: [ 0 / 0 ] - Max diff: 0.00000000
6ch/h: 0.00000000 - 0 [ 0 / 0 / 0 ]
Share Value submitted - Last Block/Total: 0.00000000 / 0.00000000
Current Primorial Value: 41

runit.sh: line 10: 20560 Segmentation fault (core dumped)

Primecoin GPU miner for nVidia - Now free for 750ti

Val/h = ‘Share Value per Hour’, PPS = ‘Primes per Second’, SPS = ‘Sieves per Second’, ACC = ‘Avg. Candidate Count / Sieve’

New Block: 491350 - Diff: 10.68048239 / 7.00000000 Valid/Total shares: [ 0 / 0 ] - Max diff: 0.00000000 6ch/h: 0.00000000 - 0 [ 0 / 0 / 0 ] Share Value submitted - Last Block/Total: 0.00000000 / 0.00000000 Current Primorial Value: 41

Val/h = ‘Share Value per Hour’, PPS = ‘Primes per Second’,
SPS = ‘Sieves per Second’, ACC = ‘Avg. Candidate Count / Sieve’

New Block: 491350 - Diff: 10.68048239 / 7.00000000
Valid/Total shares: [ 0 / 0 ] - Max diff: 0.00000000
6ch/h: 0.00000000 - 0 [ 0 / 0 / 0 ]
Share Value submitted - Last Block/Total: 0.00000000 / 0.00000000
Current Primorial Value: 41