WU's too small for GPU
log in

Advanced search

Message boards : Windows : WU's too small for GPU

1 · 2 · 3 · Next
Author Message
EG
Avatar
Send message
Joined: 22 Jun 17
Posts: 15
Credit: 12,727,586
RAC: 0
Message 222 - Posted: 24 Jun 2017, 8:51:18 UTC

I have a problem, my machines finish their assigned tasks (whole queue) before the server can deliver more work. So my machines idle out.

Part of the reason is the tiny tiny wu's some only taking 9 seconds to complete

The other part is a limit of the number of WU delivered to a full queue. 60 wu just doesn't cut it. I run out of work and the server can't keep the queue's full...

Heck can't even keep feeding enough work for them to run consistently...

You need to increase the size of the WU queue ESPECIALLY if your going to have 9 second WU....

Probably along the lines of 200 rather than 60...

Also I would suggest that you lengthen the WU's, a 9 second WU on a GPU is a HUGE waste of resources, we are spending more time uploading and downloading wu than actually doing any real work!

It would also reduce the load on your servers.....

Without these improvements, the project is more a waste of resources from the crunchers side rather than any real work.

Runs beautifully otherwise.

I would strongly reccommend making these changes.

mmonnin
Send message
Joined: 28 Nov 16
Posts: 19
Credit: 5,313,048
RAC: 0
Message 224 - Posted: 24 Jun 2017, 12:10:17 UTC - in response to Message 222.

MilkyWay used to have short tasks, about 7s each on a 280x and the servers were being hammers with requests. Their work didn't really allow for longer/larger tasks so they took 5 and bundled them together. WU availability to users was much better afterwards.

If the project continues to grow it might be a good idea to lengthen the tasks somehow.

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 225 - Posted: 24 Jun 2017, 12:15:36 UTC - in response to Message 222.

Thank you for your report!

I made the following changes.

You need to increase the size of the WU queue ESPECIALLY if your going to have 9 second WU....

Now the size regulator keeps at most 100 WUs of each size in chache instead of 30. This should increase the queue of the unsent WUs.
Note that the server sends a maximum of 15 WUs per GPU at once. It's optimized for the moderate GPU. I think, it's possible to configure the client to overcome this limit or to report more often.

Also I would suggest that you lengthen the WU's, a 9 second WU on a GPU is a HUGE waste of resources, we are spending more time uploading and downloading wu than actually doing any real work!

The smallest type of WU for the GPU is now sent to the CPU. If this is not enough, I'll redirect more WUs to the CPU.

Also I need to mention that we are reaching the server capacity limit, so there will be some shortage of WUs probably, cause the number of volunteers is increasing. However, the current performance is good enough for the project with a limited number of tasks, cause we already processed 0.5 million out of 12 million WUs in the first 6 days.

JugNut
Send message
Joined: 21 Jun 17
Posts: 20
Credit: 63,773,425
RAC: 11,950
Message 226 - Posted: 24 Jun 2017, 13:30:08 UTC - in response to Message 225.
Last modified: 24 Jun 2017, 13:38:46 UTC

Work seems to be drying up.

Server says...

Application.......................Unsent.....In progress
XaNSoNS BOINC for CPU...136..........682............0.02 (0.01 - 1.28)
XaNSoNS BOINC for GPU...0.............2296...........0.04 (0.01 - 12.54)

So 136 for CPU & 0 for GPU. I have nothing except for the occasional spatter.

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 227 - Posted: 24 Jun 2017, 13:43:53 UTC - in response to Message 226.

Work seems to be drying up.

Server says...

Application.......................Unsent.....In progress
XaNSoNS BOINC for CPU...136..........682............0.02 (0.01 - 1.28)
XaNSoNS BOINC for GPU...0.............2296...........0.04 (0.01 - 12.54)

So 136 for CPU & 0 for GPU. I have nothing except for the occasional spatter.

Yes, redirecting some small GPU tasks to the CPU was a bad idea. I need to make the work generator more flexible. Because now the large amount of unsent and inactive CPU tasks slows down the generation of new WUs.

EG
Avatar
Send message
Joined: 22 Jun 17
Posts: 15
Credit: 12,727,586
RAC: 0
Message 229 - Posted: 24 Jun 2017, 19:42:35 UTC - in response to Message 227.

.....
Yes, redirecting some small GPU tasks to the CPU was a bad idea. I need to make the work generator more flexible. Because now the large amount of unsent and inactive CPU tasks slows down the generation of new WUs.


May be a bad idea from a current server/project configuration perspective.

But from a user perspective it's the opposite.

We understand this is beta in it's first few stages. and as project scalability is a major issue, this definitely effects scalability.

No project is ever completely efficient, at least I haven't seen one yet. All we can hope for is a good compromise.

Maybe it is time to separate CPU work generation from GPU generation. Just about every project using both devices has separate work generation.

CPU and GPU capabilities are so far apart that you cannot run one WU on both... A WU that works on a CPU is way to fast on a GPU and the opposite A WU for GPU takes forever and three weeks to run on a CPU.

But it is Beta, this is the stage where these questions get answered...

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 230 - Posted: 24 Jun 2017, 21:14:43 UTC - in response to Message 229.

Maybe it is time to separate CPU work generation from GPU generation. Just about every project using both devices has separate work generation.

This project has a limited number of WUs because we cannot process more structures than are contained in the COD database. For each radiation source (xray, neutron) for each COD structure we have 18 WUs of different size. The sizes of WUs differ in thousand times.

We can separate the GPU and CPU tasks based on their size strictly, but in that case, when all the GPU tasks for all the structures will be completed, we'll be waiting for the CPU tasks to complete having no GPU tasks anymore.

The other approach is to distribute the tasks between the CPU and the GPU more flexibly. So, if the CPU queue is full, and the GPU one is almost exhausted, the most heavy 'CPU' tasks are redirected to the GPU (of course, for the GPU they are small). In the reverse situation when the CPU queue is exhausted, the most lightweight 'GPU' tasks are redirected to the CPU. Note that the work generator keeps some CPU-only (4 most lightweight) and the GPU-only (7 heaviest) WUs for each COD structure, so they can be redirected.

In addition to the distribution between the CPU and the GPU, the WUs are ditributed between several size classes. In theory, this should prevent the low-end GPU/CPU of getting heavy WUs and vise versa. But for some reason this does not work well in practice.

Another reason, why it's hard to completely separate the CPU and GPU tasks in this project, is that once all the WUs for a given COD structure for a given radiation source are completed, the calculated diffraction patterns are combained into a single archive file which is sent to a separate server then. We will have to keep a lot of files in the BOINC server if we wait for the completion of the CPU tasks for too long.

EG
Avatar
Send message
Joined: 22 Jun 17
Posts: 15
Credit: 12,727,586
RAC: 0
Message 231 - Posted: 24 Jun 2017, 21:26:50 UTC - in response to Message 230.
Last modified: 24 Jun 2017, 21:27:11 UTC

Interesting problem Vlad, I wonder if there is a solution.

The science demands that the work be structured for assembly afterwards yet generates many different size problems....

Almost like the traveling salesman problem except in a 3D sense....

ie. a problem with no real efficient solution....

Very interesting indeed...

Jim1348
Send message
Joined: 17 Nov 16
Posts: 6
Credit: 241,545
RAC: 0
Message 232 - Posted: 24 Jun 2017, 23:11:12 UTC - in response to Message 231.

Almost like the traveling salesman problem except in a 3D sense....

ie. a problem with no real efficient solution....

Karmarkar found it.
https://en.wikipedia.org/wiki/Karmarkar%27s_algorithm

EG
Avatar
Send message
Joined: 22 Jun 17
Posts: 15
Credit: 12,727,586
RAC: 0
Message 233 - Posted: 24 Jun 2017, 23:25:50 UTC - in response to Message 232.

Found what?

A solution? or simply a way to make the current inefficient methods more efficient......

He discovered a different method, much more efficient than the Simplex method.

Still no solution.

mmonnin
Send message
Joined: 28 Nov 16
Posts: 19
Credit: 5,313,048
RAC: 0
Message 234 - Posted: 25 Jun 2017, 2:13:13 UTC - in response to Message 230.

This project has a limited number of WUs because we cannot process more structures than are contained in the COD database. For each radiation source (xray, neutron) for each COD structure we have 18 WUs of different size. The sizes of WUs differ in thousand times.


How long on a GPU for all 18 if they only take 10 seconds now? Based off what we run now it wouldn't be too long. I wouldn't mind if it ran for several hours. Even the mt tasks I force onto single threads. Bundle it up. :)

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 236 - Posted: 25 Jun 2017, 11:11:36 UTC - in response to Message 234.
Last modified: 25 Jun 2017, 11:12:35 UTC

How long on a GPU for all 18 if they only take 10 seconds now? Based off what we run now it wouldn't be too long. I wouldn't mind if it ran for several hours. Even the mt tasks I force onto single threads. Bundle it up. :)

Frist of all, they are not all 10 seconds long. If you look at your results, you will find a lot of WUs that are more than one minute long. Also, even for the same size of crystallite, time varies from structure to structure due to variations in atomic density. I'm afraid that those who crunch on Intel will not be happy to recieve a bundle of WUs as a single one.

However, if the small size of WUs bothers you, I suggest to run two or more WUs on the same GPU in parallel. It should be safe to do that on high-end GPUs. This can be done via app_config.xml:

app_config> [<app> <name>xansons_gpu</name> <max_concurrent>8</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> </gpu_versions> </app>] </app_config>

You can change 0.5 to 0.25, if you want, but I do not recomend to run more than 4 WUs in parallel on the same GPU.

mmonnin
Send message
Joined: 28 Nov 16
Posts: 19
Credit: 5,313,048
RAC: 0
Message 237 - Posted: 25 Jun 2017, 12:01:59 UTC - in response to Message 236.

How long on a GPU for all 18 if they only take 10 seconds now? Based off what we run now it wouldn't be too long. I wouldn't mind if it ran for several hours. Even the mt tasks I force onto single threads. Bundle it up. :)

Frist of all, they are not all 10 seconds long. If you look at your results, you will find a lot of WUs that are more than one minute long. Also, even for the same size of crystallite, time varies from structure to structure due to variations in atomic density. I'm afraid that those who crunch on Intel will not be happy to recieve a bundle of WUs as a single one.

However, if the small size of WUs bothers you, I suggest to run two or more WUs on the same GPU in parallel. It should be safe to do that on high-end GPUs. This can be done via app_config.xml:

app_config> [<app> <name>xansons_gpu</name> <max_concurrent>8</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> </gpu_versions> </app>] </app_config>

You can change 0.5 to 0.25, if you want, but I do not recomend to run more than 4 WUs in parallel on the same GPU.


I went back through quite a few pages of work completed on my 1070. The max I saw was 270 seconds. That is already running at 3x. So max of 90 seconds. Nowhere close to an hour if all 18 were done.

Constant changes in load on a processor is even worse than max load 24/7 just like stopping and starting a car in traffic. Thats why I run 3x even if there is no performance benefit per WU. That takes up several CPU cores but thats required for this project to keep a single GPU busy.

I have to run a backup project with Xansons as 15 tasks is not enough to keep the GPU going if there is a short delay in getting work.

I already posted an app_config here before someone else posted the one you copied.

EG
Avatar
Send message
Joined: 22 Jun 17
Posts: 15
Credit: 12,727,586
RAC: 0
Message 238 - Posted: 25 Jun 2017, 13:22:47 UTC - in response to Message 236.
Last modified: 25 Jun 2017, 13:30:31 UTC


Frist of all, they are not all 10 seconds long. If you look at your results, you will find a lot of WUs that are more than one minute long. Also, even for the same size of crystallite, time varies from structure to structure due to variations in atomic density. I'm afraid that those who crunch on Intel will not be happy to recieve a bundle of WUs as a single one.

However, if the small size of WUs bothers you, I suggest to run two or more WUs on the same GPU in parallel. It should be safe to do that on high-end GPUs. This can be done via app_config.xml:

app_config> [<app> <name>xansons_gpu</name> <max_concurrent>8</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> </gpu_versions> </app>] </app_config>

You can change 0.5 to 0.25, if you want, but I do not recomend to run more than 4 WUs in parallel on the same GPU.


I'm already running 5x on two machines and they are still barely able to stay ahead of an empty queue.

5x, that's 20 wu per machine per work cycle, the cache is 15 WU per GPU. And the machine is begging for more. the other two are running 4x so w have the same issue on those also.

72 WU per work cycle. If the WU's are averaging 4 minutes per cycle that's 1080 WU per hour......

Now multiply that by 100 users.

Cache should be 4 times what it is now for most efficient operation on the user side.

Not criticizing, just illustrating how the project scales out....

The project will grow, and, as you say, your reaching max project capacity now on the server side......

Just something to consider....

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 239 - Posted: 25 Jun 2017, 14:47:21 UTC - in response to Message 237.

That takes up several CPU cores but thats required for this project to keep a single GPU busy.

So, this one is a real problem! If the CPU load would be zero while the GPU is computing, it will be OK to run multiple GPU tasks in parallel. But this is a solvable problem!
Now the CPU waits for the kernel to complete before the next one is queued. This allows the GPU to refresh the display but results in a high CPU load. If you are not using the system while crunching, you do not need to refresh the display. So, I can introduce a command line option, which will tell the program that it doesn't need to wait for each kernel to complete. The CPU load will drop to zero once the CPU completes all preliminary computation.

You will be able to specify this option via cmdline tag in app_config.xml.

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 240 - Posted: 25 Jun 2017, 14:54:36 UTC - in response to Message 238.

5x, that's 20 wu per machine per work cycle, the cache is 15 WU per GPU. And the machine is begging for more. the other two are running 4x so w have the same issue on those also.

Ok, now it's allowed to get as much as 50 WUs per GPU at once, but maybe I also should update the size regulator parameters. This requires to restart the BOINC server.

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 241 - Posted: 25 Jun 2017, 17:23:16 UTC - in response to Message 239.


So, this one is a real problem! If the CPU load would be zero while the GPU is computing, it will be OK to run multiple GPU tasks in parallel. But this is a solvable problem!
Now the CPU waits for the kernel to complete before the next one is queued. This allows the GPU to refresh the display but results in a high CPU load. If you are not using the system while crunching, you do not need to refresh the display. So, I can introduce a command line option, which will tell the program that it doesn't need to wait for each kernel to complete. The CPU load will drop to zero once the CPU completes all preliminary computation.

You will be able to specify this option via cmdline tag in app_config.xml.

The new CUDA app 1.03 does not use the CPU core anymore except for the computation of the atomic ensemble in the beginning. There is no need to specify any command line options.
The OpenCL app 1.03 will not use the CPU core as well if you add this to the app_config.xml:
<app_version> <app_name>xansons_gpu</app_name> <plan_class>opencl_ati_102_windows</plan_class> <cmdline>--nowait</cmdline> </app_version>

However, the system will be unusable due to lagging.

JugNut
Send message
Joined: 21 Jun 17
Posts: 20
Credit: 63,773,425
RAC: 11,950
Message 242 - Posted: 25 Jun 2017, 17:33:06 UTC - in response to Message 240.
Last modified: 25 Jun 2017, 18:29:45 UTC

Hey Vlad,
Maybe make it an optional choice in project preference like some other projects do.

Say..from a drop down box in project preferences....

For CPU you could choose =
Max # of cached jobs per CPU/core from 2 - 8 WU's

For GPU you could choose =
Max # of cached jobs per GPU Jobs/GPU from 2 - 28 WU's

Perhaps leave the default settings as they are now. Then implement the above option, afterwards only those having problems will need to change anything. Those unaffected will leave things as they are, so to lessen the impact on server performance.

Perhaps even make the kernel size an option for those looking for better GPU utilisation?

The below options are a part of Amicable Numbers project preferences and are selectable from a drop down box.

Kernel size for AMD/ATI GPUs [?] 16 -21
Kernel size for NVIDIA GPUs [?] 16 -21

The [?] is a hot box that bring up info & suggested usage tips.

Just a thought..


PS The ATI/AMD command works well, testing is a mixed bag but on some WU's it noticeably massively faster, but on other only a little but overall it's certainly quicker. It flies though the WU but then takes 30secs or more at the end to do the CPU calculations. It's early days yet but so far it seems like quite an improvement. Testing continues just to make sure I didn't get a bunch of smallies but so far so good. No screen lag but that's just because in this box the GPU attached to the monitor is an Nvidia the AMD is a secondary card.

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 280
Credit: 103,382
RAC: 0
Message 243 - Posted: 25 Jun 2017, 18:59:06 UTC

This WU is a good example why I don't want to bundle the different small WUs into a single one:

GeForce GTX 1080 Ti, 11339 GFLOPs - 16.7 s
Intel HD Graphics 530, 441 GFLOPs - 414.3 s

For someone it's 17 seconds, and for someone it's 7 minutes.

mmonnin
Send message
Joined: 28 Nov 16
Posts: 19
Credit: 5,313,048
RAC: 0
Message 244 - Posted: 25 Jun 2017, 18:59:34 UTC - in response to Message 239.

That takes up several CPU cores but thats required for this project to keep a single GPU busy.

So, this one is a real problem! If the CPU load would be zero while the GPU is computing, it will be OK to run multiple GPU tasks in parallel. But this is a solvable problem!
Now the CPU waits for the kernel to complete before the next one is queued. This allows the GPU to refresh the display but results in a high CPU load. If you are not using the system while crunching, you do not need to refresh the display. So, I can introduce a command line option, which will tell the program that it doesn't need to wait for each kernel to complete. The CPU load will drop to zero once the CPU completes all preliminary computation.

You will be able to specify this option via cmdline tag in app_config.xml.


No its not! Small WUs are the problem! We've said it over and over and over again. I've ran my 8 threaded CPU with just 4 open CPU threads for many months to feed GPUs. Now I'm using 3. This is not the problem.

I see there's a new version and you've made it worse. We want MAX GPU utilization. They use so much more power than CPUs they better be running at full throttle. THATS why I run more than one task. Because they are so short I don't get full GPU utilization out of it with them constantly stopping and starting. Whatever you did now I can only get 95% GPU utilization with 3 tasks when before it was at 99-100%. You've read what you want out of our posts and went the wrong way.

I ran 4x with MW as well since they are so short as well and there is some CPU crunch time at the end.

1 · 2 · 3 · Next

Message boards : Windows : WU's too small for GPU


Main page · Your account · Message boards


© 2018 Vladislav Neverov (NRC 'Kurchatov institute'), Nikolay Khrapov (Institute for Information Transmission Problems of RAS)