GPU app for ARM Mali T-628
log in

Advanced search

Message boards : Linux : GPU app for ARM Mali T-628

Author Message
Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 500 - Posted: 17 Sep 2017, 20:14:42 UTC
Last modified: 17 Sep 2017, 20:20:29 UTC

I am trying to run Xansons apps on Odroid C2. CPU app works fine (http://xansons4cod.com/xansons4cod/results.php?hostid=3878&appid=2). However I have problem with GPU app for integrated Mali T-628 GPU. I was able to build and install it, but it seems that it does not work as expected - progress quickly jumped to 1%, and it is stuck there for 1 hour. CPU usage for app is about 97%, so looks that apps is doing something. I checked code briefly and found that GetGFLOPS function returns 0 for unrecognized GPU, what affects calculations which depend on this value. Could you take a look on this?

Part of scheduler request with details of Mali T-628 GPU is here: https://github.com/BOINC/boinc/issues/1686.
____________

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 321
Credit: 103,382
RAC: 0
Message 501 - Posted: 17 Sep 2017, 21:31:51 UTC - in response to Message 500.

I am trying to run Xansons apps on Odroid C2. CPU app works fine (http://xansons4cod.com/xansons4cod/results.php?hostid=3878&appid=2). However I have problem with GPU app for integrated Mali T-628 GPU. I was able to build and install it, but it seems that it does not work as expected - progress quickly jumped to 1%, and it is stuck there for 1 hour. CPU usage for app is about 97%, so looks that apps is doing something. I checked code briefly and found that GetGFLOPS function returns 0 for unrecognized GPU, what affects calculations which depend on this value. Could you take a look on this?

Part of scheduler request with details of Mali T-628 GPU is here: https://github.com/BOINC/boinc/issues/1686.

I updated GetGFLOPS to support Mali GPUs (https://gitlab.com/vsnever/xansons_boinc/commit/1edef4a9ae5d97beb2c38ea5338150d83cad46b9) but I have no device to test on, so it will be trial and error... Can you rebuild the app with the updated source and try again?

Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 502 - Posted: 17 Sep 2017, 22:44:50 UTC

Thanks. I tried it, and now it performs calculation. However 1st result is inconclusive and most probably will be invalid - some kernels returned error CL_OUT_OF_RESOURCES:
http://xansons4cod.com/xansons4cod/result.php?resultid=21913280

I also spotted one minor issue - at the beginning progress jumps to 5% and then down to 1%. This is a cosmetic issue.

Now I am running 2nd WU, will see if it will finish successfully or not.

And one more thing, my device is Odroid XU4, not C2 as I wrote above. C2 does not have OpenCL-capable GPU.
____________

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 321
Credit: 103,382
RAC: 0
Message 503 - Posted: 17 Sep 2017, 23:46:22 UTC - in response to Message 502.
Last modified: 17 Sep 2017, 23:56:26 UTC

Thanks. I tried it, and now it performs calculation. However 1st result is inconclusive and most probably will be invalid - some kernels returned error CL_OUT_OF_RESOURCES:
http://xansons4cod.com/xansons4cod/result.php?resultid=21913280

Probably not enough registers per workitem. I'm afraid, I can't fix this without the device.
By the way, did you change the makefile to build the app for ARM?

Update. Actually, can you change BlockSize1Dsmall value from 256 to 128 in kernelsPDF.cl (line 25) and typedefs.h (line 40), rebuild and try again?

Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 504 - Posted: 18 Sep 2017, 12:15:21 UTC - in response to Message 503.

Thanks. I tried it, and now it performs calculation. However 1st result is inconclusive and most probably will be invalid - some kernels returned error CL_OUT_OF_RESOURCES:
http://xansons4cod.com/xansons4cod/result.php?resultid=21913280

Probably not enough registers per workitem. I'm afraid, I can't fix this without the device.
By the way, did you change the makefile to build the app for ARM?

Update. Actually, can you change BlockSize1Dsmall value from 256 to 128 in kernelsPDF.cl (line 25) and typedefs.h (line 40), rebuild and try again?

OK, I changes this and rebuilt app. Looks that it slowed down, will see if it will be able to complete calculations successfully now.

Makefile needed only minor update - I changed path to BOINC source. OpenCL needed one more change, I had to link with libmali too - this was needed because after last system update libOpenCL stopped exporting cl* functions. In the past linking with libOpenCL was enough.
____________

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 321
Credit: 103,382
RAC: 0
Message 505 - Posted: 18 Sep 2017, 16:18:21 UTC - in response to Message 504.

OK, I changes this and rebuilt app. Looks that it slowed down, will see if it will be able to complete calculations successfully now.

Makefile needed only minor update - I changed path to BOINC source. OpenCL needed one more change, I had to link with libmali too - this was needed because after last system update libOpenCL stopped exporting cl* functions. In the past linking with libOpenCL was enough.

Ok. Thank you very much!

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 321
Credit: 103,382
RAC: 0
Message 507 - Posted: 18 Sep 2017, 19:12:48 UTC - in response to Message 505.

http://xansons4cod.com/xansons4cod/result.php?resultid=21913410 Success! This WU was a big one. It took 559 sec. to finish it on GTX 1080. So, I don't think that the app was slowed down by reducing the workgroup size.

Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 508 - Posted: 18 Sep 2017, 19:19:26 UTC
Last modified: 18 Sep 2017, 19:21:22 UTC

WU mentioned earlier finally completed, after 6 hours 42 minutes. Smaller block size helped, WU was validated successfully. Looks that it was a very big one, I got over 3k credits for it. Here is link to it for reference:
http://xansons4cod.com/xansons4cod/result.php?resultid=21913410

Next WU was a lot shorter, it completed in 31 minutes:
http://xansons4cod.com/xansons4cod/result.php?resultid=21913461

This time could be shorter if kernels were optimized for Mali GPU - global/local/host memory is shared (no need to copy data between them), it supports SIMD instruction (vector size is 128 bits) and there are no thread blocks (warps/wavefronts). But with only 3 weeks until end of main part of calculations it probably does not make sense to try to optimize code for this GPU.

Edit: you were first, I waited for 2nd WU to finish :). Thanks for help with this app!
____________

Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 509 - Posted: 18 Sep 2017, 19:58:16 UTC

I have uploaded binaries to bitbucket, you can try them if you want:
https://bitbucket.org/sirzooro/boinc-stuff/downloads/

In order to use them, attach your device to project, unpack contents of xansons_odroid_xu4.tgz archive to /var/lib/boinc-client/projects/xansons4cod.com_xansons4cod/ dir and restart BOINC Client (config reload will not work). After doing thing, in event log you you should see entry like this:

XANSONS for COD | Found app_info.xml; using anonymous platform


CPU app is configured for 8 cores. If your device has different number of CPU cores, you must edit app_info.xml, and change values in tags avg_ncpus, max_ncpus and cmdline to match your device. Changes to app_info.xml most probably also requires BOINC restart to take effect.
____________

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 321
Credit: 103,382
RAC: 0
Message 512 - Posted: 18 Sep 2017, 20:57:26 UTC - in response to Message 509.

I have uploaded binaries to bitbucket, you can try them if you want:
https://bitbucket.org/sirzooro/boinc-stuff/downloads/

In order to use them, attach your device to project, unpack contents of xansons_odroid_xu4.tgz archive to /var/lib/boinc-client/projects/xansons4cod.com_xansons4cod/ dir and restart BOINC Client (config reload will not work). After doing thing, in event log you you should see entry like this:

XANSONS for COD | Found app_info.xml; using anonymous platform


CPU app is configured for 8 cores. If your device has different number of CPU cores, you must edit app_info.xml, and change values in tags avg_ncpus, max_ncpus and cmdline to match your device. Changes to app_info.xml most probably also requires BOINC restart to take effect.

Thank you very much, that's great!

The app already has zero-copy implemented for Intel GPU, so I enabled it for ARM too (not tested). Also, building the app for Mali should be easier now (if I updated the Makefile correctly), just:
make OpenCL=1 Mali=1
https://gitlab.com/vsnever/xansons_boinc/commit/e4739db337917c6d8273421a16c24c44b353c479

Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 513 - Posted: 18 Sep 2017, 21:50:39 UTC - in response to Message 512.

I have uploaded binaries to bitbucket, you can try them if you want:
https://bitbucket.org/sirzooro/boinc-stuff/downloads/

In order to use them, attach your device to project, unpack contents of xansons_odroid_xu4.tgz archive to /var/lib/boinc-client/projects/xansons4cod.com_xansons4cod/ dir and restart BOINC Client (config reload will not work). After doing thing, in event log you you should see entry like this:

XANSONS for COD | Found app_info.xml; using anonymous platform


CPU app is configured for 8 cores. If your device has different number of CPU cores, you must edit app_info.xml, and change values in tags avg_ncpus, max_ncpus and cmdline to match your device. Changes to app_info.xml most probably also requires BOINC restart to take effect.

Thank you very much, that's great!

The app already has zero-copy implemented for Intel GPU, so I enabled it for ARM too (not tested). Also, building the app for Mali should be easier now (if I updated the Makefile correctly), just:
make OpenCL=1 Mali=1
https://gitlab.com/vsnever/xansons_boinc/commit/e4739db337917c6d8273421a16c24c44b353c479

Thanks, I pulled the changes. For some reason linking failed, it started working after I moved -lmali after -lpthread. New app is running now, I will see in the morning if it still works.
____________

Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 521 - Posted: 22 Sep 2017, 18:23:18 UTC
Last modified: 22 Sep 2017, 18:26:07 UTC

Recompiled app works fine. However now I noticed that Boinc client downloaded too many tasks for GPU - I have 369 of them waiting, and estimated time to complete is 100 days. My work buffer is set to 0.5 + 0.01 days. In the past I noticed that for some projects real buffer is twice the configures one, but this ca be accepted. However in this case it look like there were no limit at all.

CPU task queue looks fine, now I have 5 tasks waiting.
____________

Vlad
Project administrator
Project developer
Project tester
Project scientist
Help desk expert
Send message
Joined: 26 Oct 16
Posts: 321
Credit: 103,382
RAC: 0
Message 525 - Posted: 22 Sep 2017, 21:36:25 UTC - in response to Message 521.

Recompiled app works fine. However now I noticed that Boinc client downloaded too many tasks for GPU - I have 369 of them waiting, and estimated time to complete is 100 days. My work buffer is set to 0.5 + 0.01 days. In the past I noticed that for some projects real buffer is twice the configures one, but this ca be accepted. However in this case it look like there were no limit at all.

CPU task queue looks fine, now I have 5 tasks waiting.

Here is my config_aux.xml:
<?xml version="1.0" ?> <config> <max_jobs_in_progress> <app> <app_name>xansons_gpu</app_name> <gpu_limit> <jobs>50</jobs> <per_proc/> </gpu_limit> </app> <app> <app_name>xansons_cpu</app_name> <cpu_limit> <jobs>25</jobs> </cpu_limit> </app> </max_jobs_in_progress> </config>

The limit is 50 WUs per GPU. Is this ignored if anonymous platfrom is used?
Also, have you set peak_flops for you GPU in cc_config.xml? BOINC server shows 0. It should be ~7.5e10.

Profile [B@P] Daniel
Send message
Joined: 18 Jun 17
Posts: 25
Credit: 47,963,162
RAC: 0
Message 529 - Posted: 23 Sep 2017, 13:08:45 UTC
Last modified: 23 Sep 2017, 13:15:33 UTC

Limit for anonymous platform should work fine, my job queue for CPU did not grow too much. I suspect different issue here: OpenCL reports that Odroid XU4 has 2 Mali T-628 GPUs, one with 4 and one with 2 compute units. BOINC sees both, but uses only 1st one. I tried to add <use_all_gpus>1</use_all_gpus> to cc_config.xml, but it did not help. I suspect that BOINC does not work properly when 2 GPUs have the same vendor and name, but different parameters.

BOINC was able to detect this Mali GPU by itself, I did not have to add anything special to cc_config.xml. Looks that BOINC did not try to benchmark it itself, and reports 0 instead of actual GFLOPS. Or it does not report this value to server, I do not know. I have added this info to BOINC issue on Github (link is in 1st post).
____________

Message boards : Linux : GPU app for ARM Mali T-628


Main page · Your account · Message boards


© 2020 Vladislav Neverov (NRC 'Kurchatov institute'), Nikolay Khrapov (Institute for Information Transmission Problems of RAS)