Cyrus Script is now open source

Started by
19 comments, last by assainator 13 years, 3 months ago
Well here you run into multiple problems.
One: You need a ATI HD 5xxx gpu or better to be capable of running opencl. And as there are a lot of users that have a 4xxx or 3xxx card, this might be a problem.

Two: Calling OpenCL add's overhead time. At a certain moment, you call a function to start a opencl function. The openCL driver need's to do some stuff and then it need's to send the parameters to the GPU where the opencl program starts, and then the result is returned and again some stuff is done and THEN you have your result. I think the overhead is to large to be of use, maybe you gain some performance if you have a vector(3/4, single/double) type you will gain any performance.

I might be wrong though, I suggest you make some benchmarks (execute 5.000.000 calculations of single and double floating point numbers and get the time it takes to execute it on the cpu and on the gpu and then compare). Make these single calculations as this is what is probably mostly used. You get something like:

float flt_num = 324.234f;for(unsigned int i = 0; i < 5000000; i++){  temp float = i * flt_num;}for(unsigned int i = 0; i < 5000000; i++){  temp float = call_opencl_multiply_2f((float)i, flt_num);}


Three: This means diverting resources from the rendering engine to the scripting engine. This might be so small that you won't even notice a difference, but it could also be slowing down the rendering by a large percentage.

The two main problems are support and performance.
If you don't care about supporting older or low-end cards, you only need to run some benchmarks and then decide if you want to use opencl or not.

A sidenote, if mathematics proves such a large performance impact, try to find the bottleneck and fix that first. It is better to remove or shrink the bottleneck then to add another huge part to your scripting engine.

assainator
"What? It disintegrated. By definition, it cannot be fixed." - Gru - Dispicable me

"Dude, the world is only limited by your imagination" - Me

Advertisement
Quote:Original post by assainator
One: You need a ATI HD 5xxx gpu or better to be capable of running opencl. And as there are a lot of users that have a 4xxx or 3xxx card, this might be a problem.
Please check your facts next time. OpenCL runs just fine on all 4xxx series ATI cards, all NVidia cards from the 8xxx series onwards, *and* on any x86/x64 CPU.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Info was taken from:
http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-4000/hd-4350/Pages/ati-radeon-hd-4300-specifications.aspx

assainator
"What? It disintegrated. By definition, it cannot be fixed." - Gru - Dispicable me

"Dude, the world is only limited by your imagination" - Me

Quote:Original post by assainator
Well here you run into multiple problems.
One: You need a ATI HD 5xxx gpu or better to be capable of running opencl. And as there are a lot of users that have a 4xxx or 3xxx card, this might be a problem.

Two: Calling OpenCL add's overhead time. At a certain moment, you call a function to start a opencl function. The openCL driver need's to do some stuff and then it need's to send the parameters to the GPU where the opencl program starts, and then the result is returned and again some stuff is done and THEN you have your result. I think the overhead is to large to be of use, maybe you gain some performance if you have a vector(3/4, single/double) type you will gain any performance.

I might be wrong though, I suggest you make some benchmarks (execute 5.000.000 calculations of single and double floating point numbers and get the time it takes to execute it on the cpu and on the gpu and then compare). Make these single calculations as this is what is probably mostly used. You get something like:

float flt_num = 324.234f;for(unsigned int i = 0; i < 5000000; i++){  temp float = i * flt_num;}for(unsigned int i = 0; i < 5000000; i++){  temp float = call_opencl_multiply_2f((float)i, flt_num);}


Three: This means diverting resources from the rendering engine to the scripting engine. This might be so small that you won't even notice a difference, but it could also be slowing down the rendering by a large percentage.

The two main problems are support and performance.
If you don't care about supporting older or low-end cards, you only need to run some benchmarks and then decide if you want to use opencl or not.

A sidenote, if mathematics proves such a large performance impact, try to find the bottleneck and fix that first. It is better to remove or shrink the bottleneck then to add another huge part to your scripting engine.

assainator


Thanks for your feedback

One: OpenCL can run on any X86/X64 CPU that support SSE3 and the CPUs supports SSE3 since 2005
See this link for more info and a benchmark
http://www.streamcomputing.eu/blog/2010-12-08/opencl-on-the-cpu-avx-and-sse

Two and Three: I researched on OpenCL only one day and I think when you want use CPU for OpenCL you can use CL_MEM_USE_HOST_PTR flag to say OpenCL to use your array buffer for running your code so there is no overhead for sending data from Ram to GPU ram.

mathematics is the problem for any script languages.

If I run below code in both c++ and Cyrus script,
The script becomes 40 times slower.
float flt_num = 324.234f;for(unsigned int i = 0; i < 5000000; i++){  temp float = i * flt_num;}


But when I use this code (note the s.Print(); function) the script is only 15% slower than c++ so I think calling a function in the loop do something with CPU which cause to reduce the performance maybe it disable the CPU cache or something else

float flt_num = 324.234f;string s = "Hello";for(unsigned int i = 0; i < 5000000; i++){  temp float = i * flt_num;  s.Print();}


So I think I can improve the performance of math calculations with OpenCL.
Quote:Original post by assainator
Info was taken from:
http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-4000/hd-4350/Pages/ati-radeon-hd-4300-specifications.aspx
Those pages don't make any mention of OpenCL, because they were written *before* AMD/ATI supported OpenCL.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

@swiftcoder: I thought I could trust the pages of AMD as they produce the cards. I'm sorry that I posted wrong information.

@swiftcode&kochel: Sorry, the articles I read only mentioned the gpu and the not cpu with the sse3 and/or avx instruction set(s). Going to read up tonight on this.

@kochel: It's kinda strange that when you ADD code (s.print()) that the code will run faster...

You could add specific type handling for ints and floats. That when it finds a instruction that operates on ints and floats, it won't call the function as you are doing now, but it will perform the calculation there. This might improve the execution speed as there is less calling. If it is indeed the cache that is creating the problem, this (possible) solution might shrink the problem.
You can also try to set the optimization mode in Visual Studio to minimize size (Properties->Configuration Properties->C++->Optimizations->Optimization Choose 'Minimize Size')

One small side question: Have you also tested the speed in release mode?

I hope this post is more helpful then my previous one.


asssaintor


EDIT:
One other question that popped up after I posted.
Why do you want to use opencl for calculating on the cpu? Isn't it faster to code this yourself as this means more function calling before the actual calculation starts?
If you want to use sse(2/3/s3/4/4.1), just google 'C++ sse tutorial' And you'll find a lot of tutorials and references to use.

assainator
"What? It disintegrated. By definition, it cannot be fixed." - Gru - Dispicable me

"Dude, the world is only limited by your imagination" - Me

I test below code for benchmark

	for (int j = 0; j < 1000; j++)	{		for (int i = 0; i < 1000; i++)		{			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;			c = a * b + a * b + a * b + a * b + a * b;		}	}


Here is the result

time with OpenCL = 119 ms
time with c++ = 329 ms
time with Cyrus script = 84748 ms

When I used simpler code to execute the c++ was 23 times faster than OpenCL but when I use a more complicated code c++ becomes more slow and OpenCL dose not change much.
For example c++ time becomes 329 ms from 5 ms but OpenCL becomes 119 ms from 115 ms
Well it could be that the OpenCL compiler does more optimizations then the C++ one.
Essentially you are making the same calculation 20~25 times (I didn't count them)
And within these calculation, you do the same calculation, so the compiler could boil this down to:

for(int j = 0; j < 1000; j++){ for(int i = 0; i < 1000; i++) {  c = (a * b) * 5; }}


It can even optimize it further to the following but I don't think that it will happen:

 for(int i = 0; i < 1000; i++) {  c = (a * b) * 5; }


You should try to find a way in which you never do the same calculation, you could try:
for(unsigned int j = 0; j < 1000; j ++){	for(unsigned int i = 0; i < 1000; i++)	{		c = i-(j+500) / ( ( a / i ) * b ) + (j-i)*2 - ((i*3)/4);	}}

This would also give you a more complete benchmark as divisions are more difficult for a cpu/gpu then additions.

I'm not saying that OpenCL can't be faster, but I'm just trying to point out that I see some problems in you way of doing a benchmark.
The thing is, OpenCL uses sse3, sse3 allows OpenCL to do multiple calculations at once. OpenCL can do up to 4 calculation at once because it is using sse3. Therefor, OpenCL is faster at large calculation in which you basicly do the same. But when doing complex calculations in which you barely can do the same arithmetic at once, sse3 can't be used that much anymore.

And (yet again) another question. Do you plan on running whole scripts in OpenCL or only the calculations? If you only want to do calculations in OpenCL you should try something like this:
unsigned int openclstart = GetTime(); //fill your method of getting time herefor(unsigned int i = 0; i < 1000; i++){	//do addition	c = call_opencl_add_func(a, b);		//do multiply	c = call_opencl_mul_func(a, b);	//do divide	c = call_opencl_div_func(a, b);		//do subtract	c = call_opencl_sub_func(a, b);}unsigned int openclend, cppstart;openclend = cppstart = GetTime(); //againfor(unsigned int i = 0; i < 1000; i++){	c = a + b;	c = a * b;	c = a / b;	c = a - b;}unsigned int cppend = GetTime();//Same for Cyrusscript hereunsigned int opencl_time = openclend - openclstart;unsigned int cpp_time =cppend - cppstart;//same for cyrrusscript

This because chances are small you will ever do calculations on whole buffers in cyruscript

I hope this helped.

assainator

[Edited by - assainator on January 4, 2011 1:03:56 AM]
"What? It disintegrated. By definition, it cannot be fixed." - Gru - Dispicable me

"Dude, the world is only limited by your imagination" - Me

Quote:Original post by assainator
@kochel: It's kinda strange that when you ADD code (s.print()) that the code will run faster...

It dose not run faster. Actually c++ becomes very slower and the speed ratio becomes 0.15x faster than script from 20 times faster than before.

Quote:
You could add specific type handling for ints and floats. That when it finds a instruction that operates on ints and floats, it won't call the function as you are doing now, but it will perform the calculation there. This might improve the execution speed as there is less calling.

I don't get you. Can you please explain more how I can perform the calculation there?

Quote:If it is indeed the cache that is creating the problem, this (possible) solution might shrink the problem.
You can also try to set the optimization mode in Visual Studio to minimize size (Properties->Configuration Properties->C++->Optimizations->Optimization Choose 'Minimize Size')

I'm not sure if catching is the problem the only thing that I know is that every script language are slow in calculations I tested Cyrus script against another script and c++.
Cyrus script was 40 times slower than c++ in calculations and was 6 times faster than the other script. This problem is not only the Cyrus script problem but I want find out a way to solve this problem or improve the speed.

Quote:
One small side question: Have you also tested the speed in release mode?

No not yet :D

Quote:
I hope this post is more helpful then my previous one.


asssaintor

Thank you very much your posts are very helpful to me and they helped me too much

Quote:EDIT:
One other question that popped up after I posted.
Why do you want to use opencl for calculating on the cpu? Isn't it faster to code this yourself as this means more function calling before the actual calculation starts?
If you want to use sse(2/3/s3/4/4.1), just google 'C++ sse tutorial' And you'll find a lot of tutorials and references to use.

assainator

I want to make an interface to let Cyrus script users use OpenCL in their scripts so if they want to do some calculations in script they have a faster way to do it.

I want to thank you for your posts one more time

[Edited by - Kochol on January 4, 2011 3:40:13 PM]
Quote:Original post by assainator
Well it could be that the OpenCL compiler does more optimizations then the C++ one.
Essentially you are making the same calculation 20~25 times (I didn't count them)
And within these calculation, you do the same calculation, so the compiler could boil this down to:

for(int j = 0; j < 1000; j++){ for(int i = 0; i < 1000; i++) {  c = (a * b) * 5; }}


It can even optimize it further to the following but I don't think that it will happen:

 for(int i = 0; i < 1000; i++) {  c = (a * b) * 5; }


You should try to find a way in which you never do the same calculation, you could try:
for(unsigned int j = 0; j < 1000; j ++){	for(unsigned int i = 0; i < 1000; i++)	{		c = i-(j+500) / ( ( a / i ) * b ) + (j-i)*2 - ((i*3)/4);	}}

This would also give you a more complete benchmark as divisions are more difficult for a cpu/gpu then additions.

I'm not saying that OpenCL can't be faster, but I'm just trying to point out that I see some problems in you way of doing a benchmark.
The thing is, OpenCL uses sse3, sse3 allows OpenCL to do multiple calculations at once. OpenCL can do up to 4 calculation at once because it is using sse3. Therefor, OpenCL is faster at large calculation in which you basicly do the same. But when doing complex calculations in which you barely can do the same arithmetic at once, sse3 can't be used that much anymore.

And (yet again) another question. Do you plan on running whole scripts in OpenCL or only the calculations? If you only want to do calculations in OpenCL you should try something like this:
*** Source Snippet Removed ***
This because chances are small you will ever do calculations on whole buffers in cyruscript

I hope this helped.

assainator

Yes you are right my benchmark was wrong I bench marked OpenCL with your code 100000000 times and OpenCL takes 10 sec and c++ takes 2 sec to calculate it.
Maybe there is a way to speed up it in OpenCL that I didn't discover it yet

But it is very faster than script yet :D

Thanks for your help

This topic is closed to new replies.

Advertisement