Blog Archives

Using OpenCL to find the number of "odd and even" pixels in an image

11/10/2018

As described before, a pixel is considered "odd" if the sum of its R,G,B components is an odd number.

Anyway, this week I have been focusing on trying to learn OpenCL.

It's a bit like only using Compute Shaders. It's pretty different from your other types of shader though. There are devices (hey I know what those are), contexts (not too difficult a concept), and programs, buffers and kernels (shader programs). Some of which I am familiar with from Vulkan and OpenCL.

The hardest parts for me to get my head around were the work items and work groups. It's easier to think of the game 7 Billion Humans at this point. Each worker has a work item, and they are in a work group (like a room). A worker is essentially a thread. The weirdest part of this is that you - the programmer - have to say how many work items you want and how many work groups you want. Each work group has the same number of work items (I think). So you have to divide what you want to do up. For my GPU device I have 4100 workers available (just less than 64x64). So to optimise, I chopped a bit off my 20 billion pixel image - only doing 20246528 pixels. The width and height properties I set arbitrarily at 4096 and 4943.

In order to get it work, some numbers had to be fudged. Workers may not have been working on what they were supposed to in the end. I WILL COME BACK TO THIS AND DO IT RIGHT. But for now, I wanted to share my timing value. It took 0.62 seconds. That is the slowest yet. I will share the code later once I am doing what I am supposed to be doing.

In my Lambda attempt I used the following code:

oddNumber = static_cast(std::count_if(std::begin(data), std::end(data), [](const int val) { return val & 1; }));
evenNumber = COUNT - oddNumber;

This ran at an average speed of 0.0234584 seconds. Slightly slower than the 4th iteration of the findQty() method, but about the same. Fairly slower than the fastest method that used local variables. But a lot faster than the other 3 iterations of the findQty() method. So even it does both the check and the add, it's pretty fast.

In conclusion lambda expressions are pretty cool. OpenCL is a little confusing for my graphics orientated brain, but I will get to know it better, and again - parallelism is not always the solution.

â€‹Stay tuned for a boring update on my game on Monday - progress has been slow.

0 Comments

Further Planning

10/10/2018

0 Comments

Game Design Document

8/10/2018

0 Comments

06_game_desing_document___other_planning.pdf
File Size:	68 kb
File Type:	pdf

Download File

0 Comments

Refinement

5/10/2018

0 Comments

Yesterday I did some more work on the efficiency problem, I decided to use Lambda expressions and see if that was any faster, there will be a blog post about it tomorrow, along with my results from an OpenCL test (if I ever get it working). For now though, back to game dev, here's my refinement of the Pogo Fello idea:

05_picking_a_final_idea___refinement_of_idea.pdf
File Size:	56 kb
File Type:	pdf

Download File

0 Comments

A little break from Game Dev stuff...

3/10/2018

0 Comments

So, I'm looking for a job. As part of that process I'm going to interviews and experiencing interview questions. Now, I didn't sign an NDA, but I'm still going to keep the company name secret for the example I'm going to give.

One of the problems set was to write a piece of code that would return the quantity of odd and even numbers in an array of integers.

In the example these quantities were - I think - members of a struct which was a member of a class. But for my experiment I changed them to be global variables.

What experiment? Well, after the test we went through the code to try and optimise it as best we could. My experiment was to find out what the specific performance impacts were of the changes we made.

â€‹Anyway, here's the first piece of code I wrote for it:

void findQty(int* data, int count)
{
        for(int i = 0; i < count; i++)
        {
                if(data[i] % 2 == 0)
                        evenNumber++;
                else
                        oddNumber++;
        }
}

The most obvious improvement I saw was to take out the extra increment, since we can work out the number of odd numbers from knowing the number of even numbers and the total number of numbers. I changed it to this:

void findQty(int* data, int count)
{
        for(int i = 0; i < count; i++)
        {
                if(data[i] % 2 == 0)
                        evenNumber++;
        }
        oddNumber = count - evenNumber;
}

This was better but it still could be improved. I initially suggested that the if statement would be the quickest part, but upon further inspection determined that out of all operations, the modulo would take the longest to do. Since we know that the last bit of an odd number is one, we can do a binary AND and check to see if it has that.

void findQty(int* data, int count)
{
        for(int i = 0; i < count; i++)
        {
                if(data[i] & 1 == 0)
                        evenNumber++;
        }
        oddNumber = count - evenNumber;
}

I will admit that it took me a while to figure out how to improve this further, but looking back it seems so obvious. We can take advantage of the fact that if statements in C/C++ evaluate 0 to false and 1 to true, so we don't have to do the comparison with the equals. It also means we have to swap the odd and even counts.

void findQty(int* data, int count)
{
        for(int i = 0; i < count; i++)
        {
                if(data[i] & 1)
                        oddNumber++;
        }
        evenNumber = count - oddNumber;
}

Okay, looking good, there is one more improvement we can make at the moment. The if statement evaluates to 1 if it's odd, we're adding 1 to odd in this case. It evaluates to 0 when it's even, we're not adding anything to odd then. See where this is going?

void findQty(int* data, int count)
{
        for(int i = 0; i < count; i++)
        {
                oddNumber += data[i] & 1;
        }
        evenNumber = count - oddNumber;
}

There is another improvement to be made... but let's leave that for now. It's not relevant yet. Next we're going to look at the performance.

The first thing I needed was some random integers. It was suggested to me that I could use an image. So I did. I loaded in an image using the stb_image library, added the components of each pixel to get a number between 0 and 765 for each one. The image was 6000 pixels wide and 3375 pixels tall resulting in about 20 million (20,250,000) pseudo-random numbers.

NB: I used 64-bit release mode configuration as this was determined to be the fastest in a prior experiment, this may have been a poor decision since there was so little difference in the times each method took; it was hard to see.

I timed the methods using a high resolution clock, a part of the std::chrono library. I timed each method 5 times and took the average since there was a fair amount of variation in each method's own timings, i.e. there was some overlap.

â€‹Here are the results, (time is measured in seconds):

As you can see, there was a performance improvement at every stage but the biggest impact was removing the if statement all together.

â€‹Next I attempted to multithread the fastest routine to see if it would go any faster, and it did, a little bit. However, I noticed an issue, the number of odd and even numbers it was reporting was incorrect. The number of odd numbers was too low. I determined this to be caused by two threads accessing that number at the same time, adding to it at the same time, then saving it at the same time, so that when it was accessed again, it was only increased by 1 and not 2 as it should have been.

To combat this problem I introduced a local variable, since each thread would have its own copy of it. This solved the problem and created a speed-up. To check that the speed-up was not caused by this change and was the fault of the threading, I tested the new method again without multithreading. Alas, I found that it was in fact this change that created the performance increase. The graph and code for this are below:

void findQty4thread(const int start, const int end)
{
        int odd = 0;
        for (int i = start; i < end; i++)
                odd += data[i] & 1;

        oddNumber += odd;
        evenNumber = COUNT - oddNumber;
}

The process of creating the threads themselves took a similar amount of time to the time they were saving by doing it in parallel. Perhaps this problem could be solved by doing the timings differently and using a std::thread::wait_until() - something for the nine readers I have to look into methinks.

Next I wanted to see if there was a performance increase if I used a GPU instead of a CPU. For this I modified the program I had used to do Monte-Carlo Pi on the GPU. I did this using atomic counters. Here are the important parts of the render loop:

GLuint* userCounters;
                
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, atomicsBuffer);
// map the buffer, userCounters will point to the buffers data
userCounters = static_cast(glMapBufferRange(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(GLuint) * 3,
 GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT | GL_MAP_UNSYNCHRONIZED_BIT));
// set the memory to zeros, resetting the values in the buffer
memset(userCounters, 0, sizeof(GLuint) * 3);
// unmap the buffer
glUnmapBuffer(GL_ATOMIC_COUNTER_BUFFER);

glBindBufferBase(GL_ATOMIC_COUNTER_BUFFER, 0, atomicsBuffer);

// ...

// Read buffer
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, atomicsBuffer);
glGetBufferSubData(GL_ATOMIC_COUNTER_BUFFER, 0, sizeof(GLuint) * 3, userCounters);
glBindBuffer(GL_ATOMIC_COUNTER_BUFFER, 0);
const int numOdd = userCounters[0];
const int numEven = maxImageWidth * maxImageHeight - numOdd;

Here's the important shader code:

#version 420
layout(binding = 0, offset = 0) uniform atomic_uint numOdd;
in vec2 TexCoord;
out vec4 color;
uniform sampler2D ourTexture1;

vec4 test()
{
        vec4 white = vec4(1.0, 1.0, 1.0, 1.0);
        vec4 black = vec4(0.0, 0.0, 0.0, 1.0);

        vec3 colour = texture(ourTexture1, TexCoord).rgb;
        int sum = int((colour.r + colour.g + colour.b) * 255);

        if (sum & 1)
        {
                atomicCounterIncrement(numOdd);
                return black;
        }
        return white;
}

int main()
{
        color = test();
}

Note that I couldn't use the exact method I had previously, since you can only increment or decrement atomic counters - you cannot do a +=.

Anyway, after I did this and printed my result to the screen, I found that it was not doing the full number of numbers, the image size was correct, but not the window size. GLFW had limited me to the maximum window size of 1924x1061, 2,041,364 pixels - about 10% of the original image's. Here are the images side-by-side:

It speaks to JPEG's compression that there are clusters of white and black - they are probably the exact same colour.

It took 0.00078 seconds on average to render this on the GPU. For the same number of numbers on the CPU it took approximately 0.001 seconds. So the GPU was faster but not by much. If we scale it up to the number of pixels we were doing before (and assume a linear increase in workload) then 0.007 seconds is still about twice as fast as the fastest configuration on the CPU.

Consider this: I was only using one core at a maximum of 4GHz for the CPU implementation, I was potentially using 2560 cores at 1GHz on the GPU.

Perhaps I could have achieved a proper implementation using OpenCL. But the point I'm trying to make here is, certain tasks are more appropriate for certain pieces of hardware, and multithreading isn't always the best solution for increasing performance.

0 Comments

Feedback & Analysis

2/10/2018

0 Comments

04_feedback___analysis_of_ideas.pdf
File Size:	31 kb
File Type:	pdf

Download File

0 Comments

Forward>>

Portfolio Link Goes Here

Author

Hi there, the name's Matthew Jenkinson and I'm currently working at Firesprite. In my spare time I work on programming projects like you see here.

If you intend to use the filters, PLEASE READ THE POSTS IN THE IMPORTANT CATEGORY FIRST

Here is a link to the full source code for ALL of the filters: basic_post.frag

Note to me: for code formatting use <pre><code class="language-cpp">...</code></pre>

Using OpenCL to find the number of "odd and even" pixels in an image

Further Planning

Game Design Document

Refinement

A little break from Game Dev stuff...

Feedback & Analysis

Author

Archives

Categories