• Advertisement
Sign in to follow this  

corrupt struct members on _beginthread

This topic is 3034 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

i've never used threads much before, and i'm having some weird issues where after a few iterations of calling beginthread, the member data of the struct argument i'm passing in is corrupted. have a look at the example code and the output below:
typedef struct {
    int a,b,c;
}abc;
void process_abc(abc* _abc){
    FILE* f=fopen("C:\\output.txt", "at");
    fprintf(f,"%i\n",_abc->a);
    fprintf(f,"%i\n",_abc->b);
    fprintf(f,"%i\n",_abc->c);
    fclose(f);
    _endthread();
}

// in a function:
for(int i=0;i<32;i++){
    abc _abc;
    _abc.a=1;
    _abc.b=2;
    _abc.c=3;
    _beginthread((void(*)(void*))process_abc,0,(void*)&_abc);
}

the output:
1
2
3
1
2
...
1
2
3
1
64
138
1
64
208

any idea what's causing this?

Share this post


Link to post
Share on other sites
Advertisement
Consider what happens to _abc when you exit the loop, and presumably, the function. It falls out of scope, and hence it (and the &_abc pointer you gave to each thread) can no longer be said to refer to valid memory.

Here, it is being allocated on the stack frame for the function, and when the function exits (before the threads have had their chance to observe and process the abc!), this memory it was using is being overwritten by the stack frames of later function invocations.

Share this post


Link to post
Share on other sites
ok, that makes sense now that i think about it, but how do i pass this data in then? just passing a copy instead of a pointer like this won't compile:


_beginthread((void(*)(void*))process_abc,0,(void*)_abc);




what's a best practice in this situation? thanks

Share this post


Link to post
Share on other sites
Possibly, allocate a new abc for each thread, and pass that to it. The thread can then delete it when it's done with it. You're transferring the burden of deletion, the ownership of the memory allocated, to the thread.


void process_abc(void *p){
abc* _abc = (abc*)p;
FILE* f=fopen("C:\\output.txt", "at");
fprintf(f,"%i\n",_abc->a);
fprintf(f,"%i\n",_abc->b);
fprintf(f,"%i\n",_abc->c);
fclose(f);
free(_abc);
_endthread();
}

for(int i=0;i<32;i++){
abc *_abc = malloc(sizeof(abc));
_abc->a=1;
_abc->b=2;
_abc->c=3;
_beginthread(process_abc,0,_abc);
}






If you're using C++ (doesn't look like it?), you'll probably want to catch the pointer in the thread into a std::auto_ptr or a shared pointer of some sort, instead of explicitly deleting it.

However, if you're using C++, the REAL solution is boost::thread :)

EDIT: Clarified C vs. C++

Share this post


Link to post
Share on other sites
Ugh, C++ and threads....
Quote:
what's a best practice in this situation?

Best practice would be not to use threads, and just write directly.

If however, you try to write an asynchronous file writer, that attempts to minimize congestion due to file IO through use of user-mode concurrency (and not by using overlapped IO on Windows), then the following is one way.
#include <iostream>
#include <fstream>
#include <boost/thread.hpp>
#include <boost/bind.hpp>

struct abc {
abc(int a, int b, int c) : a(a), b(b), c(c) {}
int a,b,c;
};

std::ostream & operator<<(std::ostream & os, const abc & value) {
return os << value.a << "\n" << value.b << "\n" << value.c << "\n";
}


template < class T >
class FileQueue
{
typedef std::vector<T> Queue;
public:
FileQueue(const std::string & filename)
: filename(filename)
, running(true)
{
t = boost::thread(boost::bind(&FileQueue::run, this));
}

~FileQueue()
{
{
boost::lock_guard<boost::mutex> lock(mutex);
running = false;
}
cond.notify_all();
t.join();
// flush
write_to_file(pending);
}

void push(const T & value)
{
{
boost::lock_guard<boost::mutex> lock(mutex);
if (running) pending.push_back(value);
}
cond.notify_one();
}
private:
void run()
{
Queue copy;

while (running)
{
if (wait_for_data(copy)) write_to_file(copy);
}
}

void write_to_file(Queue & q) {
std::ofstream f(filename.c_str());

std::ostream_iterator<T> it(f, "");
std::copy(q.begin(), q.end(), it);
}

bool wait_for_data(Queue & q)
{
boost::unique_lock<boost::mutex> lock(mutex);

if (pending.empty()) cond.wait(lock);

q.assign(pending.begin(), pending.end());
pending.clear();

return !q.empty();
}

std::string filename;

Queue pending;
bool running;

boost::mutex mutex;
boost::condition_variable cond;

boost::thread t;
};

int main()
{
FileQueue<abc> queue("c:\\output.txt");

for (int i = 0; i < 32; i++) queue.push(abc(i,2*i,3*i));

return 0;
}


And that doesn't even contain file IO error handling.

Note that most of the mess above has to do with trying to keep contention due to file IO outside of caller's thread. Without copying the data into 'copy', if file is on network or similar, and a timeout occurs, it would block all callers in push's lock.

I also try to keep allocations within same thread. For example, 'copy' is created by file IO thread, and destroyed when that exists, while 'pending' is created by main thread, and destroyed by same (in destructor, after join). This may not matter in general case, but 'copy' could use a thread-local allocator.

There's also the problem what happens if two file queues want to write to same file.

And I have no desire to write something similar (or at least as generic) in pure C.

Share this post


Link to post
Share on other sites
Quote:
Original post by Antheus
Ugh, C++ and threads....
Quote:
what's a best practice in this situation?

Best practice would be not to use threads, and just write directly.


well, this is to speed up an image processing routine that uses some heavy convolution kernels. without any kind of threading, this takes up to a minute on even a small 1-megapixel image, so that's not an option.

Share this post


Link to post
Share on other sites
Quote:
Original post by fluke

well, this is to speed up an image processing routine that uses some heavy convolution kernels. without any kind of threading, this takes up to a minute on even a small 1-megapixel image, so that's not an option.


Did you google for existing libraries?

How big is the kernel? The problem is basically O(w*h*m*n). For large kernels it starts approaching O(n^4).

How do you lay out your data and loops? Do you always process rows first? Are pixels laid out in same order?

How do you access pixels? Hopefully it's not using some getPixel(), setPixel().

How do you check for borders? Does each pixel access have a if (x < 0 || x > width) test? These are redundant - instead, expand the image by the size of kernel to avoid them.


Finally, some math. RAM bandwidth is in the order of 1 - 5 gigabytes per second. 1 megapixel image is perhaps 4 megabytes. So in one second, you can check some 250-1000 kernel values against each pixel image (1 gigabyte / 4 megabyte = 250, or that many memory accesses for each pixel in image). This would mean that kernel size of 15 should take in the order of one second.

Share this post


Link to post
Share on other sites
well, they aren't typical kernels. i just said that to keep the focus on my bigger problem. they are actually radial lines that extend from the source pixel. so it's more like O(w*h*n*l) where n is the number of radial lines (which can range from 10 to 64+) and l is the length of each line which at it's respective angle, goes through the source pixel from one side of the image to the other. so as you can imagine, this is pretty slow processing a set of those lines per pixel.

the order is an external radial lines loop, then width/height inside. this allows me to thread each radial line loop and the results are stored in a floating point image buffer which later is averaged with the sum of all processes to acheive the final result.

i'm using a while loop which checks two x/y indexes for exceeding width/height limits. it has to be this way as each radial line is stored as a rise/run value which can't be predicted, so preemptive measures are difficult.

Share this post


Link to post
Share on other sites
Quote:
Original post by fluke
well, they aren't typical kernels. i just said that to keep the focus on my bigger problem. they are actually radial lines that extend from the source pixel. so it's more like O(w*h*n*l) where n is the number of radial lines (which can range from 10 to 64+) and l is the length of each line which at it's respective angle, goes through the source pixel from one side of the image to the other. so as you can imagine, this is pretty slow processing a set of those lines per pixel.


Since this has come from file IO across convolution kernels into something new altogether, is it about Hough transform?

And how do you iterate over the lines? Bresenham?

Share this post


Link to post
Share on other sites
it's called differential hysteresis processing. it uses the radial lines to perform pixel comparisons.

the lines themselves are calculated as y=tan((line/numlines)*pi),x=1.0f for first quarter, then x=1.0f/tan((line/numlines)*pi),y=1.0f for second, then inverting both for other half of the circle. incrementing these with x/y during convolution produces desired radial lines.

[Edited by - fluke on November 2, 2009 12:43:32 PM]

Share this post


Link to post
Share on other sites
I would organize traversal differently to maximize cache coherency.

for every angle
let dhp be array of same size as row
for every row r
for every row q above/below:
compute x or y using Bresenham's algorithm.
offset q by x or y (depending on slope), if out of bounds, fill some neutral value or adjust bounds
adjust DHP cursor for entire q and update results in dhp


This approach requires constant memory overhead to store one row, but is very cache friendly, and requires only one line step calculation per row.

It may make sense to process angles -45..45 in one step, then rotate the image by 90 degrees, and update the remaining values. This way, you can always optimally process the rows.

This reorganization alone should drastically reduce the computation times. And just perhaps there is a way to improve the cursor window updates as well, perhaps using SSE. This could potentially eliminate two ifs per pixel.


After that, you could obviously multi-thread the whole thing, but my hunch is that there are considerable constant factor gains to be had with above approach. Multiple threads alone, depending on CPU, could result in worse performance, due to even more diverse memory accesses.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement