Jump to content
  • Advertisement
Sign in to follow this  
Yann L

Tracking invalid memory accesses

This topic is 4898 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

These memory errors can very, very annoying - I know the feeling.
I also suggest Paul Nettle's Manager - it helped me out on several occasions.

You could try to slowly 'take apart' your app - removing calls to subsystems
of code (especially code that was added recently) to see if you can 'magically'
get the problem to disappear.

Eg, don't render models anymore, or disable the soundsystem and so on.
Not really effective, but if you really have no clue, that might be
at least sth to try to give you a hint into what code is causing
the memory overwrite.

Regards

Share this post


Link to post
Share on other sites
Advertisement
With something that's as much of a corner case as that, I'd suggest a small custom-written tool, a poor man's data breakpoint debugger which simply keeps on checking the memory, very very often. Have it record the stack each time and keep around the stack trace from the last check as well. If it runs at high enough resolution, you should be able to narrow down the offending code pretty precisely between the two stack traces. Of course, this will make your program horribly slow... but it'll probably still be usable enough to debug.

Share this post


Link to post
Share on other sites
When you say the VRAM is mapped, do you mean the videocard hardware mapping of the VRAM to the PCs physical address space or the mapping of that physical memory space into your user (game) process space?

I'm inclined to believe the bug is in a driver (perhaps brought to light by abuse from your game) if its corrupting VRAM.

WndDbg is the NT kernel debugger. The Intel CPUs can break on memory reads and writes; if you know the location its writing to you ought to be able to set a break-point on write to it. It's only accurate to an aligned 4 bytes IIRC.

You need a second computer and a null-modem cable to use WndDbg, and you have to boot the kernel with some extra parameters to enable debugging... You don't have to use the checked build though. That's another thought, you could try running on a checked build of XP.


Does this happen on more than one PC?

Share this post


Link to post
Share on other sites
There was a lot of text there so I might have missed this is someone said it.

Anyway, if you have a copy of softice, you can load up your application and set some small number of memory access breakpoints (like 4 or 8 or something?). Whenever that memory is accessed, it will trigger the break and softice will tell you your EIP. It takes a bit of know how to use softice (and it's not free) but it should find your bug.

Share this post


Link to post
Share on other sites
Quote:
Original post by Shannon Barber
[...]You need a second computer and a null-modem cable to use WndDbg[...]
Not Quite
Mark Russinovich is awesome =-)

Yann L: If nothing else, perhaps you could try contacting the aforementioned guru to see if he has any ideas..? It seems likely that even if there isn't an existing tool to do what you want, Dr. Russinovich could either create such a tool, point you to somebody else that would be willing to do so (though I would expect any such custom software to cost quite a bit), or point you to the relevant chapter(s) in his book so that you could construct such a tool yourself.

Share this post


Link to post
Share on other sites
One thing that come's to mind is, are you *sure* the data is getting corrupted while it's in VRAM and not before it's transferred there? This could explain the apparant consistency in the relative memory location that's getting trashed, system memory gets trashed and then transferred into the same resource in VRAM.

Share this post


Link to post
Share on other sites
Wow, this sounds hairy ;)
OK, so you know the offset within VBO. Why not determine address of the video mem mappings via PCI config registers, then scan for known VBO data to determine its exact address? You could then place hardware breakpoints.
Unfortunately, I'm not sure if WinXP will still let you at the PCI BIOS; when I was doing this, it was back in Win95 days ;) It may be easier to use some configuration tool or the Windows hardware information to hard-code the mappings for your test machine.

Failing that, Sneftel's poor man's data checker sounds good :) We have each type of resource validate itself on every access in paranoia builds, which has saved my sanity at least once.

Share this post


Link to post
Share on other sites
You say the bug starts in a VBO, you could replace the Vertex information with a bitmap that identifies each vertex buffer you

this way you know atleast which resources are affected and can limit the problem to some point in your code

Share this post


Link to post
Share on other sites
Yann L, if I understand you right your wrong memory access is still at a valid memory address. If you are outside the valid blocks the system will throw an exception. Because you don’t get a exception and don’t know the address there is no automatic way to find the place in your program were this wrong access happened.

I am not very deep in OpenGL driver details (I am primary a DirectX guy) but it is highly believable that the driver try to keep all the data at the same place in memory all the time. The first step to find the bug is to know which of your video resource it will destroy. If you know this you can get the memory address for this resource form the point of your program were you fill it with data (I am hope that it is a static and not a dynamic resource). Than you can set a breakpoint inside VS that monitor this memory block for changes. Works very well for me in the past with a similar problem.

Share this post


Link to post
Share on other sites
Quote:
Original post by zedzeek
have u tried paul nettles memory checker?

As I said, we already have our own memory manager, that does pretty much the same thing as Pauls. Unfortunately, what we need is runtime range checking on every memory access in the code, and this can't be done with a simple memory manager.

Quote:
Original post by Kitt3n
You could try to slowly 'take apart' your app - removing calls to subsystems
of code (especially code that was added recently) to see if you can 'magically'
get the problem to disappear.

That's not so easy, unfortunately. Those bugs that rely on dereferencing a memory location with an arbitrary undefined value (and that's probably what it is, considering its random behaviour) tend to be higly volatile. You remove one part of the code, and you can be pretty sure that the bug goes away - just to reappear somewhere else later. The 'taking apart' approach is not impossible, but very, very time intensive, and exremely frustrating ;)

Quote:
Original post by Sneftel
With something that's as much of a corner case as that, I'd suggest a small custom-written tool, a poor man's data breakpoint debugger which simply keeps on checking the memory, very very often.

That's a good idea, but I don't know where the offending memory is. I did some research, and it seems that the physical VRAM isn't actually mapped in one piece to the process address space. The driver continously maps in and out single pages of the physical address space, according to its current needs. So even if I somehow manage to identify the location of the corrupted resource, it might be mapped out again a few milliseconds later. GAHH...

Quote:
Original post by Shannon Barber
When you say the VRAM is mapped, do you mean the videocard hardware mapping of the VRAM to the PCs physical address space or the mapping of that physical memory space into your user (game) process space?

The latter.

Quote:

I'm inclined to believe the bug is in a driver (perhaps brought to light by abuse from your game) if its corrupting VRAM.

That was also my thought. But I'm actually not so sure anymore. Actually, our customer first reported this bug while beta testing our application, on a mobility Radeon 9-something. We all know the quality of ATIs OpenGL drivers, so we immediately yelled: "driver bug, update your drivers". They did, and it went away.

A couple of months later, they got the very same error on a GF6800. And that's where we started to get seriously worried. We have since found one single machine in all of our company (with an old GF3 !) that was prone to the bug.

Quote:
Original post by Puzzler183
Anyway, if you have a copy of softice, you can load up your application and set some small number of memory access breakpoints (like 4 or 8 or something?). Whenever that memory is accessed, it will trigger the break and softice will tell you your EIP. It takes a bit of know how to use softice (and it's not free) but it should find your bug.

As I said above, I don't know where that memory exactly is, since it is dynamically mapped.

Quote:
Original post by Extrarius
Not Quite
Mark Russinovich is awesome =-)

Interesting. I'll have a look at that.

Quote:
Original post by joanusdmentia
One thing that come's to mind is, are you *sure* the data is getting corrupted while it's in VRAM and not before it's transferred there? This could explain the apparant consistency in the relative memory location that's getting trashed, system memory gets trashed and then transferred into the same resource in VRAM.

Was also my first idea. Unfortunately, the data copy in system RAM is completely clean. In fact, the bug happened on two static resources: a VBO that is only allocated and filled once at program startup, and a texture that is only loaded once at scene load time. Both resources render fine, as long as the "critical operation chain" isn't done. This operation chain doesn't involve any uploading or modifications of any video resource at all. Once it was performed, there is a chance (doesn't happen each time), that the static VRAM resource gets corrupted. A simple reupload of the resource from system RAM fixes it again. Till the next corruption.

Quote:
Original post by Jan Wassenberg
Wow, this sounds hairy ;)
OK, so you know the offset within VBO. Why not determine address of the video mem mappings via PCI config registers, then scan for known VBO data to determine its exact address? You could then place hardware breakpoints.

Sounds good. But will that work on mapped AGP memory ? And how do I do that on Windows XP ? I'm not so much of a kernel hacking guru ;)

Quote:
Original post by Basiror
You say the bug starts in a VBO, you could replace the Vertex information with a bitmap that identifies each vertex buffer you
this way you know atleast which resources are affected and can limit the problem to some point in your code

I know exactly the resource that gets corrupted, including the precise offset. The problem is, that it is not getting corrupted in system RAM, but in video memory.

Quote:
Original post by Demirug
Yann L, if I understand you right your wrong memory access is still at a valid memory address. If you are outside the valid blocks the system will throw an exception. Because you don’t get a exception and don’t know the address there is no automatic way to find the place in your program were this wrong access happened.

That's right. My thread title is misleading, sorry. I actually tried to modify and shuffle around code, initialize the heap with zeros, etc, in the attempt to get the bug to access memory not mapped to my process, maybe even to dereference a NULL pointer. If it finally raised an exception, then I would get it. But till now, no success. Corruption is always in video memory. Really weird.

OK, current battleplan:

* Supposedly, VC 2005 is very picky about things like uninitialized variables & co. So I'll download VC2005 Express and see if it finds something to nitpick on.

* I'll try to run Purify on the code.

* If these don't find anything, I'll try to reproduce the bug with Mesa instead of hardware OGL. If the bug reappears, and can place a breakpoint within the Mesa source, and also watch the resource memory, as everything is in system RAM.

* I'll try to continuously read back the corrupted resource using glGet* and compare it to the original copy. However, I'm afraid that the driver will just give me back its own memory copy, instead of the corrupted VRAM copy.

* WinDbg & friends. Oh dear, we're supposed to ship in 3 weeks... (runs away screaming into the next wall)...

* Or we'll just ship the whole thing, and blame it onto the drivers ;)

If everything fails, I can still move to Lappland and become a reindeer farmer, trying to hide from the hitman our customer is going to send after us when he loses a $100 million contract because of our bug... [grin]

Seriously though, thanks for your help guys, much appreciated. I'll keep you updated.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!