After I did more work on the metrics framework today I started seeing new segfaults but they appear to derive from memory corruption. I'm seeing them all over the place now when I load the frameworks/metrics/ssl-example.bro
Would someone mind taking a look through base/frameworks/metrics/main.bro and base/frameworks/metrics/non-cluster.bro in addition to playing with the ssl-example.bro script to see if I'm doing something crazy that could lead to memory corruption?
After I did more work on the metrics framework today I started seeing new segfaults but they appear to derive from memory corruption. I'm seeing them all over the place now when I load the frameworks/metrics/ssl-example.bro
I was playing some last night with that and think I was seeing the same thing as you running with ssl-example.bro on the 2009-M57-day11-21.trace
Trying to track it down, there was one line in base/protocols/ssl/main.bro that is either directly causing the problem or possibly related. In the ssl_extension handler:
c$ssl$server_name = sub_bytes(val, 6, |val|);
Changing that to assign a static dummy string instead made the crashes stop. What the sub_bytes BIF was doing looked ok -- seems to be confirmed by removing that and directly assigning 'val' to $server_name still causes crashes.
Don't think anything looked wrong about how that 'val' StringVal was being created in the SSL analyzer either, so it seems like it may be clobbered somewhere in between the point that it's created and when the metrics framework wants to use it. I'll see if I can find out where that is today, but if you see the same thing as above, do you have any ideas?
I was seeing crashes happen all over the place. It seems like memory is getting corrupted by the call to the Metrics::reset function. If you comment out that one function (in base/frameworks/metrics/non-cluster.bro) the crashes should also stop but the metrics framework also doesn't work correctly anymore! We do probably need to find an open tracefile soon that exhibits these corruption crashes.
Have you guys tried valgrind? That can often pinpoint memory
corruoption right at the source. It can be slow, but prefiltering the
trace for SSL might help (if that's what's triggering it).
I don't really think that SSL is causing it, I was seeing crashes in the mime analyzer, ssl analyzer, http analyzer, and when destructing connection records.
Have you guys tried valgrind? That can often pinpoint memory
corruoption right at the source. It can be slow, but prefiltering the
trace for SSL might help (if that's what's triggering it).
Yeah, valgrind is pointing out invalid accesses in CompHash.cc, reproducible for me by running the metrics/ssl-example.bro on the 2009-M57-day11-21.trace from bro-testing filtered for "tcp port 443".
Looked like it tried to stuff a string value someplace inside the CompositeHash::key buffer when it hasn't been sized to hold that much:
==27445== Invalid write of size 4
==27445== at 0x4027C3D: memcpy (mc_replace_strmem.c:635)
==27445== by 0x82229C5: CompositeHash::SingleValHash(int, char*, BroType*, Val*, bool) const (CompHash.cc:211)
...
I didn't think it could be sized right in the CompositeHash ctor since it's only got Types and not Vals to work with at that point, so I changed the computation to allocate and use a buffer on the fly if the pre-allocated one from the ctor isn't large enough to hold the key computed from the index vals.
$ git diff
diff --git a/src/CompHash.cc b/src/CompHash.cc
index 605949b..c3c65e3 100644
--- a/src/CompHash.cc
+++ b/src/CompHash.cc
@@ -241,16 +241,13 @@ HashKey* CompositeHash::ComputeHash(const Val* v, int type
}
char* k = key;
+ int sz = ComputeKeySize(v, type_check);
+ if ( sz == 0 )
+ return 0;
+ type_check = 0; // no need to type-check again.
- if ( ! k )
- {
- int sz = ComputeKeySize(v, type_check);
- if ( sz == 0 )
- return 0;
Arg! I'm creating indexes too large again! (not the first time I've done this). Robin, Vern, any guidance? I generally really like how the metrics framework is implemented now and I'd like to build on top of the current model but I obviously don't want to cause overly extensive performance problems.
Arg! I'm creating indexes too large again! (not the first time I've done this). Robin, Vern, any guidance? I generally really like how the metrics framework is implemented now and I'd like to build on top of the current model but I obviously don't want to cause overly extensive performance problems.
From my reading of the CompHash code, there's a general problem with it assuming it can use a fixed hash key size in situations where it can't and you just innocently found one of those situations. Performance consideration of the metrics indexing scheme is a second thing (depends somewhat on how the first thing is fixed).
I looked at this a bit more now. You're right that it can't know the
size, but in general the code already accounts for that: it
preallocates the buffer only if the size is static independent of the
value, and otherwise allocates on the fly already.
The problem is however that this check-if-static test is not done
right: for strings that are part of records and optional, it may say
it's static when it in fact it's not.
Your patch fixes that by double-checking that the supposedly static
size is indeed correct, but I think I have found a better fix: the
patch below gets rid of the crash for me. It also passes the
test-suite so I'm going to commit. Please put some stress on the
metrics framework everybody to see if we're fine now.
The patch may look a bit odd but the problem is that the semantics of
"v" are overloaded for SingleTypeKeySize(): v==0 can either mean we
are computing the size of a potential static buffer, or we have an
unset optional record field.
Oh, believe me, I will. Hopefully tonight I can finish intermediate metric updates for the metrics framework so that detection time can decrease for the metrics framework on clusters. You won't have to wait for the break interval to hit in order to detect threshold's being crossed globally.
Well, turns out it didn't work. While everything passed on the Linux
box I had tried it on, it crashes on my laptop. I'm going to push a
new version of the fix that now passes the Mac (but can't try Linux
right now). I hope this does the trick, but if it this still makes
problem, feel free to revert.