We have a few deployments that utilize an APCON for traffic aggregation. We’ve noticed in these environments that Bro has trouble reassembling the traffic correctly and there is a significant amount of capture loss (based on the script). We’ve tried different hashing algorithms on the APCON to no effect.
Has anyone else seen anything similar to this or have any insight?
We bypassed the APCON in one of the environments and it helped a little with capture loss (about a 10% drop) and errors in the weird.log. Unfortunately, this was during a weekend so it’s tough to say how much of an impact it made. Another network we’re in fixed some load balancing issues upstream and this help significantly with loss (though weird.logs remain about where they were). I think the APCON may have been a red herring in this instance but I’d be curious to see how your network looks before and after implementation if you’d like to keep in touch.
The main things I’ve been looking at are capture loss and weird.log errors (specifically HTTP_version_mismatch, SYN_seq_jump, TCP_seq_underflow_or_misorder) these may lean towards traffic being mangled. This presentation is pretty helpful in showing you what to look for: https://speakerdeck.com/vladg/bro-deployment-verification-and-troubleshooting
We’re also trying to determine if the Apcon is a red herring since (unfortunately) two changes were made at the same time. While we swapped our Anues for Apcons the network team was also upgrading to Nexus switches.
Our weird log started filling with with “data_before_established” and “possible_split_routing” events right after the changes.
Yep, in one of the environments, we’re getting a ton of “possible_split_routing” and “data_before_established” both with and without the APCON in the mix.
Thinking it has to do with how Bro is handling the load balancing to workers.
Just providing a quick update that think we solved our issue. Our SPANs began getting FabricPath packets after the network team upgraded to Nexus switches (see my separate thread about workers at 100% CPU).
Long story short, Apcon has a “Protocol Stripping” configuration screen where you can enable stripping of the FP encapsulation layer. After enabling that option the “data_before_established” and “inappropriate_FIN” messages stopped.
That’s awesome that you got it worked out! I’ll take a look and see what the configs are for the APCONs we’re dealing with and maybe get to the bottom of our issue.
Still doing some preliminary analysis but we had one of our clients strip the VLAN protocol on our egress ports from the APCON and it allowed Bro/PF_Ring to sessionize the traffic properly (we had tried different -tuples in our set up previously). We have a test we run that checks each level of our framework to make sure we have the proper visibility. This was always a scattershot of what was logged by Bro but, after we stripped the VLAN protocol going to our box, it was cleaned up and looked good.
I’m going to have them strip the FabricPath protocol and see how that affects the traffic as well (can only strip one protocol at a time).
Odd thing is, the weird.log entries were still roughly the same with the VLANs stripped, so it was something with how Bro or PF_Ring was handling the incoming packets from the APCON.
Thanks for the update Josh! Another hard lesson learned today was that the Apcon requires a power-cycle if any physical changes are made. For us it was removing/replacing a SFP during testing. Were were seeing an extremely high number of “ethertype unknown” with arbitrary values (in a 1000 packet sample tcpdump reported 340 unique unknown ethertype values) this issue cleared after power-cycling the Apcon.
That’s interesting, I’ll be sure to add that to my debugging cheatsheet for sure!