Flare removal

For the 2.5 release, we were hoping to understand why the topic/seth/remove-flare fixes some issues that people have been seeing with the communication code. Perhaps even more to the point we are aiming to understand why that branch fixes the problem, but Robin's branch topic/robin/no-flares-2.4.1 doesn't work.

The problem that we've seen will exhibit on Linux (for some reason FreeBSD doesn't seem to be affected) and you will see high memory use on the child of your manager process. People will tend to notice it in two ways.
1. Memory exhaustion
2. Logs being written that are seconds to minutes old.

This isn't exactly a request for anyone to do anything, but more a call for anyone that would like to dig around in the core to figure out what is going on here so we can get a fix merged into master.


I had looked into it a while ago… I don’t think the differences in your branches has anything to do with flares…

$ git diff origin/topic/robin/no-flares-2.4.1 origin/topic/seth/remove-flare src/iosource/Manager.cc
diff --git a/src/iosource/Manager.cc b/src/iosource/Manager.cc
index 80fa5fe…5ad8cca 100644
— a/src/iosource/Manager.cc
+++ b/src/iosource/Manager.cc
@@ -96,8 +96,8 @@ IOSource* Manager::FindSoonest(double* ts)
// return it.
int maxx = 0;

  • if ( soonest_src && (call_count % SELECT_FREQUENCY) != 0 )
  • goto finished;
    +// if ( soonest_src && (call_count % SELECT_FREQUENCY) != 0 )
    +// goto finished;

// Select on the join of all file descriptors.
fd_set fd_read, fd_write, fd_except;

$ git diff origin/topic/robin/no-flares-2.4.1 origin/topic/seth/remove-flare src/RemoteSerializer.cc

  • // FIXME: Fine-tune this (timeouts, flush, etc.)
  • struct timeval small_timeout;
  • small_timeout.tv_sec = 0;
  • small_timeout.tv_usec =
  • io->CanWrite() || io->CanRead() ? 1 : 10;

flare_fix.patch (839 Bytes)

That's an interesting idea and at least sounds as plausible as anything else I've heard.

Have you tried making those small changes you outlined by themselves on the NCSA dev cluster yet? I'd be curious to hear if that fixes the problem. Or do you even see this issue on that cluster?