I would keep the particular data-services scalable but allow the user to specify their distribution across the data nodes. As Jon already wrote, it could look like this (I added Spam and Scan pools):
[data-1]
type = data
pools = Intel::pool
[data-2]
type = data
pools = Intel::pool, Scan::pool
[data-3]
type = data
pools = Scan::pool, Spam::pool
[data-4]
type = data
pools = Spam:pool
However, this approach likely results in confusing config files and, as Jon wrote, it's hard to define a default configuration. In the end this is an optimization problem: How to assign data-services (pools) to data nodes to get the best performance (in terms of speed, memory-usage and reliability)?
I guess there are two possible approaches:
1) Let the user do the optimization, i.e. provide a possibility to assign data services to data nodes as described above.
2) Let the developer specify constraints for the data service distribution across data nodes and automatize the optimization. The minimal example would be that for each data service a minimum and maximum or default number of data nodes is specified (e.g. Intel on 1-2 nodes and Scan detection on all available nodes). More complex specifications could require that a data service isn't scheduled on data nodes together with (particular) other services.
Another thing that might need to be considered are deep clusters. If I remember correctly, there has been some work on that in context of broker. For a deep cluster there might be even hierarchies of data nodes (e.g. root-intel-nodes managing the whole database and 2nd-level-data-nodes serving as caches for worker-nodes on per site level).
Jan