Wednesday, June 3, 2009

If we could set the agent's server to run on with a failover option

We all know (or I hope we do) that in lotusscript you can open a db with failover.
So if you are clustering and on node goes down, your code will still execute. A very important option IMHO.

Agents are a key part of applications running on servers. Same scenario one node goes down and oops my agents do not run. Bummer. I know someone wrote Lotusscript code to have agents work in this scenario, but what about formula agents? Or what if I have 500 agents on the server and want this functionality. EWWWWW.

I think that we should be able to set agents server to run on with failover. This would be a simple and elegant solution and aid in Disaster recovery planning.

I am going to post this in the partner forum as a wish list, but let me hear your thoughts

9 comments:

  1. Put it on IdeaJam too.

    One workaround would be to have a single agent (could even be only one agent in one database on a server), that would poll a specific server for availability and if it goes down it could go out and flip the runonserver setting for all the affected agents on the failover server. The agent would run ever 5 minutes, and might perhaps be part of a database that has already cataloged ALL a server's agents in itself (there are several opensource/nearly free tools that do this).

    ReplyDelete
  2. True but you open up a host of issues. when you change the server name, agent manager will cause all the agents will run to update data, this in turn could cause many replication conflicts.

    This would need to be addressed in any solution to the agent settings. (history would need to carry over Etc)

    ReplyDelete
  3. Seems to me the biggest issue here is replication (including clustered). For example:

    - Server A is primary server for a group of agents
    - Agents run on A
    - Prior to complete replication to server B, server A goes down
    - Server B detects that A is down and allows agents to fail over
    - Agents on B process some of the same documents or actions

    IMO the critical agents that must run (i.e. fail over) are not likely good candidates for risking this type of data integrity problem. For example, time critical actions or external data syncronizations could be disasters in this scenario.

    Unless such a facility could robustly provide services for ensuring data integrity when the crashed server comes back online, it's risks may outweigh it's benefits. If you need to fail over, that's the easy part to code; the hard part is making sure your data is valid afterward.

    ReplyDelete
  4. @Matt Good point. I never assumed it would be easy, Just incredibly helpful. You are right lots to consider.

    ReplyDelete
  5. It would be cool. Every time it comes up it starts a conversation. Craig made a great point though: what happens if it's the network that's down, not the servers? Do both A & B run?

    ReplyDelete
  6. @Matt - agreed coding the failover would be easy. However, I have a solution that could provide the other data validity - and John can attest to it, as he has seen it in action. It is a record locking application I wrote back in 4.6x that is sensitive to replication and clustering. Implemented it would protect you from these issues. You could set said agents to run from any server, have them test if the primary was available - if not, start on this one (the secondary). I know it would solve the issue for an outage - do you remember this app John?

    ReplyDelete
  7. @Mike yeah i remember. It was pretty cool. It could easily be adapted for this

    ReplyDelete
  8. Nice idea, though I agree with what's been said here before - it's gonna be a challenge. I'm curious to hear if there's been any progress since, it seems as though the topic of disaster recovery is at the center of attention these days..

    ReplyDelete