Design and evaluation of a faulttolerant mobileagent system

Design and evaluation of a fault-tolerant mobile-agent system Authors: Lyu, M. R. , Xinyu Chen, and Tsz Yeung Wong From : Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications vol. 19, Issue 5, Sep. Oct 2004 Date : 2007/5/24 Teacher：Jong-Shin Chen Student number: 9530618 Name : 施宏達

Outline Introduction System architecture and protocol design Agent failure detection and recovery Failures of witness agents and the recovery strategy Simplifying the witnessing dependency An example model Experimental results Conclusion

Introduction Mobile agents create a new paradigm for data exchange and resource sharing in rapidly growing and continually changing computer networks. Therefore, survivability and fault tolerance are vital issues for deploying mobile-agent systems.

System architecture and protocol design(1/5) The agent server should provide three types of stable storage—for logs, checkpoints, and messages. Each agent contains its internal data, which could also be lost due to the failure. If the agent renews its computation from the starting point of its itinerary, it will violate the exactly-once property.

System architecture and protocol design(2/5) An actual agent is a common mobile agent that performs specific computations for its owner. Witness agents monitor the actual agent and detect whether it’s lost.

System architecture and protocol design(3/5) (1)Log entry logiarrive (2)Send message msgiarrive to server. Si-1 (3)After computation, checkpoint the data (4)Log entry logileave (5)Send message msgileave to server. Si-1 (6)Spawn a witness agent

System architecture and protocol design(4/5) Witness agent wi– 1 is more passive than the actual agent in this protocol. Two messages are expected: msgiarrive and msgileave. After receiving these two messages, wi– 1 waits for the direct heartbeat message, msgialive, which the witness agent at server Si sends.

System architecture and protocol design(5/5)

Agent failure detection and recovery(1/3) wi– 1 fails to receive msgiarrive: 1. The message is lost due to an unreliable network. 2. The message arrives after the timeout period of wi– 1. 3. Actual agent α gets lost when it’s ready to leave Si– 1 and is heading for Si. 4. Actual agent α gets lost when it arrives at Si without logging. 5. Actual agent α gets lost when it arrives at Si with logging.

Agent failure detection and recovery(2/3) (1)The witness agent spawns a probe, which travels to Si. (2)(2)The probe carries the checkpointed data. (3) The probe inspects the log in Si. (4) If logiarrive is found, the probe retransmits msgiarrive to. Si– 1. (5) If not, it recovers the agent from the checkpointed data.

Agent failure detection and recovery(3/3) wi– 1 fails to receive msgileave: 1. The message is lost due to an unreliable network. 2. The message arrives after the timeout period of wi– 1. 3. Actual agent α gets lost just after sending message msgiarrive. 4. Actual agent α gets lost just after logging entry logileave. 5. Actual agent α gets lost after spawning witness agent wi.

Failures of witness agents and the recovery strategy(1/2) w 0→w 1→w 2→… wi– 1→ wi →α Failures of witness agents: 1. The network is congested or unreliable. 2. The system load of Si is too high. 3. Witness agent wi was not created or is lost.

Failures of witness agents and the recovery strategy(2/2) (1)A failure strikes Si– 1, and the witnessing dependency is broken. (2)(2) A failure strikes Si, and the actual agent is terminated. (3) The witness agent at Si– 2 recovers the witness agent at Si– 1. (4) The witness agent at Si– 1 recovers the actual agent at Si.

Simplifying the witnessing dependency(1/2) The actual agent creates witness agents along its itinerary, and the witness agents exchange heartbeat messages. These procedures consume considerable resources. No more than k servers can fail at the same time, we can simplify our mechanism by shortening.

Simplifying the witnessing dependency(2/2) If (i ≤ k) 0→w 1→… wi– 1→α Else wi–k→ wi–k+1→…→ wi– 1→α Finally, when α successfully logs entry logi+1 arrive, the system can terminate wi–k by sending message msgi+1 kill from Si+1 to Si–k.

An example model

Experimental results Network transmission rate: 100 for agents, 200 for messages Server repair rate, t_s_r: 0. 1 All message log rates: 100 Arrival, leave, and heartbeat message bound times: 1, 100, and 20, respectively Heartbeat interval: 5

Experimental results(Agent survivability a) server failure rate is 0. 001 job completion rate is 0. 01

Experimental results(Agent survivability b) server failure rate is 0. 005 job completion rate is 0. 01

Experimental results(Agent survivability c) server failure rate is 0. 005 job completion rate is 0. 05

Experimental results(witness agents a) server failure rate is 0. 001 job completion rate is 0. 01

Experimental results(witness agents b) server failure rate is 0. 005 job completion rate is 0. 01

Experimental results(witness agents c) server failure rate is 0. 005 job completion rate is 0. 05

Experimental results(probes a) server failure rate is 0. 001 job completion rate is 0. 01

Experimental results(probes b) server failure rate is 0. 005 job completion rate is 0. 01

Experimental results(probes c) server failure rate is 0. 005 job completion rate is 0. 05

Conclusion This agent fault-tolerant recovery approach improves agent survivability in failure-prone mobile agent systems. Thus, it can help create a more reliable agent deployment environment.