Scalable and Secure Architectures for Online Multiplayer Games
Scalable and Secure Architectures for Online Multiplayer Games Thesis Proposal Ashwin Bharambe May 15, 2006 Ashwin R. Bharambe
Online Games are Huge! 8 million http: //www. mmogchart. com/ Number of subscribers 7 million Some more facts World of Warcraft 6 million 1. These MMORPGs have client 5 million server architectures 4 million 2. They accommodate ~0. 5 million 3 million Final Fantasy XI players at a time Everquest 2 million Ultima Online 1 million 1997 1998 1999 2000 2001 2002 2003 2004 2005 2
Why MMORPGs Scale Role Playing Games have been slow-paced Players interact with the server relatively infrequently Maintain multiple independent game-worlds Each hosted on a different server Not true with other game genres FPS or First Person Shooters (e. g. , Quake) Demand high interactivity Need a single game-world 3
FPS Games Don’t Scale Bandwidth (kbps) Quake II server Bandwidth and computation, both become bottlenecks 4
Goal: Cooperative Server Architecture Focus on fast-paced FPS games 5
Distributing Games: Challenges Tight latency constraints As players or missiles move, updates must be disseminated very quickly v < 150 ms for FPS games High write-sharing in the workload Cheating Execution and state maintenance spread over untrustworthy nodes 6
Talk Outline Problem Background Game Model Related Work Colyseus Architecture Expected Contributions 7
Game Model Mutable State Immutable State Ammo Interactive 3 -D environment (maps, models, textures) Monsters Game Status Screenshot of Serious Sam Player 8
Game Execution in Client-Server Model void Run. Game. Frame() // every 50 -100 ms { // every object in the world // thinks once every game frame foreach (obj in mutable_objs) { if (obj->think) obj->think(); } send_world_update_to_clients(); }; 9
Object Partitioning Player Monster 10
Distributed Game Execution class Cruz. Missile { // every object in the world // thinks once every game frame Monster void think() { update_pos(); if (dist_to_ground() < EPSILON) Missile explode(); } Item Object Discovery void explode() { foreach (p in get_nearby_objects()) { if (p. type == “player”) p. health -= 50; } Replica Synchronization } }; 11
Talk Outline Problem Background Game Model Related Work Colyseus Architecture Expected Contributions 12
Related Work Distributed Designs Distributed Interactive Simulation (DIS) v v e. g. , HLA, DIVE, MASSIVE, etc. Use region-based partitioning, IP multicast Butterfly, Second-Life, Sim. MUD [INFOCOM 04] v Use region-based partitioning, DHT multicast Cheat-proofing Lock-step synchronization with commitment 13
Related Work: Techniques Region-based Partitioning Parallel Simulation Area-of-Interest Management with Multicast 14
Related Work: Techniques Region-based Partitioning Divide the game-world into fixed #regions Assign objects in one region to one server + Simple to place and discover objects – High migration rates, especially for FPS games – Regions exhibit very high skews in popularity can result in severe load imbalance Parallel Simulation Area-of-Interest Management with Multicast 15
Related Work: Techniques Region-based Partitioning Parallel Simulation Peer-to-peer: each peer maintains full state Writes to objects are sent to all peers + Point-to-point link updates go fastest – Needs lock-step + bucket synchronization – No conflict resolution inconsistency never heals Area-of-Interest Management with Multicast 16
Related Work: Techniques Region-based Partitioning Parallel Simulation Area-of-Interest Management with Multicast Players only need updates from nearby region 1 region == 1 multicast group, use one shared multicast tree per group Bandwidth load-imbalance due to skews in region popularity Updates need multiple hops, bad for FPS games 17
Talk Outline Problem Background Colyseus Architecture Scalability [NSDI 2006] v Evaluation Security Expected Contributions 18
Colyseus Components Server S 1 ) ( ts c e bj o _ y b r ea n _ get Object Discovery P 1 R 3 P 2 R 4 Replica Management Server S 2 P 3 P 4 Object Placement Server S 3 19
Object Placement Flexible and dynamic object placement Permits use of clustering algorithms Not tied to “regions” Previous systems use region-based placement Popularity Frequent, disruptive migration for fast games Regions in a game have very skewed popularity Region Rank 20
Replication Model hop Primary-Backup Replication 1 - Single Primary Read-only Replicas Writes are serialized at the primary Primary responsible for executing think code Replica trails from the primary by one hop Weakly consistent Low latency is critical 21
Object Discovery Most objects only need other “nearby” objects for executing think functions get_nearby_objects () 22
Distributed Object Discovery My position is x=x 1, y=y 1, z=z 1 S Located on 128. 2. 255 Publication S Use Sa structured overlay to achieve this P Find all objects with obj. x ε [x 1, x 2] obj. y ε [y 1, y 2] obj. z ε [z 1, z 2] Subscription 23
Mercury: Range Queriable DHT [SIGCOMM 2004] Supports range queries vs. exact matches No need for partitioning into “regions” Places data contiguously Can utilize spatial locality in games Dynamically balances load Control traffic does not cause hotspots Provides O(log n)-hop lookup About 200 ms for 225 nodes in our setup 24
Object Discovery Optimizations Pre-fetch soon-to-be required objects Use game physics for prediction Pro-active replication Piggyback object creation on update messages Soft-state subscriptions and publications Add object-specific TTLs to pubs and subs 25
Colyseus Design: Recap 128. 2. 9. 200 128. 2. 9. 100 Replica Direct point-to-point connection Monster on 128. 2. 9. 200 Find me nearby objects Mercury 26
Putting It All Together 27
Talk Outline Problem Background Colyseus Architecture Scalability v Evaluation [NSDI 2006] Security Expected Contributions 28
Evaluation Goals Bandwidth scalability Per-node bandwidth usage should scale with the number of nodes View inconsistency due to object discovery latency should be small Discovery latency, pre-fetching overhead in [NSDI 2006] 29
Experimental Setup Emulab-based evaluation Synthetic game Workload based on Quake III traces P 2 P scenario 1 player per server Unlimited bandwidth Modeled end-to-end latencies More results including a Quake II evaluation, in [NSDI 2006] 30
Mean outgoing bandwidth (kbps) Per-node Bandwidth Scaling Number of nodes 31
Avg. fraction of mobile objects missing View Inconsistency no delay 100 ms delay 400 ms delay Number of nodes 32
Planned Work Consistency models Game operations demand differing levels of consistency and latency response v v Causal ordering of events Atomicity Deployment Performance metrics depend crucially on the workload A real game workload would be useful for future research 33
Talk Outline Problem Background Colyseus Architecture Scalability v Evaluation Security [Planned Work] Expected Contributions 34
Cheating in Online Games Why do cheats arise? Distributed system (client-server or P 2 P) Bugs in the game implementation Possible Cheats in Colyseus Object Discovery v map-hack, subscription-hijack Replication v god-mode, event-ordering, etc. Object Placement v god-mode 35
Object Discovery Cheats map-hack cheat [Information overexposure] Subscribe to arbitrary areas in the game Discover all objects, which may be against game rules Subscription-hijack cheat Incorrectly route subscriptions of your enemy Enemy cannot discover (see) players v Other players can see her and can shoot her 36
Replication Cheats god-mode cheat Primary node has arbitrary control over writes to the object Timestamp cheat Primary node decides the serialized write order No, I don’t! You die! Node A Node B 37
Replication Cheats Suppress-update cheat Primary does not send updates to the replicas Inconsistency cheat Primary sends incorrect or conflicting updates to the replicas hi t m o fr e d i H y u g s Player B I moved to another room I am dea d Player C Player A Player D 38
Related Work NEO protocol [Gauthier. Dickey 04] Lock-step synchronization with commitment v v Send encrypted update in round 1 Send decryption key in round 2, only after you receive updates from everybody + Addresses v v suppress-update cheat timestamp cheat – Lock-step synchronization increases “lag” – Does not address god-mode cheat, among others 39
Solution Approach Philosophy: Detection rather than Prevention Preventing cheating ≈ Byzantine fault tolerance Known protocols emphasize strict consistency and assume weak synchrony v Multiple rounds unsuitable for game-play High-level decisions 1. 2. 3. Make players leave an audit-trail Make peers police each other Keep detection out of critical path always 40
Distributed Audit Randomly chosen witness Log Centralized Auditor Log 41
Logging Using Witnesses Player Node Think code c sti h i tim at Op ate p d Up Player Log Serialized Updates Witness Log Witness Node 42
Using Witnesses: Good and Bad + Player, witness logs can be used for audits Potentially address timestamp, god-mode and inconsistency cheats + Witness can generate pubs + subs Addresses map-hack cheat – Bandwidth overhead – Does not handle suppress-update cheat and the subscription-hijack cheat 43
Using Witnesses: Alternate Design Witness Node has primary copy of player Move the primary directly to the witness node Code execution and writes directly applied at the witness – Primary replica updates go through witness – Witness gets arbitrary power Player cannot complain to anybody 44
Challenges Balance power between player and witness Use cryptographic techniques How do players detect somebody is cheating? Extraction of rules from the game code Securing the object discovery layer Leverage DHT security research Keep bandwidth overhead minimal 45
Talk Outline Problem Background Colyseus Architecture Scalability v Evaluation Security Expected Contributions 46
Expected Contributions Mercury range-queriable DHT Design and evaluation of Colyseus Real-world measurement of game workloads Anti-cheating protocols 47
Expected Contributions Mercury range-queriable DHT First structured overlay to support range queries and dynamic load balancing Implementation used in other systems Design and evaluation of Colyseus Real-world measurement of game workloads Anti-cheating protocols 48
Expected Contributions Mercury range-queriable DHT Design and evaluation of Colyseus First distributed design to be successfully applied for scaling FPS games Demonstrated that low-latency game-play is feasible Flexible architecture for adapting to various types of games Real-world measurement of game workloads Anti-cheating protocols 49
Expected Contributions Mercury range-queriable DHT Design and evaluation of Colyseus Real-world measurement of game workloads Deployment of Quake III Anti-cheating protocols 50
Expected Contributions Mercury range-queriable DHT Design and evaluation of Colyseus Real-world measurement of game workloads Anti-cheating protocols Encourage real-world deployments Lead towards lighter-weight fault-tolerance protocols 51
Summary of Thesis Statement Design of scalable, secure architectures for games utilizing key properties Game workload is predictable Players tolerate loose, eventual consistency 52
Differences from Related Work Avoid region-based object placement Frequent migration when objects move Load-imbalance due to skewed region popularity 1 -hop unicast update path between primaries and replicas Previous systems used overlay multicast Replication model with eventual consistency Avoid parallel execution 53
Timeline Development of newer consistency and anti-cheat protocols May 06 Jul 06 Integration of Colyseus with Quake III May 06 Jul 06 Implementation of consistency and anti-cheat protocols Jul 06 Sep 06 Deployment and evaluation Jul 06 Dec 06 Thesis writing Dec 06 Mar 07 54
Thanks 55
Mean object discovery latency (ms) Object Discovery Latency Number of nodes 56
Object Discovery Latency Observations: 1. Routing delay scales similarly for both types of DHTs: both exploit caching effectively. Median hop-count = 3. 2. DHT gains a small advantage because it does not have to “spread” subscriptions 57
Mean outgoing bandwidth (kbps) Bandwidth Breakdown Number of nodes 58
Bandwidth Breakdown Observations: 1. Object discovery forms a significant part of the total bandwidth consumed 2. A range-queriable DHT scales better vs. a normal DHT (with linearized maps) 59
Goals and Challenges 1. Relieve the computational bottleneck Challenge: partition code execution effectively 2. Relieve the bandwidth bottleneck Challenge: minimize bandwidth overhead due to object replication 3. Enable low-latency game-play Challenge: replicas should be updated as quickly as possible 60
Key Design Elements Primary-backup replication model Read-only replicas Flexible object placement Allow objects to be placed on any node Scalable object lookup Use structured overlays for discovering objects 61
View Consistency Object discovery should succeed as quickly as possible Missing objects incorrect rendered view Challenges O(log n) hops for the structured overlay v Not enough for fast games Objects like missiles travel fast and short-lived 62
Distributed Architectures: Motivation Server farms? $$$ Significant barrier to entry Motivating factors Most game publishers are small Games grow old very quickly What if you are ~1000 university students wanting to host and play a large game? 63
Colyseus Components server s 1 P 1 Object Store P 2 R 3 Object Location R 4 Replica Management 1. Specify Predicted Interests: (5 < X < 60 & 10 < y < 200) TTL 30 sec 3. Register Replicas: R 3 (to s 2), R 4 (to s 2) 4. Synch Replicas: R 3, R 4 2. Locate Remote Objects: P 3 on s 2, P 4 on s 2 Mercury Object Placement 5. Optimize Placement: migrate P 1 to server s 2 P 3 P 4 server s 2 64
Object Pre-fetching On-demand object discovery can cause stalls or render an incorrect view Use game physics for prediction Predict which areas objects will move to Subscribe to object publications in those areas 65
Pro-active Replication Normal object discovery and replica instantiation slow for short-lived objects Piggyback object-creation messages to updates of other objects Replicate missile pro-actively wherever creator is replicated 66
Soft-state Storage Objects need to tailor publication rate to speed Ammo or health-packs don’t move much Add TTLs to subscriptions and publications Stored pubs act like triggers to incoming subs 67
Per-node Bandwidth Scaling Observations: 1. Colyseus bandwidth-costs scale well with #nodes 2. Feasible for P 2 P deployment (compare single-server or broadcast) 3. In aggregate, Colyseus bandwidth costs are 4 -5 times higher there is overhead 68
View Inconsistency no delay 100 ms delay 400 ms delay Observations: 1. View inconsistency is small and gets repaired quickly 2. Missing objects on the periphery 69
Cheating in Games Examples of some cheats Information overexposure (maphack) Get arbitrary health, weapons (god-mode) Precise and automatic weapons (aimbot) Event ordering v Did I shoot you first or did you move first? Exploiting bugs inside the game (duping) 70
Distributed Design Components Object Discovery Instantiate Replicas Object Replica 71
- Slides: 71