Scaling Up PVSS Showstopper Tests Paul Burkimsher ITCO

Scaling Up PVSS Showstopper Tests Paul Burkimsher IT-CO P. C. Burkimsher IT-CO-BE July 2004

Aim of the Scaling Up Project WYSIWYAF l Investigate functionality and performance of large PVSS systems l Reassure ourselves that PVSS scales to support large systems l Provide detail rather than bland reassurances

What has been achieved? l 18 months PVSS gone through many pre -release versions – – – “ 2. 13” 3. 0 Alpha 3. 0 Pre-Beta 3. 0 RC 1. 5 l Lots of feedback to ETM. l ETM have incorporated – Design fixes & Bug fixes

Progress of the project l Has closely followed the different versions. Some going over the same ground, repeating tests as bugs were fixed. l Good news: V 3. 0 Official Release is now here (even 3. 0. 1) l Aim of this talk: – Summarise where we’ve got to today. – Show that the list of potential “showstoppers” has been addressed

What were the potential showstoppers? l Basic functionality – Synchronised types in V 2 ! l Sheer number of systems – Can the implementation cope? l Sheer number of displays l Alert Avalanches – How does PVSS degrade? l Is load of many Alerts reasonable? l Is load of many Trends reasonable?

What were the potential showstoppers? l Basic functionality – Synchronised types in V 2! }Skip l Sheer number of systems – Can the implementation cope? l. Alert Avalanches –How does PVSS degrade? l Is load of many Alerts reasonable? l Is load of many Trends reasonable?

Sheer number of systems l 130 systems simulated on 5 machines l 40, 000 DPEs l ~5 million DPEs l Interconnected successfully

What were the potential showstoppers? l Basic functionality – Synchronised types in V 2! }Skip l Sheer number of systems – Can the implementation cope? l. Alert Avalanches –How does PVSS degrade? l Is load of many Alerts reasonable? l Is load of many Trends reasonable?

Alert Avalanche Configuration UI 91 94 UI l 2 WXP machines l Each machine = 1 system l Each system has 5 crates declared x 256 channels x 2 alerts in each channel (“voltage” and “current”) l 40, 000 DPEs total in each system l Each system showed alerts from both systems

Traffic & Alert Generation l Simple UI script l Repeat – Delay D m. S – Change N DPEs l Traffic rate D N – Bursts. – Not changes/sec. l Option provoke alerts

Alert Avalanche Test Results - I l You can select which system’s alerts you wish to view l UI caches ALL alerts from ALL selected systems. l Needs sufficient RAM! (5, 000 CAME + 5, 000 WENT alerts needed 80 Mb) l Screen update is CPU hungry and an avalanche takes time(!) – 30 sec for 10, 000 lines.

Alert Avalanche Test Results - II l Too many alerts -> progressive degradation l 1) Screen update suspended – Message shown l 2) Evasive Action. Event Manager eventually cuts the connection to the UI; UI suicides. – EM correctly processed ALL alerts and LOST NO DATA.

Alert Avalanche Test Results - III l Alert screen update is CPU intensive l Scattered alert screens behave the same as local ones. (TCP) l “Went” alerts that are acknowledged on one alert screen disappear from the other alert screens, as expected. – Bugs we reported have now been fixed.

What were the potential showstoppers? l Basic functionality – Synchronised types in V 2! l Sheer number of systems – Can the implementation cope? l Alert Avalanches – How does PVSS degrade? l. Is load of many Alerts reasonable? l Is load of many Trends reasonable?

Agreed Realistic Configuration W W L L L L W L L L l 3 level hierarchy of machines l Only ancestral connections, no peer links. Only direct connections allowed. l 40, 000 DPEs in each system, 1 sys per machine l Mixed platform (W=Windows, L=Linux)

Viewing Alerts coming from leaf systems 91 92 95 03 93 04 05 06 07 94 08 09 10 11 12 13 l 1, 000 “came” alerts generated on PC 94 took 15 sec to be absorbed by PC 91. All 4(2) CPUs in PC 91 shouldered the load. l Additional alerts then fed from PC 93 to the top node. – Same graceful degradation and evasive action seen as before. PC 91’s EM killed PC 91’s Alert Screen l Display is again the bottleneck.

Rate supportable from 2 systems 91 92 95 03 93 04 05 06 07 94 08 09 10 11 12 13 l Set up a high, but supportable rate of traffic (10, 000 1, 000) on each of PC 93 and PC 94, feeding PC 91. l PC 93 itself was almost saturated, but PC 91 coped (~200 alerts/sec average, dual CPU)

Surprise Overload (manual) 91 92 95 03 93 04 05 06 07 94 08 09 10 11 12 13 l Manually stop PC 93 l PC 91 pops up a message l Manually restart PC 93 l Rush of traffic to PC 91 caused PC 93 to overload l PC 93’s EM killed PC 93’s Dist. M l PC 91 pops up a message

PVSS Self-healing property l PVSS self-healing algorithm – Pmon on PC 93 restarts PC 93’s Dist. M

Remarks l Evasive action taken by EM, cutting connection, is very good. Localises problems, keeping the overall system intact. l Self-healing action is very good. Automatic restart of dead managers l BUT…

Evasive action and Self-healing 9 1 l Manually stop PC 93 l PC 91 pops up a message l Manually restart PC 93 l Rush of traffic to PC 91 causes PC 93 to overload l PC 93’s EM killed PC 93’s Dist. M l PC 91 pops up a message l Pmon restarts PC 93’s Dist. M 9 2 9 3 9 4

Self-healing Improvement l To avoid the infinite loop, ETM’s Pmon eventually gives up. l Configurable how soon – Still not ideal! l ETM are currently considering my suggestion for improvement: – Pmon should issue the restart, but not immediately.

(Old) Alert Screen l We fed back many problems with the Alert Screen during the prerelease trials. – E. g. leaves stale information on-screen when systems leave and come back.

New Alert/Event Screen in V 3. 0 l 3. 0 Official release now has a completely new Alert/Event Screen which fixes most of the problems. l It’s new and still has some bugs, but the ones we have seen are neither design problems nor showstoppers.

More work for ETM: l When Dist. M is killed by EM taking evasive action, the only indication is in the log. l But Log viewer, like Alert viewer, is heavy on CPU and shouldn’t be left running when it’s not needed.

Reconnection Behaviour l No gaps in the Alert archive of the machine that isolated itself by taking evasive action. No data was lost. l It takes about 20 sec for 2 newly restarted Distribution Managers to get back in contact. l Existing (new-style!) alert screens are updated with the alerts of new systems that join (or re-join) the cluster.

Is load of many Alerts reasonable? l ~200 alerts/sec average would be rather worrying in a production system. So I believe “Yes”. l The response to an overload is very good. Though can still be tweaked. l Data integrity is preserved throughout.

What were the potential showstoppers? l Basic functionality – Synchronised types in V 2! l Sheer number of systems – Can the implementation cope? l Alert Avalanches – How does PVSS degrade? l Is load of many Alerts reasonable? l. Is load of many Trends reasonable?

Can you see the baby?

What were the potential showstoppers? l Basic functionality – Synchronised types in V 2! l Sheer number of systems – Can the implementation cope? l Alert Avalanches – How does PVSS degrade? l Is load of many Alerts reasonable? l. Is load of many Trends reasonable?

Is the load of many Trends reasonable? l Same configuration: 91 92 95 03 93 04 05 06 07 94 08 09 10 11 12 13 l Trend windows were opened on PC 91 displaying data from more and more systems. Mixed platform.

Is Memory Usage Reasonable? s e Y RAM (MB) Steady state, no trends open on PC 91 593 Open plot ctrl panel on 91 658 On PC 91, open a 1 channel trend window from PC 03 658 On PC 91, open a 1 channel trend window from PC 04 657 On PC 91, open a 1 channel trend window from PC 05 657 On PC 91, open a 1 channel trend window from PC 06 658 On PC 91, open a 1 channel trend window from PC 07 658

Is Memory Usage Reasonable? s e Y RAM Steady state, no trends open on PC 91 602 On PC 91, open 16 single channel trend windows from PC 95 Crate 1 Board 1 604 On PC 91, open 16 single channel trend windows from PC 03 Crate 1 Board 1 607 On PC 91, open 16 single channel trend windows from PC 04 Crate 1 Board 1 610

Test 34: Looked at top node plotting data from leaf machines’ archives l Performed excellently. l Test ceased when we ran out of screen real estate to show even the iconised trends (48 of).

Bland result? No! l Did the tests go smoothly? No! – But there was good news at the end

Zzzzzzz Observed gaps in the trend!! l Investigation showed gap was correct – Remote Desktop start-up caused CPU load – Data changes were not generated at this time

Zzzzzzz Proof with a Scattered Generator Trend UI on PC 94 Scattered UI on PC 93 Traffic EM l Steady traffic generation l No gaps in the recorded archive – Even when deliberately soak up CPU l Gaps were seen in the display – Need a “Trend Refresh” button (ETM)

Would sustained overload give trend problems? Zzzzzzz l High traffic (400 m. S delay1000 changes) on PC 93, as a scattered member of PC 94’s system. l PC 94’s own trend plot could not keep up. l PC 91’s trend plot could not keep up. l “Not keep up” means…

Zzzzzzz “Display can’t keep up” means… Timenow Trend screen values updated to here

Zzzzzzz Evasive action EM took evasive action, (disconnected the traffic generator) just here Time now Trend screen values finally updated to here Last 65 sec queued in Traffic Generator. Lost when it suicided.

Wakey! Summary of Multiple Trending l PVSS can cope l PVSS is very resilient to overload l Successful tests.

Test 31 DP change rates l Measured saturation rates on different platform configurations. l No surprises. Faster machines with more memory are better. Linux is better than Windows. l Numbers on the Web.

Test 32 DP changes with alerts l Measured saturation rates; no surprises again. l Dual CPU can help in processing when there a lot of alert screen (user interface) updates.

What were the potential showstoppers? l Basic functionality – Synchronised types in V 2! l Sheer number of systems – Can the implementation cope? l Alert Avalanches – How does PVSS degrade? l Is load of many Alerts reasonable? l Is load of many Trends reasonable? l Conclusions

Conclusions l No showstoppers. l We have seen nothing to suggest that PVSS cannot be used to build a very big system.

Further work - I l Further “informational” tests will be conducted to assist in making configuration recommendations, eg understanding the configurability of the message queuing and evasive action mechanism. l Follow up issues such as “AES needed more CPU when scattered”. l Traffic overload from a SIM driver rather than a UI l Collaborate with Peter C. to perform network overload tests.

Further work – II l Request a Use Case from experiments for a non-stressed configuration: – Realistic sustained alert rates – Realistic peak alert rate + realistic duration • i. e. not a sustained avalanche – How many users connected to control room machine? – % viewing alerts; % viewing trends; % viewing numbers (eg CAEN voltages) – Terminal Server UI connections – How many UIs can control room cope with? l What recommendations do you want?

In greater detail… l The numbers behind these slides will soon be available on the Web at http: //itcobe. web. cern. ch/itcobe/Pro jects/Scaling. Up. PVSS/welcome. html l Any questions?

Can you see the baby?

Example Numbers Table showing the Traffic Rates on different machine configurations, that gave rise to 70% CPU usage on those machines. See the Web links for the original table and details on how to interpret the figures. Name O/S GHz GB Rate@~70% CPU PC 92 Linux 2. 2 x 2 2 10001000 PC 93 W 2000 1. 8 0. 5 1000500 PC 94 WXP 2. 4 1 20001000 PC 95 Linux 2. 4 1 10001000 PC 03 Linux 0. 7 0. 25 20001000