Challenges with collecting anonymizing sharing and using highspeed

Data Collection “If you want people to do Chemistry Research, you need a Chemistry

Approach 1. Collect aggregate logs only (e. g. Zeek, Fire. Eye, DHCP/NAT logs) –

Administrative Challenges ● Legal/Ethical: ○ Institutional Review Board (IRB) approval may be required. ○

Example: Zeek Data • 1 hour of peak raw traffic uses more storage than

Slides: 5

Challenges with collecting, anonymizing, sharing and using high-speed network-traffic data Alastair Nottingham, Jeff W. Collyer, Brian E. Root, Molly Buchanan, Yizhe Zhang, Kolia Sadeghi (CCRi), Yixin Sun, Don E. Brown, Jack W. Davidson, Malathi Veeraraghavan

Data Collection “If you want people to do Chemistry Research, you need a Chemistry Lab. The same goes for Cyber Research. ” – Jacob Baxter, Punch. Cyber, DARPA CHASE collaborator Needed for Global Analysis Technical Challenges: • Apply machine learning to data collected from multiple enterprise scale networks to detect novel threats. • Privacy – how to retain sufficient fidelity without compromising user privacy. • Network ecosystems are varied, “normal” is highly subjective. • Data retention – how to select, process and store network data over protracted periods. • Multiple vantage points may help detect certain global threats (e. g. APT/Ransomware). • Raw data is extremely sensitive, impractically large, structurally diverse and constantly changing. • To be useful, data must be categorized, shared between collaborating institutions, and retained in consistent format for months without leaking sensitive information. • Sensor placement – where to place sensors to detect specific threat types. • Labelling – how to label training data (supervised) and/or verifying outputs (unsupervised) in continuous network traffic data. • Sharing results – how to facilitate reuse of intermediate results to reduce redundancies in ML computation

Approach 1. Collect aggregate logs only (e. g. Zeek, Fire. Eye, DHCP/NAT logs) – – 2. Sensors on network border (external) and within CS department (internal). – 3. 4. 5. Much easier/faster to process/store than raw data. Facilitated/Supported by SOC. Zeek, Fire. Eye, Honeypot, DHCP, NAT, Authentication All logs anonymized – – – Consistent record format. Custom anonymization per data type. Obfuscate information while retaining patterns. Fast custom plugin-based C++ framework Collection of ground truth Stingar, intelligence feeds, collaborator feeds Feedback loop with SOC – mutual benefit. All data restricted to a secure, purpose built HPC. – – Data usage governed/enforced by HPC policy. Data not freely shared, mitigating anonymization failures Control over stored data – easy to update anonymized data. Custom ML framework to manage data reuse

Administrative Challenges ● Legal/Ethical: ○ Institutional Review Board (IRB) approval may be required. ○ Policies vary by institution, and anonymization may not be an option for some data-oriented research. ● Sustentation: ○ Current effort is supported by a 4 -year DARPA project. ○ Developing methods for sustaining this effort (NSF). ○ Adding new data collection organizations (currently UVA, VT – working with IU and GWU on NSF extension) ● Support: ○ Supporting external researchers on dedicated HPC with wide variety of big-data analytics packages incurs significant HR costs. ● Documentation: ○ Significant time can be lost in users deciphering the data dictionary/metadata for an expanding variety of log types.

Example: Zeek Data • 1 hour of peak raw traffic uses more storage than 1 month of aggregated Zeek logs. • <1 day to anonymize 1 month of aggregate Zeek traffic with 2 GB ram and 48 cores.