Skip to content

Service Level Agreements (SLAs)

This Service Level Agreement (SLA) outlines the relationship between the CIRRUS team - who provides the on-premise cloud infrastructure - and its recognized users, including UCAR Employees, Visitors, and external collaborators authorized to use the on-premise cloud resources.

NSF NCAR | CISL operates Compute, Storage & Network hardware in robust Data Centers at multiple organizational facilities. The on-premise cloud offers users the ability to utilize those highly available, organizationally supported, compute resources for approved use cases. This includes access to routable network space and UCAR Domain Name Systems (DNS). These resources provide a supplement to computing needs that aren't fulfilled by the HPC offering, public cloud, or what is available locally.

Primary Services Service Dependencies
Kubernetes Cluster Server nodes
Argo CD Networking
Harbor GLADE mount
OpenBao
JupyterHub / Binder

Audience: Service Technical Staff, System Administrators, On & Off Site Personnel, and Authorized Affiliates

Recognized Customers: On & Off Site Personnel, and Authorized Affiliates

Important

Availability: The service is designed to operate 24/7. However, support is currently limited to business hours only.


Response Level and Service Definitions

Definitions

Severity Description
Critical Complete loss of a core service or major functionality due to failure or incident (e.g., site-wide outage).

No workaround is available that will restore service reliably within one (1) hour. This may include a site wide security incident.
Urgent Significant degradation of a critical service or full failure of a non-critical service impacting productivity.

A workaround may exist but may not fully restore service.
Regular Minor or extended functionality issues - basic functionality is present.

A workaround is available. Includes feature requests, non-urgent upgrades, or inquiries.

Response Times

Response Level Business Hours
(M-F 08:00 - 17:00 MST)
After Hours
Critical Response within 2 hours Addressed at start of next business day
Urgent Response within 4 hours Addressed at start of next business day
Regular Reviewed during business hours Reviewed during business hours

Important

There is currently no after-hours support. All issues occurring after business hours will be triaged at the start of the next workday.


Backup & Disaster Recovery Policy

CIRRUS follows Infrastructure as Code (IaC) practices. All applications deployed on the on-prem cloud are defined via code repositories and can be redeployed as needed.

  • Application Backups: Applications themselves are not backed up individually; they are re-deployed via Argo CD and source-controlled templates.
  • Argo CD: Argo projects are backed up after changes, enabling project restoration in case of data loss.
  • Container Images (Harbor): Images stored in Harbor are backed up to object storage and can be restored from there.

Persistent Volume Backups

Persistent Volumes (PVs) in CIRRUS can be replicated across sites to improve resiliency.

To request PV replication for your application, please create a ticket.


Change Management

All changes must be submitted via a Jira ticket. For more information on this process, please see create tickets.

Tickets are reviewed and prioritized by the CIRRUS Product Owner.

  • Critical and Urgent tickets will be addressed based on SLA response times.
  • Regular requests are reviewed during the team's bi-weekly planning sessions.

Contact Information

Business Hours: 08:00 - 17:00 MST, Monday - Friday

Primary Contact: Nick Cote

Secondary Contact: Submit a Jira Request

Off Hours Contact: Nick Cote and/or Jira Request


Monitoring & Reporting

For observability, the CIRRUS infrastructure leverages:

  • Prometheus for metrics collection
  • Grafana for visualization and dashboards
  • Loki for centralized log aggregation

These tools work together to detect, surface, and alert the CIRRUS team to any operational issues within the platform.