XCI-073_AWS_Information_Services_Design-v0.2.pdf identifies XCI as the "team that implements, operates, and supports XSEDE Information Services."
Why is XSEDE's integration team operating a production service rather than XSEDE operations team?
Delivery Effort Stage:
This seems like a good next step once we’ve established a baseline, which I think this doc helps with.
Long-term, it would be logical for the XSEDE ops team to be the operators. Have you (JP, Eric) asked the ops team if they can operate it? Another approach might be to make whoever in the integration team is doing the operating become an official member of the ops team, managed by that area. There may be policies and practices that ops uses that ought to be applied consistently, and the best way to do that is to have all XSEDE operators under common management. Plus, the transition from integration to ops would likely be a healthy thing.
A possible counter-argument could be that separating the dev from the ops for this specific service would inhibit agile development. However, even Agile shops typically have separate ops & dev teams and I’m not sure mixing them is better for the product in the long term. Better perhaps to have a clear understanding of how often updates need to be made and then seeing if the ops team can handle it?
Because the SysOps team is not staffed with expertise to operate the RabbitMQ, PostgreSQL, and Apache+Django application service we have instead elected to work with them to make sure XCI follows their security guidelines and best practices. In the past, this was a paper exercise: we read their documents and manually follow their practices. On AWS we're taking this one step further by directly applying SysOps Ansible configurations to XCI operated services. This review is itself part of making sure that XCI is following all SysOps processes.
Making a single team responsible for the services is more efficient and effective that having an arbitrary fence with multiple teams managing pieces and having to coordinate across teams.
I hope that thru this review we can established that the right processes are being used so that it doesn't matter that a single team manages the full stack.
Most of the operations work is administering RabbitMQ, PostgreSQL, and Apache/Django services which operations isn't staffed to support (except for perhaps Apache). OS updates have to be coordinated with services administrators to avoid any outages and incompatibilities. Since service admins are also experienced systems administrators it's been easier to have a single team administer everything. XCI coordinates with SysOps to follow all XSEDE operational and security practices.
Moving forward since SysOps has created an Ansible repository with XSEDE recommended configurations, the XCI team will use applicable SysOps configurations plus our XCI service configurations on all new XCI AWS instances.
A philosophy of "don't let the org chart hinder efficient work" seems reasonable to me.
As an XCI staff member, I'm a little concerned that applying XCI staff time to operations will result in less time for integration work, but I'm content to leave such concerns to XSEDE management.
However, I suggest that the technical design should not tie the service operation to a specific WBS. For example, the bastion host should be called something like "info-serv-admin" rather than "xci-awsadmin" so if at some future date XSEDE Operations is in a position to help with operating this service, they can do so without confusion about the XCI labeling.
In other words, put details about who operates the service in the deployment plan, and leave some flexibility in the design regarding how the operations team is constituted at any particular time.