Intro
This is the final post in a series of four, in which we set out to revisit various BeyondCorp topics and share lessons that were learnt along the internal implementation path at Google.
The first post in this series focused on providing necessary context for how Google adopted BeyondCorp, Google’s implementation of the zero trust security model. The second post focused on managing devices - how we decide whether or not a device should be trusted and why that distinction is necessary. The third post focused on tiered access - how to define access tiers and rules and how to simplify troubleshooting when things go wrong.
This post introduces the concept of gated services, how to identify and, subsequently, migrate them and the associated lessons we learned along the way.
High level architecture for BeyondCorp
Identifying and gating services
How do you identify and categorize all the services that should be gated?
Google began as a web-based company, and as it matured in the modern era, most internal business applications were developed with a web-first approach. These applications were hosted on similar internal architecture as our external services, with the exception that they could only be accessed on corporate office networks. Thus, identifying services to be gated by BeyondCorp was made easier for us due to the fact that most internal services were already properly inventoried and hosted via standard, central solutions. Migration, in many cases, was as simple as a DNS change. Solid IT asset inventory systems and maintenance are critical to migrating to a zero trust security model.
Enforcement of zero trust access policies began with services which we determined would not be meaningfully impacted by the change in access requirements. For most services, requirements could be gathered via typical access log analysis or consulting with service owners. Services which could not be readily gated by default ACL requirements required service owners to develop strict access groups and/or eliminate risky workflows before they could be migrated.
As discussed in our previous blog post, Google makes internal services available based on device trust tiers. Today, those services are accessible by the highest trust tier by default.
When the intent of the change is to restrict access to a service to a specific group or team, service owners are free to propose access changes to add or remove restrictions to their service. Access changes which are deemed to be sufficiently low risk can be automatically approved. In all other cases, such as where the owning team wants to expose a service to a risky device tier, they must work with security engineers to follow the principle of least privilege and devise solutions.
What do you do with services that are incompatible with BeyondCorp ideals?
It may not always be possible to gate an application by the preferred zero trust solution. Services that cannot be easily gated typically fall into these categories:
- Type 1: "Non-proxyable protocols", e.g. non-HTTP/HTTPS traffic.
- Type 2: Low latency requirements or localized high throughput traffic.
- Type 3: Administrative and emergency access networks.
When that was not an option, we found that no single solution would work for all critical requirements:
- Solutions for the "Type 1" traffic have generally involved maintaining a specialized client tunneling which strongly enforces authentication and authorization decisions on the client and the server end of the connection. This is usually client/server type traffic which is similar to HTTP traffic in that connectivity is typically multi-point to point.
- Solutions to the "Type 2" problems generally rely on moving BeyondCorp-compatible compute resources locally or developing a solution tightly integrated with network access equipment to selectively forward "local" traffic without permanently opening network holes.
- As for “Type 3,” it would be ideal to completely eliminate all privileged internal networks. However, the reality is that some privileged networking will likely always be required to maintain the network itself and also to provide emergency access during outages.
How do you prioritize gating?
Prioritization starts by identifying all the services that are currently accessible via internal IP-access alone and migrating the most critical services to BeyondCorp, while working to slowly ratchet down permissions via exception management processes. Criticality of the service may also depend on the number and type of users, sensitivity of data handled, security and privacy risks enabled by the service.
Migration logistics
Most services required integration testing with the BeyondCorp proxy. Service teams were encouraged to stand up "test" services which were used to test functionality behind the BeyondCorp proxy. Most services that performed their own access control enforcement were reconfigured to instead rely on BeyondCorp for all user/group authentication and authorization. Service teams have been encouraged to develop their own "fine-grained" discretionary access controls in the services by leveraging session data provided by the BeyondCorp proxy.
Allow coarse gating and exceptions
Inventory: It's easy to overlook the importance of keeping a good inventory of services, devices, owners and security exceptions. The journey to a BeyondCorp world should start by solving organizational challenges when managing and maintaining data quality in inventory systems. In short, knowing how a service works, who should access it, and what makes that acceptable are the central tenets of managing BeyondCorp. Fine-grained access control is severely complicated when this insight is missing.
Legacy protocols: Most large enterprises will inevitably need to support workflows and protocols which cannot be migrated to a BeyondCorp world (in any reasonable amount of time). Exception management and service inventory become crucial at this stage while stakeholders develop solutions.
Run highly reliable systems
The BeyondCorp initiative would not be sustainable at Google’s scale without the involvement of various Site Reliability Engineering (SRE) teams across the inventory systems, BeyondCorp infrastructure and client side solutions. The ability to successfully achieve wide-spread adoption of changes this large can be hampered by perceived (or in some cases, actual) reliability issues. Understanding the user workflows that might be impacted, working with key stakeholders and ensuring the transition is smooth and trouble-free for all users helps protect against backlash and avoids users finding undesirable workarounds. By applying our reliability engineering practices, those teams helped to ensure that the components of our implementation all have availability and latency targets, operational robustness, etc. These are compatible with our business needs and intended user experiences.
The BeyondCorp initiative would not be sustainable at Google’s scale without the involvement of various Site Reliability Engineering (SRE) teams across the inventory systems, BeyondCorp infrastructure and client side solutions. The ability to successfully achieve wide-spread adoption of changes this large can be hampered by perceived (or in some cases, actual) reliability issues. Understanding the user workflows that might be impacted, working with key stakeholders and ensuring the transition is smooth and trouble-free for all users helps protect against backlash and avoids users finding undesirable workarounds. By applying our reliability engineering practices, those teams helped to ensure that the components of our implementation all have availability and latency targets, operational robustness, etc. These are compatible with our business needs and intended user experiences.
Put employees in control as much as possible
Employees cover a broad range of job functions with varying requirements of technology and tools. In addition to communicating changes to our employees early, we provide them with self-service solutions for handling exceptions or addressing issues affecting their devices. By putting our employees in control, we help to ensure that security mechanisms do not get in their way, helping with the acceptance and scaling processes.
Throughout this series of blog posts, we set out to revisit and demystify BeyondCorp, Google’s internal implementation of a zero trust security model. The four posts had different focus areas - setting context, devices, tiered access and, finally, services (this post).
If you want to learn more, you can check out the BeyondCorp research papers. In addition, getting started with BeyondCorp is now easier using zero trust solutions from Google Cloud (context-aware access) and other enterprise providers. Lastly, stay tuned for an upcoming BeyondCorp webinar on Cloud OnAir in a few months where you will be able to learn more and ask us questions. We hope that these blog posts, research papers, and webinars will help you on your journey to enable zero trust access.
Thank you to the editors of the BeyondCorp blog post series, Puneet Goel (Product Manager), Lior Tishbi (Program Manager), and Justin McWilliams (Engineering Manager).