Tag Archives: validation

Kubernetes CRD Validation Using CEL

Motivation

CRDs was used to support two major categories of built-in validation:

  • CRD structural schemas: Provide type checking of custom resources against schemas.
  • OpenAPIv3 validation rules: Provide regex ('pattern' property), range limits ('minimum' and 'maximum' properties) on individual fields and size limits on maps and lists ('minItems', 'maxItems').

For use cases that cannot be covered by build-in validation support:

  • Admission Webhooks: have validating admission webhook for further validation
  • Custom validators: write custom checks in several languages such as Rego

While admission webhooks do support CRDs validation, they significantly complicate the development and operability of CRDs.

To provide an self-contained, in-process validation, an inline expression language - Common Expression Language (CEL), is introduced into CRDs such that a much larger portion of validation use cases can be solved without the use of webhooks.

It is sufficiently lightweight and safe to be run directly in the kube-apiserver, has a straight-forward and unsurprising grammar, and supports pre-parsing and typechecking of expressions, allowing syntax and type errors to be caught at CRD registration time.


CRD validation rule

CRD validation rules are promoted to GA in Kubernetes 1.29 to validate custom resources based on validation rules.

Validation rules use the Common Expression Language (CEL) to validate custom resource values. Validation rules are included in CustomResourceDefinition schemas using the x-kubernetes-validations extension.

The Rule is scoped to the location of the x-kubernetes-validations extension in the schema. And self variable in the CEL expression is bound to the scoped value.

All validation rules are scoped to the current object: no cross-object or stateful validation rules are supported.

For example:

...
  openAPIV3Schema:
    type: object
    properties:
      spec:
        type: object
        x-kubernetes-validations:
          - rule: "self.minReplicas <= self.replicas"
            message: "replicas should be greater than or equal to minReplicas."
          - rule: "self.replicas <= self.maxReplicas"
            message: "replicas should be smaller than or equal to maxReplicas."
        properties:
          ...
          minReplicas:
            type: integer
          replicas:
            type: integer
          maxReplicas:
            type: integer
        required:
          - minReplicas
          - replicas
          - maxReplicas

will reject a request to create this custom resource:

apiVersion: "stable.example.com/v1"


kind: CronTab


metadata:


 name: my-new-cron-object


spec:


 minReplicas: 0


 replicas: 20


 maxReplicas: 10

with the response:

The CronTab "my-new-cron-object" is invalid:
* spec: Invalid value: map[string]interface {}{"maxReplicas":10, "minReplicas":0, "replicas":20}: replicas should be smaller than or equal to maxReplicas.

x-kubernetes-validations could have multiple rules. The rule under x-kubernetes-validations represents the expression which will be evaluated by CEL. The message represents the message displayed when validation fails.

Note: You can quickly test CEL expressions in CEL Playground.

Validation rules are compiled when CRDs are created/updated. The request of CRDs create/update will fail if compilation of validation rules fail. Compilation process includes type checking as well.

Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:

Validation Rule

Purpose

self.minReplicas <= self.replicas

Validate an integer field is less than or equal to another integer field

'Available' in self.stateCounts

Validate an entry with the 'Available' key exists in a map

self.set1.all(e, !(e in self.set2))

Validate that the elements of two sets are disjoint

self == oldSelf

Validate that a required field is immutable once it is set

self.created + self.ttl < self.expired

Validate that 'expired' date is after a 'create' date plus a 'ttl' duration

Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.


CRD transition rules

Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.

Transition rule examples:

Transition Rule

Purpose

self == oldSelf

For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.

(on parent of field) has(self.field) == has(oldSelf.field)

on field: self == oldSelf

Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).

self.all(x, x in oldSelf)

Only allow adding items to a field that represents a set (prevent removals).

self >= oldSelf

Validate that a number is monotonically increasing.

Using the Functions Libraries

Validation rules have access to a couple different function libraries:

Examples of function libraries in use:

Validation Rule

Purpose

!(self.getDayOfWeek() in [0, 6]

Validate that a date is not a Sunday or Saturday.

isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']

Validate that a URL has an allowed hostname.

self.map(x, x.weight).sum() == 1

Validate that the weights of a list of objects sum to 1.

int(self.find('^[0-9]*')) < 100

Validate that a string starts with a number less than 100

self.isSorted()

Validate that a list is sorted

Resource use and limits

To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.


Estimated cost limit

CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n2) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.

Good practice:

Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.


Runtime cost limits for CRD validation rules

In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.

With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.

CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.


Adoption and Related work

This feature has been turned on by default since Kubernetes 1.25 and finally graduated to GA in Kubernetes 1.29. It raised a lot of interest and is now widely used in the Kubernetes ecosystem. We are excited to share that the Gateway API was able to replace all the validating webhook previously used with this feature.

After CEL was introduced into Kubernetes, we are excited to expand the power to multiple areas including the Admission Chain and authorization config. We will have a separate blog to introduce further.

We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in upcoming Kubernetes releases.


Acknowledgements

Special thanks to Joe Betz, Kermit Alexander, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to CRD Validation Rules!

By Cici Huang – Software Engineer

Kubernetes CRD Validation Using CEL

Motivation

CRDs was used to support two major categories of built-in validation:

  • CRD structural schemas: Provide type checking of custom resources against schemas.
  • OpenAPIv3 validation rules: Provide regex ('pattern' property), range limits ('minimum' and 'maximum' properties) on individual fields and size limits on maps and lists ('minItems', 'maxItems').

For use cases that cannot be covered by build-in validation support:

  • Admission Webhooks: have validating admission webhook for further validation
  • Custom validators: write custom checks in several languages such as Rego

While admission webhooks do support CRDs validation, they significantly complicate the development and operability of CRDs.

To provide an self-contained, in-process validation, an inline expression language - Common Expression Language (CEL), is introduced into CRDs such that a much larger portion of validation use cases can be solved without the use of webhooks.

It is sufficiently lightweight and safe to be run directly in the kube-apiserver, has a straight-forward and unsurprising grammar, and supports pre-parsing and typechecking of expressions, allowing syntax and type errors to be caught at CRD registration time.


CRD validation rule

CRD validation rules are promoted to GA in Kubernetes 1.29 to validate custom resources based on validation rules.

Validation rules use the Common Expression Language (CEL) to validate custom resource values. Validation rules are included in CustomResourceDefinition schemas using the x-kubernetes-validations extension.

The Rule is scoped to the location of the x-kubernetes-validations extension in the schema. And self variable in the CEL expression is bound to the scoped value.

All validation rules are scoped to the current object: no cross-object or stateful validation rules are supported.

For example:

...
  openAPIV3Schema:
    type: object
    properties:
      spec:
        type: object
        x-kubernetes-validations:
          - rule: "self.minReplicas <= self.replicas"
            message: "replicas should be greater than or equal to minReplicas."
          - rule: "self.replicas <= self.maxReplicas"
            message: "replicas should be smaller than or equal to maxReplicas."
        properties:
          ...
          minReplicas:
            type: integer
          replicas:
            type: integer
          maxReplicas:
            type: integer
        required:
          - minReplicas
          - replicas
          - maxReplicas

will reject a request to create this custom resource:

apiVersion: "stable.example.com/v1"


kind: CronTab


metadata:


 name: my-new-cron-object


spec:


 minReplicas: 0


 replicas: 20


 maxReplicas: 10

with the response:

The CronTab "my-new-cron-object" is invalid:
* spec: Invalid value: map[string]interface {}{"maxReplicas":10, "minReplicas":0, "replicas":20}: replicas should be smaller than or equal to maxReplicas.

x-kubernetes-validations could have multiple rules. The rule under x-kubernetes-validations represents the expression which will be evaluated by CEL. The message represents the message displayed when validation fails.

Note: You can quickly test CEL expressions in CEL Playground.

Validation rules are compiled when CRDs are created/updated. The request of CRDs create/update will fail if compilation of validation rules fail. Compilation process includes type checking as well.

Validation rules support a wide range of use cases. To get a sense of some of the capabilities, let's look at a few examples:

Validation Rule

Purpose

self.minReplicas <= self.replicas

Validate an integer field is less than or equal to another integer field

'Available' in self.stateCounts

Validate an entry with the 'Available' key exists in a map

self.set1.all(e, !(e in self.set2))

Validate that the elements of two sets are disjoint

self == oldSelf

Validate that a required field is immutable once it is set

self.created + self.ttl < self.expired

Validate that 'expired' date is after a 'create' date plus a 'ttl' duration

Validation rules are expressive and flexible. See the Validation Rules documentation to learn more about what validation rules are capable of.


CRD transition rules

Transition Rules make it possible to compare the new state against the old state of a resource in validation rules. You use transition rules to make sure that the cluster's API server does not accept invalid state transitions. A transition rule is a validation rule that references 'oldSelf'. The API server only evaluates transition rules when both an old value and new value exist.

Transition rule examples:

Transition Rule

Purpose

self == oldSelf

For a required field, make that field immutable once it is set. For an optional field, only allow transitioning from unset to set, or from set to unset.

(on parent of field) has(self.field) == has(oldSelf.field)

on field: self == oldSelf

Make a field immutable: validate that a field, even if optional, never changes after the resource is created (for a required field, the previous rule is simpler).

self.all(x, x in oldSelf)

Only allow adding items to a field that represents a set (prevent removals).

self >= oldSelf

Validate that a number is monotonically increasing.

Using the Functions Libraries

Validation rules have access to a couple different function libraries:

Examples of function libraries in use:

Validation Rule

Purpose

!(self.getDayOfWeek() in [0, 6]

Validate that a date is not a Sunday or Saturday.

isUrl(self) && url(self).getHostname() in [a.example.com', 'b.example.com']

Validate that a URL has an allowed hostname.

self.map(x, x.weight).sum() == 1

Validate that the weights of a list of objects sum to 1.

int(self.find('^[0-9]*')) < 100

Validate that a string starts with a number less than 100

self.isSorted()

Validate that a list is sorted

Resource use and limits

To prevent CEL evaluation from consuming excessive compute resources, validation rules impose some limits. These limits are based on CEL cost units, a platform and machine independent measure of execution cost. As a result, the limits are the same regardless of where they are enforced.


Estimated cost limit

CEL is, by design, non-Turing-complete in such a way that the halting problem isn’t a concern. CEL takes advantage of this design choice to include an "estimated cost" subsystem that can statically compute the worst case run time cost of any CEL expression. Validation rules are integrated with the estimated cost system and disallow CEL expressions from being included in CRDs if they have a sufficiently poor (high) estimated cost. The estimated cost limit is set quite high and typically requires an O(n2) or worse operation, across something of unbounded size, to be exceeded. Fortunately the fix is usually quite simple: because the cost system is aware of size limits declared in the CRD's schema, CRD authors can add size limits to the CRD's schema (maxItems for arrays, maxProperties for maps, maxLength for strings) to reduce the estimated cost.

Good practice:

Set maxItems, maxProperties and maxLength on all array, map (object with additionalProperties) and string types in CRD schemas! This results in lower and more accurate estimated costs and generally makes a CRD safer to use.


Runtime cost limits for CRD validation rules

In addition to the estimated cost limit, CEL keeps track of actual cost while evaluating a CEL expression and will halt execution of the expression if a limit is exceeded.

With the estimated cost limit already in place, the runtime cost limit is rarely encountered. But it is possible. For example, it might be encountered for a large resource composed entirely of a single large list and a validation rule that is either evaluated on each element in the list, or traverses the entire list.

CRD authors can ensure the runtime cost limit will not be exceeded in much the same way the estimated cost limit is avoided: by setting maxItems, maxProperties and maxLength on array, map and string types.


Adoption and Related work

This feature has been turned on by default since Kubernetes 1.25 and finally graduated to GA in Kubernetes 1.29. It raised a lot of interest and is now widely used in the Kubernetes ecosystem. We are excited to share that the Gateway API was able to replace all the validating webhook previously used with this feature.

After CEL was introduced into Kubernetes, we are excited to expand the power to multiple areas including the Admission Chain and authorization config. We will have a separate blog to introduce further.

We look forward to working with the community on the adoption of CRD Validation Rules, and hope to see this feature promoted to general availability in upcoming Kubernetes releases.


Acknowledgements

Special thanks to Joe Betz, Kermit Alexander, Ben Luddy, Jordan Liggitt, David Eads, Daniel Smith, Dr. Stefan Schimanski, Leila Jalali and everyone who contributed to CRD Validation Rules!

By Cici Huang – Software Engineer

Introducing the Data Validation Tool

Data validation is a crucial step in data warehouse, database, or data lake migration projects. It involves comparing structured or semi-structured data from the source and target tables and verifying that they match after each migration step (e.g data and schema migration, SQL script translation, ETL migration, etc.)
 
Today, we are excited to announce the Data Validation Tool (DVT), an open sourced Python CLI tool that provides an automated and repeatable solution for validation across different environments. The tool uses the Ibis framework to connect to a large number of data sources including BigQuery, Cloud Spanner, Cloud SQL, Teradata, and more.

DVT connections

Why DVT?

Cross-platform data validation is a non-trivial and time-consuming effort, and many customers have to build and maintain a custom solution to perform such tasks. The DVT provides a standardized solution to validate customer's newly migrated data in Google Cloud against the existing data from their on-prem systems. DVT can be integrated with existing enterprise infrastructure and ETL pipelines to provide a seamless and automated validation.

Solution

The DVT provides connectivity to BigQuery, Cloud SQL, and Spanner as well as third-party database products and file systems. In addition, it can be easily integrated into other Google Cloud services such as Cloud Composer, Cloud Functions, and Cloud Run. DVT supports the following connection types:
  • BigQuery
  • Cloud SQL
  • FileSystem (GCS, S3, or local files)
  • Hive
  • Impala
  • MySQL
  • Oracle
  • Postgres
  • Redshift
  • Snowflake
  • Spanner
  • SQL Server
  • Teradata
The DVT performs multi-leveled data validation functions, from the table level all the way to the row level. Below is a list of the validation features:

Table level
  • Table row count
  • Group by row count
  • Column aggregation
  • Filters and limits
Column level
  • Schema/Column data type
Row level
  • Hash comparison (BigQuery only)
Raw SQL exploration
  • Run custom queries on different data sources

How to Use the DVT

The first step to validating your data is creating a connection. You can create a connection to any of the data sources listed previously. Here’s an example of connecting to BigQuery:

data-validation connections add --connection-name $MY_BQ_CONN BigQuery --project-id $MY_GCP_PROJECT

Now you can run a validation. This is how a validation between a BigQuery and a MySQL table would look:

data-validation run \

--type Column \

--source-conn $MY_BQ_CONN --target-conn $MY_SQL_CONN \

--tables-list project.dataset.source_table=database.target_table

The default validation if no aggregation is provided is a COUNT *. The tool will count the number of columns in the source table and verify it matches the count on the target table.

The DVT supports a lot of customization while validating your data. For example, you can validate multiple tables, run validations on specific columns, and add labels to your validations.

data-validation run \

--type Column \

--source-conn $MY_BQ_CONN --target-conn $MY_BQ_CONN \

--tables-list project.dataset.source_a=project.dataset.target_a,\ project.dataset.source_b=project.dataset.target_b \

--sum total_cost,revenue \

--labels owner=Mandy,description=‘finance tables’

You can also save your validation to a YAML configuration file. This way you can store previous validations and modify your validation configuration. By providing the `config-file` flag, you can generate the YAML file. Note that the validation will not execute when this flag is provided - only the file is created.

data-validation run \

--type Column \

--source-conn $MY_BQ_CONN --target-conn $MY_BQ_CONN \

-tables-list bigquery-public-data.new_york_citibike.citibike_trips \

-c citibike.yaml

Here is an example of a YAML configuration file for a GroupedColumn validation:

​​result_handler:
project_id: my-project-id
table_id: pso_data_validator.results
type: BigQuery
source: my_bq_conn
target: my_bq_conn
validations:
- aggregates:
- field_alias: count
  source_column: null
  target_column: null
  type: count
- field_alias: sum__num_bikes_available
  source_column: num_bikes_available
  target_column: num_bikes_available
  type: sum
- field_alias: sum__num_docks_available
  source_column: num_docks_available
  target_column: num_docks_available
  type: sum
filters:
- source: region_id=71
  target: region_id=71
  type: custom
grouped_columns:
- cast: null
  field_alias: region_id
  source_column: region_id
  target_column: region_id
labels:
- !!python/tuple
  - description
  - test
schema_name: bigquery-public-data.new_york_citibike
table_name: citibike_stations
target_schema_name: bigquery-public-data.new_york_citibike
target_table_name: citibike_stations
threshold: 0.0
type: GroupedColumn

Once you have a YAML configuration file, it is very easy to run the validation.

data-validation run-config -c citibike.yaml

Validation reports can be output to stdout (default) or to a result handler. The tool currently supports BigQuery as the result handler.

In order to output to BigQuery, simply add the`--bq-result-handler` or `-bqrh` flag.

data-validation run \

--type Column \

--source-conn my_bq_conn --target-conn my_bq_conn \

--tables-list bigquery-public-data.new_york_citibike.citibike_trips \

--count tripduration,start_station_name \

--bq-result-handler $YOUR_PROJECT_ID.pso_data_validator.results

Below is an example of the validation results in BigQuery. View the complete schema for validation reports in BigQuery here.

BigQuery result set

Getting Started

Ready to start integrating the DVT into your data movement processes? Check out the tool on PyPi here and contribute to the tool via GitHub. We’re actively working on new features to make the tool as useful to our customers. Happy validating!

By Neha Nene and James Fu – PSO Data Team