feat: enable OAuth2-based access to services from notebooks

idirze · idirze · commit a35cec688fad · 2026-04-10T09:03:04.000+02:00
diff --git a/README.md b/README.md
@@ -116,20 +116,21 @@ At deployment time, the Helm chart:
 # Know issues
 1. [Polaris - Spark Iceberg REST Catalog refresh token](https://github.com/apache/iceberg/issues/12363)
     > Long-running jobs may need more metadata calls to Polaris during execution, not just one initial call
-2. [Trino - Issue with Vended Credential Renewal with Iceberg REST Catalog](https://github.com/trinodb/trino/issues/25827)
+2. [Polaris - OAuth 2 grant type "refresh_token" not implemented](https://github.com/apache/iceberg/issues/12196)
+3. [Trino - Issue with Vended Credential Renewal with Iceberg REST Catalog](https://github.com/trinodb/trino/issues/25827)
    > Reported upstream: with `iceberg.rest-catalog.vended-credentials-enabled=true`, long-running queries may fail once the STS token expires because Trino appears not to refresh vended credentials from the Iceberg REST catalog `/credentials` endpoint.
    >
    > A fix has been proposed in [PR #28792](https://github.com/trinodb/trino/pull/28792), but it is still under review, so this behavior should be validated in our environment.
-3. [Trino - Extra credential support for user token passthrough](https://github.com/trinodb/trino/issues/27197)
+4. [Trino - Extra credential support for user token passthrough](https://github.com/trinodb/trino/issues/27197)
     > Requests support for passing per-user OAuth tokens/credentials to the Iceberg REST catalog
-4. [Trino - Include oauth user in the request to the iceberg REST catalog](https://github.com/trinodb/trino/issues/26320)
+5. [Trino - Include oauth user in the request to the iceberg REST catalog](https://github.com/trinodb/trino/issues/26320)
    > [Starburst supports OAuth 2.0 token pass-through for the Iceberg REST catalog](https://docs.starburst.io/latest/object-storage/metastores.html#oauth-2-0-token-pass-through), which forwards the delegated OAuth token from the coordinator to the catalog:
    >
    > ```properties
    > http-server.authentication.type=DELEGATED-OAUTH2
    > iceberg.rest-catalog.security=OAUTH2_PASSTHROUGH
    > ```
-5. [STS assume role fails with credentials (from Lakekeeper) due to incomplete STS implementation](https://github.com/seaweedfs/seaweedfs/discussions/8312)
+6. [STS assume role fails with credentials (from Lakekeeper) due to incomplete STS implementation](https://github.com/seaweedfs/seaweedfs/discussions/8312)
    > The discussion initially points to a possible SeaweedFS STS compatibility issue, but the later reproducer narrows the failure to Lakekeeper's scoped session policy: multipart writes fail when the policy omits the required multipart S3 permissions.
    >
    > It demonstrates that multipart upload can fail if the scoped session policy does not include multipart actions such as:
diff --git a/notebooks/bronze/mobility/nyc_trip/02_trino_python_explore_yellow_tripdata.ipynb b/notebooks/bronze/mobility/nyc_trip/02_trino_python_explore_yellow_tripdata.ipynb
@@ -37,6 +37,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "from trino.auth import JWTAuthentication\n",
     "from sqlalchemy import create_engine\n",
     "import pandas as pd\n",
     "import altair as alt\n",
@@ -53,11 +54,15 @@
     "\n",
     "We connect to the Trino cluster using SQLAlchemy.\n",
     "\n",
+    "The `me()` helper returns the current authenticated notebook user.<br>\n",
+    "The `access_token()` helper retrieves a fresh access token from JupyterHub auth state for the current session.\n",
+    "\n",
     "Connection details:\n",
-    "- **User:** `trino`  \n",
-    "- **Host:** `trino-default.okdp.sandbox`  \n",
-    "- **Catalog:** `lakehouse`  \n",
-    "- **Schema:** `nyc_tripdata`  \n",
+    "- **User:** `me()`\n",
+    "- **Host:** `trino-default.okdp.sandbox`\n",
+    "- **Catalog:** `bronze`\n",
+    "- **Schema:** `nyc_tlc`\n",
+    "- **Authentication:** `JWTAuthentication(access_token())`\n",
     "- **Protocol:** HTTPS with `verify=False` (disabled cert verification)"
    ]
   },
@@ -69,9 +74,14 @@
    "outputs": [],
    "source": [
     "engine = create_engine(\n",
-    "    \"trino://trino@trino-default.okdp.sandbox/bronze/nyc_tlc\",\n",
-    "    connect_args={\"http_scheme\": \"https\", \"verify\": False}\n",
+    "    f\"trino://{me()}@trino-default.okdp.sandbox/bronze/nyc_tlc\",\n",
+    "    connect_args={\n",
+    "        \"http_scheme\": \"https\",\n",
+    "        \"verify\": False,\n",
+    "        \"auth\": JWTAuthentication(access_token()),\n",
+    "    },\n",
     ")\n",
+    "\n",
     "engine"
    ]
   },
@@ -401,7 +411,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.12"
+   "version": "3.11.15"
   }
  },
  "nbformat": 4,
diff --git a/notebooks/bronze/mobility/nyc_trip/03_trino_sql_explore_yellow_tripdata.ipynb b/notebooks/bronze/mobility/nyc_trip/03_trino_sql_explore_yellow_tripdata.ipynb
@@ -34,6 +34,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "from trino.auth import JWTAuthentication\n",
+    "from sqlalchemy import create_engine\n",
     "import pandas as pd\n",
     "import altair as alt\n",
     "%load_ext sql\n",
@@ -48,7 +50,13 @@
    "source": [
     "## 🔌 2. Connect to Trino using SQL Magic\n",
     "\n",
-    "We connect to the Trino cluster using a SQLAlchemy-compliant connection URI through SQL Magic."
+    "We connect to the Trino cluster using a pre-created SQLAlchemy engine with SQL Magic.\n",
+    "\n",
+    "The `me()` helper returns the current authenticated notebook user. </br>\n",
+    "The `access_token()` helper retrieves a fresh access token from JupyterHub auth state for the current session.\n",
+    "\n",
+    "Instead of passing a raw connection URL directly to SQL Magic, we first create a SQLAlchemy engine with `JWTAuthentication(access_token())`, then register that engine with `%sql engine`.\n",
+    "This keeps authentication consistent with the standard SQLAlchemy connection pattern and avoids embedding a static token in the SQL Magic connection string."
    ]
   },
   {
@@ -58,7 +66,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%sql trino://trino@trino-default.okdp.sandbox:443/bronze/nyc_tlc?http_scheme=https&verify=false"
+    "engine = create_engine(\n",
+    "    f\"trino://{me()}@trino-default.okdp.sandbox/bronze/nyc_tlc\",\n",
+    "    connect_args={\n",
+    "        \"http_scheme\": \"https\",\n",
+    "        \"verify\": False,\n",
+    "        \"auth\": JWTAuthentication(access_token()),\n",
+    "    },\n",
+    ")\n",
+    "\n",
+    "engine\n",
+    "\n",
+    "%sql engine"
    ]
   },
   {
@@ -549,7 +568,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.12"
+   "version": "3.11.15"
   }
  },
  "nbformat": 4,
diff --git a/notebooks/gold/mobility/nyc_trip/01_build_monthly_stats.ipynb b/notebooks/gold/mobility/nyc_trip/01_build_monthly_stats.ipynb
@@ -129,9 +129,31 @@
    "source": [
     "## Start Spark\n",
     "\n",
-    "This notebook only sets notebook-local values that are safe to make visible.\n",
+    "Configuration choices used here:\n",
     "\n",
-    "Catalog wiring, Iceberg extensions, REST catalog endpoints, and object-storage access are assumed to be injected outside the notebook.\n"
+    "- `org.apache.iceberg.spark.SparkCatalog` is used for Iceberg catalog integration in Spark.\n",
+    "- Iceberg REST catalog settings are already preconfigured to point to Apache Polaris.\n",
+    "- `header.X-Iceberg-Access-Delegation=vended-credentials` is enabled so Polaris can delegate storage access for Iceberg-managed tables.\n",
+    "- Global `s3a` settings are already configured for Bronze raw Parquet reads.\n",
+    "- Iceberg Spark SQL extensions are enabled explicitly.\n",
+    "\n",
+    "Spark catalogs are configured under `spark.sql.catalog.<catalog_name>`.\n",
+    "\n",
+    "For authentication, this example uses the `access_token()` notebook helper by default.\n",
+    "This helper retrieves the current fresh user access token from JupyterHub auth state and passes it to Polaris as a bearer token.\n",
+    "\n",
+    "The line below is commented on purpose:\n",
+    "\n",
+    "`# .config(f\"spark.sql.catalog.{SILVER_CATALOG}.credential\", polaris_oauth2_credential)` </br>\n",
+    "`# .config(f\"spark.sql.catalog.{GOLD_CATALOG}.credential\", polaris_oauth2_credential)`\n",
+    "\n",
+    "If needed, you can uncomment it to switch from end-user token authentication to a static OAuth2 client credential (`client_id:client_secret`) for service-to-service or ETL-style execution.\n",
+    "\n",
+    "The following line is enabled by default:\n",
+    "\n",
+    "` .config(f\"spark.sql.catalog.{CATALOG_NAME}.token\", access_token())`\n",
+    "\n",
+    "This means Spark connects to Polaris with the current authenticated notebook user token instead of a fixed technical client.\n"
    ]
   },
   {
@@ -141,13 +163,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "polaris_credential = \"my-polaris-spark-etl-app:mySparkAppSecret\"\n",
+    "polaris_oauth2_credential = \"my-polaris-spark-etl-app:mySparkAppSecret\"\n",
     "\n",
     "spark = (\n",
     "    SparkSession.builder\n",
     "    .appName(\"NYC Tripdata - Gold - Monthly Stats - PySpark\")\n",
-    "    .config(f\"spark.sql.catalog.{SILVER_CATALOG}.credential\", polaris_credential)\n",
-    "    .config(f\"spark.sql.catalog.{GOLD_CATALOG}.credential\", polaris_credential)\n",
+    "    #.config(f\"spark.sql.catalog.{SILVER_CATALOG}.credential\", polaris_oauth2_credential)\n",
+    "    #.config(f\"spark.sql.catalog.{GOLD_CATALOG}.credential\", polaris_oauth2_credential)\n",
+    "    .config(f\"spark.sql.catalog.{SILVER_CATALOG}.token\", access_token())\n",
+    "    .config(f\"spark.sql.catalog.{GOLD_CATALOG}.token\", access_token())\n",
+    "    .config(f\"spark.sql.catalog.{SILVER_CATALOG}.token-refresh-enabled\", \"false\")\n",
+    "    .config(f\"spark.sql.catalog.{GOLD_CATALOG}.token-refresh-enabled\", \"false\")\n",
     "    .config(\"spark.executor.memory\", \"2g\")\n",
     "    .config(\"spark.executor.memoryOverhead\", \"640m\")\n",
     "    .config(\"spark.executor.cores\", 1)\n",
diff --git a/notebooks/silver/mobility/nyc_trip/01_Issue-seaweedFS-build_trips.ipynb b/notebooks/silver/mobility/nyc_trip/01_Issue-seaweedFS-build_trips.ipynb
diff --git a/notebooks/silver/mobility/nyc_trip/01_build_trips.ipynb b/notebooks/silver/mobility/nyc_trip/01_build_trips.ipynb