|
129 | 129 | "source": [ |
130 | 130 | "## Start Spark\n", |
131 | 131 | "\n", |
132 | | - "This notebook only sets notebook-local values that are safe to make visible.\n", |
| 132 | + "Configuration choices used here:\n", |
133 | 133 | "\n", |
134 | | - "Catalog wiring, Iceberg extensions, REST catalog endpoints, and object-storage access are assumed to be injected outside the notebook.\n" |
| 134 | + "- `org.apache.iceberg.spark.SparkCatalog` is used for Iceberg catalog integration in Spark.\n", |
| 135 | + "- Iceberg REST catalog settings are already preconfigured to point to Apache Polaris.\n", |
| 136 | + "- `header.X-Iceberg-Access-Delegation=vended-credentials` is enabled so Polaris can delegate storage access for Iceberg-managed tables.\n", |
| 137 | + "- Global `s3a` settings are already configured for Bronze raw Parquet reads.\n", |
| 138 | + "- Iceberg Spark SQL extensions are enabled explicitly.\n", |
| 139 | + "\n", |
| 140 | + "Spark catalogs are configured under `spark.sql.catalog.<catalog_name>`.\n", |
| 141 | + "\n", |
| 142 | + "For authentication, this example uses the `access_token()` notebook helper by default.\n", |
| 143 | + "This helper retrieves the current fresh user access token from JupyterHub auth state and passes it to Polaris as a bearer token.\n", |
| 144 | + "\n", |
| 145 | + "The line below is commented on purpose:\n", |
| 146 | + "\n", |
| 147 | + "`# .config(f\"spark.sql.catalog.{SILVER_CATALOG}.credential\", polaris_oauth2_credential)` </br>\n", |
| 148 | + "`# .config(f\"spark.sql.catalog.{GOLD_CATALOG}.credential\", polaris_oauth2_credential)`\n", |
| 149 | + "\n", |
| 150 | + "If needed, you can uncomment it to switch from end-user token authentication to a static OAuth2 client credential (`client_id:client_secret`) for service-to-service or ETL-style execution.\n", |
| 151 | + "\n", |
| 152 | + "The following line is enabled by default:\n", |
| 153 | + "\n", |
| 154 | + "` .config(f\"spark.sql.catalog.{CATALOG_NAME}.token\", access_token())`\n", |
| 155 | + "\n", |
| 156 | + "This means Spark connects to Polaris with the current authenticated notebook user token instead of a fixed technical client.\n" |
135 | 157 | ] |
136 | 158 | }, |
137 | 159 | { |
|
141 | 163 | "metadata": {}, |
142 | 164 | "outputs": [], |
143 | 165 | "source": [ |
144 | | - "polaris_credential = \"my-polaris-spark-etl-app:mySparkAppSecret\"\n", |
| 166 | + "polaris_oauth2_credential = \"my-polaris-spark-etl-app:mySparkAppSecret\"\n", |
145 | 167 | "\n", |
146 | 168 | "spark = (\n", |
147 | 169 | " SparkSession.builder\n", |
148 | 170 | " .appName(\"NYC Tripdata - Gold - Monthly Stats - PySpark\")\n", |
149 | | - " .config(f\"spark.sql.catalog.{SILVER_CATALOG}.credential\", polaris_credential)\n", |
150 | | - " .config(f\"spark.sql.catalog.{GOLD_CATALOG}.credential\", polaris_credential)\n", |
| 171 | + " #.config(f\"spark.sql.catalog.{SILVER_CATALOG}.credential\", polaris_oauth2_credential)\n", |
| 172 | + " #.config(f\"spark.sql.catalog.{GOLD_CATALOG}.credential\", polaris_oauth2_credential)\n", |
| 173 | + " .config(f\"spark.sql.catalog.{SILVER_CATALOG}.token\", access_token())\n", |
| 174 | + " .config(f\"spark.sql.catalog.{GOLD_CATALOG}.token\", access_token())\n", |
| 175 | + " .config(f\"spark.sql.catalog.{SILVER_CATALOG}.token-refresh-enabled\", \"false\")\n", |
| 176 | + " .config(f\"spark.sql.catalog.{GOLD_CATALOG}.token-refresh-enabled\", \"false\")\n", |
151 | 177 | " .config(\"spark.executor.memory\", \"2g\")\n", |
152 | 178 | " .config(\"spark.executor.memoryOverhead\", \"640m\")\n", |
153 | 179 | " .config(\"spark.executor.cores\", 1)\n", |
|
0 commit comments