Skip to content

Commit 5a39f0f

Browse files
authored
Merge pull request #741 from tedvilutis/main
Fabric Lakehouse Skill
2 parents 2017acd + 3b907f7 commit 5a39f0f

4 files changed

Lines changed: 332 additions & 0 deletions

File tree

docs/README.skills.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ Skills differ from other primitives by supporting bundled assets (scripts, code
3535
| [copilot-sdk](../skills/copilot-sdk/SKILL.md) | Build agentic applications with GitHub Copilot SDK. Use when embedding AI agents in apps, creating custom tools, implementing streaming responses, managing sessions, connecting to MCP servers, or creating custom agents. Triggers on Copilot SDK, GitHub SDK, agentic app, embed Copilot, programmable agent, MCP server, custom agent. | None |
3636
| [create-web-form](../skills/create-web-form/SKILL.md) | Create robust, accessible web forms with best practices for HTML structure, CSS styling, JavaScript interactivity, form validation, and server-side processing. Use when asked to "create a form", "build a web form", "add a contact form", "make a signup form", or when building any HTML form with data handling. Covers PHP and Python backends, MySQL database integration, REST APIs, XML data exchange, accessibility (ARIA), and progressive web apps. | `references/accessibility.md`<br />`references/aria-form-role.md`<br />`references/css-styling.md`<br />`references/form-basics.md`<br />`references/form-controls.md`<br />`references/form-data-handling.md`<br />`references/html-form-elements.md`<br />`references/html-form-example.md`<br />`references/hypertext-transfer-protocol.md`<br />`references/javascript.md`<br />`references/php-cookies.md`<br />`references/php-forms.md`<br />`references/php-json.md`<br />`references/php-mysql-database.md`<br />`references/progressive-web-app.md`<br />`references/python-as-web-framework.md`<br />`references/python-contact-form.md`<br />`references/python-flask-app.md`<br />`references/python-flask.md`<br />`references/security.md`<br />`references/styling-web-forms.md`<br />`references/web-api.md`<br />`references/web-performance.md`<br />`references/xml.md` |
3737
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates/business-flow-swimlane-template.excalidraw`<br />`templates/class-diagram-template.excalidraw`<br />`templates/data-flow-diagram-template.excalidraw`<br />`templates/er-diagram-template.excalidraw`<br />`templates/flowchart-template.excalidraw`<br />`templates/mindmap-template.excalidraw`<br />`templates/relationship-template.excalidraw`<br />`templates/sequence-diagram-template.excalidraw` |
38+
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
3839
| [finnish-humanizer](../skills/finnish-humanizer/SKILL.md) | Detect and remove AI-generated markers from Finnish text, making it sound like a native Finnish speaker wrote it. Use when asked to "humanize", "naturalize", or "remove AI feel" from Finnish text, or when editing .md/.txt files containing Finnish content. Identifies 26 patterns (12 Finnish-specific + 14 universal) and 4 style markers. | `references/patterns.md` |
3940
| [fluentui-blazor](../skills/fluentui-blazor/SKILL.md) | Guide for using the Microsoft Fluent UI Blazor component library (Microsoft.FluentUI.AspNetCore.Components NuGet package) in Blazor applications. Use this when the user is building a Blazor app with Fluent UI components, setting up the library, using FluentUI components like FluentButton, FluentDataGrid, FluentDialog, FluentToast, FluentNavMenu, FluentTextField, FluentSelect, FluentAutocomplete, FluentDesignTheme, or any component prefixed with "Fluent". Also use when troubleshooting missing providers, JS interop issues, or theming. | `references/DATAGRID.md`<br />`references/LAYOUT-AND-NAVIGATION.md`<br />`references/SETUP.md`<br />`references/THEMING.md` |
4041
| [gh-cli](../skills/gh-cli/SKILL.md) | GitHub CLI (gh) comprehensive reference for repositories, issues, pull requests, Actions, projects, releases, gists, codespaces, organizations, extensions, and all GitHub operations from the command line. | None |

skills/fabric-lakehouse/SKILL.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
name: fabric-lakehouse
3+
description: 'Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices.'
4+
metadata:
5+
author: tedvilutis
6+
version: "1.0"
7+
---
8+
9+
# When to Use This Skill
10+
11+
Use this skill when you need to:
12+
- Generate a document or explanation that includes definition and context about Fabric Lakehouse and its capabilities.
13+
- Design, build, and optimize Lakehouse solutions using best practices.
14+
- Understand the core concepts and components of a Lakehouse in Microsoft Fabric.
15+
- Learn how to manage tabular and non-tabular data within a Lakehouse.
16+
17+
# Fabric Lakehouse
18+
19+
## Core Concepts
20+
21+
### What is a Lakehouse?
22+
23+
Lakehouse in Microsoft Fabric is an item that gives users a place to store their tabular data (like tables) and non-tabular data (like files). It combines the flexibility of a data lake with the management capabilities of a data warehouse. It provides:
24+
25+
- **Unified storage** in OneLake for structured and unstructured data
26+
- **Delta Lake format** for ACID transactions, versioning, and time travel
27+
- **SQL analytics endpoint** for T-SQL queries
28+
- **Semantic model** for Power BI integration
29+
- Support for other table formats like CSV, Parquet
30+
- Support for any file formats
31+
- Tools for table optimization and data management
32+
33+
### Key Components
34+
35+
- **Delta Tables**: Managed tables with ACID compliance and schema enforcement
36+
- **Files**: Unstructured/semi-structured data in the Files section
37+
- **SQL Endpoint**: Auto-generated read-only SQL interface for querying
38+
- **Shortcuts**: Virtual links to external/internal data without copying
39+
- **Fabric Materialized Views**: Pre-computed tables for fast query performance
40+
41+
### Tabular data in a Lakehouse
42+
43+
Tabular data in a form of tables are stored under "Tables" folder. Main format for tables in Lakehouse is Delta. Lakehouse can store tabular data in other formats like CSV or Parquet, these formats are only available for Spark querying.
44+
Tables can be internal, when data is stored under "Tables" folder, or external, when only reference to a table is stored under "Tables" folder but the data itself is stored in a referenced location. Tables are referenced through Shortcuts, which can be internal (pointing to another location in Fabric) or external (pointing to data stored outside of Fabric).
45+
46+
### Schemas for tables in a Lakehouse
47+
48+
When creating a lakehouse, users can choose to enable schemas. Schemas are used to organize Lakehouse tables. Schemas are implemented as folders under the "Tables" folder and store tables inside of those folders. The default schema is "dbo" and it can't be deleted or renamed. All other schemas are optional and can be created, renamed, or deleted. Users can reference a schema located in another lakehouse using a Schema Shortcut, thereby referencing all tables in the destination schema with a single shortcut.
49+
50+
### Files in a Lakehouse
51+
52+
Files are stored under "Files" folder. Users can create folders and subfolders to organize their files. Any file format can be stored in Lakehouse.
53+
54+
### Fabric Materialized Views
55+
56+
Set of pre-computed tables that are automatically updated based on a schedule. They provide fast query performance for complex aggregations and joins. Materialized views are defined using PySpark or Spark SQL and stored in an associated Notebook.
57+
58+
### Spark Views
59+
60+
Logical tables defined by a SQL query. They do not store data but provide a virtual layer for querying. Views are defined using Spark SQL and stored in Lakehouse next to Tables.
61+
62+
## Security
63+
64+
### Item access or control plane security
65+
66+
Users can have workspace roles (Admin, Member, Contributor, Viewer) that provide different levels of access to Lakehouse and its contents. Users can also get access permission using sharing capabilities of Lakehouse.
67+
68+
### Data access or OneLake Security
69+
70+
For data access use OneLake security model, which is based on Microsoft Entra ID (formerly Azure Active Directory) and role-based access control (RBAC). Lakehouse data is stored in OneLake, so access to data is controlled through OneLake permissions. In addition to object-level permissions, Lakehouse also supports column-level and row-level security for tables, allowing fine-grained control over who can see specific columns or rows in a table.
71+
72+
73+
## Lakehouse Shortcuts
74+
75+
Shortcuts create virtual links to data without copying:
76+
77+
### Types of Shortcuts
78+
79+
- **Internal**: Link to other Fabric Lakehouses/tables, cross-workspace data sharing
80+
- **ADLS Gen2**: Link to ADLS Gen2 containers in Azure
81+
- **Amazon S3**: AWS S3 buckets, cross-cloud data access
82+
- **Dataverse**: Microsoft Dataverse, business application data
83+
- **Google Cloud Storage**: GCS buckets, cross-cloud data access
84+
85+
## Performance Optimization
86+
87+
### V-Order Optimization
88+
89+
For faster data read with semantic model enable V-Order optimization on Delta tables. This presorts data in a way that improves query performance for common access patterns.
90+
91+
### Table Optimization
92+
93+
Tables can also be optimized using the OPTIMIZE command, which compacts small files into larger ones and can also apply Z-ordering to improve query performance on specific columns. Regular optimization helps maintain performance as data is ingested and updated over time. The Vacuum command can be used to clean up old files and free up storage space, especially after updates and deletes.
94+
95+
## Lineage
96+
97+
The Lakehouse item supports lineage, which allows users to track the origin and transformations of data. Lineage information is automatically captured for tables and files in Lakehouse, showing how data flows from source to destination. This helps with debugging, auditing, and understanding data dependencies.
98+
99+
## PySpark Code Examples
100+
101+
See [PySpark code](references/pyspark.md) for details.
102+
103+
## Getting data into Lakehouse
104+
105+
See [Get data](references/getdata.md) for details.
106+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
### Data Factory Integration
2+
3+
Microsoft Fabric includes Data Factory for ETL/ELT orchestration:
4+
5+
- **180+ connectors** for data sources
6+
- **Copy activity** for data movement
7+
- **Dataflow Gen2** for transformations
8+
- **Notebook activity** for Spark processing
9+
- **Scheduling** and triggers
10+
11+
### Pipeline Activities
12+
13+
| Activity | Description |
14+
|----------|-------------|
15+
| Copy Data | Move data between sources and Lakehouse |
16+
| Notebook | Execute Spark notebooks |
17+
| Dataflow | Run Dataflow Gen2 transformations |
18+
| Stored Procedure | Execute SQL procedures |
19+
| ForEach | Loop over items |
20+
| If Condition | Conditional branching |
21+
| Get Metadata | Retrieve file/folder metadata |
22+
| Lakehouse Maintenance | Optimize and vacuum Delta tables |
23+
24+
### Orchestration Patterns
25+
26+
```
27+
Pipeline: Daily_ETL_Pipeline
28+
├── Get Metadata (check for new files)
29+
├── ForEach (process each file)
30+
│ ├── Copy Data (bronze layer)
31+
│ └── Notebook (silver transformation)
32+
├── Notebook (gold aggregation)
33+
└── Lakehouse Maintenance (optimize tables)
34+
```
35+
36+
---
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
### Spark Configuration (Best Practices)
2+
3+
```python
4+
# Enable Fabric optimizations
5+
spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
6+
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
7+
```
8+
9+
### Reading Data
10+
11+
```python
12+
# Read CSV file
13+
df = spark.read.format("csv") \
14+
.option("header", "true") \
15+
.option("inferSchema", "true") \
16+
.load("Files/bronze/data.csv")
17+
18+
# Read JSON file
19+
df = spark.read.format("json").load("Files/bronze/data.json")
20+
21+
# Read Parquet file
22+
df = spark.read.format("parquet").load("Files/bronze/data.parquet")
23+
24+
# Read Delta table
25+
df = spark.read.table("my_delta_table")
26+
27+
# Read from SQL endpoint
28+
df = spark.sql("SELECT * FROM lakehouse.my_table")
29+
```
30+
31+
### Writing Delta Tables
32+
33+
```python
34+
# Write DataFrame as managed Delta table
35+
df.write.format("delta") \
36+
.mode("overwrite") \
37+
.saveAsTable("silver_customers")
38+
39+
# Write with partitioning
40+
df.write.format("delta") \
41+
.mode("overwrite") \
42+
.partitionBy("year", "month") \
43+
.saveAsTable("silver_transactions")
44+
45+
# Append to existing table
46+
df.write.format("delta") \
47+
.mode("append") \
48+
.saveAsTable("silver_events")
49+
```
50+
51+
### Delta Table Operations (CRUD)
52+
53+
```python
54+
# UPDATE
55+
spark.sql("""
56+
UPDATE silver_customers
57+
SET status = 'active'
58+
WHERE last_login > '2024-01-01' -- Example date, adjust as needed
59+
""")
60+
61+
# DELETE
62+
spark.sql("""
63+
DELETE FROM silver_customers
64+
WHERE is_deleted = true
65+
""")
66+
67+
# MERGE (Upsert)
68+
spark.sql("""
69+
MERGE INTO silver_customers AS target
70+
USING staging_customers AS source
71+
ON target.customer_id = source.customer_id
72+
WHEN MATCHED THEN UPDATE SET *
73+
WHEN NOT MATCHED THEN INSERT *
74+
""")
75+
```
76+
77+
### Schema Definition
78+
79+
```python
80+
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DecimalType
81+
82+
schema = StructType([
83+
StructField("id", IntegerType(), False),
84+
StructField("name", StringType(), True),
85+
StructField("email", StringType(), True),
86+
StructField("amount", DecimalType(18, 2), True),
87+
StructField("created_at", TimestampType(), True)
88+
])
89+
90+
df = spark.read.format("csv") \
91+
.schema(schema) \
92+
.option("header", "true") \
93+
.load("Files/bronze/customers.csv")
94+
```
95+
96+
### SQL Magic in Notebooks
97+
98+
```sql
99+
%%sql
100+
-- Query Delta table directly
101+
SELECT
102+
customer_id,
103+
COUNT(*) as order_count,
104+
SUM(amount) as total_amount
105+
FROM gold_orders
106+
GROUP BY customer_id
107+
ORDER BY total_amount DESC
108+
LIMIT 10
109+
```
110+
111+
### V-Order Optimization
112+
113+
```python
114+
# Enable V-Order for read optimization
115+
spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
116+
```
117+
118+
### Table Optimization
119+
120+
```sql
121+
%%sql
122+
-- Optimize table (compact small files)
123+
OPTIMIZE silver_transactions
124+
125+
-- Optimize with Z-ordering on query columns
126+
OPTIMIZE silver_transactions ZORDER BY (customer_id, transaction_date)
127+
128+
-- Vacuum old files (default 7 days retention)
129+
VACUUM silver_transactions
130+
131+
-- Vacuum with custom retention
132+
VACUUM silver_transactions RETAIN 168 HOURS
133+
134+
```
135+
136+
### Incremental Load Pattern
137+
138+
```python
139+
from pyspark.sql.functions import col
140+
141+
# Get last processed watermark
142+
last_watermark = spark.sql("""
143+
SELECT MAX(processed_timestamp) as watermark
144+
FROM silver_orders
145+
""").collect()[0]["watermark"]
146+
147+
# Load only new records
148+
new_records = spark.read.format("delta") \
149+
.table("bronze_orders") \
150+
.filter(col("created_at") > last_watermark)
151+
152+
# Merge new records
153+
new_records.createOrReplaceTempView("staging_orders")
154+
spark.sql("""
155+
MERGE INTO silver_orders AS target
156+
USING staging_orders AS source
157+
ON target.order_id = source.order_id
158+
WHEN MATCHED THEN UPDATE SET *
159+
WHEN NOT MATCHED THEN INSERT *
160+
""")
161+
```
162+
163+
### SCD Type 2 Pattern
164+
165+
```python
166+
from pyspark.sql.functions import current_timestamp, lit
167+
168+
# Close existing records
169+
spark.sql("""
170+
UPDATE dim_customer
171+
SET is_current = false, end_date = current_timestamp()
172+
WHERE customer_id IN (SELECT customer_id FROM staging_customer)
173+
AND is_current = true
174+
""")
175+
176+
# Insert new versions
177+
spark.sql("""
178+
INSERT INTO dim_customer
179+
SELECT
180+
customer_id,
181+
name,
182+
email,
183+
address,
184+
current_timestamp() as start_date,
185+
null as end_date,
186+
true as is_current
187+
FROM staging_customer
188+
""")
189+
```

0 commit comments

Comments
 (0)