|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Inferring JSON Structure Schemas from Your Data with Structurize" |
| 4 | +date: 2026-02-04 |
| 5 | +author: Clemens Vasters |
| 6 | +--- |
| 7 | + |
| 8 | +# Inferring JSON Structure Schemas from Your Data with Structurize |
| 9 | + |
| 10 | +If you have JSON data and need a schema for it, the new `json2s` command in |
| 11 | +**Structurize** can help. It analyzes your JSON files, figures out what types |
| 12 | +you're working with, and produces a valid JSON Structure schema document — one |
| 13 | +that you can use for validation, code generation, or documentation. |
| 14 | + |
| 15 | +## What is Structurize? |
| 16 | + |
| 17 | +[Structurize](https://github.com/clemensv/avrotize) is a schema conversion |
| 18 | +toolkit that transforms between various schema formats: JSON Schema, JSON |
| 19 | +Structure, Avro Schema, Protocol Buffers, XSD, and more. It also generates code |
| 20 | +in multiple languages (C#, Python, TypeScript, Java, Go, Rust, C++) and exports |
| 21 | +schemas to SQL databases and data formats like Parquet and Iceberg. |
| 22 | + |
| 23 | +The tool ships under two package names — `structurize` and `avrotize` — both |
| 24 | +sharing the same codebase. Choose whichever aligns with your primary use case. |
| 25 | +JSON Structure users will likely prefer `structurize`. |
| 26 | + |
| 27 | +Install it with: |
| 28 | + |
| 29 | +```bash |
| 30 | +pip install structurize |
| 31 | +``` |
| 32 | + |
| 33 | +## The json2s Command: Schema Inference from JSON |
| 34 | + |
| 35 | +The `json2s` command reads one or more JSON files and infers a JSON Structure |
| 36 | +schema from them. It handles single JSON objects, JSON arrays, and JSONL |
| 37 | +(newline-delimited JSON) files. |
| 38 | + |
| 39 | +### Basic Usage |
| 40 | + |
| 41 | +```bash |
| 42 | +structurize json2s data.json --out schema.jstruct.json --type-name MyType |
| 43 | +``` |
| 44 | + |
| 45 | +Parameters: |
| 46 | + |
| 47 | +- `<json_files...>` — One or more JSON files to analyze |
| 48 | +- `--out` — Output path for the JSON Structure schema (stdout if omitted) |
| 49 | +- `--type-name` — Name for the root type (default: "Document") |
| 50 | +- `--base-id` — Base URI for `$id` generation (default: "https://example.com/") |
| 51 | +- `--sample-size` — Maximum records to sample (0 = all, default: 0) |
| 52 | +- `--infer-choices` — Detect discriminated unions (more on this below) |
| 53 | + |
| 54 | +### Multiple Files and JSONL |
| 55 | + |
| 56 | +The command accepts multiple input files, merging their structures into a |
| 57 | +unified schema. This is useful when your data is split across files or when |
| 58 | +you want to analyze several examples together. |
| 59 | + |
| 60 | +JSONL files (one JSON object per line) are first-class citizens. The inferrer |
| 61 | +reads each line as a separate document and consolidates their structures. |
| 62 | + |
| 63 | +```bash |
| 64 | +# Multiple JSON files |
| 65 | +structurize json2s orders.json users.json events.json --out unified.jstruct.json |
| 66 | + |
| 67 | +# A JSONL file with many records |
| 68 | +structurize json2s events.jsonl --out events.jstruct.json --type-name DomainEvent |
| 69 | +``` |
| 70 | + |
| 71 | +## Detecting Discriminated Unions with `--infer-choices` |
| 72 | + |
| 73 | +Here's where things get interesting. Many event-driven systems, APIs, and |
| 74 | +message formats use **discriminated unions**: a single field (often called |
| 75 | +`type`, `kind`, or `event_type`) determines which variant of a structure you're |
| 76 | +dealing with. |
| 77 | + |
| 78 | +Consider this JSONL file with three event types: |
| 79 | + |
| 80 | +```jsonl |
| 81 | +{"event_type": "user_created", "user_id": "u123", "email": "alice@example.com", "created_at": "2026-02-04T10:00:00Z"} |
| 82 | +{"event_type": "user_created", "user_id": "u456", "email": "bob@example.com", "created_at": "2026-02-04T11:00:00Z"} |
| 83 | +{"event_type": "order_placed", "order_id": "ord-001", "user_id": "u123", "total": 99.50, "items": [{"sku": "A1", "qty": 2}]} |
| 84 | +{"event_type": "order_placed", "order_id": "ord-002", "user_id": "u456", "total": 150.00, "items": [{"sku": "B2", "qty": 1}]} |
| 85 | +{"event_type": "payment_received", "payment_id": "pay-001", "order_id": "ord-001", "amount": 99.50, "method": "card"} |
| 86 | +{"event_type": "payment_received", "payment_id": "pay-002", "order_id": "ord-002", "amount": 150.00, "method": "paypal"} |
| 87 | +``` |
| 88 | + |
| 89 | +### Without `--infer-choices`: A Flat Object |
| 90 | + |
| 91 | +Running the basic inference: |
| 92 | + |
| 93 | +```bash |
| 94 | +structurize json2s events.jsonl --out events.jstruct.json --type-name DomainEvent |
| 95 | +``` |
| 96 | + |
| 97 | +Produces a single object type with all fields merged: |
| 98 | + |
| 99 | +```json |
| 100 | +{ |
| 101 | + "$schema": "https://json-structure.org/meta/core/v0/#", |
| 102 | + "$id": "https://example.com/DomainEvent", |
| 103 | + "type": "object", |
| 104 | + "name": "DomainEvent", |
| 105 | + "properties": { |
| 106 | + "event_type": { "type": "string" }, |
| 107 | + "user_id": { "type": "string" }, |
| 108 | + "email": { "type": "string" }, |
| 109 | + "created_at": { "type": "string" }, |
| 110 | + "order_id": { "type": "string" }, |
| 111 | + "total": { "type": "double" }, |
| 112 | + "items": { "type": "array", "items": { ... } }, |
| 113 | + "payment_id": { "type": "string" }, |
| 114 | + "amount": { "type": "double" }, |
| 115 | + "method": { "type": "string" } |
| 116 | + }, |
| 117 | + "required": ["event_type"] |
| 118 | +} |
| 119 | +``` |
| 120 | + |
| 121 | +This works, but it loses the structure: `email` only makes sense for |
| 122 | +`user_created` events, `items` only for `order_placed`, and so on. All fields |
| 123 | +become optional except `event_type`, which is the only one present in every |
| 124 | +record. |
| 125 | + |
| 126 | +### With `--infer-choices`: An Inline Union |
| 127 | + |
| 128 | +Add the `--infer-choices` flag: |
| 129 | + |
| 130 | +```bash |
| 131 | +structurize json2s events.jsonl --infer-choices --out events.jstruct.json --type-name DomainEvent |
| 132 | +``` |
| 133 | + |
| 134 | +Now the inferrer detects that `event_type` is a **discriminator** whose values |
| 135 | +correlate with distinct field signatures. It produces a JSON Structure |
| 136 | +`choice` type — an inline union: |
| 137 | + |
| 138 | +```json |
| 139 | +{ |
| 140 | + "$schema": "https://json-structure.org/meta/core/v0/#", |
| 141 | + "$id": "https://example.com/schemas/DomainEvent", |
| 142 | + "type": "choice", |
| 143 | + "name": "DomainEvent", |
| 144 | + "$extends": "#/definitions/DomainEventBase", |
| 145 | + "selector": "event_type", |
| 146 | + "choices": { |
| 147 | + "order_placed": { "type": { "$ref": "#/definitions/OrderPlaced" } }, |
| 148 | + "payment_received": { "type": { "$ref": "#/definitions/PaymentReceived" } }, |
| 149 | + "user_created": { "type": { "$ref": "#/definitions/UserCreated" } } |
| 150 | + }, |
| 151 | + "definitions": { |
| 152 | + "DomainEventBase": { |
| 153 | + "abstract": true, |
| 154 | + "type": "object", |
| 155 | + "name": "DomainEventBase", |
| 156 | + "properties": { |
| 157 | + "event_type": { "type": "string" } |
| 158 | + } |
| 159 | + }, |
| 160 | + "OrderPlaced": { |
| 161 | + "type": "object", |
| 162 | + "name": "OrderPlaced", |
| 163 | + "$extends": "#/definitions/DomainEventBase", |
| 164 | + "properties": { |
| 165 | + "items": { "type": "array", "items": { ... } }, |
| 166 | + "order_id": { "type": "string" }, |
| 167 | + "total": { "type": "double" }, |
| 168 | + "user_id": { "type": "string" } |
| 169 | + }, |
| 170 | + "required": ["items", "order_id", "total", "user_id"] |
| 171 | + }, |
| 172 | + "PaymentReceived": { |
| 173 | + "type": "object", |
| 174 | + "name": "PaymentReceived", |
| 175 | + "$extends": "#/definitions/DomainEventBase", |
| 176 | + "properties": { |
| 177 | + "amount": { "type": "double" }, |
| 178 | + "method": { "type": "string" }, |
| 179 | + "order_id": { "type": "string" }, |
| 180 | + "payment_id": { "type": "string" } |
| 181 | + }, |
| 182 | + "required": ["amount", "method", "order_id", "payment_id"] |
| 183 | + }, |
| 184 | + "UserCreated": { |
| 185 | + "type": "object", |
| 186 | + "name": "UserCreated", |
| 187 | + "$extends": "#/definitions/DomainEventBase", |
| 188 | + "properties": { |
| 189 | + "created_at": { "type": "string" }, |
| 190 | + "email": { "type": "string" }, |
| 191 | + "user_id": { "type": "string" } |
| 192 | + }, |
| 193 | + "required": ["created_at", "email", "user_id"] |
| 194 | + } |
| 195 | + } |
| 196 | +} |
| 197 | +``` |
| 198 | + |
| 199 | +This is a proper inline union: |
| 200 | + |
| 201 | +- **`selector`** points to the discriminator field (`event_type`) |
| 202 | +- **`choices`** maps each discriminator value to a variant type |
| 203 | +- **`$extends`** references an abstract base type with common fields |
| 204 | +- Each variant extends the base and adds its specific fields |
| 205 | + |
| 206 | +The choice keys (`order_placed`, `payment_received`, `user_created`) match the |
| 207 | +actual values in the data, so instances validate correctly. |
| 208 | + |
| 209 | +### Validating the Result |
| 210 | + |
| 211 | +Using the [json-structure Python SDK](https://pypi.org/project/json-structure/), |
| 212 | +we can verify that both the schema and the original instances are valid: |
| 213 | + |
| 214 | +```python |
| 215 | +import json |
| 216 | +from json_structure import SchemaValidator, InstanceValidator |
| 217 | + |
| 218 | +with open('events_schema.jstruct.json') as f: |
| 219 | + schema = json.load(f) |
| 220 | + |
| 221 | +# Validate the schema itself |
| 222 | +sv = SchemaValidator(extended=True) |
| 223 | +errors = sv.validate(schema) |
| 224 | +print('Schema valid:', len(errors) == 0) |
| 225 | + |
| 226 | +# Validate each instance |
| 227 | +iv = InstanceValidator(schema, extended=True) |
| 228 | +with open('events.jsonl') as f: |
| 229 | + for line in f: |
| 230 | + if line.strip(): |
| 231 | + instance = json.loads(line) |
| 232 | + errors = iv.validate(instance) |
| 233 | + print(f"{instance['event_type']}: {'valid' if not errors else errors}") |
| 234 | +``` |
| 235 | + |
| 236 | +Output: |
| 237 | + |
| 238 | +``` |
| 239 | +Schema valid: True |
| 240 | +user_created: valid |
| 241 | +user_created: valid |
| 242 | +order_placed: valid |
| 243 | +order_placed: valid |
| 244 | +payment_received: valid |
| 245 | +payment_received: valid |
| 246 | +``` |
| 247 | + |
| 248 | +All six instances validate against the inferred schema. |
| 249 | + |
| 250 | +## How the Algorithm Works |
| 251 | + |
| 252 | +The `--infer-choices` option uses a clustering algorithm: |
| 253 | + |
| 254 | +1. **Document Fingerprinting**: Each JSON object is characterized by its field |
| 255 | + signature — the set of top-level keys it contains. |
| 256 | + |
| 257 | +2. **Jaccard Similarity Clustering**: Documents with similar field signatures |
| 258 | + are grouped together. A two-pass refinement handles edge cases. |
| 259 | + |
| 260 | +3. **Discriminator Detection**: The algorithm looks for fields whose values |
| 261 | + correlate strongly with cluster membership. A field like `event_type` that |
| 262 | + has distinct values for each cluster is a strong discriminator candidate. |
| 263 | + |
| 264 | +4. **Sparse Data Filtering**: If documents have high overlap (same basic |
| 265 | + structure with some optional fields), they're treated as a single type with |
| 266 | + optional properties rather than distinct variants. |
| 267 | + |
| 268 | +5. **Nested Discriminators**: The algorithm can detect discriminators inside |
| 269 | + nested objects (up to 2 levels deep), handling envelope patterns like |
| 270 | + CloudEvents with typed payloads. |
| 271 | + |
| 272 | +The result is a schema that captures the polymorphic structure of your data |
| 273 | +rather than flattening everything into a single bag of optional fields. |
| 274 | + |
| 275 | +## Use Cases |
| 276 | + |
| 277 | +- **Event Sourcing**: Infer schemas from event logs with multiple event types |
| 278 | +- **API Documentation**: Generate schemas from sample API responses |
| 279 | +- **Message Queues**: Document Kafka/RabbitMQ message formats |
| 280 | +- **Data Lake Schemas**: Create schemas for semi-structured data in Parquet or Iceberg |
| 281 | +- **Code Generation**: Feed the schema into structurize's code generators to produce typed classes |
| 282 | + |
| 283 | +## Getting Started |
| 284 | + |
| 285 | +Install structurize: |
| 286 | + |
| 287 | +```bash |
| 288 | +pip install structurize |
| 289 | +``` |
| 290 | + |
| 291 | +Point it at your data: |
| 292 | + |
| 293 | +```bash |
| 294 | +structurize json2s your-data.jsonl --infer-choices --out schema.jstruct.json --type-name YourType |
| 295 | +``` |
| 296 | + |
| 297 | +Validate the result with the json-structure SDK, or use structurize to convert |
| 298 | +the schema to code, documentation, or other formats. |
| 299 | + |
| 300 | +--- |
| 301 | + |
| 302 | +The `json2s` command with `--infer-choices` bridges the gap between the JSON |
| 303 | +data you have and the structured schema you need. It understands that your |
| 304 | +data isn't just a blob of fields — it's a collection of distinct types |
| 305 | +with a common discriminator. And it produces schemas that reflect that structure. |
| 306 | + |
0 commit comments