Skip to content

Commit b550d32

Browse files
author
Clemens Vasters
committed
Add blog post: Structurize json2s command
1 parent 9a27215 commit b550d32

1 file changed

Lines changed: 306 additions & 0 deletions

File tree

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
---
2+
layout: post
3+
title: "Inferring JSON Structure Schemas from Your Data with Structurize"
4+
date: 2026-02-04
5+
author: Clemens Vasters
6+
---
7+
8+
# Inferring JSON Structure Schemas from Your Data with Structurize
9+
10+
If you have JSON data and need a schema for it, the new `json2s` command in
11+
**Structurize** can help. It analyzes your JSON files, figures out what types
12+
you're working with, and produces a valid JSON Structure schema document — one
13+
that you can use for validation, code generation, or documentation.
14+
15+
## What is Structurize?
16+
17+
[Structurize](https://github.com/clemensv/avrotize) is a schema conversion
18+
toolkit that transforms between various schema formats: JSON Schema, JSON
19+
Structure, Avro Schema, Protocol Buffers, XSD, and more. It also generates code
20+
in multiple languages (C#, Python, TypeScript, Java, Go, Rust, C++) and exports
21+
schemas to SQL databases and data formats like Parquet and Iceberg.
22+
23+
The tool ships under two package names — `structurize` and `avrotize` — both
24+
sharing the same codebase. Choose whichever aligns with your primary use case.
25+
JSON Structure users will likely prefer `structurize`.
26+
27+
Install it with:
28+
29+
```bash
30+
pip install structurize
31+
```
32+
33+
## The json2s Command: Schema Inference from JSON
34+
35+
The `json2s` command reads one or more JSON files and infers a JSON Structure
36+
schema from them. It handles single JSON objects, JSON arrays, and JSONL
37+
(newline-delimited JSON) files.
38+
39+
### Basic Usage
40+
41+
```bash
42+
structurize json2s data.json --out schema.jstruct.json --type-name MyType
43+
```
44+
45+
Parameters:
46+
47+
- `<json_files...>` — One or more JSON files to analyze
48+
- `--out` — Output path for the JSON Structure schema (stdout if omitted)
49+
- `--type-name` — Name for the root type (default: "Document")
50+
- `--base-id` — Base URI for `$id` generation (default: "https://example.com/")
51+
- `--sample-size` — Maximum records to sample (0 = all, default: 0)
52+
- `--infer-choices` — Detect discriminated unions (more on this below)
53+
54+
### Multiple Files and JSONL
55+
56+
The command accepts multiple input files, merging their structures into a
57+
unified schema. This is useful when your data is split across files or when
58+
you want to analyze several examples together.
59+
60+
JSONL files (one JSON object per line) are first-class citizens. The inferrer
61+
reads each line as a separate document and consolidates their structures.
62+
63+
```bash
64+
# Multiple JSON files
65+
structurize json2s orders.json users.json events.json --out unified.jstruct.json
66+
67+
# A JSONL file with many records
68+
structurize json2s events.jsonl --out events.jstruct.json --type-name DomainEvent
69+
```
70+
71+
## Detecting Discriminated Unions with `--infer-choices`
72+
73+
Here's where things get interesting. Many event-driven systems, APIs, and
74+
message formats use **discriminated unions**: a single field (often called
75+
`type`, `kind`, or `event_type`) determines which variant of a structure you're
76+
dealing with.
77+
78+
Consider this JSONL file with three event types:
79+
80+
```jsonl
81+
{"event_type": "user_created", "user_id": "u123", "email": "alice@example.com", "created_at": "2026-02-04T10:00:00Z"}
82+
{"event_type": "user_created", "user_id": "u456", "email": "bob@example.com", "created_at": "2026-02-04T11:00:00Z"}
83+
{"event_type": "order_placed", "order_id": "ord-001", "user_id": "u123", "total": 99.50, "items": [{"sku": "A1", "qty": 2}]}
84+
{"event_type": "order_placed", "order_id": "ord-002", "user_id": "u456", "total": 150.00, "items": [{"sku": "B2", "qty": 1}]}
85+
{"event_type": "payment_received", "payment_id": "pay-001", "order_id": "ord-001", "amount": 99.50, "method": "card"}
86+
{"event_type": "payment_received", "payment_id": "pay-002", "order_id": "ord-002", "amount": 150.00, "method": "paypal"}
87+
```
88+
89+
### Without `--infer-choices`: A Flat Object
90+
91+
Running the basic inference:
92+
93+
```bash
94+
structurize json2s events.jsonl --out events.jstruct.json --type-name DomainEvent
95+
```
96+
97+
Produces a single object type with all fields merged:
98+
99+
```json
100+
{
101+
"$schema": "https://json-structure.org/meta/core/v0/#",
102+
"$id": "https://example.com/DomainEvent",
103+
"type": "object",
104+
"name": "DomainEvent",
105+
"properties": {
106+
"event_type": { "type": "string" },
107+
"user_id": { "type": "string" },
108+
"email": { "type": "string" },
109+
"created_at": { "type": "string" },
110+
"order_id": { "type": "string" },
111+
"total": { "type": "double" },
112+
"items": { "type": "array", "items": { ... } },
113+
"payment_id": { "type": "string" },
114+
"amount": { "type": "double" },
115+
"method": { "type": "string" }
116+
},
117+
"required": ["event_type"]
118+
}
119+
```
120+
121+
This works, but it loses the structure: `email` only makes sense for
122+
`user_created` events, `items` only for `order_placed`, and so on. All fields
123+
become optional except `event_type`, which is the only one present in every
124+
record.
125+
126+
### With `--infer-choices`: An Inline Union
127+
128+
Add the `--infer-choices` flag:
129+
130+
```bash
131+
structurize json2s events.jsonl --infer-choices --out events.jstruct.json --type-name DomainEvent
132+
```
133+
134+
Now the inferrer detects that `event_type` is a **discriminator** whose values
135+
correlate with distinct field signatures. It produces a JSON Structure
136+
`choice` type — an inline union:
137+
138+
```json
139+
{
140+
"$schema": "https://json-structure.org/meta/core/v0/#",
141+
"$id": "https://example.com/schemas/DomainEvent",
142+
"type": "choice",
143+
"name": "DomainEvent",
144+
"$extends": "#/definitions/DomainEventBase",
145+
"selector": "event_type",
146+
"choices": {
147+
"order_placed": { "type": { "$ref": "#/definitions/OrderPlaced" } },
148+
"payment_received": { "type": { "$ref": "#/definitions/PaymentReceived" } },
149+
"user_created": { "type": { "$ref": "#/definitions/UserCreated" } }
150+
},
151+
"definitions": {
152+
"DomainEventBase": {
153+
"abstract": true,
154+
"type": "object",
155+
"name": "DomainEventBase",
156+
"properties": {
157+
"event_type": { "type": "string" }
158+
}
159+
},
160+
"OrderPlaced": {
161+
"type": "object",
162+
"name": "OrderPlaced",
163+
"$extends": "#/definitions/DomainEventBase",
164+
"properties": {
165+
"items": { "type": "array", "items": { ... } },
166+
"order_id": { "type": "string" },
167+
"total": { "type": "double" },
168+
"user_id": { "type": "string" }
169+
},
170+
"required": ["items", "order_id", "total", "user_id"]
171+
},
172+
"PaymentReceived": {
173+
"type": "object",
174+
"name": "PaymentReceived",
175+
"$extends": "#/definitions/DomainEventBase",
176+
"properties": {
177+
"amount": { "type": "double" },
178+
"method": { "type": "string" },
179+
"order_id": { "type": "string" },
180+
"payment_id": { "type": "string" }
181+
},
182+
"required": ["amount", "method", "order_id", "payment_id"]
183+
},
184+
"UserCreated": {
185+
"type": "object",
186+
"name": "UserCreated",
187+
"$extends": "#/definitions/DomainEventBase",
188+
"properties": {
189+
"created_at": { "type": "string" },
190+
"email": { "type": "string" },
191+
"user_id": { "type": "string" }
192+
},
193+
"required": ["created_at", "email", "user_id"]
194+
}
195+
}
196+
}
197+
```
198+
199+
This is a proper inline union:
200+
201+
- **`selector`** points to the discriminator field (`event_type`)
202+
- **`choices`** maps each discriminator value to a variant type
203+
- **`$extends`** references an abstract base type with common fields
204+
- Each variant extends the base and adds its specific fields
205+
206+
The choice keys (`order_placed`, `payment_received`, `user_created`) match the
207+
actual values in the data, so instances validate correctly.
208+
209+
### Validating the Result
210+
211+
Using the [json-structure Python SDK](https://pypi.org/project/json-structure/),
212+
we can verify that both the schema and the original instances are valid:
213+
214+
```python
215+
import json
216+
from json_structure import SchemaValidator, InstanceValidator
217+
218+
with open('events_schema.jstruct.json') as f:
219+
schema = json.load(f)
220+
221+
# Validate the schema itself
222+
sv = SchemaValidator(extended=True)
223+
errors = sv.validate(schema)
224+
print('Schema valid:', len(errors) == 0)
225+
226+
# Validate each instance
227+
iv = InstanceValidator(schema, extended=True)
228+
with open('events.jsonl') as f:
229+
for line in f:
230+
if line.strip():
231+
instance = json.loads(line)
232+
errors = iv.validate(instance)
233+
print(f"{instance['event_type']}: {'valid' if not errors else errors}")
234+
```
235+
236+
Output:
237+
238+
```
239+
Schema valid: True
240+
user_created: valid
241+
user_created: valid
242+
order_placed: valid
243+
order_placed: valid
244+
payment_received: valid
245+
payment_received: valid
246+
```
247+
248+
All six instances validate against the inferred schema.
249+
250+
## How the Algorithm Works
251+
252+
The `--infer-choices` option uses a clustering algorithm:
253+
254+
1. **Document Fingerprinting**: Each JSON object is characterized by its field
255+
signature — the set of top-level keys it contains.
256+
257+
2. **Jaccard Similarity Clustering**: Documents with similar field signatures
258+
are grouped together. A two-pass refinement handles edge cases.
259+
260+
3. **Discriminator Detection**: The algorithm looks for fields whose values
261+
correlate strongly with cluster membership. A field like `event_type` that
262+
has distinct values for each cluster is a strong discriminator candidate.
263+
264+
4. **Sparse Data Filtering**: If documents have high overlap (same basic
265+
structure with some optional fields), they're treated as a single type with
266+
optional properties rather than distinct variants.
267+
268+
5. **Nested Discriminators**: The algorithm can detect discriminators inside
269+
nested objects (up to 2 levels deep), handling envelope patterns like
270+
CloudEvents with typed payloads.
271+
272+
The result is a schema that captures the polymorphic structure of your data
273+
rather than flattening everything into a single bag of optional fields.
274+
275+
## Use Cases
276+
277+
- **Event Sourcing**: Infer schemas from event logs with multiple event types
278+
- **API Documentation**: Generate schemas from sample API responses
279+
- **Message Queues**: Document Kafka/RabbitMQ message formats
280+
- **Data Lake Schemas**: Create schemas for semi-structured data in Parquet or Iceberg
281+
- **Code Generation**: Feed the schema into structurize's code generators to produce typed classes
282+
283+
## Getting Started
284+
285+
Install structurize:
286+
287+
```bash
288+
pip install structurize
289+
```
290+
291+
Point it at your data:
292+
293+
```bash
294+
structurize json2s your-data.jsonl --infer-choices --out schema.jstruct.json --type-name YourType
295+
```
296+
297+
Validate the result with the json-structure SDK, or use structurize to convert
298+
the schema to code, documentation, or other formats.
299+
300+
---
301+
302+
The `json2s` command with `--infer-choices` bridges the gap between the JSON
303+
data you have and the structured schema you need. It understands that your
304+
data isn't just a blob of fields — it's a collection of distinct types
305+
with a common discriminator. And it produces schemas that reflect that structure.
306+

0 commit comments

Comments
 (0)