Mesa-examples revival: vision & open questions #417

EwoutH · 2026-03-19T08:42:03Z

EwoutH
Mar 19, 2026
Maintainer

Explored with and final post written up by Claude 4.6 Opus.

What we're trying to do

Mesa-examples has been neglected since the core examples moved to the main repo. Examples break silently, documentation quality varies wildly, and new users struggle to find what they need. We want to turn mesa-examples into a well-maintained, discoverable, and contributor-friendly collection that stays healthy as Mesa evolves.

Goals

Minimize maintainer load. Maintainers should make editorial decisions, not chase down every broken example or review every line of code themselves.
Maximize quality and robustness. Examples should work, be well-documented, and demonstrate something genuinely useful about Mesa or agent-based modeling.
Keep contributing accessible for modellers. The contribution process should meet modellers where they are. Contributors write models and explain them — they shouldn't need to build infrastructure.
Make it a sustainable system. Nothing should silently rot. When things break, it's visible. When people move on, the system degrades gracefully rather than collapsing.
Discoverability. A new user should be able to find relevant examples by domain, complexity, or Mesa feature — without needing prior knowledge of the repository structure.
Learning path. Mesa-examples should support progression from beginner to advanced usage, not just be a flat collection.

Key directions

Example lifecycle and status

It might be useful to every having an explicit status that reflect its maturity: something like incubator (works but not yet polished), verified (reviewed, documented, actively maintained), and showcase (editorially selected as exemplary). Plus deprecated for examples that are no longer maintained. Status makes quality visible to users and sets clear expectations for contributors.

Open questions:

Are these the right status levels, or do we need more/fewer?
What are the concrete criteria for promotion between statuses?
How do we handle the transition when an example degrades — who decides, and when?

Ownership

Every verified example could have an explicit owner: the person responsible for keeping it healthy. Not doing all the work, but responding when something needs attention. Incubator examples wouldn't require an owner, keeping the barrier to entry low. When an owner steps away, there should be a clear process: flag it, seek adoption, demote if nobody picks it up.

Open questions:

What's a realistic ownership commitment? What do we actually expect from an owner?
How do we make ownership attractive rather than burdensome?
How do we handle the transition when an owner goes silent?

A structured contribution process

A structured contribution process could help a more gently way towards high-quality examples. This could be like three review stages for PRs: author self-review (working through a checklist, demonstrating understanding), peer/collaborator review (running the model, exploring behavior, asking substantive questions), and maintainer approval (confirming the process was followed, making the editorial call). This distributes load and builds community skills — especially important in the era of AI-generated code, where self-review is how contributors demonstrate they understand what they're submitting. See the review guidelines for an initial draft policy.

Open questions:

Do we need a proposal/issue stage before a PR, or does that add friction without enough value?
How do we prevent peer review from becoming perfunctory?
How should the review depth scale with the target status (incubator vs. verified)?

README as mini-paper

Example READMEs could follow a structure inspired by academic papers: abstract, background, model description (agents, rules, space, parameters), how to run, results and discussion, and references. This serves multiple audiences: users browsing the gallery, learners working through an example, contributors using it as a template, and academics wanting something citable. The completeness of these sections can scale with example status.

Open questions:

What's the minimum viable README for incubator status?
How do we keep this lightweight enough that it doesn't scare off contributors?
Should parts of the README be generated from metadata to avoid duplication?

Metadata

Each example could carry a small metadata file enabling machine-readable discoverability and automated validation. We've been converging on fields like title, abstract, authors, domain, complexity, keywords, and Mesa version compatibility. The guiding principle: require only what you must, automate what you can, and never let information go silently stale.

Open questions:

What's the right file format (YAML, TOML, something else)?
Which fields are truly required vs. optional?
Can some metadata (like Mesa version compatibility) be CI-derived rather than author-declared?

Automation and CI

Automated validation could be a way how we keep the collection healthy without constant maintainer attention. This means CI on PRs (does it run, is metadata valid), scheduled CI against current Mesa (catch breakage early), and pre-release testing against Mesa release candidates. When something breaks, it should become a visible, tracked issue — not something hiding in a log.

Open questions:

What should CI actually test — just "does it run for N steps," or deeper checks?
How do we handle Solara/visualization testing in CI?
What's the right cadence for scheduled runs?

What we're looking for in proposals

A strong proposal doesn't need to address all of the above. It should pick a coherent subset, demonstrate understanding of the tradeoffs involved, and show concrete thinking about implementation. We value proposals that are honest about what's hard and what they don't know, over ones that present everything as solved.

We're particularly interested in:

How you think about balancing ease of contribution against quality standards
Practical experience with (or ideas for) CI/automation for example validation
Thoughtful engagement with the metadata and discoverability problem
Understanding of what makes an example genuinely useful for learning, not just runnable

quaquel · 2026-03-19T08:52:39Z

quaquel
Mar 19, 2026
Maintainer

I'll try to respond later in more detail. But, one critical question for me is what we are trying to achieve with this repo? Is it merely a rich collection of how people have used Mesa, or should each example demonstrate some aspect of the Mesa library in greater detail than can be done by the curated examples in the core repo? Clarifying what we try to achieve with mesa-examples first will probably help answer some of the more detailed questions raised by @EwoutH.

If, as @EwoutH also suggests, each example should demonstrate something genuinely useful about Mesa, then it is essential that the claimed contribution be clear. Moreover, it can be grounds to remove examples that merely duplicate rather than add.

0 replies

abhinavk0220 · 2026-03-19T09:21:25Z

abhinavk0220
Mar 19, 2026

Great to see this formalized a lot of these pain points are exactly what I ran into building the four LLM examples (#360, #363, #372, #378).
A few thoughts from that experience:

On example lifecycle

The incubator → verified → showcase progression makes sense. One thing I'd add: for LLM examples specifically, "verified" is tricky because they require API keys and are non-deterministic. Maybe LLM examples need a slightly different verification standard e.g., structural checks (does it initialize, does the step loop complete with a mock) rather than output reproducibility.

On README as mini-paper

I started doing this naturally for the LLM examples each has a background section, model description, and key findings (e.g., LLM agents producing weaker segregation in Schelling, or flatter epidemic curves in SIR due to behavioral heterogeneity). Having a standardized template would have saved time and made them more consistent across examples.

On CI for LLM examples

The biggest gap I see is that LLM examples can't run in standard CI without API keys. One approach: a lightweight "smoke test" mode where models run with mock LLM responses, testing that the Mesa machinery works even if the LLM outputs are canned. This keeps CI useful without requiring secrets in the pipeline.

On metadata

YAML feels right it's already familiar in the Python ecosystem and readable without tooling. For required fields, I'd keep it minimal: title, authors, domain, mesa_version_min, complexity. Everything else optional.
Happy to help prototype any of this my existing PRs could serve as guinea pigs for the lifecycle/metadata system.

0 replies

Tushar1733 · 2026-03-19T10:14:51Z

Tushar1733
Mar 19, 2026

Hey @EwoutH , My approach almost solves all the problems you've mentioned so far except for the Discoverability problem:

I have already developed a prototype Validator which executes all the examples in interactive mode to evaluate whether it's working or not . This can be integrated using CI to automate the validating system. you can the validator in my mesa-examples fork : https://github.com/Tushar1733/mesa-examples

My approach includes:

Per-Example Metadata File.
CI Integration Using Metadata.
Weekly Scheduled CI
GitHub Issue / PR Labels (Automatic)
Status Lifecycle System
Archival Policy(Not the priority though)
Standardised Contribution Template
Example Health Dashboard

Key Benefits of This Approach

→ automated example validation
→ structured example metadata
→ early detection of compatibility issues
→ transparent maintenance lifecycle
→ lower burden on maintainers
→ improved contributor experience
→ repository-wide health visibility via dashboard

here's my plan's workflow chart:-

5 replies

EwoutH Mar 19, 2026
Maintainer Author

I think a lot of this makes sense.

Where will the status labels live? PRs get completed, so that’s not a persistent place. I think in the metadata makes the most sense.

quaquel Mar 19, 2026
Maintainer

Can't CI be used to update something like that, either through commit or through posting to some pinned discussion? Perhaps tied only to stable releases of mesa itself.

Tushar1733 Mar 19, 2026

I think a lot of this makes sense.
Where will the status labels live? PRs get completed, so that’s not a persistent place. I think in the metadata makes the most sense.

@EwoutH Yeah the Labels will stay in metadata of each example. and i am planning to create a health dashboard of all the examples in the README of the repo so we can label it there as well.

Tushar1733 Mar 19, 2026

Can't CI be used to update something like that, either through commit or through posting to some pinned discussion? Perhaps tied only to stable releases of mesa itself.

@quaquel Your thinking is right CI can be used to update that automatically.

Tushar1733 Mar 21, 2026

Hey @EwoutH here's my final workflow chart, i think this will answer you many questions

------Trigger ----------------Pipeline -------------Outputs----

abhinavk0220 · 2026-03-19T17:26:44Z

abhinavk0220
Mar 19, 2026

Following up on my earlier comment after reading quaquel's question and the responses so far, I want to engage more concretely with the open questions.

On quaquel's question: what is mesa-examples for?

I think the answer is both and the lifecycle system is precisely how to hold both without contradiction. The incubator tier is a rich collection: open to anyone, low bar, shows how people use Mesa in practice. Verified and above are the curated layer: editorially selected, demonstrating something genuinely useful about Mesa APIs or ABM methodology. The status makes the distinction visible without excluding anyone.

The criterion for promotion from incubator → verified could be: does this example demonstrate a Mesa feature or modeling pattern not already covered by another verified example? That gives a concrete, reviewable bar and creates grounds to decline promotion (not rejection of the PR, just of the status upgrade) when there is genuine overlap.

On where status labels live (responding to @EwoutH)

Metadata file is the right answer it's the only persistent, version-controlled, machine-readable place. I'd propose a minimal meta.yaml in each example directory:

title: LLM Schelling Segregation
authors:
  - abhinavk0220
domain:
  - social-dynamics
  - segregation
complexity: intermediate   # beginner / intermediate / advanced
mesa_version_min: "3.0"
status: incubator          # incubator / verified / showcase / deprecated
owner: null                # required for verified and above
llm_required: false        # flag for examples needing API keys

The fields I'd make truly required: title, authors, domain, status. Everything else optional for incubator, required for verified. This keeps the barrier low at entry.

GitHub labels can mirror the metadata status as a view helpful for filtering PRs but the meta.yaml is the source of truth. CI updates the label from the file, not the other way around.

On CI-derived vs author-declared compatibility (responding to @quaquel)

Author declares mesa_version_min (the minimum they have tested against). CI does the rest: on every Mesa release, run all verified examples and write results back to a pinned discussion or a ci-status.json in the repo. When an example breaks, open a GitHub issue automatically tagged status: broken, notify the owner. If unresolved after N weeks, demote to deprecated and open a call-for-adoption issue.

The key insight is that mesa_version_min never goes stale because it is only a floor the CI result tells you the ceiling. No human needs to update it unless they backport a fix.

On CI for LLM examples

Standard CI cannot run LLM examples without API keys, but the Mesa machinery can still be tested. A mock LLM responder that returns a fixed string lets you verify: does the model initialize correctly, does step() complete without error, does the DataCollector collect at each step, does the visualization component render. These four checks catch 90% of real breakage without any API calls. The llm_required: true flag in metadata tells CI to use mock mode automatically.

On preventing perfunctory peer reviews

The review template should require the reviewer to complete a sentence: "I ran the model for N steps and observed [specific emergent behavior]." Without evidence of actually running it, the review is flagged as incomplete by the PR template checklist. This is low-friction (one sentence) but creates accountability you cannot fake having run a model without knowing what it does.

For LLM examples, the bar adjusts: "I ran the model with [model name] for N steps and the agents produced responses consistent with [expected reasoning pattern]."

On minimum viable README for incubator

Two sections required: (1) what does this model do in one paragraph, (2) how to run it with exact commands. Everything else background, results, references required for verified, encouraged for incubator. This keeps the entry bar low while ensuring even the simplest submission is usable by a newcomer.

On ownership commitment

A realistic ownership commitment for verified examples: respond within two weeks when a CI-opened issue tags you, and either fix it or explicitly hand off. That is it no ongoing maintenance required beyond being reachable. If an owner goes silent for 30 days on a tagged issue, the example moves to a needs-owner state and a call-for-adoption goes out. This makes ownership attractive because the commitment is bounded and explicit.

Happy to prototype the meta.yaml schema and the CI validation workflow as part of the GSoC work the four LLM examples I submitted (#360, #363, #372, #378) plus the existing examples in the repo are a ready-made test bed for iterating on this.

0 replies

himanksingh2024-svg · 2026-03-19T18:14:43Z

himanksingh2024-svg
Mar 19, 2026

From fixing PR #383 (aco_tsp path bug) and reviewing PR #382 (hex_snowflake visualization failure), I can confirm the silent breakage problem is real. Both examples failed for different reasons — one a path issue, one a removed API — and neither was visible without actually running them locally.
On CI: the minimum viable test should be "runs for N steps without error." That single check would have caught both issues I found. For the visualization layer (Solara), even just testing that the app.py imports without error catches a large class of breakage.
On metadata: YAML makes sense. For incubator, I'd require only: title, domain, complexity, mesa_version_min. Status field should default to incubator automatically on first PR — contributors shouldn't need to know the lifecycle exists to contribute.
Happy to help prototype this at the call tomorrow.

0 replies

aniketgit-hub101 · 2026-03-20T06:13:42Z

aniketgit-hub101
Mar 20, 2026

Hi @EwoutH, thank you for writing this up so clearly — it maps almost exactly to the problems I have been thinking about while preparing my GSoC proposal for this project.

0 replies

Harshini2411 · 2026-03-21T01:37:55Z

Harshini2411
Mar 21, 2026

Coming at this from the peer review side, I've reviewed six PRs this week and a pattern kept coming up: reviewers reading the code but not running it. Two PRs had runtime-breaking bugs that were invisible from the diff alone but immediately obvious on execution. Agreeing with abhinavk0220 that requiring one sentence about observed behavior when running the model is the right fix.

On the README-as-mini-paper idea: one concrete CI addition nobody mentioned, testing that README code snippets actually execute. PR #389 had a quick start snippet referencing three model attributes that don't exist in model.py. That's a whole class of breakage CI could catch automatically.

One open question I'd love a steer on: for LLM examples, is mesa-llm abstractions or direct API calls the preferred pattern going forward? Happy to align my work to whatever direction makes most sense for the repo.

0 replies

EwoutH · 2026-03-21T12:42:24Z

EwoutH
Mar 21, 2026
Maintainer Author

I was thinking a bit on the documentation, mini-paper and metadata. Ideally it's one thing, that's both easy to read and write for humans as for machines.

Rather than maintaining a separate meta.yaml and a README.md that overlap heavily (title, authors, domain, abstract), what about using a single README with YAML frontmatter?

---
title: LLM Schelling Segregation
authors:
  - abhinavk0220
domain:
  - social-dynamics
  - segregation
complexity: intermediate
mesa_version_min: "3.0"
status: incubator
owner: null
keywords: [LLM, segregation, behavioral heterogeneity]
---

## Abstract
One-paragraph summary of what this model does and why it's interesting.

## Model Description
Agents, rules, space, parameters...

## How to Run
Exact commands to get it working.

## Results & Discussion
...

This is a well-established pattern (Hugo, Jekyll, Quarto, Pandoc all use it), so contributors will likely recognize it and tooling already exists. One file means no drift between metadata and documentation.

4 replies

ShreyasN707 Mar 21, 2026

I was thinking about concept of registry system . While the README (mini-paper + frontmatter) is useful for detailed understanding, it doesn’t fully solve discoverability of examples.

A registry could act as a central place where all examples are tracked and summarized, combining both metadata and CI results
This would make it easier to explore examples and also ensure that their health is always visible.
Automatically track the health of each example using CI
(whether it runs, fails, or is outdated)
Store the status/lifecycle label
(like incubator, learning, showcase) to help filter examples
Provide quick information about each like :

title
domain
complexity

Show compatibility details
(like which Mesa version it works with)

What it would do:

It becomes a single place to:

see which examples are working
find examples easily
filter based on needs

It acts like a dashboard for maintainers
It also acts like a catalog for users
(to explore and choose useful examples)

@EwoutH how is this concept ?will this be useful for the discoverability and the maintenance?

EwoutH Mar 21, 2026
Maintainer Author

Could be useful. As long as we have one source of truth.

Vanya-kapoor Mar 21, 2026

@EwoutH the frontmatter approach makes a lot of sense — one file, no drift, familiar pattern. I've been working on exactly this in #423 (metadata.toml for forest_fire as a reference implementation) and #424 (CI validation workflow), so happy to pivot those to frontmatter format if that's the direction.
One thing nobody has mentioned yet: human discoverability is solved by the catalog/registry, but AI discoverability is a separate layer. When a user asks Claude or Copilot "how do I build a Mesa model?", the AI either uses stale training data (pre Mesa 3.x) or scrapes HTML and gets noise. I've been working on an llms.txt (#425 ) at the repo root — plain Markdown that gives AI tools accurate Mesa 3.x context, the full working example list, and the migration error table in one fetch. It's complementary to the frontmatter system: frontmatter feeds the human-readable catalog, llms.txt feeds AI tools. The broken-example health table in llms.txt means AI tools stop recommending broken examples to new users, which is one of the most common friction points for people discovering Mesa.

ShreyasN707 Mar 22, 2026

I was thinking of using the status of an example to place it into different tiers like showcase, learning, archived, deprecated. The status (set by the maintainer) can be combined with the health of the example to decide where it belongs. This can help improve discoverability while also clearly separating well-maintained examples from broken or outdated ones.something like:

| Example Name       |  Status          | Health      | Final Tier   | Meaning                                      |
|--------------------|------------------|-------------|--------------|----------------------------------------------|
| Predator Prey      | learning         | healthy     | learning     | Good for beginners, works properly           |
| Schelling Model    | showcase         | healthy     | showcase     | High-quality, recommended example            |
| Traffic Simulation | learning         | broken      | deprecated   | Was useful but currently broken              |
| Old Market Model   | reference        | stale       | archived     | Outdated but kept for reference              |
| New Experiment     | incubator        | unstable    | incubator    | Early-stage, still being tested              |

aniketgit-hub101 · 2026-03-21T12:47:35Z

aniketgit-hub101
Mar 21, 2026

This is a much cleaner design — one file, no drift, familiar pattern. The frontmatter approach also makes the CI validation script simpler: parse the frontmatter block, validate required fields for the declared status level, then treat everything below the --- as the human-readable documentation. No separate file to keep in sync.
Two practical questions this raises for me:

For CI purposes, should entry_point be derivable from convention (e.g., always run.py or app.py) rather than declared in frontmatter? That would reduce required fields further.
For the snippet testing you mentioned earlier — would the ## How to Run section be the canonical source for executable snippets, or should there be a separate fenced code block marker (e.g., ```python run) to distinguish runnable from illustrative code?

0 replies

Tushar1733 · 2026-03-22T05:37:32Z

Tushar1733
Mar 22, 2026

Are these the right status levels, or do we need more/fewer?

Hey @EwoutH I have a suggestion about the labels,

Right now you are thinking of using this pattern :

incubator | verified | showcase | deprecated

This doesn't seem very user-friendly. It's kind of complex if we see it from a perspective of someone who is not involved with CS background.

I think we should keep the things as simple as possible for the users, the pattern i suggest is this :

Active\varified | Needs-upgrade | Broken | Archive

let me know if you think the same way.

2 replies

EwoutH Mar 22, 2026
Maintainer Author

Some discussion about which states we need and why is definitely useful.

Maybe we need to start with all the problems we have with current examples and then see which status levels could help with that?

Tushar1733 Mar 22, 2026

@EwoutH I've already audited all the examples in this repo and if you i want i can give you the detailed reports too.
The categories problems i explored are these:

My intuition is very simple straight forward and i am dropping my idea of labeling.

himanksingh2024-svg · 2026-03-22T07:55:20Z

himanksingh2024-svg
Mar 22, 2026

The frontmatter approach makes sense from a contributor perspective. When I submitted PR #383, there was no standard structure — a single README with frontmatter would have made the contribution process clearer and reduced the chance of drift between metadata and docs.
On entry points for CI: convention over configuration makes sense here. If CI defaults to looking for app.py or run.py, contributors don't need to declare it in frontmatter at all — that's one less required field for incubator submissions, keeping the barrier low.
On status labels: starting from the problems with current examples is the right framing. From running examples locally, the most common issues are "doesn't run at all" and "runs but visualization broken" — any status system should make those two states immediately visible.

0 replies

Tushar1733 · 2026-03-25T08:00:41Z

Tushar1733
Mar 25, 2026

Hey @EwoutH @quaquel while experimenting with my ideas i noticed that managing the dependencies of different model in CI can be messy.

So i got a idea.

Idea : One collective requirements.txt file in the root of repo.
- In short: A python script will extract all the dependencies across all over the Examples and dump it into a collective requirements.txt which can be used by the CI to solve the messy Dependency issues.

I have documented the experiment and workflow that I tested, and I’ll provide the link here so it’s easier to review the approach and results.
Link: https://github.com/Tushar1733/Experiments

6 replies

Tushar1733 Mar 25, 2026

That's a valid point. Handling different versions of the same dependency across examples would add quite a bit of complexity.
since it would require resolving dependency constraints or isolating environments per example.

For this experiment, my goal is simpler: to test the examples against the latest available versions of the dependencies.

Nandha-kumar-S Mar 25, 2026

to test the examples against the latest available versions of the dependencies.

Do you mean One collective req.txt file with the latest available versions of the dependencies? if yes it will still break examples which uses older versions resulting in deprecated function usage error.

Tushar1733 Mar 25, 2026

Yeah I know, after the example breaks it will fail the test and be labelled: needs-upgrade

Tushar1733 Mar 25, 2026

Do you mean One collective req.txt file with the latest available versions of the dependencies? if yes it will still break examples which uses older versions resulting in deprecated function usage error.

now i get your point @Nandha-kumar-S so you mean what will happen if an example explicitly needs an older version of dependencies right? if yes then you are right to question my approach.

Thanks for clarification i'll soon update my solution.

Nandha-kumar-S Mar 25, 2026

Yes exactly If we label the broken model as needs-upgrade and the burden of fixing that code to get the CI green exists while the model was working just fine with its isolated dependencies. Each and every model doesn't have to work only with the latest version of its dependencies.

Nandha-kumar-S · 2026-03-25T09:57:36Z

Nandha-kumar-S
Mar 25, 2026

Hi @EwoutH Thanks for notifying us about your vision on this, Im working on these pillars Automation and CI, Metadata & Discoverability, and Ownership & Graceful Degradation.
Regarding the open question states about Ownership

How do we handle the transition when an owner goes silent?

We can tie the CI, the CODEOWNERS file, and the metadata together to automate this.
If an example fails the scheduled Matrix CI, a bot pings the CODEOWNER. If there is no fix or response within a set timeframe (e.g., 30 days), an automated script changes the README's frontmatter from status: verified to status: deprecated and opens an "Adopt Me" issue.
This ensures the repository degrades gracefully and users aren't misled by broken showcase models.

Lmk what you think about this,

0 replies

himanksingh2024-svg · 2026-03-25T16:57:42Z

himanksingh2024-svg
Mar 25, 2026

@Nandha-kumar-S raises the core issue here — the goal of CI shouldn't be to enforce a shared environment, but to verify that each example works correctly within its own declared dependencies.

The natural solution is a per-example requirements.txt (already implied by the frontmatter approach @EwoutH proposed) + a GitHub Actions matrix job where each example gets its own isolated install and test run. A model that works fine on Mesa 3.x with pinned deps shouldn't fail CI just because another example has moved to 4.0.

This also makes the needs-upgrade label meaningful — it flags examples where the author's intended environment no longer installs cleanly, not examples that simply conflict with a shared lockfile. Failures become precise and traceable.

From fixing #383 and reviewing other examples locally, I think "runs for N steps without error" is the minimum viable CI check — that single gate would have caught most of the silent breakage I've seen. Convention over configuration for entry points (defaulting to app.py or run.py) keeps the per-example setup lightweight without adding required frontmatter fields.

Happy to prototype the matrix workflow — this is something I've been thinking through for my GSOC proposal on this project.

1 reply

Nandha-kumar-S Mar 25, 2026

A model that works fine on Mesa 3.x with pinned deps shouldn't fail CI just because another example has moved to 4.0

Exactly man !

aniketgit-hub101 · 2026-03-25T17:10:06Z

aniketgit-hub101
Mar 25, 2026

The per-example isolation point is exactly right — a matrix job where each example installs its own declared dependencies is the clean solution. It also makes the frontmatter mesa_version_min field meaningful: CI uses it as the floor for the test matrix, so you know the example passes on Mesa 3.x and you can track exactly where it breaks.
One addition worth considering: a per-example requirements.txt alongside the README frontmatter, auto-generated from the example's imports if not provided. That keeps the contribution bar low — modellers don't need to manually maintain a lockfile — while still giving CI an isolated environment to work with.

0 replies

himanksingh2024-svg · 2026-03-26T04:06:35Z

himanksingh2024-svg
Mar 26, 2026

@aniketgit-hub101 the auto-generation idea is a good addition — it removes the last manual step for contributors. One thing worth thinking through: auto-generating from imports gives you direct dependencies, but not pinned versions. So CI would install latest by default, which brings back the version conflict problem unless we also pin at generation time.

A practical approach: auto-generate the requirements.txt on first contribution with pinned versions from the contributor's environment (something like pip freeze scoped to the example's imports). The contributor can then edit it if needed, but most won't have to. That way the barrier stays low AND the environment is reproducible.

The frontmatter mesa_version_min then acts as a sanity check — if the auto-generated requirements.txt pins Mesa 3.x but mesa_version_min says 4.0, CI can flag the inconsistency automatically.

0 replies

aniketgit-hub101 · 2026-03-26T04:24:01Z

aniketgit-hub101
Mar 26, 2026

That's the right refinement — pinned versions from the contributor's environment solves the reproducibility problem cleanly. The pip freeze scoped to example imports approach is practical: contributors get a working lockfile automatically, and the rare case where they need to adjust it is explicit rather than hidden.
The cross-check between requirements.txt pins and mesa_version_min frontmatter is a nice consistency guarantee — CI catches the case where someone updates the lockfile but forgets to update the frontmatter floor, or vice versa. That kind of automatic consistency enforcement is exactly what makes the system trustworthy without requiring contributors to understand all the moving parts.
Would it make sense for the auto-generation tool to also write mesa_version_min automatically from the pinned Mesa version in requirements.txt? That would make the frontmatter nearly zero-effort for new contributors — they run one command, get both files, and only need to fill in title, authors, domain, complexity, and status.

0 replies

abhinavk0220 · 2026-03-26T05:14:47Z

abhinavk0220
Mar 26, 2026

The single-file frontmatter approach is the right call less surface area for drift, familiar to anyone who has used Hugo/Jekyll, and the CI validation logic becomes trivially simple: parse frontmatter, validate required fields against status level, done.

To make this concrete, here is what the frontmatter would look like for one of my existing LLM examples (#363):

---
title: LLM Schelling Segregation
authors:
  - abhinavk0220
domain: [social-dynamics, segregation]
complexity: intermediate
mesa_version_min: "3.0"
status: incubator
owner: null
llm_required: true
entry_point: app.py
---

On entry_point: convention vs. declared (responding to @aniketgit-hub101's question)

Convention-first: CI looks for app.py, then run.py, then falls back to a declared field. This means 90% of contributors never need to think about it, and the escape hatch exists for the exceptions. One less required field for incubator status, which keeps the barrier to entry low.

On review depth scaling with status level this is the open question nobody has answered concretely yet, and I have direct experience from the four LLM PRs I submitted (#360, #363, #372, #378).

Incubator: one reviewer, must confirm they ran the model "ran for N steps, observed X behavior"_. No fake-able shortcut.
Verified: two reviewers minimum. One runs against current Mesa main. One reviews the README for factual accuracy against the actual code.
Showcase: maintainer sign-off required in addition to the above.

The key insight is that the checklist scales with status level, not the process — same three-stage flow (self-review → peer → maintainer) at every tier, just with more boxes ticked as the bar rises. This keeps the contribution path predictable for everyone.

On the ## How to Run section as the canonical executable source yes, that is the right anchor for snippet testing. A fenced block marked ```bash run signals "this must execute". Everything else is illustrative. The contract is explicit without requiring contributors to understand a separate test framework.

On LLM examples specifically the llm_required: true flag in frontmatter lets CI branch automatically into mock mode: a lightweight responder that returns a fixed string lets you verify that the model initializes correctly, step() completes without error, the DataCollector collects at each step, and the visualization component renders. These four checks catch the vast majority of real breakage without any API calls. The four LLM PRs I submitted (#360, #363, #372, #378) are a ready test bed for iterating on this happy to prototype the mock CI workflow as part of the GSoC work.

0 replies

aniketgit-hub101 · 2026-03-26T05:24:51Z

aniketgit-hub101
Mar 26, 2026

The convention-first entry_point resolution (app.py → run.py → declared) is the right call — it matches how contributors actually structure examples without requiring them to know the system exists.
Building on the review checklist idea: the "must confirm they ran the model" requirement for Incubator is the right minimum bar, but it only works if the reviewer has a frictionless way to run it. This is where the auto-generated requirements.txt (from our earlier thread with @EwoutH) connects — if CI auto-generates a pinned lockfile on first contribution, the reviewer can run pip install -r requirements.txt && python app.py with one command and no environment setup. The checklist item becomes genuinely verifiable rather than a rubber stamp.
On the four LLM checks (initialize, step, DataCollector, visualization) — this is a useful baseline. One addition: checking that DataCollector output is non-empty after N steps catches a silent failure mode where the model runs but produces no data, which looks like success but means the example is broken for its primary purpose.

2 replies

abhinavk0220 Mar 26, 2026

The DataCollector non-empty check is a sharp addition — and for LLM examples it catches a failure mode that is particularly easy to miss. The model can run cleanly (all LLM calls succeed, no exceptions) while the DataCollector silently returns an empty DataFrame because a variable was never registered in __init__ or collect() was never called at the end of step(). From the four LLM PRs, this exact issue surfaced during development of the SIR model — the epidemic curves looked right visually but the underlying DataFrame was one step behind because collect() was placed before step() logic instead of after.

One extension worth adding to that check: verify that agent-level data is non-empty too, not just model-level. For LLM examples the interesting signal is individual agent decisions — opinion shifts, cooperation rates, health state transitions — not just aggregate counts. A model that collects model-level data but drops agent-level data is half-broken for its primary purpose.

On the auto-generated requirements.txt connecting to mock CI: the llm_required: true flag means CI needs two install profiles — the real one for human contributors (pip install -r requirements.txt) and a mock one for CI that swaps the LLM provider for a fixed-response stub. This is worth encoding explicitly in the frontmatter or a separate requirements-ci.txt rather than leaving CI to figure it out, otherwise the auto-generation logic gets complicated. Small detail but worth designing upfront so the system stays simple for both contributors and maintainers.

aniketgit-hub101 Mar 27, 2026

The two-profile approach is worth encoding explicitly — requirements-ci.txt as a separate file makes the contract clear for both contributors and the CI system. Auto-generation handles the common case: if llm_required: false, one lockfile is enough; if llm_required: true, the generator creates both, with the CI profile substituting a fixed-response stub for the real provider. Contributors who never touch LLM examples never need to know the two-profile system exists.
On agent-level DataCollector checks: agreed that's the right extension. A concrete implementation: after N steps, assert len(model.datacollector.get_agent_vars_dataframe()) > 0 alongside the model-level check. For LLM examples this catches the case where agent decisions are never recorded — which, as you noted, is exactly the silent failure mode that matters most.

abhinavk0220 · 2026-03-31T17:43:05Z

abhinavk0220
Mar 31, 2026

Wanted to share a concrete output that connects back to this discussion.

I built the Ratchet Effect model (@EwoutH's issue #249, open since March 2025) â€” PR #458. Remote work as the domain, demonstrating path dependency through asymmetric lock-in dynamics.

The README uses the frontmatter schema you proposed here:

---
title: Ratchet Effect â€” Remote Work
authors:
  - abhinavk0220
domain: [labor-economics, behavioral, social-dynamics]
complexity: intermediate
mesa_version_min: "3.0"
status: incubator
owner: abhinavk0220
llm_required: false
entry_point: app.py
---

A few things I noticed while writing it:

On the frontmatter in practice: the entry_point field turned out to be unnecessary â€” CI convention-first (app.py -> run.py) meant I never needed to declare it. Worth making it truly optional rather than declaring in the schema for incubator status.

On status: incubator criteria: for this PR to graduate to verified, the concrete bar I'd propose is: one reviewer confirms they ran solara run app.py and observed the ratchet gap (no-shock baseline vs. post-shock equilibrium). That one sentence proves the model does what the README claims. No output reproducibility requirement for non-deterministic models, just behavioral confirmation.

On the llm_required flag: I added it because the four LLM examples I submitted (#360, #363, #372, #378) need CI to know whether to spin up a mock responder. For non-LLM examples it defaults to false and CI ignores it, zero overhead for the majority of contributors.

Happy to use PR #458 as a guinea pig for iterating on the contribution process and metadata system if useful.

0 replies

Tushar1733 · 2026-04-21T04:24:12Z

Tushar1733
Apr 21, 2026

Update — Validator 2 (Declared Environment) prototype complete

Since submitting my GSoC proposal on March 31, I've completed the prototype for Validator 2, which was the remaining open piece at the time of submission.

Validator 2 now:

Creates an isolated virtual environment per example
Installs only the pinned versions declared in each example's requirements file
Runs the same model test + server boot check as Validator 1
Generates a structured JSON report for its environment

Both validators are now working end-to-end. You can find the updated prototype here: https://github.com/Tushar1733/mesa-examples/blob/main/scripts/declared_validate_examples.py

You can comment down if any concern arises. @EwoutH @quaquel

0 replies

This comment was marked as off-topic.

Sign in to view

Uh oh!

Mesa-examples revival: vision & open questions #417

Uh oh!

Uh oh!

EwoutH Mar 19, 2026 Maintainer

What we're trying to do

Goals

Key directions

Example lifecycle and status

Ownership

A structured contribution process

README as mini-paper

Metadata

Automation and CI

What we're looking for in proposals

Replies: 22 comments · 20 replies

Uh oh!

quaquel Mar 19, 2026 Maintainer

This comment was marked as off-topic.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EwoutH Mar 19, 2026 Maintainer Author

Uh oh!

quaquel Mar 19, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hey @EwoutH here's my final workflow chart, i think this will answer you many questions

------Trigger ----------------Pipeline -------------Outputs----

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EwoutH Mar 21, 2026 Maintainer Author

Uh oh!

What it would do:

Uh oh!

EwoutH Mar 21, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Hey @EwoutH I have a suggestion about the labels,

Uh oh!

EwoutH Mar 22, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EwoutH
Mar 19, 2026
Maintainer

Replies: 22 comments 20 replies

quaquel
Mar 19, 2026
Maintainer

EwoutH Mar 19, 2026
Maintainer Author

quaquel Mar 19, 2026
Maintainer

EwoutH
Mar 21, 2026
Maintainer Author

EwoutH Mar 21, 2026
Maintainer Author

EwoutH Mar 22, 2026
Maintainer Author