Skip to content

Add support for pointing indexes in ocaml-index#2051

Open
Tim-ats-d wants to merge 23 commits intoocaml:mainfrom
Tim-ats-d:pointing-indexes
Open

Add support for pointing indexes in ocaml-index#2051
Tim-ats-d wants to merge 23 commits intoocaml:mainfrom
Tim-ats-d:pointing-indexes

Conversation

@Tim-ats-d
Copy link
Copy Markdown
Contributor

This PR adds support for pointing to an index file that already exists on disk.

This mechanism results in significantly smaller index files and fewer write operations.

Copy link
Copy Markdown
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voodoos raised another issue: Since the indexes are not self-contained anymore, it's possible that a pointing index targets another index file that has been removed or updated. If we don't check for that and follow pointers to arbitrary locations, some very bad things could happen!

Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
@Tim-ats-d
Copy link
Copy Markdown
Contributor Author

@voodoos raised another issue: Since the indexes are not self-contained anymore, it's possible that a pointing index targets another index file that has been removed or updated. If we don't check for that and follow pointers to arbitrary locations, some very bad things could happen!

I think I could make some kind of defensive programming and simply fail if the pointed-to index file no longer exists. What do you think?

@Tim-ats-d Tim-ats-d force-pushed the pointing-indexes branch 2 times, most recently from 0908c3a to 8032c54 Compare April 1, 2026 13:21
Copy link
Copy Markdown
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking good! I only have some minor requests to improve the id checking logic :)

Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/index_format.ml Outdated
@Tim-ats-d
Copy link
Copy Markdown
Contributor Author

Tim-ats-d commented Apr 3, 2026

Here are some benchmark results from Hyperfine I got running ocaml-index on a bunch of ocaml-index files generated from Merlin repo.

Without pointing index:

Time (mean ± σ):      2.425 s ±  0.087 s    [User: 1.968 s, System: 0.395 s]
Range (min … max):    2.313 s …  2.553 s    10 runs

With pointing-index enabled:

Time (mean ± σ):     896.2 ms ±  42.5 ms    [User: 640.6 ms, System: 245.2 ms]
Range (min … max):   840.6 ms … 972.3 ms    10 runs

Complete command I used:

ocaml-index aggregate dot_merlin_reader.ocaml-index merlin_analysis.ocaml-index merlin_commands.ocaml-index merlin_config.ocaml-index merlin_dot_protocol.ocaml-index merlin_extend.ocaml-index merlin_index_format.ocaml-index merlin_kernel.ocaml-index merlin_specific.ocaml-index merlin_utils.ocaml-index ocaml-index.ocaml-index ocaml_compression.ocaml-index ocaml_parsing.ocaml-index ocaml_preprocess.ocaml-index ocaml_typing.ocaml-index ocaml_utils.ocaml-index ocamlmerlin_server.ocaml-index os_ipc.ocaml-index query_commands.ocaml-index query_protocol.ocaml-index  --root . --rewrite-root

@Tim-ats-d
Copy link
Copy Markdown
Contributor Author

Tim-ats-d commented Apr 3, 2026

Concerning reading performance, here are some other metrics. I ran ocaml-index dump on a project ocaml-index file aggregating the same files used on the previous benchmark.

With pointing-index enabled:

Time (mean ± σ):      3.990 s ±  0.076 s    [User: 3.372 s, System: 0.581 s]
  Range (min … max):    3.909 s …  4.129 s    10 runs

Without:

Time (mean ± σ):      3.803 s ±  0.141 s    [User: 3.417 s, System: 0.356 s]
  Range (min … max):    3.628 s …  4.092 s    10 runs

So reading performances are mostly similar.

@Tim-ats-d
Copy link
Copy Markdown
Contributor Author

My last commit should address your two last review comments @art-w

Copy link
Copy Markdown
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left some additional comments because I'm still not convinced that the store id validation is correct... In fact it might be easier to add the id field in the store type directly (instead of On_disk), since it's as relevant as the filename to identify the expected index file.
I also did not see the code to report a message to users that their build is out of date? (when we run into an invalid store file)

Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.mli Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
Copy link
Copy Markdown
Contributor

@art-w art-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, this looks great! I have some minor remarks but nothing blocking, in case @voodoos has time for another look :)

Comment thread src/ocaml-index/bin/ocaml_index.ml Outdated
Comment thread src/index-format/granular_marshal.ml Outdated
| Serialized of { loc : int }
| Serialized_reused of { loc : int }
| On_disk of { store : store; loc : int; schema : 'a schema }
| On_disk_ptr of { filename : string; loc : int; id : int }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @voodoos) Do we know if the filenames are absolute or relative? (I wonder if there's a risk that ocaml-index could be called from different folders, such that the pointed filenames are later not found because of the relative paths interpretation)

Copy link
Copy Markdown
Contributor Author

@Tim-ats-d Tim-ats-d Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are relative. We can maybe use Unix.realpath on such filename.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filenames in the locations stored in the indexes are relative unless --root is passed to the indexer.
In dune they should always be relative to the root of the build context (like _build/default).

Comment thread src/index-format/granular_marshal.ml Outdated
Comment thread src/index-format/granular_marshal.mli Outdated
| Serialized of { loc : int }
| Serialized_reused of { loc : int }
| On_disk of { store : store; loc : int; schema : 'a schema }
| On_disk_ptr of { filename : string; loc : int; id : int }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filenames in the locations stored in the indexes are relative unless --root is passed to the indexer.
In dune they should always be relative to the root of the build context (like _build/default).

Comment thread src/ocaml-index/bin/ocaml_index.ml
Copy link
Copy Markdown
Collaborator

@voodoos voodoos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks fine to me. Just one worry about the new exception handling when reading the indexes for occurrences. Especially since it's a point in time were indexes are much more likely to be out of sync than during aggregation at build time.

Also, it's all or nothing, but for user queries it would be great to still be able to provide partial results if some of the files are out-of-sync. Would that be hard to do ? If so let's just add a comment about it around the place where we read indexes for occurrences.

Comment thread src/index-format/granular_marshal.mli Outdated
Comment thread src/ocaml-index/bin/ocaml_index.ml
Comment thread tests/test-dirs/occurrences/project-wide/union.t Outdated
@voodoos
Copy link
Copy Markdown
Collaborator

voodoos commented Apr 21, 2026

And this definitely deserves a changelog entry.

@voodoos
Copy link
Copy Markdown
Collaborator

voodoos commented Apr 21, 2026

Thanks, can you rebase ? there are some merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants