Skip to content

Lazy fields parsing #2072

Open
poslegm wants to merge 32 commits intoscalapb:masterfrom
poslegm:fresh-benchmarks
Open

Lazy fields parsing #2072
poslegm wants to merge 32 commits intoscalapb:masterfrom
poslegm:fresh-benchmarks

Conversation

@poslegm
Copy link
Copy Markdown

@poslegm poslegm commented Mar 8, 2026

Hello!

This is a revival of the work to support lazy fields (previous attempt #1376).

Context

In Java protobuf, string fields are handled using LazyField. This mechanism stores the field data as a ByteString and only parses it into a UTF‑8 string when the corresponding getter is called. When a message containing such fields is serialized, the raw ByteString is written directly, without performing UTF‑8 encoding or decoding (source).

Unlike java protobuf, scalapb does not lazily serialize strings. Accordingly, this is an opportunity to reduce the overhead if the following factors coincide:

  • protobuf message consists of a large number of string fields;
  • read access (calling a getter method) to only a small number of attributes (parse message → read a few fields → serialize).

Such usage patterns are quite common for cloud-native applications.

The essence of the changes

Generating LazyField[String] for string fields if scalapb.options.lazy_fields is enabled. LazyField[T] contains the original ByteString and lazily parses the value on demand. Introduces implicit conversions for convenient use of generated case classes.

Example:

message LazyWithRecursion {
  option (scalapb.message) = {
    lazy_fields: true
  };
  string data = 1;
  LazyWithRecursion nested = 2;
}
val original = LazyWithRecursion(data = "a lazy string", nested = Some(LazyWithRecursion(data = "nested string")))

val updated = original.update(_.nested.data := "updated string")

println(updated.nested.get.data == "updated string") // true

How it works with parsing and serialization:

val msg = LazyMessage.parseFrom(bytes)

// No parsing has happened for lazy_field yet.

val serialized = msg.toByteArray // <--- fast serialization without UTF-8 encoding

val upper = msg.lazyField.toUpperCase  // <--- Parsing happens here, on first access.

println(upper)

Benchmarks

New benchmarks have been added: roundTripScala and roundTripJava. They test the full proto lifecycle: parsing and serialization. I was confused by the fact that transforming data in object ${Message}Test using the toJavaProto method affects the performance results. In my generated code, this method forces ByteString usage during java proto preparation, so the comparison is not entirely fair. The results have also improved for existing benchmarks, but I wanted a clearer comparison.

Round-trip benchmark Java Scala
LargeStringMessage 9,345 ns/op 9,088 ns/op
LazyFieldsStringMessage (same as LargeStringMessage but lazy_fields: true) 9,484 ns/op 2,734 ns/op

Looks great. More than 3x speedup 🚀 Of course, scalapb is faster than java proto even without additional improvements.

Questions

  1. Should I commit benchmarks results into this PR?
  2. Should I add more tests?
  3. Should we keep using toJavaProto in data preparation for benchmarks (object ${Message}Test)?
  4. Anything else?

object ${Message}Test {
val scala = TestCases.make${Message}Scala

val java = protos.${Message}.toJavaProto(scala)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential problem with data preparation is here. Java protobuf created by conversion from scala proto instead of bytes parsing.

So, lazy_fields affects java serialization performance:

serializeJava Java
LargeStringMessage 2,584 ns/op
LazyFieldsStringMessage (same as LargeStringMessage but lazy_fields: true) 1,111 ns/op

i think that it is caused by forced bytes writing at toJavaProto with lazy_fields enabled (ProtobufGenerator#348L). But anyway it doesn't look as clear java protobuf benchmark.

What about changing this line to val java = Protos.${Message}.parseFrom(bytes)?

java: Boolean = true
): Unit = {
ops.mkdir ! ops.pwd / 'results
val benchmarks0 = if (benchmarks.nonEmpty) benchmarks else testNames
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a problem with running this script with fresh ammonite version. This strange code helped to run the script.

Also scalapb argument value is ignored further. So, I need to hardcode snapshot version into benchmarks/project/plugins.sbt.

Comment thread AGENTS.md
@@ -0,0 +1,81 @@
# Agents
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can delete it if it is not necessary.

@poslegm
Copy link
Copy Markdown
Author

poslegm commented Mar 18, 2026

@thesamet Hello! I would be grateful for a review of these changes.

override def toString: String = value.toString()
// Equality for LazyField[T] is not commutative!
// It is extremely important to use LazyField[T] only with the `-language:strictEquality` enabled for Scala 3 or `-Xfatal-warnings` for Scala 2.
override def equals(other: Any): Boolean = value == other
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions may potentially force decoding unintentionally, for example if we end up if hashmaps of lazy fields. Can we define equals to be true only if the rhs is also a LazyField with the same bytes? If the user wants to compare to a string they should explicitly call .value

Copy link
Copy Markdown
Author

@poslegm poslegm Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a difficult trade-off between approaching drop-in replacement and the most explicit behavior.

In my opinion, this optimization is heuristic, so it's acceptable to perform decoding under the hood, even if the user doesn't think about it.

I would prefer to strive for the ability to enable this flag with minimal codebase changes, and I would document the heuristic nature and limitations. I'm basing this on java protobuf, which hides the lazy nature of string fields from the end user.

But anyway your comment is very helpful, I found and fixed bug in equality and added more tests. Now Set[LazyField[String]] works without unnecessary decoding, and Set[String] cannot be used without decoding in any case.

Comment thread protobuf/scalapb/scalapb.proto Outdated
@poslegm poslegm requested a review from thesamet April 2, 2026 16:11
Copy link
Copy Markdown
Contributor

@thesamet thesamet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR — the performance gains are real and the approach of routing lazy string fields through customSingleScalaTypeName / TypeMapper is the right foundation. A few issues to address before this is ready to merge.


def toByteString: ByteString = bytes

override def toString: String = value.toString()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toString forces lazy decoding. Any logging, string interpolation (s"$msg"), or debug print will silently trigger UTF-8 parsing — defeating the round-trip optimization. Consider s"LazyField(${bytes.size()} bytes)" here and letting callers use .value.toString() explicitly.

override def toString: String = value.toString()
// Equality for LazyField[T] is not commutative!
// It is extremely important to use LazyField[T] only with the `-language:strictEquality` enabled for Scala 3 or `-Xfatal-warnings` for Scala 2.
override def equals(other: Any): Boolean = other match {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-commutative equals is the most concerning aspect of this public API. The tests explicitly demonstrate that Set[Any](s, lazyS) and Set[Any](lazyS, s) have different sizes depending on insertion order. This is a silent correctness hazard for code using Any-typed collections or comparing against extracted strings. The comment recommends -language:strictEquality, but that's Scala 3 only and opt-in — Scala 2 users have no protection. Please document this limitation prominently in the scaladoc on the class.

implicit val stringDecoder: LazyDecoder[String] = _.toStringUtf8()
}

trait LazyEncoder[T] {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyEncoder is defined but not used anywhere in the implementation (the write path goes through the TypeMapper's toBase which calls .toByteString directly). Either use it or remove it to avoid dead public API surface.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LazyEncoder already used for implicit conversions: example

case FieldDescriptor.JavaType.STRING => FunctionApplication(s"$d.PString")
case FieldDescriptor.JavaType.ENUM => FunctionApplication(s"$d.PEnum")
case FieldDescriptor.JavaType.MESSAGE => MethodApplication("toPMessage")
case FieldDescriptor.JavaType.STRING if fd.getContainingType.lazyStringFields =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping lazy string fields to PByteString in singleFieldAsPvalue breaks JSON serialization. Libraries like scalapb-json4s and scalapb-circe call getField (which uses this path) to convert messages to JSON. With this change, lazy string fields would be emitted as base64-encoded bytes instead of JSON strings.

The inverse direction is equally broken: generateMessageReads uses baseSingleScalaTypeName = ByteString for lazy fields, so it looks for PByteString in the field map. But JSON parsers produce PString for string fields — so JSON parsing of messages with lazy string fields would silently return default values.

Fix: keep the PString path for singleFieldAsPvalue (call .value on the lazy field to get the string before wrapping), and handle the PString -> LazyField[String] conversion in generateMessageReads.

val bytes = original.toByteArray
val parsed = LazyRepeated.parseFrom(bytes)

def f(str: Seq[String]): Int = str.length
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsed.items has type Seq[LazyField[String]] (because lazyStringFields is enabled at file level), but f expects Seq[String]. There is no Conversion[Seq[LazyField[String]], Seq[String]] defined — implicit conversions don't apply element-wise to collections. This should fail to compile. Please verify this test actually compiles and passes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it compiles and passes with Scala 2 and Scala 3. I prove it with sbt e2eJVM2_13/testOnly LazyStringFieldsSpec and sbt e2eJVM3/testOnly LazyStringFieldsSpec. Conversions is defined here and here

val bytes = original.toByteArray
val parsed = LazyDictionary.parseFrom(bytes)

parsed.stringToInt should contain theSameElementsAs originalStringToInt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With file-level lazy_string_fields: true, the map entry messages for LazyDictionary also inherit the setting (via message.getFile.scalaOptions.getLazyStringFields). This means stringToInt would be typed as Map[LazyField[String], Int] instead of Map[String, Int]. Map lookups by String key would silently fail at runtime because HashMap calls String.equals(LazyField[String]) on the stored keys, which returns false.

Consider whether lazyStringFields should be excluded from map entry key fields, and add a test that does .get("hello") on such a map to catch this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookups already works because of implicit conversions on .get("hello") calls. But you are right and this behaviour is error-prone because of untyped collection problem.

I excluded map keys from lazy parsing. Now parsing with lazy_string_fields: true results in Map[String, T] for map<string, t> proto dictionaries.

@@ -0,0 +1,316 @@
[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't commit benchmark result JSON files to the main branch. They bloat the repository (4 files × ~316 lines each) and will create churn every time benchmarks are re-run. Consider a dedicated benchmark-results branch or linking to them from the PR description instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants