🚀 Describe the new functionality needed
The remote::vertexai inference provider does not implement openai_chat_completions_with_reasoning(). When the Responses API is used with reasoning enabled (e.g., reasoning.effort = "high") and vertexai is the inference backend, the Responses layer catches the NotImplementedError and falls back to regular chat completion, silently discarding reasoning content.
Gemini models already support thinking via ThinkingConfig, and the vertexai provider already extracts thinking parts into delta.reasoning_content on streaming chunks. The missing piece is the wrapper method that the Responses layer expects.
💡 Why is this needed? What if we don't build it?
Without this, vertexai users cannot use the Responses API's reasoning features. The provider silently falls back to non-reasoning chat completion, which is confusing since Gemini models fully support thinking. All other major providers (openai, bedrock, ollama, vllm) already implement this method.
Other thoughts
Gemini's thinking API supports opaque thought_signature tokens for faithful reasoning replay across turns. The llama-stack Responses layer does not capture or propagate these, so multi-turn reasoning replay uses plain text with thought: True as a lossy approximation. This matches the approach other providers take with their own reasoning fields.
🚀 Describe the new functionality needed
The
remote::vertexaiinference provider does not implementopenai_chat_completions_with_reasoning(). When the Responses API is used with reasoning enabled (e.g.,reasoning.effort = "high") and vertexai is the inference backend, the Responses layer catches theNotImplementedErrorand falls back to regular chat completion, silently discarding reasoning content.Gemini models already support thinking via
ThinkingConfig, and the vertexai provider already extracts thinking parts intodelta.reasoning_contenton streaming chunks. The missing piece is the wrapper method that the Responses layer expects.💡 Why is this needed? What if we don't build it?
Without this, vertexai users cannot use the Responses API's reasoning features. The provider silently falls back to non-reasoning chat completion, which is confusing since Gemini models fully support thinking. All other major providers (openai, bedrock, ollama, vllm) already implement this method.
Other thoughts
Gemini's thinking API supports opaque
thought_signaturetokens for faithful reasoning replay across turns. The llama-stack Responses layer does not capture or propagate these, so multi-turn reasoning replay uses plain text withthought: Trueas a lossy approximation. This matches the approach other providers take with their own reasoning fields.