Skip to content
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
e02e3a9
Improve use of CharacterEncoding
mmatera Mar 15, 2026
586d3a4
Merge branch 'master' into fix_ToStringEncoding
mmatera Mar 15, 2026
96ea4e8
Merge branch 'master' into fix_ToStringEncoding
mmatera Mar 16, 2026
cd526f5
Merge branch 'master' into fix_ToStringEncoding
rocky Mar 20, 2026
dc9c8ad
Merge remote-tracking branch 'origin/master' into fix_ToStringEncoding
mmatera Mar 24, 2026
1a53c1a
not finished
mmatera Mar 25, 2026
3d4b0a5
hangle encoding in doctests
mmatera Mar 25, 2026
0218bd9
adjust tests
mmatera Mar 25, 2026
1324c41
commenting out the Mathml tests
mmatera Mar 25, 2026
f58574c
Merge remote-tracking branch 'origin/master' into fix_ToStringEncoding
mmatera Mar 25, 2026
79dcf9d
adding missing module
mmatera Mar 25, 2026
e7f88e5
avoid circular import
mmatera Mar 25, 2026
0595532
using Mathics3-scanner tables. Moving encoding.py to mathics.eval
mmatera Mar 25, 2026
1670da6
remove hard coded table
mmatera Mar 25, 2026
d9e7eb5
last tweaks
mmatera Mar 27, 2026
b01006c
Merge branch 'master' into fix_ToStringEncoding
mmatera Mar 29, 2026
8a33df5
Merge branch 'master' into fix_ToStringEncoding
mmatera Mar 30, 2026
2d82c6d
Merge branch 'master' into fix_ToStringEncoding
mmatera Apr 3, 2026
68cb8b9
parent 6239c8346809177f6497fc594e4cc20b0c9d9686
mmatera Mar 29, 2026
cd3cf0d
parent 6239c8346809177f6497fc594e4cc20b0c9d9686
mmatera Mar 29, 2026
b743191
strip result before the comparison
mmatera Apr 4, 2026
ba5d790
fix wrong character
mmatera Apr 4, 2026
17ab410
Merge remote-tracking branch 'origin/master' into fix_ToStringEncoding
mmatera Apr 4, 2026
55ac83b
merge with handle_encoding_in_docpipeline
mmatera Apr 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion mathics/builtin/atomic/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -899,7 +899,20 @@ def eval_default(self, value, evaluation: Evaluation, options: dict):
def eval_form(self, expr, form, evaluation: Evaluation, options: dict):
"ToString[expr_, form_Symbol, OptionsPattern[ToString]]"
encoding = options["System`CharacterEncoding"]
return eval_ToString(expr, form, encoding.value, evaluation)
if isinstance(encoding, String):
encoding_str = encoding.value
if encoding_str not in _encodings:
evaluation.message("$CharacterEncoding", "charcode", encoding)
encoding_str = evaluation.definitions.get_ownvalue(
"System`$SystemCharacterEncoding"
).value
else:
evaluation.message("$CharacterEncoding", "charcode", encoding)
encoding_str = evaluation.definitions.get_ownvalue(
"System`$SystemCharacterEncoding"
).value

return eval_ToString(expr, form, encoding_str, evaluation)


class Transliterate(Builtin):
Expand Down
4 changes: 2 additions & 2 deletions mathics/eval/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
def eval_ToString(
expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation
) -> String:
boxes = format_element(expr, evaluation, form, encoding=encoding)
text = boxes.to_text(evaluation=evaluation)
boxes = format_element(expr, evaluation, form)
Copy link
Copy Markdown
Member

@rocky rocky Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the final idea is that the strings in format_element are going to get converted, then I think this is approaching this the wrong way.

Instead, format_element needs to take the parameters expr, form, and encoding to produce boxes that have the appropriate strings in them initially.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but this doesn't align with how the experiments I showed you suggest WMA works. It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied. And if you do the conversion earlier, a double conversion spoils the result.
Handling encoding at the level of format_element is like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.

Copy link
Copy Markdown
Member

@rocky rocky Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but this doesn't align with how the experiments I showed you suggest WMA works.

I did not find anywhere in those experiments that there was a string that was encoded one way, and inside ToString, it got reencoded, as opposed to being encoded correctly initially.

It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied.

That is not at issue here. What is at issue here is taking a string that was wrongly encoded and re-encoding it.

Consider this example where I set a breakpoint at the location we are discussing:

$ mathics3
...
In[1]:= ToString[a >= b, CharacterEncoding -> "ASCII"]
(/tmp/Mathics3/mathics-core/mathics/eval/strings.py:30:5 @46): eval_ToString
-- 30     try:
(trepan3k) list
 25    	    expr: BaseElement, form: Symbol, encoding: String, evaluation: Evaluation
 26    	) -> String:
 27    	
 28    	    boxes = format_element(expr, evaluation, form)
 29    	    breakpoint()
 30  ->	    try:
 31    	        return String(boxes.to_text(evaluation=evaluation, encoding=encoding))
 32    	    except EncodingNameError:
 33    	        # Mimic the WMA behavior. In the future, we can implement the mechanism
 34    	        # with encodings stored in .m files, and give a chance with it.
(trepan3k) boxes.elements
(<Expression: <Symbol: System`PaneBox>[<String: ""a ≥ b"">]>, <Expression: <Symbol: ...

<String: ""a ≥ b""> is wrong. That should be <String: ""a >= b"">.

And if you do the conversion earlier, a double conversion spoils the result. Handling encoding at the level of format_element is like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.

This is not relevant here. We started with a Mathics3 Expression, and inside format_element, this expression got turned into an incorrect string, because encoding information indicating that strings are supposed to be ASCII was not respected inside format_element.

Another viable solution might be to have format_element not convert the expression a >= b to a String, and leave it as an Expression for later. But, I am not sure that is possible or correct. I believe only that what is done is incorrect and there's no evidence right now that WMA is reencoding strings instead of encoding them correctly initially.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<String: ""a ≥ b""> is wrong. That should be <String: ""a >= b"">.

I have been looking again this, and again, this is a central misunderstanding: as I see this, the line 28

    	    boxes = format_element(expr, evaluation, form)

must return a boxed expression that uses the internal representation (Unicode/UTF-8). Then, the result <String: ""a ≥ b""> is correct. The encoding is applied in line 31

    	        return String(boxes.to_text(evaluation=evaluation, encoding=encoding))

which takes the box expression and converts it into a Python string, in the request encoding.

The advantage of this approach is that all the codepage translation machinary is completely localized in one module. The drawback is that we have to scan each character to see if we need to translate it. But this is how WMA does it, and I guess they developers had very good reasons to do in this way.

text = boxes.to_text(evaluation=evaluation, encoding=encoding)
return String(text)


Expand Down
41 changes: 40 additions & 1 deletion mathics/format/render/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
"""
Mathics3 box rendering to plain text.
"""

from mathics.builtin.box.graphics import GraphicsBox
from mathics.builtin.box.graphics3d import Graphics3DBox
from mathics.builtin.box.layout import (
Expand Down Expand Up @@ -34,6 +33,43 @@
add_render_function(FormBox, convert_inner_box_field)


# Map WMA encoding names to Python encoding names
ENCODING_WMA_TO_PYTHON = {
"WindowsEastEurope": "cp1250",
"WindowsCyrillic": "cp1251",
"WindowsANSI": "cp1252",
"WindowsGreek": "cp1252",
"WindowsTurkish": "cp1254",
}


def encode_string_value(value: str, encoding: str):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a just a proof of concept. The final version should look into the MathicsScanner tables

"""Convert an Unicode string `value` to the required `encoding`"""
if encoding == "ASCII":
# TODO: replace from a table from MathicsScanner
ascii_map = {
"⇒": "=>",
"↔": "<->",
"→": "->",
"⇾": "->",
"⇾": "->",
"⇴": "->",
"∫": r"\[Integral]",
"𝑑": r"\[DifferentialD]",
"⧦": r"\[Equivalent]",
"×": r" x ",
}
result = ""
for ch in value:
ch = ascii_map.get(ch, ch)
result += ch
return result

encoding = ENCODING_WMA_TO_PYTHON.get(encoding, encoding)
result = value.encode("utf-8").decode(encoding)
return result


def fractionbox(box: FractionBox, **options) -> str:
# Note: values set in `options` take precedence over `box_options`
child_options = {**options, **box.box_options}
Expand Down Expand Up @@ -159,6 +195,9 @@ def string(s: String, **options) -> str:
if value.startswith('"') and value.endswith('"'): # nopep8
if not show_string_characters:
value = value[1:-1]

if "encoding" in options and options["encoding"] != "Unicode":
value = encode_string_value(value, options["encoding"])
Copy link
Copy Markdown
Member

@rocky rocky Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more closely, there may be a deeper problem here.

If the Mathics3 string was encoded with Unicode under the user's control, that should remain. If Mathics3 added the Unicode because an operator appeared, that is probably wrong, and the code that added the Unicode should be fixed.

So, what is a specific scenario or situation where line 200 is triggered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting MATHICS_CHARACTER_ENCODING="ASCII") or when it is call from ToString with a specific CharacterEncoding option.

Copy link
Copy Markdown
Member

@rocky rocky Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting MATHICS_CHARACTER_ENCODING="ASCII") or when it is call from ToString with a specific CharacterEncoding option.

This paraphrases the if condition. I meant, what is it that is causing an operator to get converted before ToString was called. This, I think, is the real source of the problem.

return value


Expand Down
Loading