-
-
Notifications
You must be signed in to change notification settings - Fork 66
Improve use of CharacterEncoding #1735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 1 commit
e02e3a9
586d3a4
96ea4e8
cd526f5
dc9c8ad
1a53c1a
3d4b0a5
0218bd9
1324c41
f58574c
79dcf9d
e7f88e5
0595532
1670da6
d9e7eb5
b01006c
8a33df5
2d82c6d
68cb8b9
cd3cf0d
b743191
ba5d790
17ab410
55ac83b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,7 +2,6 @@ | |
| """ | ||
| Mathics3 box rendering to plain text. | ||
| """ | ||
|
|
||
| from mathics.builtin.box.graphics import GraphicsBox | ||
| from mathics.builtin.box.graphics3d import Graphics3DBox | ||
| from mathics.builtin.box.layout import ( | ||
|
|
@@ -34,6 +33,43 @@ | |
| add_render_function(FormBox, convert_inner_box_field) | ||
|
|
||
|
|
||
| # Map WMA encoding names to Python encoding names | ||
| ENCODING_WMA_TO_PYTHON = { | ||
| "WindowsEastEurope": "cp1250", | ||
| "WindowsCyrillic": "cp1251", | ||
| "WindowsANSI": "cp1252", | ||
| "WindowsGreek": "cp1252", | ||
| "WindowsTurkish": "cp1254", | ||
| } | ||
|
|
||
|
|
||
| def encode_string_value(value: str, encoding: str): | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function is a just a proof of concept. The final version should look into the MathicsScanner tables |
||
| """Convert an Unicode string `value` to the required `encoding`""" | ||
| if encoding == "ASCII": | ||
| # TODO: replace from a table from MathicsScanner | ||
| ascii_map = { | ||
| "⇒": "=>", | ||
| "↔": "<->", | ||
| "→": "->", | ||
| "⇾": "->", | ||
| "⇾": "->", | ||
| "⇴": "->", | ||
| "∫": r"\[Integral]", | ||
| "𝑑": r"\[DifferentialD]", | ||
| "⧦": r"\[Equivalent]", | ||
| "×": r" x ", | ||
| } | ||
| result = "" | ||
| for ch in value: | ||
| ch = ascii_map.get(ch, ch) | ||
| result += ch | ||
| return result | ||
|
|
||
| encoding = ENCODING_WMA_TO_PYTHON.get(encoding, encoding) | ||
| result = value.encode("utf-8").decode(encoding) | ||
| return result | ||
|
|
||
|
|
||
| def fractionbox(box: FractionBox, **options) -> str: | ||
| # Note: values set in `options` take precedence over `box_options` | ||
| child_options = {**options, **box.box_options} | ||
|
|
@@ -159,6 +195,9 @@ def string(s: String, **options) -> str: | |
| if value.startswith('"') and value.endswith('"'): # nopep8 | ||
| if not show_string_characters: | ||
| value = value[1:-1] | ||
|
|
||
| if "encoding" in options and options["encoding"] != "Unicode": | ||
| value = encode_string_value(value, options["encoding"]) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at this more closely, there may be a deeper problem here. If the Mathics3 string was encoded with Unicode under the user's control, that should remain. If Mathics3 added the Unicode because an operator appeared, that is probably wrong, and the code that added the Unicode should be fixed. So, what is a specific scenario or situation where line 200 is triggered?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Line 200 is triggered when the required encoding is not the standard Unicode. It happens when the SystemCharacterEncoding is not Unicode (for example by setting
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This paraphrases the if condition. I meant, what is it that is causing an operator to get converted before |
||
| return value | ||
|
|
||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the final idea is that the strings in
format_elementare going to get converted, then I think this is approaching this the wrong way.Instead,
format_elementneeds to take the parametersexpr,form, andencodingto produce boxes that have the appropriate strings in them initially.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but this doesn't align with how the experiments I showed you suggest WMA works. It does not matter how you create a string or a Box expression; in the end, an encoding pass is applied. And if you do the conversion earlier, a double conversion spoils the result.
Handling encoding at the level of
format_elementis like to modify the underlying structure of a Graphics object, because you know in the end it is going to be converted into a PNG file.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find anywhere in those experiments that there was a string that was encoded one way, and inside
ToString, it got reencoded, as opposed to being encoded correctly initially.That is not at issue here. What is at issue here is taking a string that was wrongly encoded and re-encoding it.
Consider this example where I set a breakpoint at the location we are discussing:
<String: ""a ≥ b"">is wrong. That should be<String: ""a >= b"">.This is not relevant here. We started with a Mathics3 Expression, and inside
format_element, this expression got turned into an incorrect string, because encoding information indicating that strings are supposed to be ASCII was not respected insideformat_element.Another viable solution might be to have
format_elementnot convert the expressiona >= bto aString, and leave it as anExpressionfor later. But, I am not sure that is possible or correct. I believe only that what is done is incorrect and there's no evidence right now that WMA is reencoding strings instead of encoding them correctly initially.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been looking again this, and again, this is a central misunderstanding: as I see this, the line 28
must return a boxed expression that uses the internal representation (Unicode/UTF-8). Then, the result
<String: ""a ≥ b"">is correct. The encoding is applied in line 31which takes the box expression and converts it into a Python string, in the request encoding.
The advantage of this approach is that all the codepage translation machinary is completely localized in one module. The drawback is that we have to scan each character to see if we need to translate it. But this is how WMA does it, and I guess they developers had very good reasons to do in this way.