Difference between revisions of "TextCharacterEncode"

(UTF-8+)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[category:Text Functions]]
 
[[category:Text Functions]]
 
[[category:Analytica 5.0]]
 
[[category:Analytica 5.0]]
 +
{{ReleaseBar}}
 
(''new to [[Analytica 5.0]]'')
 
(''new to [[Analytica 5.0]]'')
 +
 +
Converts text to or from many common codings, including URLs, XML, UTF-i, and NFC (unicode).
  
 
== TextCharacterEncode( type, text ) ==
 
== TextCharacterEncode( type, text ) ==
  
Converts «text» into a special encoded or unencoded form according to «type». Possible values for «type» include these:
+
Converts «text» into a special encoded or unencoded form according to «type». Possible values for «type» include:
  
 
* For encoding or decoding URLs: <code>'URL'</code>, <code>'IRI'</code>, <code>'URL%'</code>, <code>'-URL'</code>
 
* For encoding or decoding URLs: <code>'URL'</code>, <code>'IRI'</code>, <code>'URL%'</code>, <code>'-URL'</code>
 
* For encoding XML or HTML: <code>'XML'</code>, <code>'-XML'</code>
 
* For encoding XML or HTML: <code>'XML'</code>, <code>'-XML'</code>
* For UTF-8 encodings: <code>'UTF-8'</code>, <code>'-UTF-8'</code>
+
* For UTF-8 encodings: <code>'UTF-8'</code>, <code>'UTF-8+'</code>, <code>'-UTF-8'</code>
* For Unicode normalized forms: <code>'NFC'</code>, <code>'NFD'</code>, <code>'NFKC'</code>, <code>'NFKD'</code>.
+
* For Unicode normalized forms: <code>'NFC'</code>, <code>'NFD'</code>, <code>'NFKC'</code>, <code>'NFKD'</code>.{{Release|6.5||
* Return «text» with no change: <code>'None'</code>
+
* Hash code: <code>'SHA-1'</code>, <code>'SHA-256'</code>}}
 +
* Return «text» with no change: <code>'None'</code>{{Release|6.3||
 +
* [[Entering extended characters|Unicode character names]]: <code>'characterName'</code>, <code>'-characterName'</code>}}
  
Note that a convention is used that «type» options starting with a minus invert the respective encoding.
+
Start «type» with a minus, code>'-'</code>, to invert the encoding -- i.e. to decode the text.
  
 
== Encoding text for inclusion in a URL ==
 
== Encoding text for inclusion in a URL ==
Line 21: Line 26:
 
Data is often passed in the query string portion of a URL, such as "John Doe" in the following URL:
 
Data is often passed in the query string portion of a URL, such as "John Doe" in the following URL:
 
::<code>http://acme.com/somePage?name=John+Doe</code>
 
::<code>http://acme.com/somePage?name=John+Doe</code>
Notice that the space has been converted to a <code>'+'</code> before inserting it in the URL. The special characters "<code>"!*'();:@&=+$,/?#[]%</code> have special meaning in a URL and must be converted into text that does not involve those characters. The «type» value <code>'URL'</code> encodes data according to the RFC-3986 standard.
+
Notice that it converts the space to a <code>'+'</code> before inserting it in the URL. The special characters "<code>"!*'();:@&=+$,/?#[]%</code> each have a special meaning in a URL and so must be converted into text that does not involve those characters. The «type» value <code>'URL'</code> encodes data according to the RFC-3986 standard.
  
If you ever need to pass a URL as a data item in another URL, then all its special characters need to be encoded so they aren't interpreted as part of the outer URL.
+
If you ever need to pass a URL as a data item in another URL, you must encode all its special characters so they aren't interpreted as part of the outer URL.
  
This same encoding appears in other standards as well, including in the standard for submitting form data in HTTP, and in the JSON standard, among others.
+
This same encoding appears in other standards as well, including submitting form data in HTTP, and in for JSON.
  
 
When using <code>[[TextCharacterEncode]]('URL', text)</code>, your text should only encode the value that will be placed after an equal sign in the query, but nothing more. For example, you should write:
 
When using <code>[[TextCharacterEncode]]('URL', text)</code>, your text should only encode the value that will be placed after an equal sign in the query, but nothing more. For example, you should write:
 
::<code>'http://acme.com/somePage?name=' & [[TextCharacterEncode]]( 'URL', 'John Doe' )</code>
 
::<code>'http://acme.com/somePage?name=' & [[TextCharacterEncode]]( 'URL', 'John Doe' )</code>
and not
+
not
 
::<code>[[TextCharacterEncode]]( 'URL', 'http://acme.com/somePage?name=John Doe' )</code>
 
::<code>[[TextCharacterEncode]]( 'URL', 'http://acme.com/somePage?name=John Doe' )</code>
since in the latter case the characters <code>? =:/</code>, etc. will all be encoded, which you don't want.
+
since in the latter case the characters <code>? =:/</code>, etc. will be encoded, which you don't want.
  
A problem with the <code>'URL'</code> encoding is that all characters except the letters, digits, and <code>-._~</code> are percent encoded, making URLs extremely non-readable, especially for non-English sites. The <code>'IRI'</code> option (International Resource Identifier) preserves all but the reserved characters ("<code>!*'();:@&=+$,/?#[]%</code>"), which generally still works correctly for URLs.
+
A problem with the <code>'URL'</code> encoding is that all characters except the letters, digits, and <code>-._~</code> are percent encoded, making URLs very difficult to read, especially for non-English sites. The <code>'IRI'</code> option (International Resource Identifier) preserves all but the reserved characters ("<code>!*'();:@&=+$,/?#[]%</code>"), which generally still works correctly for URLs.
  
 
The standard URL encoding changes space to a plus character. The <code>'URL%'</code> option uses percent encoding for space (<code>%20</code>) instead.
 
The standard URL encoding changes space to a plus character. The <code>'URL%'</code> option uses percent encoding for space (<code>%20</code>) instead.
  
The type option <code>'-URL'</code> converts the URL-encoded text back into the original text. It works for any of the encodings <code>'URL'</code>, <code>'IRI'</code> and <code>'URL%'</code>.
+
The «type» option <code>'-URL'</code> converts the URL-encoded text back into the original text. It works for any of the encodings <code>'URL'</code>, <code>'IRI'</code> and <code>'URL%'</code>.
  
 
=== Examples ===
 
=== Examples ===
Line 56: Line 61:
 
== Encoding text in XML or HTML ==
 
== Encoding text in XML or HTML ==
  
The option <code>'XML'</code> for «type» encodes data for insertion in XML or HTML. Without this encoding, the XML or HTML parser will attempt to interpret special characters such as '<', '>', '&', quotes. Also, a few characters falling in control ranges (below ascii 32 or between ascii 128 and 159) will be automatically converted to entities as required be the standards.
+
The option <code>'XML'</code> for «type» encodes data for insertion in XML or HTML. Without this encoding, an XML or HTML parser will attempt to interpret special characters such as '<', '>', '&', quotes. Also, a few characters falling in control ranges (below ascii 32 or between ascii 128 and 159) will be automatically converted to entities as required be the standards.
  
 
The <code>'-XML'</code> does the inverse decoding.
 
The <code>'-XML'</code> does the inverse decoding.
 +
 +
{{Release|6.3||
 +
== Character names ==
 +
''New to [[Analytica 6.3]]''
 +
 +
You can map an extended character to its name, or a character name to the character, for example:
 +
:<code>[[TextCharacterEncode]]( '&Psi;', 'characterName')</code> &rarr; <code>'Psi'</code>
 +
:<code>[[TextCharacterEncode]]( 'Psi', '-characterName'</code> &rarr; <code>'&Psi;'</code>
 +
You can use these character names when [[Entering extended characters|typing extended characters]] by typing a backslash, <code>\</code>, the character name, then the TAB key.
 +
 +
When a character has no special name assigned, <code>[[TextCharacterEncode]]( '&Psi;', 'characterName')</code> returns [[Null]]. Analytica reads the character names from the file <code>"characterNames.ini"</code> found in the Analytica installation folder.
 +
}}
  
 
=== Examples ===
 
=== Examples ===
Line 82: Line 99:
 
== Unicode normalization ==
 
== Unicode normalization ==
  
The options <code>'NFC'</code>, <code>'NFD'</code>, <code>'NFKC'</code>, and <code>'NFKD'</code> for «type» convert text into canonical Unicode normalized forms.
+
The «type» options <code>'NFC'</code>, <code>'NFD'</code>, <code>'NFKC'</code>, and <code>'NFKD'</code> convert text into canonical Unicode normalized forms.
  
 
The [http://www.unicode.org/standard/principles.html Unicode standard] includes special ''combining'' characters that allow ' ''composite character'' to be constructed from one or more combining characters applied to a main character. As a result, there are often multiple ways to encode the same visible glyph, which is often the case with accented characters. For example, the accented 'a'  character <code>'á'</code> can be obtained with either of the following:
 
The [http://www.unicode.org/standard/principles.html Unicode standard] includes special ''combining'' characters that allow ' ''composite character'' to be constructed from one or more combining characters applied to a main character. As a result, there are often multiple ways to encode the same visible glyph, which is often the case with accented characters. For example, the accented 'a'  character <code>'á'</code> can be obtained with either of the following:
Line 98: Line 115:
  
 
Note: The four Unicode normalization encodings required Windows Vista or later. When running on XP, «text» is returned unchanged.
 
Note: The four Unicode normalization encodings required Windows Vista or later. When running on XP, «text» is returned unchanged.
 +
 +
{{Release|6.5||
 +
== Hash codes ==
 +
''New to [[Analytica 6.5]]''
 +
 +
The <code>'SHA-1'</code> hash returns a 40 hexadecimal character string and <code>'SHA-256'</code> hash returns a 64 hexadecimal character string. These can be used to verify that text from two sources is the same. It is a standard used in various cryptographic contexts. <code>'SHA-1'</code> is considered as not very secure since being cryptographically broken. However, it is still widely used, for example by the 2-factor authentication standard that uses authenticator apps for the second form of authentication.
 +
 +
:<code>[[TextCharacterEncode]]( 'SHA-1', 'Hello world') &rarr; '7b502c3a1f48c8609ae212cdfb639dee39673f5e'</code>
 +
:<code>[[TextCharacterEncode]]( 'SHA-256', 'Hello world') &rarr; '64ec88ca00b268e5ba1a35678a1b5316d212f4f366b2477232534a8aeca37f3c'</code>}}
  
 
=== Examples ===
 
=== Examples ===

Latest revision as of 19:21, 18 October 2024



Release:

4.6  •  5.0  •  5.1  •  5.2  •  5.3  •  5.4  •  6.0  •  6.1  •  6.2  •  6.3  •  6.4  •  6.5


(new to Analytica 5.0)

Converts text to or from many common codings, including URLs, XML, UTF-i, and NFC (unicode).

TextCharacterEncode( type, text )

Converts «text» into a special encoded or unencoded form according to «type». Possible values for «type» include:

  • For encoding or decoding URLs: 'URL', 'IRI', 'URL%', '-URL'
  • For encoding XML or HTML: 'XML', '-XML'
  • For UTF-8 encodings: 'UTF-8', 'UTF-8+', '-UTF-8'
  • For Unicode normalized forms: 'NFC', 'NFD', 'NFKC', 'NFKD'.
  • Hash code: 'SHA-1', 'SHA-256'
  • Return «text» with no change: 'None'
  • Unicode character names: 'characterName', '-characterName'

Start «type» with a minus, code>'-', to invert the encoding -- i.e. to decode the text.

Encoding text for inclusion in a URL

«Type» options 'URL', 'IRI' and 'URL%' encode data for inclusion on a URL. The option '-URL' decodes URL data.

Data is often passed in the query string portion of a URL, such as "John Doe" in the following URL:

http://acme.com/somePage?name=John+Doe

Notice that it converts the space to a '+' before inserting it in the URL. The special characters ""!*'();:@&=+$,/?#[]% each have a special meaning in a URL and so must be converted into text that does not involve those characters. The «type» value 'URL' encodes data according to the RFC-3986 standard.

If you ever need to pass a URL as a data item in another URL, you must encode all its special characters so they aren't interpreted as part of the outer URL.

This same encoding appears in other standards as well, including submitting form data in HTTP, and in for JSON.

When using TextCharacterEncode('URL', text), your text should only encode the value that will be placed after an equal sign in the query, but nothing more. For example, you should write:

'http://acme.com/somePage?name=' & TextCharacterEncode( 'URL', 'John Doe' )

not

TextCharacterEncode( 'URL', 'http://acme.com/somePage?name=John Doe' )

since in the latter case the characters ? =:/, etc. will be encoded, which you don't want.

A problem with the 'URL' encoding is that all characters except the letters, digits, and -._~ are percent encoded, making URLs very difficult to read, especially for non-English sites. The 'IRI' option (International Resource Identifier) preserves all but the reserved characters ("!*'();:@&=+$,/?#[]%"), which generally still works correctly for URLs.

The standard URL encoding changes space to a plus character. The 'URL%' option uses percent encoding for space (%20) instead.

The «type» option '-URL' converts the URL-encoded text back into the original text. It works for any of the encodings 'URL', 'IRI' and 'URL%'.

Examples

TextCharacterEncode('URL', '(1+2) = 3') → "%281%2B2%29+%3D+3"
TextCharacterEncode('URL%', '(1+2) = 3') → "%281%2B2%29%20%3D%203"
TextCharacterEncode('-URL','%281%2B2%29+%3D+3') → "(1+2) = 3"
TextCharacterEncode('-URL','%281%2B2%29%20%3D%203') → "(1+2) = 3"
TextCharacterEncode('URL', 'test@中文.com') → "test%40%E4%B8%AD%E6%96%87.com"
TextCharacterEncode('IRI', 'test@中文.com') → "test%40中文.com"
Variable email := "John_Doe@yahoo.com"
Variable website := "http://acme.com?name=johnDoe&type=student"
Variable cityToFind = "San Francisco, CA"
Variable UrlToRead := "http://dataSource.com/query?email=" & TextCharacterEncode( 'URL', email ) & "&site=" & TextCharacterEncode('URL', website) & "&city=" & TextCharacterEncode('URL', cityToFind)
UrlToRead"http://dataSource.com/query?email=John_Doe%40yahoo.com&site=http%3A%2F%2Facme.com%3Fname%3DjohnDoe%26type%3Dstudent&city=San+Francisco%2C+CA"

Encoding text in XML or HTML

The option 'XML' for «type» encodes data for insertion in XML or HTML. Without this encoding, an XML or HTML parser will attempt to interpret special characters such as '<', '>', '&', quotes. Also, a few characters falling in control ranges (below ascii 32 or between ascii 128 and 159) will be automatically converted to entities as required be the standards.

The '-XML' does the inverse decoding.

Character names

New to Analytica 6.3

You can map an extended character to its name, or a character name to the character, for example:

TextCharacterEncode( 'Ψ', 'characterName')'Psi'
TextCharacterEncode( 'Psi', '-characterName''Ψ'

You can use these character names when typing extended characters by typing a backslash, \, the character name, then the TAB key.

When a character has no special name assigned, TextCharacterEncode( 'Ψ', 'characterName') returns Null. Analytica reads the character names from the file "characterNames.ini" found in the Analytica installation folder.

Examples

TextCharacterEncode( 'XML', 'One < Two, Three & Four are "Bigger"' )
"One &lt; Two, &lt;b&gt;Three &amp; Four&lt;/b&gt; are &quot;Bigger&quot;"
TextCharacterEncode('-XML', One &lt; Two, &lt;b&gt;Three &amp; Four&lt;/b&gt; are &quot;Bigger&quot;' )
'One < Two, Three & Four are "Bigger"'

UTF-8 encoding

Set «type» to 'UTF-8' to obtain the UTF-8 encoding, or to '-UTF-8' to decode a UTF-8 encoding into the Unicode characters.

The option 'UTF-8+' prepends the UTF-8 Byte Order Mark (BOM). The '-UTF-8' decoding option always removes the BOM if it is present.

Examples

TextCharacterEncode('UTF-8', '확률 분포') → "확률 분포Œ"
Asc(SplitText(TextCharacterEncode('UTF-8', '확률 분포')))
[0xed, 0x99, 0x95, 0xeb, 0xa5, 0xa0, 0x20, 0xeb, 0xb6, 0x84, 0xed, 0x8f, 0xac]
TextCharacterEncode('-UTF-8', '확률 분포Œ') → "확률 분포"
TextCharacterEncode('UTF-8+', '확률 분포') → "ï»¿í™•ë¥ ë¶„í¬Œ" { '' is the BOM }

Unicode normalization

The «type» options 'NFC', 'NFD', 'NFKC', and 'NFKD' convert text into canonical Unicode normalized forms.

The Unicode standard includes special combining characters that allow ' composite character to be constructed from one or more combining characters applied to a main character. As a result, there are often multiple ways to encode the same visible glyph, which is often the case with accented characters. For example, the accented 'a' character 'á' can be obtained with either of the following:

Chr(225) → 'á'
'a' & Chr(0x301) → 'á'

Although these display the same, the first has a text length of 1, the second a text length of 2. The character Chr(0x301) is the acute accent combining character, and can be applied to any character. In fact, it is even possible to apply multiple combining characters to the same glyph.

When you want to ensure that combined characters are used (such as the one character 'á'), set «type» to 'NFC', which stands for Normalized Form Combined. If the two character sequence 'a' & Chr(0x301) appears in «text», it will be replaced with the single pre-composed character Chr(225).

When you want to ensure that composite characters are split into individual combining character constituents, set «type» to 'NFD', which stands for Normal Form Decomposed. Hence, for example, the single character 'á' will be replaced with the two character sequence :'a' & Chr(0x301).

Unicode also includes digraph ligature characters], such as the ligature 'fi', which is a single character glyph that contains both 'f' and 'i'. The «type» options 'NFKC' and 'NFKD' expand ligatures and digraphs into their individual characters. None of the four canonical Unicode encodings re-combine character sequences into pre-composed ligatures.

Note: The four Unicode normalization encodings required Windows Vista or later. When running on XP, «text» is returned unchanged.

Hash codes

New to Analytica 6.5

The 'SHA-1' hash returns a 40 hexadecimal character string and 'SHA-256' hash returns a 64 hexadecimal character string. These can be used to verify that text from two sources is the same. It is a standard used in various cryptographic contexts. 'SHA-1' is considered as not very secure since being cryptographically broken. However, it is still widely used, for example by the 2-factor authentication standard that uses authenticator apps for the second form of authentication.

TextCharacterEncode( 'SHA-1', 'Hello world') → '7b502c3a1f48c8609ae212cdfb639dee39673f5e'
TextCharacterEncode( 'SHA-256', 'Hello world') → '64ec88ca00b268e5ba1a35678a1b5316d212f4f366b2477232534a8aeca37f3c'

Examples

TextLength(TextCharacterEncode( ['NFC', 'NFD'], 'á' )) → [1,2]
TextLength(TextCharacterEncode( ['NFC', 'NFD'], 'může' )) & rarr; [4,6]
Chr(0xfb01) → 'fi'
TextLength(TextCharacterEncode( ['None', 'NFC','NFD','NFKC','NFKD'], 'fix your résumé' )) → [14,14,16,15,17]

In the last example, note that the first character in «text» is the pre-composed ligature 'fi'. The 'NFKC' and 'NFKD' cases expanded this ligature into two characters.

See Also

Comments


You are not allowed to post comments.