TextDistance


( new to Analytica 5.0 )

TextDistance( t1, t2, substitution, deletion, insertion, transpose, default, L1, L2 )

Returns the measure of how different text «t1» is from text «t2», which is called the lexical distance or edit distance between the two texts.

With no optional parameters specified, the result of TextDistance(t1,t2) is known as the Levenshtein distance, which is the minimum number of character insertions, deletions or substitutions required to change «t1» into «t2». The result is 0 when «t1» and «t2» are equal, or the number of edit steps otherwise.

The number of edit to change «t1» into «t2» when only character insertions and deletions (i.e., substitutions are not allowed) are allowed is given by

TextDistance(t1, t2, substitution:false )

The Hamming distance between two strings of equal length, equal to the number of positions at which the symbols are different, is given by

TextDistance(t1, t2, insertion:false, deletion:false, substitution:true )

The Damerau-Levenshtein distance, which also allows transpositions of adjacent characters in addition to the other operations of substitution, insertion and deletion, is obtained using

TextDistance(t1, t2, transpose:true )

The length of the Longest common subsequence is obtained using

(TextLength(t1&t2) - TextDistance(t1, t2, substutition:false) ) / 2

The longest common subsequence is the longest ordered sequence of characters, not necessarily adjacent, found in both «t1» and «t2».

Other combinations are possible by specifying True or False for the optional parameters:

  • «substitution»: (Default: 1) Set to 0 to disable character substitutions as an edit operation, or 1 to enable.
  • «deletion»: (Default: 1) Set to 0 to disable character deletions, or 1 to enable.
  • «insertion»: (Default: 1) Set to 0 to disable character insertions, or 1 to enable.
  • «transpose»: (Default: 0) Set to 0 to disable transposition of adjacent characters, 1 to enable. When enabled, minimal edit distance is guaranteed only when «deletion» and «insertion» are also enabled.

With some combinations, it might not be possible to transform «t1» into «t2» -- for example when the lengths are different and only character substitutions are allowed. In such a case, INF is returned.

Examples

TextDistance('cation', 'acting') → 4
{ cation → ation → action → actinn → acting }
TextDistance('cation', 'actino', transpose:true) → 2
{ cation → action → action }

Non-equal costs for edit operations

You can specify the cost of each type of edit operation explicitly by passing a positive number to each corresponding parameter. For example, to allow insertions and deletions, but at a cost of 10 times the cost of substitutions, you can use

TextDistance(t1, t2, substitution:1, insertion:10, deletion:10 )

This would result in 3 when comparing "abc" to "bcd", since 3 substutitions is cheaper than one deletion and one insertion (which would be 20).

When the cost for «insertion» is different from the cost for «deletion», the distance measure is non-symmetric -- I.e., the cost to go from «t1» to «t2» may be different from the cost to go from «t2» to «t1».

Specifying a cost of 0 is the same as specifying a cost of INF -- it disables the operation. But, it is possible to specify a zero cost (e.g., deletions are free) by passing in an array of zeros. This is covered in the next subsection.

Examples

Transpositions allowed, but counted as 1.5 steps:

TextDistance('ca','abc',transpose:1.5) → 2.5

Insertions and deletions are twice as expensive as substitutions:

TextDistance('abcde','bcdef', insertion:2, deletion:2 ) → 4
TextDistance('abc','bcd', insertion:2, deletion:2 ) → 3

First case delete and insert. Second case, substitute three times.

Costs that vary by character

You can specify that the insertion of the letter 'a' costs more than the insertion of the letter 'b', etc., by passing an array of costs to «insertion» that is indexed by «L2», and also passing an index to «L2» that has a list of letters. The letters are case-sensitive, so you will need to include both lower-case and upper-case letters. When «L2» includes the element Null, the corresponding entry in the array passed to «insertion» is used for any letter not explicitly listed. If it does not include «Null», the value passed to the optional parameter «default» is used, or the value 1 if «default» is not specified. Note that the index «L2» is used for the letters that appear in «t2». When zero costs appear in the array, they are treated as zero cost (i.e., that there is no cost to insert the corresponding letter).

To specify costs for «deletion» that vary by the letter being deleted, pass an array to «deleted» that is indexed by «L1», and also pass an index containing letters to the «L1» parameter. The entry corresponding to «L1»=Null is used when a letter in «t1» is deleted that is not listed in «L1».

Substitutions can vary by either or both the letter before and letter after the substitution. To specify costs that vary by both, pass an array that varies by both «L1» and «L2» to the «substitution» parameter, with the cost of each substitution in the corresponding cell. If the costs vary only by the original letter, then the array should be indexed only by «L1». If the costs vary only be the letter after the substitution, then the array should be indexed only by «L2».

Transpositions can also vary by either of both the first and second letter being transposed. For example, you could specify that transposing "th" to "ht" is more expensive than other edits. To specify transposition costs that vary by both letters, the array passed to «transpose» should be indexed by both «L1» and «L2». «L1» is used for the leftmost letter as it appears in «t1» before the transposition.

Each of these four parameters treats its indexing differently from how other Analytica array functions do, in that the parameter passed is not implicitly indexed by «L1» or «L2». So, TextDistance(t1, t2, deletion:2, L1:L1) is not equivalent to TextDistance(t1, t2, deletion:Array(L1,2), L1:L1). The difference occurs when a character in «t1» is not in listed in «L1». When the parameter does not contain the index, the value passed is used even in this case, whereas when it is indexed by «L1», the deletions[L1=Null] entry is used if «L1» contains Null, or else «default» is used.

Examples

In this example, we will only tally differences in vowels. Edits involving consonants are free, any edit involving a vowel costs 1.

Index Letter1 := 'a'..'z'
Index Letter2 := 'a'..'z'
Index Vowel := ['a', 'e', 'i', 'o' ,'u']
TextDistance('differences', 'similarities',
                         substitution: @[Vowel=Letter1] or @[Vowel=Letter2],
                         deletion: @[Vowel=Letter1]>0,
                         insertion: @[Vowel=Letter2]>0,
                         L1:Letter1, L2:Letter2 )
→ 4

Note that the @[I=x] operator returns the position of x in index I, or 0 when not present, so that the expression @[Vowel=c]>0 means "c is a vowel". Notice that the result is the same as using the vowels only:

TextDistance('ieee','iiaiie') → 4

The above example can be encoded using «L1» and «L2» indexes with only the relevant characters as follows

index Vowel1 := [Null, 'a', 'e', 'i', 'o', 'u']
index Vowel2 := [Null, 'a', 'e', 'i', 'o', 'u']
 TextDistance('differences', 'similarities',
                substitution: (Vowel1<>null) or (Vowel2<>null),
                deletion: Vowel1<>Null,
                insertion: Vowel2<>Null, 
                L1:Vowel1, L2:Vowel2) 
→ 4

The Null entry in the each of the two indexes holds the cost for all the characters that aren't explicitly listed. In this case, the null entry in (Vowel1<>Null) is 0, but is one for each of the vowel entries. When creating an index of letters with a null entry, you need to be careful to ensure that the cell containing Null actually has Null and not the text 'Null'. Because a list of letters displays in Analytica without quotes, if you simply enter Null into a cell in a list-of-text, the cell will contain text, not the identifier Null. Hence, you'll need to change the List-of-text to a plain List before inserting the cell with Null, or view it in expression view.

In the next example, spaces cannot be altered. Hence, the result will be INF when the number of words differs. But then, edits occur in each word only. It would actually be more efficient to split the inputs into words, find the TextDistance for each pair, and sum the result, but this is used to illustrate an INF cost for prohibiting alterations to certain characters.

Index Space1 := [Null, ' ']
Index Space2 := [Null, ' ']
Variable Cost1 := If Space1=' ' Then INF Else 1
Variable Cost2 := If Space2=' ' Then INF Else 1
 TextDistance('one two three', 'uno dos tres',
                               substitution: Cost1*Cost2,
                               deletion: Cost1,
                               insertion: Cost2, 
                               L1:Space1, L2:Space2) 
→ 7

Limitations with Transpose operations

When you enable «transpose» operations, there are some situations in which the algorithm is not perfect. The algorithm is only guaranteed to find the minimum number of edit operations when the cost of a transpose operation is greater than the average cost of insertion and deletion operations, where the cost is considered to be infinite if either is disallowed.

The function will not consider most transposition opportunities when «insertion» or «deletion» is completely disabled. Hence, if «substitution» is also disabled, the result is usually INF, even though a sequence of edit operations may exist. These are limitations of the current algorithm.

The function does not signal an error in these cases. It simply returns the best solution it can find given its current algorithm, which is an upper bound to the minimum edit distance. But, to play it safe, you should probably keep «insertion» and «deletion» enabled, and ensure that any customized costs for «transpose» are greater than costs for «insertion» and «deletion».

See Also

Comments


You are not allowed to post comments.