Export-Import data format


under construction -- do not rely on this information yet

This page provides a detailed specification of Analytica's multidimensional data file format, which we refer to as the Export-Import data format. This format is also described in Import and Export data in the Analytica User Guide, largely by way of examples. Here we provide a more complete and detailed specification of the format as it exists as of Analytica 4.5. There have been some revisions to this spec in Analytica 4.5 -- see Differences from 4.4 and earlier.

The format is used by

Format specification

Conventions

In the format specification that follows, we use the following conventions:

  • <name>: Indicates a pattern that is defined separately
  • +, *, ? : When these characters appear at the end of a line, it means the entire line is repeated a variable number of times. These qualifiers have the following meanings:
  • + : Repeated 1 or more times
  • * : Repeated 0 or more times
  • ? : Repeated 0 or 1 times
  • | : Disjunction (or) -- A|B means A or B can appear at that spot.
  • LiteralText : These characters appear in the file verbatim.
  • New line characters in the file are explicitly specified as <newline>. when two formal patterns appear on separate lines, this does not imply newline at that position. There often will be, but it will be given explicitly, either in the pattern, or in the subpattern.

For example, when a pattern is given as:

<item><valueList><newline>+

it means that all three subpatterns are repeated one or more times (it isn't just the last one).

Formal Specification

The export file has the following format:

TextTable <view> <ident> <newline>
<block>+

This says that the file format consists of a first line that starts with the word TextTable. After that, it consists of repeated <block>s. <ident> is the Analytica identifier of the table that was exported.

 <view> := Definition|EditTable|DetermTable|ProbTable|IntraTable|Value|Mid|Mean|ProbBands|Statistics|PDF|CDF|Sample|ProbValue

View identifies the type of table that was exported. This doesn't really have any impact on the import, it is more just for information.

 <newline> : May be CR, LF or CRLF.  In files, the PC convention is usually CRLF.  CR is ascii 13, LF is ascii 10.

 <block> := 
        <slicerList>?
        <colIdent><valueList><newline>?          { the column headers }
        <rowIdent><newline>?
        <item><valueList><newline>+              { the table data, the first <item> is the row header index value }

The format captures the array in a certain pivot, so that one index is a column index and one index is a row index. All other indexes are slicers. The slicer indexes are listed first, then the column index and then the row index. The column headers line lists the index values for the column index. It is possible to have a pivot in which there is no column index -- for example, a 1-D array would have only one index, which would usually be the row index. When there is no column index, the column headers line does not appear (this is why the ? appears at the end). Similarly, a given pivot might not have a row index. The row identifier line appears if and only if a row index is present. The row index values appear in the <item> column of the data rows. Each block will have one data row for each row index item. If there is no row index, then there will be only one data row, and <item> will be empty so that the line will actually start with a <tab> character.

The column headers line and the <rowIdent> row usually appear in every block, and appear identically in every case. This information is therefore redundant, and hence, they are optional after the first block. It is an error for these lines to contain different information in two different <block>s. If the ident is different or the index value are different, then the file is not in a valid format. The row headers, i.e., <item>, are also duplicated identically in every block, so these are also redundant, and again, they must be identical in every block or file is not in a valid format. If the <rowIdent> line does not appear in a block after the first, then the <item> must also not appear or it is an error; conversely, if the <rowItem> does appear, then the <item> must appear and be identical to previous blocks.

<slicerList> :=
<indexIdent><tab><value><newline>+

The slicer list provides the dimensions other than the row and column indexes, and for each block it also specifies the slicer value that pertains to this block. Each <block> contains a 2-D table of data, so a 3-D table consists of several blocks where the third dimension is a slicer index, and each block is one slice along the slicer index. A 4-D table would have two slicer indexes, so the slicer list would specify two "coordinates". In general, each block can actually contain the data of a single scalar value, a 1-D vector, or a 2-D table, depending on whether row and column indexes are present in the pivot.

The <indexIdent> is the identifier of the index. The export format requires unique names for all indexes (in theory, if local indexes are used, it is possible to have multiple indexes with the same identifier, but you can't export those). <value> is an index value -- the slicer positions are recorded by value, not by position. This also means that elements of the indexes you use as slicers must be unique.

In the first block, the list of slicers must include all slicer indexes. In the blocks after that, only the slicers that have different values from the previous block need to be listed. But the slicer items must be incremented in order, and without skipping any. So, for example, if I := ['a', 'b', 'c'], then the first block must have I<tab>'a', the first time I is incremented it must be to I<tab>'b' and the third time it must be to I<tab>'c'. This may then repeat from the beginning, but not in any other order.

 <valueList> := 
        <tab><value>+

 <value> := <number>|"<text>"|<expression>|~<expression>~|«null»|<blank>

 <item> := <number>|"<text>"|<non-ident expression>|~<expression>~|«null»

A <value> encodes one cell in a table. In <item> encodes one item (cell) of an index. A <number> is any of Analytica's numeric formats, including suffix format, or the canonical date-time format.

"<text>"

Text values are delineated with quotation marks. The text inside cannot contain any tabs, newlines internal quotes or backslashes. If the text contains any of these characters, they need to be escaped using \t, \n, \r, \\ or \". As an alternative, a single quote (') can be used for the start and stop quote, in which case any single quote in text must be escaped using \'. These are the only escape sequences recognized. As an example, the two line text that contains a tab character and an internal quote would be encoded as "I said:\t\"Hello\rworld\"".

<expression>
<non-ident expression>

An <expression> is any Analytica expression, as might appear in a definition or edit table cell. An example might be Normal(10, 3). Expressions include as special cases numbers, text, identifiers, function calls, references, Null, etc. We've called out <number> separately even though it would actually fall into this case. We've also called out "<text>" separately even though it to is a special case of expression; however, a pure text string has some slightly different character escape rules, and thus it makes sense to pull it out separately. The <non-ident expression> in <item> means any expression other than a single identifier. So, for example, Price would not qualify as a <non-ident expression>, while Price + Tax would.

When you have an Analytica expression that does not qualify given the restrictions on what can appear in <expression> or <non-ident expression>, it can be surrounded by tildes (~) -- in the same way <text> is surrounded by quotes, and then problematic characters within the expression can be escaped. The surrounding tildes are not part of the expression itself. Tabs, newlines and backslashes must be replaced with \t, \r, \n, or \\. Tildes may optionally be escaped using \~. These are the only escape sequences recognized.

The restrictions and the tilde convention are there to prevent potentially ambiguous situations.

A multi-line expression might appear as:

<tab>~Index x := 1..100;\rConcat(x, x + 1000)~

Examples

There is a small limitation with these examples in that a tab character doesn't display as a tab. Instead it looks like a bunch of spaces. For now, use your imagination. Maybe we can eventually resolve that by using color to show which gaps are tabs.

Scalar value

This format isn't of great use for non-array values, but a scalar can be represented. This is the simplest table, since no slicers, no row index and no column index appear. There is a single block, containing a single line, with a blank row header so that the second line starts with a <tab> character.

 TextTable Value Profit
        1.345M

Column vector

Here is an example of a one-dimensional array, pivoted so that the sole index is the row-index. There is only one <block> since there are no slicers, and there is no column index.

 TextTable EditTable House_cost_inputs
 House_inputs                                 ''<font color="green"> ← row index identifier</font>''
 "Prop tax"          3400
 "Tax rate"          0.44
 "Maintenance"       4000
 "Interest"          0.105
 "Appreciation"      0.08

The underlying model for this is:

Index House_inputs := ["Prop tax", "Tax rate", "Maintenance", "Interest", "Appreciation"]
Variable House_cost_inputs := Table(House_inputs)(3400, 0.44, 4000, 0.105, 0.08)</code<

Row vector

The same 1-D array can be recorded as a row vector. Again there is only one block since there is no slicers, and in this case there is also no row index.

 TextTable EditTable House_cost_inputs
 House_Inputs         "Prop tax"     "Tax rate"     "Maintenance"  "Interest"     "Appreciation"
                      3400           0.44           4000           0.105          0.8

The row containing the data starts with a <tab>.

Single slicer

The same 1-D array can also be recorded as a scalar <block>, where the sole index is represented as a slicer. Here there is no row nor column index. This is not as compact, but it illustrates the basics of slicer indexes. Here we have five <block>s, each having one cell.

 TextTable EditTable House_cost_inputs
 House_Inputs         "Prop tax"
                      3400
 House_Inputs         "Tax rate"
                      0.44
 House_Inputs         "Maintenance"
                      4000
 House_Inputs         "Interest"
                      0.105
 House_Inputs         "Appreciation"
                      0.8

2-D array

A 2-D array can be represented in a single <block>. The column index and the row index are shown.

 TextTable Value Mortgage
 Down_payment  20000           45000           60000
 Buying_price
 200000                180000        155000          140000
 400000                380000        355000          340000
 600000                580000        555000          540000

3-D array

Here there is one slicer index, Buying_price, and three <block>s, one for each possible Buying_price.

 TextTable Mean Net_diff
 Buying_price          200000
 Years_owned           5              10              15
 Down_payment
 20000                 10112          12160           13525
 45000                 10093          12158           13540
 60000                 10073          12157           13555
 Buying_price          400000
 Years_owned           5              10              15
 Down_payment
 20000                 10180          14201           16867
 45000                 10160          14199           16882
 60000                 10141          14198           16897
 Buying_price          600000
 Years_owned           5              10              15
 Down_payment
 20000                 10248          16242           20209
 45000                 10228          16241           20224
 60000                 10208          16239           20239

3-D with two slicers

Using the same example as for 3-D, here we show the same data with two slicers and a column index, no row index. This illustrates how the slicers are incremented in each block. Notice that the outer-most slicer, Buying_price, only appears in the blocks where it changes.

 TextTable Mean Net_diff
 Buying_price          200000
 Down_payment          20000
 Years_owned           5              10              15
                       10112          12160           13525
 Down_payment          45000
 Years_owned           5              10              15
                       10093          12158           13540
 Down_payment          60000
 Years_owned           5              10              15
                       10073          12157           13555
 Buying_price          400000
 Down_payment          20000
 Years_owned           5              10              15
                       10180          14201           16867
 Down_payment          45000
 Years_owned           5              10              15
                       10160          14199           16882
 Down_payment          60000
 Years_owned           5              10              15
                       10141          14198           16897
 Buying_price          60000
 Down_payment          20000
 Years_owned           5              10              15
                       10248          16242           20209
 Down_payment          45000
 Years_owned           5              10              15
                       10228          16241           20224
 Down_payment          60000
 Years_owned           5              10              15
                       10208          16239           20239

Index Correspondence

When you import the data for a multi-D array from an export-import format into an existing table, it is necessary for Analytica to determine the correspondence between the indexes that appear in the file and the indexes of your table. The same correspondence must occur when you paste data copied using Copy Table (which uses this same format).

The algorithm(s) used for making this correspondence have varied over the years through various Analytica releases. I don't think there has been a formal specification prior to this attempt. The following specification applies to Analytica 4.5.

  1. If an identifier matches an existing table index, accept this correspondence (even if the values differ)
    1. If there are a different number of values, and the index cannot be updated as part of the import, an error occurs.
    2. If there are the same number of index values, but the actual values differ, and it is possible to change the index values as part of the import, ask user if index values should also be updated from file. This is optional.
    3. If there are the same number of index values, but the actual values differ, and it is not possible to change the index value as part of the import, then just leave them.
    4. Incoming data retains its positional order. It is not spliced onto a new ordering of index values.
  2. If there is an index that has the same name as the table exported (or a Domain of X, where X is the table exported), this indicates that the exported table had a self index. This can be corresponded with the receiving table's Self (or domain) index.
    1. The self-index will be updated from the incoming file, without question, if the values aren't identical.
    2. A domain will be updated from the incoming file if the current domain is an explicit list. This may be worth asking the user.
  3. If after applying the first two rules, all indexes in the file are corresponded, but additional table indexes remain, then
    1. If all extra indexes are currently slicers, ask user whether he wants to remove other indexes from the table, or copy into the current slice (or cancel).
    2. If the extra indexes are row or column pivots, then ask user whether he wants to remove the other indexes from the table. (Copy into slice isn't an option).
  4. For remaining indexes after above steps, if identifiers don't match but the number of index elements is the same as those in the receiving table, when the indexes are taken in the pivot order, and the number of dimensions is the same, then correspond these by position, for a full correspondence.
  5. If the row & column indexes are the same length (but different idents), correspond them.
  6. If after all the above rules, there is still a remaining uncorresponded index in the data file, but its identifier matches a global index that isn't in the table, make that correspondence. If this completes the correspondence, then ask the user if he wishes to add these indexes to the table.
  7. If the above rules fail to identify a full correspondence, fail with an error -- "the indexes or index sizes and pivot don't match".

Differences from 4.4 and earlier

When all index and table values are numeric, the old spec and this new spec coincide. The spec was revised for Analytica 4.5 and fixes several ambiguities related to non-numeric values.

The most obvious change is that the old spec did not require text values to be quoted. Since identifiers and expressions are also not quoted, this gave rise to numerous ambiguities. When a potential identifier appeared, is that an identifier or text? Is 1.23 the numeric value 1.23 or the textual sequence of characters "1.23"?

The ambiguities from not always quoting text values created other local ambiguities, in which it was not possible in all cases to determine from local information alone how many elements the row index had, and where a <block> ended and the next one began. These were not globally ambiguous -- i.e., it was possible to resolve these by looking through the entire file to the end first -- but it complicated parsing. Some of these differences such as requiring text values to be quoted and identifiers appearing as <item> to be tilde escaped solve these ambiguity problems. The ambiguous cases arose only rarely, but that isn't really an excuse for allowing ambiguity to exist in the format.

The old format did not support multi-line cell values (or index items), nor did it handle tab characters within text values, etc. It also did not handle expressions, which was particularly important if you had an edit table of distributions that you wanted to export and import.

Many of the details pin things down that were never clearly specified previously (and in many cases, not handled when they occurred).

The tilde escapes for expressions is new, as is the support for escaping of characters within text values. The handling of a <blank> cell is different (this is a blank cell, rather than the empty text ("").

The specification of how indexes are corresponded is new and different, hopefully better thought out.

Files exported by Analytica 4.5 conform to this new spec. They are not guaranteed to be readable in 4.4 or earlier, although most will be, especially if they are numbers only, don't contain identifiers or expressions, etc.

Files exported by Analytica 4.4 or earlier and then imported into Analytica 4.5 may not correspond to the spec here, and are not guaranteed to work. Tables with numbers only mostly will. Unquoted text values will be rejected.

See Also

Comments


You are not allowed to post comments.