Difference between revisions of "Relational to Array Dialog"

(Created page with "category:Windows and dialogs ''New to Analytica 6.3'' This dialog is used to convert a relational table or "fact table" to a multi-dimensional array. <center>imag...")
 
 
(8 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
''New to [[Analytica 6.3]]''
 
''New to [[Analytica 6.3]]''
  
This dialog is used to convert a relational table or "fact table" to a multi-dimensional array.
+
This dialog is used to convert a relational table or "fact table" to a multi-dimensional array. The dialog is essentially a "wizard" that helps you create a call to [[MdTable]] to perform the transformation, which you can do entirely from the dialog without writing any expressions yourself.  
  
 
<center>[[image:MdTable Wizard Project Route Length.png]]</center>
 
<center>[[image:MdTable Wizard Project Route Length.png]]</center>
Line 18: Line 18:
 
:[[image:Project Route Length relational table.png]]
 
:[[image:Project Route Length relational table.png]]
 
The project and route columns are "index-like", where as the Length column contains quantitative data.
 
The project and route columns are "index-like", where as the Length column contains quantitative data.
 +
 +
== Fact tables ==
 +
The dialog also handles the conversion of a ''fact table'' to an array. The term ''fact table'' comes from OLAP literature. A fact table is a relational table having more than one value column, and in which the multiple value columns are all instances of the same quantity varying across one additional index (often a time index). A fact table allows one dimension to be embedded in each record, rather than having to have a separate row for every cell.
 +
 +
(TO DO: Insert an image)
  
 
== Accessing the dialog ==
 
== Accessing the dialog ==
Line 32: Line 37:
 
** Column names must be unique (no duplicates)
 
** Column names must be unique (no duplicates)
 
** (A few others, TBD)
 
** (A few others, TBD)
 +
 +
== Column types ==
 +
In the main part of the dialog, you specify for each column of your relational table whether the column should be treated as an index, as a value or ignored.  For each index column, you have the options of using an existing index in your model, creating a new index, or using a new local index. These are each covered in turn below.
 +
 +
One of the nicest things about using this dialog is that it can take care of creating the index nodes for you, eliminating a few steps when writing an expression yourself that calls [[MdTable]].
 +
 +
An index should have only unique values (no duplicate elements). An index column in a relational table will have many duplicates. In general, an index should have all the values that appear in the corresponding column of the relational table.
 +
 +
The dialog looks at your data and makes an educated first guess about how to treat each column.
 +
 +
=== Existing index ===
 +
 +
Use this option when you already have an index in your model for the column. The existing index should contain all the values that appear in the column.
 +
 +
If the column label is a handle to an index, or if there is already an index that matches the name of your column and has the values that appear in your relational column, it will auto-select the existing index.
 +
 +
When you first select this option, you may see «click to select» as seen here
 +
:[[image:Click to select existing index.png]]
 +
The blue underline indicates that the item can be clicked. You'll need to select the existing index in the index finder that pops up.
 +
 +
Once you've selected an index, or it has found one for you, the selected index identifier is blue underlined, for example
 +
:[[image:Existing route index selected.png]]
 +
To change it to a different existing index, click the blue link and again use the index selector to locate the correct index.
 +
 +
When you select an existing index, the index definition is not modified my the dialog.
 +
 +
=== New index ===
 +
This option creates a new global index for you, and defines it automatically to be the sorted unique values that appear in the column.
 +
 +
Your relational column name is used to create the identifier for the new index. You can change this later by renaming the created node if necessary.
 +
 +
The new index is defined by an expression that computes the sorted unique values. Thus, if you want to tweak this in any way, you can simply modify the definition later.
 +
 +
=== Local index ===
 +
This option creates a new [[Local Indexes|local index]], using the sorted unique values that appear in the column. The local index's name is based on the name of your relational column.
 +
 +
The local index for each separate value variable is a ''different'' local index, even though they might have the same name. Thus, use this option with care. Typically, it is much more convenient to use have a global index in Analytica, so consider using ''New index'' instead.
 +
 +
=== Value ===
 +
A value column contains the data for the resulting array. Value columns usually contain numeric data.
 +
 +
You need to have at least one Value column (with one exception -- you don't need any value column when using a <code>Count</code> type).
 +
 +
You can specify multiple Value columns. When you do so, there are two options:
 +
* Create a separate variable for each value column.
 +
* Combine all the specified value columns into one result variable using a ''Value index''.  See [[#Value indexes]].
 +
 +
You should use the first option -- separate variables -- when each value column contains different types of quantities. Use a value index (also known as a fact index) only when your source table is structured as a fact table, where the value columns are used to embed an index (such as a time-like index) in each record.
 +
 +
When you mark more than one column as Value, the value index selector appears. Set this to '"No value index"' when you want to create a separate variable for each column.
 +
 +
=== Ignore ===
 +
Columns marked ''Ignore'' do not participate in the transformation. Use this when the column should not appear as an index in the final result. Ignoring a column may cause some rows to map to the same index coordinates, in which case the values are aggregated (see [[#Aggregation type]).
 +
 +
== Aggregation type ==
 +
In is not uncommon for multiple records to have the same values in every index column. When this happens, you have multiple records mapping to the ''same'' result cell. To deal with this case, the values much be combined. Words such as ''aggregated'' and ''conglomerated'' are sometimes used for this process. 
 +
 +
You have several standard options for the ''''Type''' of aggregation (called simply the ''type''):
 +
* [[Sum]] (the default)
 +
* [[Min]]
 +
* [[Max]]
 +
* [[Average]]
 +
* [[Product]]
 +
* First
 +
* Last
 +
* [[Count]]
 +
 +
Of these, only [[Count]] does not require a Value column.
 +
 +
Other custom types are possible as well (i.e., any array reducing function), but don't appear on the dialog's dropdown. Since the dialog produces a variable with an explicit call to [[MdTable]], if the need arises to use a custom type, you can modify the definition. See [[MdTable]] for details on using a custom «type».
 +
 +
== Multiple values columns ==
 +
 +
When you mark more than one column as Value, the value index selector appears. Set this to '"No value index"' when you want to create a separate variable for each column.
 +
 +
=== Value indexes ===
 +
Sometimes you want a subset of value columns to be in the same result using an index for the value columns. This occurs when transforming a so-called ''fact table''.  We illustrate this with an example.
 +
 +
Each row in the following STAR table contains data for one student.
 +
:[[image:STAR relational.png]]
 +
(This data was [https://vincentarelbundock.github.io/Rdatasets/csv/AER/STAR.csv downloaded] from [https://vincentarelbundock.github.io/Rdatasets/datasets.html this collection of datasets].)
 +
 +
We wish to extract the math test scores, indexed by gender, ethnicity and birth (quarter). There are 4 columns with math scores. The dialog entries are set as follows.
 +
:[[image:MdTable Wizard for Math Scores.png]]
 +
There are several columns that cannot be seen in this image because you have to scroll down. All were set to '''Ignore''' in this example).
 +
 +
Notice that the four math test score columns have been marked as <code>Value</code>, and we have specified that a new Value index be created named Math_test. After pressing '''OK''', and renaming the new variable to Math_scores, the result is
 +
:[[image:Math scores array.png]]
 +
 +
A single Variable was created, along with four indexes: <code>Math_test</code>, <code>ethnicity</code>, <code>birth</code> and <code>gender</code>.
 +
 +
== See Also ==
 +
* [[MdTable]]
 +
* [[New variable from column]]
 +
* [[Import and Export data]]

Latest revision as of 01:58, 9 February 2023


New to Analytica 6.3

This dialog is used to convert a relational table or "fact table" to a multi-dimensional array. The dialog is essentially a "wizard" that helps you create a call to MdTable to perform the transformation, which you can do entirely from the dialog without writing any expressions yourself.

MdTable Wizard Project Route Length.png

Relational tables

Relational tables are commonly used in databases and spreadsheets. A relational table is a two-dimensional heterogeneous tables with one "record" on each row, and different quantities relating to each record in a different column. They are often heterogeneous, with some columns containing index-like labels, and other columns containing numerical quantities with varying units of measurement and varying meanings.

A relational table often depicts quantities that are effectively multi-dimensional quantities. However, the multi-dimensionality is only implied, where some columns effectively specify the "coordinate" of the record in a multi-dimensional space. In such a case, it is usually far more convenient in Analytica to convert these into actual multi-dimensional arrays, which Analytica loves to manipulate directly. This accomplishes two things: (a) Is separates our heterogenous data into separate variables, where each variable represents one type of quantity only, with one consistent unit of measurement, and (b) turns the implicit "coordinates" that appear in the index-like columns of a relational table into actual explicit indexes. Once the quantities are in a homogeneous array form, they become really easy to manipulate using Analytica's various array functions.

It is worth noting, however, that for some extremely high-dimensional cases, a relational table can be a very space-efficient sparse representation.

Here is an example of a simple 3-column relational table.

Project Route Length relational table.png

The project and route columns are "index-like", where as the Length column contains quantitative data.

Fact tables

The dialog also handles the conversion of a fact table to an array. The term fact table comes from OLAP literature. A fact table is a relational table having more than one value column, and in which the multiple value columns are all instances of the same quantity varying across one additional index (often a time index). A fact table allows one dimension to be embedded in each record, rather than having to have a separate row for every cell.

(TO DO: Insert an image)

Accessing the dialog

To bring up the Relational table to array dialog, select Relational table to array... on the right-mouse context menu while viewing a relational table in a result window.

Relational table to array menu option.png

For the option to be present and enabled,

  • You must be in edit mode.
  • The module containing the table's variable must be editable (it cannot be locked, browse-only). This is where it will create new variables.
  • It must be a result table (not an edit table)
  • The table must appear to be a relational table.
    • The table must be pivoted with the columns horizontally and the records down the vertical rows.
    • Column names must be unique (no duplicates)
    • (A few others, TBD)

Column types

In the main part of the dialog, you specify for each column of your relational table whether the column should be treated as an index, as a value or ignored. For each index column, you have the options of using an existing index in your model, creating a new index, or using a new local index. These are each covered in turn below.

One of the nicest things about using this dialog is that it can take care of creating the index nodes for you, eliminating a few steps when writing an expression yourself that calls MdTable.

An index should have only unique values (no duplicate elements). An index column in a relational table will have many duplicates. In general, an index should have all the values that appear in the corresponding column of the relational table.

The dialog looks at your data and makes an educated first guess about how to treat each column.

Existing index

Use this option when you already have an index in your model for the column. The existing index should contain all the values that appear in the column.

If the column label is a handle to an index, or if there is already an index that matches the name of your column and has the values that appear in your relational column, it will auto-select the existing index.

When you first select this option, you may see «click to select» as seen here

Click to select existing index.png

The blue underline indicates that the item can be clicked. You'll need to select the existing index in the index finder that pops up.

Once you've selected an index, or it has found one for you, the selected index identifier is blue underlined, for example

Existing route index selected.png

To change it to a different existing index, click the blue link and again use the index selector to locate the correct index.

When you select an existing index, the index definition is not modified my the dialog.

New index

This option creates a new global index for you, and defines it automatically to be the sorted unique values that appear in the column.

Your relational column name is used to create the identifier for the new index. You can change this later by renaming the created node if necessary.

The new index is defined by an expression that computes the sorted unique values. Thus, if you want to tweak this in any way, you can simply modify the definition later.

Local index

This option creates a new local index, using the sorted unique values that appear in the column. The local index's name is based on the name of your relational column.

The local index for each separate value variable is a different local index, even though they might have the same name. Thus, use this option with care. Typically, it is much more convenient to use have a global index in Analytica, so consider using New index instead.

Value

A value column contains the data for the resulting array. Value columns usually contain numeric data.

You need to have at least one Value column (with one exception -- you don't need any value column when using a Count type).

You can specify multiple Value columns. When you do so, there are two options:

  • Create a separate variable for each value column.
  • Combine all the specified value columns into one result variable using a Value index. See #Value indexes.

You should use the first option -- separate variables -- when each value column contains different types of quantities. Use a value index (also known as a fact index) only when your source table is structured as a fact table, where the value columns are used to embed an index (such as a time-like index) in each record.

When you mark more than one column as Value, the value index selector appears. Set this to '"No value index"' when you want to create a separate variable for each column.

Ignore

Columns marked Ignore do not participate in the transformation. Use this when the column should not appear as an index in the final result. Ignoring a column may cause some rows to map to the same index coordinates, in which case the values are aggregated (see [[#Aggregation type]).

Aggregation type

In is not uncommon for multiple records to have the same values in every index column. When this happens, you have multiple records mapping to the same result cell. To deal with this case, the values much be combined. Words such as aggregated and conglomerated are sometimes used for this process.

You have several standard options for the 'Type of aggregation (called simply the type):

Of these, only Count does not require a Value column.

Other custom types are possible as well (i.e., any array reducing function), but don't appear on the dialog's dropdown. Since the dialog produces a variable with an explicit call to MdTable, if the need arises to use a custom type, you can modify the definition. See MdTable for details on using a custom «type».

Multiple values columns

When you mark more than one column as Value, the value index selector appears. Set this to '"No value index"' when you want to create a separate variable for each column.

Value indexes

Sometimes you want a subset of value columns to be in the same result using an index for the value columns. This occurs when transforming a so-called fact table. We illustrate this with an example.

Each row in the following STAR table contains data for one student.

STAR relational.png

(This data was downloaded from this collection of datasets.)

We wish to extract the math test scores, indexed by gender, ethnicity and birth (quarter). There are 4 columns with math scores. The dialog entries are set as follows.

MdTable Wizard for Math Scores.png

There are several columns that cannot be seen in this image because you have to scroll down. All were set to Ignore in this example).

Notice that the four math test score columns have been marked as Value, and we have specified that a new Value index be created named Math_test. After pressing OK, and renaming the new variable to Math_scores, the result is

Math scores array.png

A single Variable was created, along with four indexes: Math_test, ethnicity, birth and gender.

See Also

Comments


You are not allowed to post comments.