Parsing and formatting data




Release:

4.6  •  5.0  •  5.1  •  5.2  •  5.3  •  5.4  •  6.0  •  6.1  •  6.2  •  6.3  •  6.4  •  6.5



Requires the Analytica Enterprise or Optimizer edition, or ADE.

Analytica has tools for reading and writing data in most common formats, including CSV (comma-separated values in a table), XML, JSON, spreadsheets, and relational databases. This page is an overview of how to read and write files in CSV, XML, and JSON, plus some additional functions for parsing custom data formats. Normally, you start by reading a data file using ReadTextFile, or ReadFromURL to read from a web page, and then use functions described below to parse the data. To write a file, you use the corresponding function to generate the desired format, and WriteTextFile to generate the file. See sections on spreadsheets and databases for how to access data in those formats. You can also read and write binary files with ReadBinaryFile and WriteBinaryFile.

Data in rows and columns (CSV)

A CSV file organizes data as a table of cells organized into rows and columns. Each "cell" contains one datum, which may be a number, a date, or text. The first line of the file may (or may not) contain column headings rather than actual data. This format is usually called a CSV, which stands for Comma-Separated Values, because the cells are often separated by commas. But, CSV files may also use other separators, such as the tab character. The ParseCSV and MakeCSV functions (introduced in Analytica 5.0) make it easy to parse or produce CSV.

Even though CSV is one of the most widely used data formats, there is no official CSV standard. While all CSV conventions have a lot in common, particularly table structure, there are many details that can vary, in addition to the separator. Key variations include when quotes are needed around cells, an escape character to allow separator, new-line, and quotes within a single-cell text value. ParseCSV and MakeCSV functions use Excel's conventions by default, and offer a lot of flexibility with optional parameters to handle other CSV conventions. They also can convert from text to numbers or dates, and vice-versa.

Here's how to read and parse a CSV file using Excel conventions, including commas as separators:

ParseCSV(ReadTextFile( "MyFile.csv" ) )

The result is a 2-D array, indexed by local indexes named .Row and .Column.

For a CSV file that uses a tab character as a separator, use

ParseCSV(ReadTextFile("MyFile.csv"), separator: Chr(9))

ParseCSV includes many other options, including whether to use an existing index for the column index or row index, instead of the local indexes .Row and .Column , to get the row index labels from a specific column in the data, other quoting conventions, different international/regional conventions, or to extract only a subset of the columns. See ParseCSV for details.

To write a 2-D array, x, to a CSV file, use:

WriteTextFile("MyFile.csv", MakeCSV(x, I, J))

where I and J are the indexes of x. To write a tab-separated file, use

WriteTextFile("MyFile.csv", MakeCSV(x, I, J, separator: Chr(9)))

See MakeCSV for details on other options.

XML

The eXtensible Markup Language is a flexible (but verbose) standard for data encoding. The standard nails down all the encoding details, but leaves the specific schema specification to the application, so the actual structure of the data is virtually unlimited. Hence, it is quite common that data in an XML file has a rich structure that may look quite a bit different than a rectangular array.

A single richly-structured XML source may contain many pieces of information that you can extract, which typically will fit well into arrays and indexes. A good way to think about XML data is that you'll extract information from it from a series of "queries".

Typically, you read in the XML file into an Analytica text variable, and the use the Microsoft XML DOM parser to generate an XML object, that offers a rich set of queries to access that object, for example:

Variable XML_text := ReadTextFile("My File.XML")

Variable XmlDoc := 
Var d := COMCreateObject("Msxml2.DOMDocument.3.0");
d->async := False;
d->loadXML(XML_Text);
If (d->parseError->errorCode <> 0) Then Error(d->parseError->reason);
d

You can use ReadTextFile or ReadFromURL to read XML_Text, or you can replace the d->loadXML method call with a direct read from a file or URL using, e.g.,

d->load("http://website.com/download/data.xml")

After loading the data into the XML DOM, you extract data via a series of queries using XPath expressions. XPath is a extremely rich and powerful query language, making it suitable to the wide variations in schema among XML files. The methods and properties of the XML parser are documented at Microsoft XML DOM Parser API. Here is one useful pattern to extract all the XML tags from a <person>object:

Variable PeopleNodes :=

Var nodes := xmlDoc->selectNodes("//*/row");
Index Row := 1..nodes->length;
nodes->item(Row-1)

The resulting nodes is an array of IXMLDOMNode objects. If each of these <person> tag contains a single tag <age>, you can extract the age using

ParseNumber(PeopleNodes->selectSingleNode("age")->text)

For more, see the example at Extracting data from an XML file and consult the Microsoft XML DOM API reference.

Because XML schemas are so open-ended, there is no generic MakeXML function. But, it is relatively easy to write Analytica code to concatenate your information using the Text Concatenation Operator: & and JoinText. When including text, you should XML-encode it in case it contains XML-reserved characters. For example:

JoinText("<person>" & TextCharacterEncode("XML", personNames) & "</person>", People)

For numbers or dates, use NumberToText to control the number format used for the numbers.

A second method for creating XML, which we have found less convenient, is to use the same Microsoft DOM that we use for parsing XML. Methods within the DOM let you to add and modify tags and content. After making these changes, you can write the XML to a file using the save method or extract the XML text using the text property.

JSON

The JavaScript Object Notation (JSON) format is a data format used to encode the contents of JavaScript objects. Its structure matches the data structures of the JavaScript language, and so it's very convenient for web developers to directly serialize and deserialize information in client-side JavaScript. It is now widely used even in non-JavaScript contexts, even if it is not always ideal for the data used by other programming languages and applications.

In the simplest usage, you can parse a JSON data file without specifying a schema, e.g.,

ParseJSON(ReadFromURL("http://SomeWebSite.com/download/data.json"))

The function infers the class structure from the data itself and, in general, returns tree-structured data using references and local indexes. There are several ways to map JSON collections into Analytica data structures. It is usually convenient to use existing indexes from your model for class instance data within the JSON. For more control over how the JSON data structures are processed, you can use an explicit schema, that describes the file structure and how it maps to your own indexes. See Parsing JSON with a schema.

To write data with very simple structure to JSON, use:

WriteTextFile("MyData.json", MakeJSON(x))

But, often it requires additional information. For example, a 2-dimensional array could be encoded in JSON as a list of lists or a list of class instances (where one index corresponds to a JavaScript class). If you encode it as a list of lists, which index should be written as the outermost? To make sure you get what you want, you should list the indexes of your data in the outermost-to-innermost order, and indicate which indexes should translate to JavaScript classes. For example, you might write a 5-D array x with indexes I, J, K, L, and M as:

WriteTextFile("MyData.json", MakeJSON(x, indexOrder: I, J ,K, objects: L, M))

For details, see MakeJSON.

Custom data formats

When another application uses a non-standard textual data format (i.e., something other than rows and columns of cell (CSV), Excel files, Analytica import-export format, XML or JSON), you can "program" the parsing of the data yourself. Start by reading the textual data with ReadTextFile or ReadFromURL. You can then parse it using text Text Functions, especially FindInText, SplitText, ParseNumber and ParseDate. The task is often greatly facilitated by using Regular expressions.

When another application uses a binary data format, then you must use ReadBinaryFile and WriteBinaryFile.

See Also

Comments


You are not allowed to post comments.