Difference between revisions of "Parsing and formatting data"

Line 75: Line 75:
  
 
== Custom data formats ==
 
== Custom data formats ==
 +
When another application uses a non-standard textual data format (i.e., something other than rows and columns of cell (CSV), Excel files, Analytica import-export format, XML or JSON), you will need to "program" the parsing of the data yourself. In general, the steps will be to read the textual data using [[ReadTextFile]] or [[ReadFromURL]], and then parse it using text [[:category:Text Functions|Text Functions]], especially [[FindInText]], [[SplitText]], [[ParseNumber]] and [[ParseDate]]. The task is often greatly facilitated by using [[Regular expressions]].
 +
 +
When another application uses a binary data format, then you must use [[ReadBinaryFile]] and [[WriteBinaryFile]].
 +
 
== See Also ==
 
== See Also ==
 
* [[ReadTextFile]]
 
* [[ReadTextFile]]

Revision as of 18:08, 8 August 2017


After reading a data file using ReadTextFile, you'll need to parse it to separate the contents into individual values, and to convert some of the fields to numbers and dates. There are data file formats, ranging from common standards like CSV, XML and JSON, to custom formats. When you write data to a text file, using WriteTextFile, you'll need to put your output into the desired standard or custom format.

Data in rows and columns (CSV)

Data files are often organized into rows and columns. Each "cell" contains one datum, which may be a number, a date, or text. The first line of the file may (or may not) contain column headings rather than actual data. This type or organization is broadly referred to as a CSV format. CSV stands for Comma-Separated Values, since a common convention is for the value on each line to be separated by a commas, but the term is broadly applied even when a different separator, such as a tab character, is used.

Even though CSV is one of the most widely used standard data formats, there is no official CSV standard. While all CSV conventions have a lot in common, particularly the 2-D structure of the data, there are many details that can and do vary among applications. Foremost among these conventions regarding when quotes are placed around cells, how separator, new-line, and other special characters are escaped within single-cell text values, and how quoted cells are interpreted. The ParseCSV and MakeCSV functions in Analytica 5.0 and later parse and produce CSV using Excel's conventions by default, with a great deal of flexibility with optional parameters to adapt to other CSV conventions, which makes it quite easy to parse or produce CSV. These functions also handle the conversion from text to numbers and dates and vise-versa.

Reading and parsing a CSV file that uses commas as separators is done as follows:

ParseCSV(ReadTextFile( "MyFile.csv" ) )

The result is a 2-D array, indexed by local indexes named .Row and .Column. For a CSV file that uses a tab character as a separator, use

ParseCSV(ReadTextFile( "MyFile.csv" ), separator:Chr(9) )

ParseCSV includes many other options, some of which are likely to be necessary or convenient in a particular case. You may wish to use an existing index for the column index or row index, take the row index labels from a specific column in the data, adopt different quoting conventions, use different international/regional conventions, or extract only a subset of the columns. See ParseCSV for details.

Writing a 2-D array, x, to a CSV file is done as follows:

WriteTextFile("MyFile.csv", MakeCSV( x, I, J ) )

where I and J are the indexes of x. To write a tab-separated file, use

WriteTextFile("MyFile.csv", MakeCSV( x, I, J, separator:Chr(9) ) )

MakeCSV supports many additional conventions, see Chr(9).

XML

The eXtensible Markup Language is a flexible (but verbose) standard for data encoding. The standard nails down all the encoding details, but leaves the specific schema specification to the application, so the actual structure of the data is virtually unlimited. Hence, it is quite possible (likely even) that data in an XML file has a rich structure that may look quite a bit different than a rectangular array.

From a single richly-structured XML source, there will be many pieces of information that you can extract, and each of these typically will fit well into arrays and indexes. So a good way to think about XML data is that you'll extract information from it from a series of "queries".

To parse XML data in your model, use the Microsoft XML DOM parser, which handles the parsing, and provides an extremely rich query capability. In the case where you have already read the XML text into a variable XML_Text, instantiate the parser as follows:

Variable xmlDoc :=

Var d := COMCreateObject("Msxml2.DOMDocument.3.0");
d->async := False;
d->loadXML(XML_Text);
If (d->parseError->errorCode <> 0) Then Error(d->parseError->reason);
d

The methods and properties of the XML parser are documented at Microsoft XML DOM Parser API.

You can use ReadTextFile or ReadFromURL to read XML_Text, or you can replace the d->loadXML method call with a direct read from a file or URL using, e.g.,

d->load("http://website.com/download/data.xml")

With the data loaded into the XML DOM, you then extract data via a series of queries using XPath expressions. XPath is a extremely rich and powerful query language, making it suitable to the wide variations in schema among XML files. One useful pattern is illustrated here, where all tags in the XML named <person> are extracted.

Variable PeopleNodes:=

Var nodes := xmlDoc->selectNodes("//*/row");
Index Row := 1..nodes->length;
nodes->item(Row-1)

The resulting nodes is an array of IXMLDOMNode objects. If each of these <person> tag contains a single tag <age>, you can extract the age using

ParseNumber(PeopleNodes->selectSingleNode("age")->text)

For further information, see the example at Extracting data from an XML file and consult the Microsoft XML DOM API reference.

To create XML, it is generally relatively easy to write Analytica code to concatenate your information using the Text Concatenation Operator: & and JoinText. When including text, you should XML-encode it in case it contains XML-reserved characters. For example:

JoinText( "<person>" & TextCharacterEncode("XML", personNames) & "</person>", People )

For numbers or dates, use NumberToText to control the number format used for the numbers. Because XML schemas are so open-ended, there is no generic MakeXML function.

A second method for creating XML, which we have found to be less convenient, is to do it through the same Microsoft DOM that we use for parsing XML. Methods within the DOM allow you to add and modify tags, and once complete, you can write the XML to a file using the save method or extract the XML text using the text property.

JSON

The JavaScript Object Notation (JSON) format is a data format used to encode the contents of JavaScript objects. The structure of the format matches the data structures of the JavaScript language, and hence provides an extremely convenient format for web developers who can use it to directly serialize and deserialize information in client-side JavaScript. Although it is not always well-suited for the data used by other programming languages and applications, it is now widely used even in non-JavaScript contexts.

In the simplest usage, you can parse a JSON data file without specifying a schema, e.g.,

ParseJSON( ReadFromURL("http://SomeWebSite.com/download/data.json") )

The function infers the class structure from the data itself and, in general, returns tree-structured data using references and local indexes. In general, there are multiple ways that you might map JSON collections into Analytica data structures, and it is usually convenient to make use of existing indexes in your model for class instance data within the JSON. So, for more control over how the JSON data structures are processed, you can use an explicit schema, which describes the schema of the file and how it maps to your own indexes. See Parsing JSON with a schema.

To write data with "very simple" structure to JSON, use

WriteTextFile( "MyData.json", MakeJSON( x ) )

However, in general, additional information is required. For example, when you have a 2-dimensional array, should this be encoded in JSON as a list of lists or as a list of class instances (where one index corresponds to a JavaScript class)? If you encode it as a list of lists, which index should be written as the outermost? To disambiguate these questions, you should list the indexes of your data in the outermost-to-innermost order, and also indicate which indexes should translate to JavaScript classes. For example, a 5-D array x with indexes I, J, K, L, and M might be written using

WriteTextFile("MyData.json", MakeJSON(x, indexOrder:I,J,K, objects:L,M))

See MakeJSON.

Custom data formats

When another application uses a non-standard textual data format (i.e., something other than rows and columns of cell (CSV), Excel files, Analytica import-export format, XML or JSON), you will need to "program" the parsing of the data yourself. In general, the steps will be to read the textual data using ReadTextFile or ReadFromURL, and then parse it using text Text Functions, especially FindInText, SplitText, ParseNumber and ParseDate. The task is often greatly facilitated by using Regular expressions.

When another application uses a binary data format, then you must use ReadBinaryFile and WriteBinaryFile.

See Also

Comments


You are not allowed to post comments.