Tutorial on using pre-trained OpenAI language models

        Under construction -- no ready to be used yet

In this tutorial you will learn how to use language models (via the OpenAI API) from your Analytica model, and also learn about language models and how they work.

Adding the OpenAI API library to your model

( Downloading the library) ( Create new model. Add library to model ) ( Getting and entering the OpenAI API key )

The OpenAI API Library needs to be integrated into your model for you to access its functionality. Here's a step-by-step guide on how to do it:

Download the OpenAI API Library: The first step is to download the OpenAI library. Make sure to save it in a location you can easily access.
Create a New Model: After the download is complete, open Analytica and create a new model. This will serve as your workspace.
Add the Library to the Model: Locate the "File" button on the top-left corner of the Analytica interface, click on it, and select the option "Add Module".
Select the Library File: In the file selection dialog that appears, navigate to where you saved the OpenAI library file and select it.
Embed the Library: Once you've selected the library file, a new dialog will open up. Choose the "Embed a copy" option. This action embeds a copy of the library directly into your model.
Save Your API Key: Lastly, open the embedded OpenAI API library module. Locate the section where you can save your API key. Input your OpenAI API key into the designated environment variable to finalize the setup.

What OpenAI models are available to use?

Base Models:

ada
babbage
curie
davinci
gpt-3.5-turbo
whisper-1

Code Search Models:

ada-code-search-code
ada-code-search-text
babbage-code-search-code
babbage-code-search-text
code-search-ada-code-001
code-search-ada-text-001
code-search-babbage-code-001
code-search-babbage-text-001

Search Document Models:

ada-search-document
babbage-search-document
curie-search-document
davinci-search-document
text-search-ada-doc-001
text-search-babbage-doc-001
text-search-curie-doc-001
text-search-davinci-doc-001

Search Query Models:

ada-search-query
babbage-search-query
curie-search-query
davinci-search-query
text-search-ada-query-001
text-search-babbage-query-001
text-search-curie-query-001
text-search-davinci-query-001

Similarity Models:

ada-similarity
babbage-similarity
curie-similarity
davinci-similarity
text-similarity-ada-001
text-similarity-babbage-001
text-similarity-curie-001
text-similarity-davinci-001

Instruction Models:

curie-instruct-beta
davinci-instruct-beta

Edit Models:

code-davinci-edit-001
text-davinci-edit-001

Text Models:

text-ada-001
text-babbage-001
text-curie-001
text-davinci-001
text-davinci-002
text-davinci-003
text-embedding-ada-002

Turbo Models:

gpt-3.5-turbo-0301
gpt-3.5-turbo-0613
gpt-3.5-turbo-16k
gpt-3.5-turbo-16k-0613

Generating text completions

This step is all about constructing a model capable of generating text sequences, with the specific characteristics of these sequences being determined by the parameters you define. Here's how you do it:

Create an Index Node for Completions: Begin by constructing an Index node and name it "Completion number". This node plays an important role in your model, as it dictates the number of different text completions your model will generate. Inside the nodes definition you'll enter the following: 1..4 (represents 1 - 4, you can change it to however many examples you need)
Set Up a Variable Node for Prompt Generation and Define the Command: Next, establish a variable node and title it "Example of prompt generation". Inside this nodes definition, you'll input the following command:

Prompt_completion("In the night sky, ", Completion_index:Completion_number, max_tokens:100)

This command instructs the model to create text that continues from the prompt "There once was a boy". The parameter "Completion_number" controls the number of unique completions the model will generate. The max_tokens:100 limits each text output to approximately 100 tokens in length.

This configuration provides a playground for you to experiment with your model, modify the parameters, and observe how changes affect text generation. Through this interactive learning process, you'll gain deeper insights into how to navigate your model effectively to generate the desired results.

Word likelihoods

(tutorial about next token probabilities. logits and perplexity, etc.)

Language models work by receiving an input sequence and then assigning a score, or log probability, to each potential subsequent word, called a token, from its available vocabulary. The higher the score, the more the model believes that a particular token is a fitting successor to the input sequence.

In this part of our discussion, we're going to unpack these token probabilities further. We're seeking to understand why the model gives some tokens a higher score, suggesting they are more likely, while others receive a lower score, indicating they are less likely.

To bring this into perspective, let's think about the phrase, "The dog is barking". It's logical to expect the word "barking" to get a high score because it completes the sentence well. But, if we switch "barking" with a misspelled word such as "barkeing", the model's score for this token would drop, reflecting its recognition of the spelling mistake. In the same vein, if we replace "barking" with an unconnected word like "piano", the model would give it a much lower score compared to "barking". This is because "piano" is not a logical ending to the sentence "The dog is".

Controlling completion randomness

To control the randomness in text generation, we can utilize parameters such as 'temperature' or 'top_p'. The temperature parameter, which ranges from 0 to 2, influences the randomness of the output. Higher values, such as 1.0, yield more diverse but potentially less coherent text, while lower values, such as 0.7, result in more focused and deterministic outputs.

Conversely, 'top_p', ranging from 0 to 1, operates by selecting the most probable tokens based on their cumulative probability until the total probability mass surpasses a predefined threshold (p). For instance, a lower value, like 0.2, would make the output more uniform due to fewer options for the next word.

In-context learning

(What is in-context learning?) (Some examples: Translation, ....)

Getting the desired task with a pre-trained language model can be quite challenging. While these models possess vast knowledge, they might struggle to comprehend the specific task's format unless tuned or conditioned.

Therefore, it is common to write prompts that tell the language model what the format of the task they are supposed to be accomplishing. This method is called in-context learning. In the case where the provided prompt contains several examples of the task that the model is supposed to accomplish, it is known as “few-shot learning,” since the model is learning perform the task based on a few examples.

Creating a typical prompt often has two essential parts:

Instructions: This component serves as a set of guidelines, instructing the model on how to approach and accomplish the given task. For certain models like OpenAI's text-davinci-003, which are fine-tuned to follow specific instructions, the instruction string becomes even more crucial.
Demonstrations: These demonstration strings provide concrete examples of successfully completing the task at hand. By presenting the model with these instances, it gains valuable insights to adapt and generalize its understanding.

For example, suppose we want to design a prompt for the task of translating words from English to Chinese. A prompt for this task could look like the following:

Translate English to Chinese.
 
dog -> 狗
 
apple -> 苹果
 
coffee -> 咖啡
 
supermarket -> 超市
 
squirrel ->

Given this prompt, most large language models should answer with the correct answer: "松鼠"

Creating classifiers using in-context learning

What is a classifer?

A classifier is a type of system that language models can be utilized to build. While language models are known for their ability to generate text, they are also widely employed to create classification systems. These systems work by taking input in the form of text and assigning a corresponding label to it. Here are a few examples of such tasks:

Classifying Tweets as either TOXIC or NOT_TOXIC.
Determining whether restaurant reviews exhibit a POSITIVE or NEGATIVE sentiment.
Identifying whether two statements AGREE with each other or CONTRADICT each other.

In essence, classifiers leverage the power of language models to make informed decisions and categorize text data based on predefined criteria.

(Classifier Task 1)

(Download the training / test set) (Write the n-shot prompt) (Evaluate performance on several test cases)

(Classifier Task 2)

Managing a conversation

Comparing different models

Similarity embeddings

(Uses) (Computing similarities) (Indexing a collection of docs) (Adding similar snippet to LLM prompts)

For this task, you'll be crafting an index node titled as "Speech Name" encompassing the below listed significant speeches:

Gettysburg Address by Abraham Lincoln (1863)
First Inaugural Address by Franklin D. Roosevelt (1933)
Inaugural Address by John F. Kennedy (1961)
'We Shall Overcome' Speech by Lyndon B. Johnson (1965)
'Tear Down This Wall' Speech by Ronald Reagan (1987)

Following this, create a variable node, titled "Presidential Speech," and set it as a table which will be indexed with "Speech Name" node. Subsequently, create a decision node named "Search Queries" comprised of a collection of phrases that may be relevant to the speeches. The phrases to be included are:

Racial injustice
References to religion
Berlin, Germany
Americans uniting
87
Formation of a new nation
The resilience of our nation
Upholding truth
Confronting reality

With these nodes set up, your next task will involve embedding the "Presidential Speech" and "Search Queries". Start by initiating a variable node named "Speech Embeddings," utilizing the following formula: Embedding_for(Ex_OAL_Ref_Material)

Next, create an additional variable node, called "Query Embedding," setting the following as its definition: Embedding_for(EX_OAL_Search_query)

Lastly, create a variable node titled "Similarity of Query to Speech" and set the following as its definition: Embedding_similarity(Ex_OAL_Speech_embedd,Ex_OAL_Query_emb)

Running these calculations will showcasing the similarity between each search query and the referenced presidential speeches. The output will be displayed in a table, signifying the degree of similarity as shown below:

(image)