# Embedding Vector Embedding

Obtain vector representations of given inputs that can be easily used by machine learning models and algorithms.

## What are embedding vectors?

Embedding vectors are lists of floating-point numbers used to measure the relevance of text strings. The distance between two vectors measures their relevance: a small distance indicates high relevance, while a large distance indicates low relevance.

Embeddings are commonly used for:
- **Search** - Results are ranked by their relevance to the query string
- **Clustering** - Grouping text strings by similarity
- **Recommendation** - Recommending items with relevant text strings
- **Anomaly Detection** - Identifying outliers with lower relevance
- **Classification** - Classifying text strings by their most similar labels

## Quick Start

### Install the SDK

```bash
pip install openai
```

### Basic Example

> Please make sure to replace `$MODELVERSE_API_KEY` with your own API Key, obtain your [API Key](https://astraflow.ucloud.cn/modelverse/experience/api-keys).

<!-- tabs:start -->
#### ** Python **

```python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELVERSE_API_KEY",
    base_url="https://api-us-ca.umodelverse.ai/v1"
)

response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-3-large"
)

print(response.data[0].embedding)
```

#### ** curl **

```bash
curl https://api-us-ca.umodelverse.ai/v1/embeddings \
  -H "Authorization: Bearer $MODELVERSE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Your text string goes here",
    "model": "text-embedding-3-large"
  }'
```
<!-- tabs:end -->

The response contains the embedding vector (a list of floating-point numbers) along with some additional metadata. You can extract the embedding vector, store it in a vector database, and use it for many different use cases.

## API Reference

**POST** `https://api-us-ca.umodelverse.ai/v1/embeddings`

### Request Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| input | string or array | Yes | Input text to embed, encoded as a string or array of tokens. To embed multiple inputs in a single request, pass an array of strings. Inputs must not exceed 8192 tokens and cannot be empty strings. The total token sum of all inputs in a single request can be up to 300,000. |
| model | string | Yes | The model ID to use, such as `text-embedding-3-large`. |
| dimensions | integer | No | The number of dimensions the output embedding vector should have. Supported only in text-embedding-3 and later model versions. |
| encoding_format | string | No | The format to return the embedding vector. Can be `float` or `base64`. Default: `float` |

### Response Example

```json
{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0023064255, -0.009327292, ..., -0.0028842222],
      "index": 0
    }
  ],
  "model": "text-embedding-3-large",
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}
```

### Response Field Descriptions

| Field | Type | Description |
|-------|------|-------------|
| embedding | array | The embedding vector, a list of floating-point numbers. The length of the vector depends on the model. |
| index | integer | Index of the embedding in the list of embeddings. |
| object | string | The type of the object, always "embedding". |

## Embedding Models

| Model | Default Dimensions | Max Input | MTEB Evaluation Performance |
|-------|--------------------|----------|-----------------------------|
| text-embedding-3-large | 3072 | 8192 | 64.6% |
| text-embedding-ada-002 | 1536 | 8192 | 61.0% |

## Reducing Embedding Dimensions

Using larger embedding vectors is often more expensive and consumes more compute, memory, and storage. You can shorten the embedding dimensions by passing in the `dimensions` parameter without losing the conceptual representation properties of the embedding.

For example, a `text-embedding-3-large` embedding can be shortened to 256 dimensions while still outperforming a 1536-dimensional `text-embedding-ada-002`.

<!-- tabs:start -->
#### ** Python **

```python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELVERSE_API_KEY",
    base_url="https://api-us-ca.umodelverse.ai/v1"
)

response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Testing 123",
    dimensions=256  # Specify output dimensions
)

print(response.data[0].embedding)
```

#### ** curl **

```bash
curl https://api-us-ca.umodelverse.ai/v1/embeddings \
  -H "Authorization: Bearer $MODELVERSE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Testing 123",
    "model": "text-embedding-3-large",
    "dimensions": 256
  }'
```
<!-- tabs:end -->

### Manual Dimension Normalization

If you need to manually truncate and normalize the embedding vector:

```python
from openai import OpenAI
import numpy as np

client = OpenAI(
    api_key="YOUR_MODELVERSE_API_KEY",
    base_url="https://api-us-ca.umodelverse.ai/v1"
)

def normalize_l2(x):
    x = np.array(x)
    if x.ndim == 1:
        norm = np.linalg.norm(x)
        if norm == 0:
            return x
        return x / norm
    else:
        norm = np.linalg.norm(x, 2, axis=1, keepdims=True)
        return np.where(norm == 0, x, x / norm)

response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Testing 123",
    encoding_format="float"
)

cut_dim = response.data[0].embedding[:256]
norm_dim = normalize_l2(cut_dim)
print(norm_dim)
```

## Use Cases

### 1. Text Search

Use cosine similarity between the query's embedding vector and each document to return the highest-scoring document.

```python
from openai import OpenAI
import numpy as np

client = OpenAI(
    api_key="YOUR_MODELVERSE_API_KEY",
    base_url="https://api-us-ca.umodelverse.ai/v1"
)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text, model="text-embedding-3-large"):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

def search_documents(documents, query, n=3):
    query_embedding = get_embedding(query)
    
    results = []
    for doc in documents:
        doc_embedding = get_embedding(doc)
        similarity = cosine_similarity(query_embedding, doc_embedding)
        results.append((doc, similarity))
    
    results.sort(key=lambda x: x[1], reverse=True)
    return results[:n]

# Example
documents = ["Python is a programming language", "Machine learning is fun", "The weather is nice today"]
results = search_documents(documents, "programming")
print(results)
```

### 2. Embedding-based Q&A

Put the relevant document into the model's context window for Q&A.

```python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELVERSE_API_KEY",
    base_url="https://api-us-ca.umodelverse.ai/v1"
)

# Assume relevant articles have been found through embedding search
relevant_article = "The gold medal for curling at the 2022 Winter Olympics was won by..."

query = f"""Answer the question using the following article. If you can't find the answer, write "I don't know."

Article:
\"\"\"
{relevant_article}
\"\"\"

Question: Which athletes won the curling gold medal at the 2022 Winter Olympics?
"""

response = client.chat.completions.create(
    messages=[
        {'role': 'system', 'content': 'You answer questions about the 2022 Winter Olympics.'},
        {'role': 'user', 'content': query},
    ],
    model="gpt-4o",
    temperature=0,
)

print(response.choices[0].message.content)
```

### 3. Clustering Analysis

Use embedding vectors to perform clustering and grouping of text.

```python
import numpy as np
from sklearn.cluster import KMeans

# Assume embeddings is a list of obtained embedding vectors
embeddings = [...]  # Embedding vectors obtained from the API

matrix = np.vstack(embeddings)
n_clusters = 4

kmeans = KMeans(
    n_clusters=n_clusters,
    init='k-means++',
    random_state=42
)
kmeans.fit(matrix)

# Cluster labels for each text
labels = kmeans.labels_
```

### 4. Recommendation System

Perform recommendations based on the similarity of embedding vectors.

```python
from openai import OpenAI
import numpy as np

client = OpenAI(
    api_key="YOUR_MODELVERSE_API_KEY",
    base_url="https://api-us-ca.umodelverse.ai/v1"
)

def get_embedding(text, model="text-embedding-3-large"):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

def recommend_similar(items, source_index, n=3):
    """Returns the n most similar items to the source item."""
    embeddings = [get_embedding(item) for item in items]
    source_embedding = embeddings[source_index]
    
    similarities = []
    for i, emb in enumerate(embeddings):
        if i != source_index:
            sim = np.dot(source_embedding, emb)
            similarities.append((i, items[i], sim))
    
    similarities.sort(key=lambda x: x[2], reverse=True)
    return similarities[:n]
```

### 5. Zero-Shot Classification

Classify without the need for training data using embeddings.

```python
from openai import OpenAI
import numpy as np

client = OpenAI(
    api_key="YOUR_MODELVERSE_API_KEY",
    base_url="https://api-us-ca.umodelverse.ai/v1"
)

def get_embedding(text, model="text-embedding-3-large"):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def classify_text(text, labels):
    text_embedding = get_embedding(text)
    label_embeddings = [get_embedding(label) for label in labels]
    
    similarities = [cosine_similarity(text_embedding, le) for le in label_embeddings]
    best_index = np.argmax(similarities)
    return labels[best_index]

# Example
labels = ["positive", "negative", "neutral"]
result = classify_text("This product is amazing!", labels)
print(result)  # Output: positive
```

## FAQs

### How to calculate the number of tokens in a string?

Use OpenAI's tokenizer [`tiktoken`](https://github.com/openai/tiktoken):

```python
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

print(num_tokens_from_string("tiktoken is great!"))  # Output: 4
```

> Use `cl100k_base` encoding for third-generation embedding models like `text-embedding-3-large`.

### How to quickly retrieve K nearest embedding vectors?

For fast searches among many vectors, it's recommended to use a vector database like:
- AI Database (see documentation: [AI Database](https://docs.ucloud.cn/aidb/README))
- pgvector (see documentation: [PostgreSQL](https://docs.ucloud.cn/upgsql/README))

### Which distance function should be used?

**Cosine similarity** is recommended. OpenAI embeddings are normalized to length 1, meaning:
- Cosine similarity can be computed using only the dot product, making it faster
- Cosine similarity and Euclidean distance will yield the same ranking