Embeddings Overview
GrafitoDB provides seamless integration with multiple embedding providers, allowing you to convert text into vector representations for semantic search, clustering, and similarity analysis.
What are Embeddings?
Embeddings are dense vector representations of text (or other data) that capture semantic meaning. Similar texts have vectors that are close together in the embedding space, enabling:
- Semantic search: Find content with similar meaning
- Clustering: Group similar items automatically
- Recommendation: Suggest similar items
- Classification: Categorize content based on similarity
Supported Providers
GrafitoDB includes built-in support for 12+ embedding providers:
Cloud APIs
| Provider | Description | Best For |
|---|---|---|
| OpenAI | text-embedding-3-small, text-embedding-3-large | High-quality embeddings |
| Hugging Face | Inference API with thousands of models | Flexibility, open-source models |
| Google GenAI | text-embedding-004 | Google Cloud integration |
| AWS Bedrock | Titan embeddings | AWS ecosystem |
| Cohere | Embed models | Enterprise use |
| Jina | Jina embeddings | Specialized models |
| Mistral | Mistral embeddings | European AI |
| VoyageAI | Voyage embeddings | Domain-specific |
| Together AI | Together embeddings | Open-source models |
Local/On-Premise
| Provider | Description | Best For |
|---|---|---|
| Ollama | Local LLM inference | Privacy, offline use |
| Sentence Transformers | Local embedding models | Cost-effective |
| TensorFlow Hub | Google ML models | TensorFlow ecosystem |
Quick Start
1. Create an Embedding Function
from grafito import GrafitoDatabase
from grafito.embedding_functions import OpenAIEmbeddingFunction
# Initialize embedding function
embed_fn = OpenAIEmbeddingFunction(
model="text-embedding-3-small",
api_key_env_var="OPENAI_API_KEY"
)
2. Create a Vector Index
db = GrafitoDatabase(":memory:")
# Create vector index with embedding function
db.create_vector_index(
name="articles_vec",
dim=1536, # Match your embedding model
embedding_function=embed_fn
)
3. Add Data with Embeddings
# Create nodes
article = db.create_node(
labels=["Article"],
properties={"title": "Introduction to Graph Databases"}
)
# Generate embeddings automatically or manually
db.upsert_embedding(
node_id=article.id,
text="Introduction to Graph Databases",
index="articles_vec"
)
4. Perform Semantic Search
# Search for similar content
results = db.vector_search(
index="articles_vec",
query="graph database basics",
k=5
)
for node, score in results:
print(f"{node.properties['title']}: {score}")
Embedding Function Base Interface
All embedding functions implement a common interface:
class EmbeddingFunction:
"""Abstract embedding function interface."""
def __call__(self, input: list[str]) -> list[list[float]]:
"""Generate embeddings for the given input texts."""
...
def name(self) -> str:
"""Return the registry name for this embedding function."""
...
def default_space(self) -> str:
"""Return default distance space (cosine, l2, ip)."""
...
def supported_spaces(self) -> list[str]:
"""Return supported distance spaces."""
...
def dimension(self) -> int | None:
"""Return embedding dimension if known."""
...
Distance Spaces
Different embedding models work best with different distance metrics:
| Space | Description | Best For |
|---|---|---|
cosine |
Cosine similarity | Most text embeddings |
l2 |
Euclidean distance | Image embeddings |
ip |
Inner product | OpenAI embeddings |
Configuration Management
Embedding functions can be serialized for persistence:
# Get configuration
config = embed_fn.get_config()
# {'model': 'text-embedding-3-small'}
# Recreate from configuration
from grafito.embedding_functions import create_embedding_function
embed_fn = create_embedding_function("openai", config)
Choosing a Provider
For Production Use
- OpenAI: Best quality embeddings, reliable API
- Cohere: Enterprise features, good multilingual support
- AWS Bedrock: If already using AWS infrastructure
For Cost-Effective Solutions
- Hugging Face Inference API: Pay-per-use, many free models
- Sentence Transformers: Run locally, no API costs
For Privacy/Offline
- Ollama: Run models locally on your hardware
- Sentence Transformers: Local inference
For Experimentation
- Ollama: Easy to try different models
- Hugging Face: Largest selection of models
Environment Variables
Most providers support environment variables for API keys:
# OpenAI
export OPENAI_API_KEY="sk-..."
# Hugging Face
export HF_TOKEN="hf_..."
# Google GenAI
export GOOGLE_API_KEY="..."
# AWS Bedrock (uses boto3 credential chain)
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
Error Handling
from grafito.embedding_functions import OpenAIEmbeddingFunction
try:
embed_fn = OpenAIEmbeddingFunction()
except ValueError as e:
print(f"Configuration error: {e}")
try:
embeddings = embed_fn(["text to embed"])
except Exception as e:
print(f"API error: {e}")
Next Steps
- OpenAI Integration - OpenAI's embedding models
- Hugging Face - Open-source models
- Cloud Providers - Google, AWS, Cohere, etc.
- Ollama - Local embedding models