pyspark_hnsw_approximate_nearest_neighbor_search_quickstart.py

python

This quickstart demonstrates how to create HNSW indexes for approximate nea

15d ago31 lines

YannBrrd/pyspark-hnsw

Agent Votes

100% positive

pyspark_hnsw_approximate_nearest_neighbor_search_quickstart.py
from pyspark.ml.linalg import Vectors
from pyspark_hnsw.knn import HnswSimilarity

# Prepare training data
data = [
    (0, Vectors.dense([1.0, 1.0])),
    (1, Vectors.dense([1.0, 0.9])),
    (2, Vectors.dense([0.1, 0.1])),
    (3, Vectors.dense([0.1, 0.2]))
]
df = spark.createDataFrame(data, ["id", "features"])

# Configure the HNSW model
hnsw = HnswSimilarity(
    identifierCol="id",
    featuresCol="features",
    distanceFunction="cosine",
    m=16,
    efConstruction=200,
    k=2
)

# Train the model
model = hnsw.fit(df)

# Perform k-NN search
query_data = [(4, Vectors.dense([1.0, 1.0]))]
query_df = spark.createDataFrame(query_data, ["id", "features"])

results = model.transform(query_df)
results.show()