Replace explorer notebook with explorer.md docs page (#18459)

Co-authored-by: UltralyticsAssistant <web@ultralytics.com>
This commit is contained in:
Muhammad Rizwan Munawar 2024-12-30 19:07:26 +05:00 committed by GitHub
parent f7b9009c91
commit 674f7898a8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 288 additions and 620 deletions

View File

@ -47,7 +47,7 @@ jobs:
python-version: "3.x"
- uses: astral-sh/setup-uv@v5
- name: Install Dependencies
run: uv pip install --system ruff black tqdm mkdocs-material "mkdocstrings[python]" mkdocs-jupyter mkdocs-redirects mkdocs-ultralytics-plugin mkdocs-macros-plugin
run: uv pip install --system ruff black tqdm mkdocs-material "mkdocstrings[python]" mkdocs-redirects mkdocs-ultralytics-plugin mkdocs-macros-plugin
- name: Ruff fixes
continue-on-error: true
run: ruff check --fix --unsafe-fixes --select D --ignore=D100,D104,D203,D205,D212,D213,D401,D406,D407,D413 .

View File

@ -1,616 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "aa923c26-81c8-4565-9277-1cb686e3702e",
"metadata": {
"id": "aa923c26-81c8-4565-9277-1cb686e3702e"
},
"source": [
"# VOC Exploration Example\n",
"<div align=\"center\">\n",
"\n",
" <a href=\"https://ultralytics.com/yolo\" target=\"_blank\">\n",
" <img width=\"1024\", src=\"https://raw.githubusercontent.com/ultralytics/assets/main/yolov8/banner-yolov8.png\"></a>\n",
"\n",
" [中文](https://docs.ultralytics.com/zh/) | [한국어](https://docs.ultralytics.com/ko/) | [日本語](https://docs.ultralytics.com/ja/) | [Русский](https://docs.ultralytics.com/ru/) | [Deutsch](https://docs.ultralytics.com/de/) | [Français](https://docs.ultralytics.com/fr/) | [Español](https://docs.ultralytics.com/es/) | [Português](https://docs.ultralytics.com/pt/) | [Türkçe](https://docs.ultralytics.com/tr/) | [Tiếng Việt](https://docs.ultralytics.com/vi/) | [العربية](https://docs.ultralytics.com/ar/)\n",
"\n",
" <a href=\"https://console.paperspace.com/github/ultralytics/ultralytics\"><img src=\"https://assets.paperspace.io/img/gradient-badge.svg\" alt=\"Run on Gradient\"/></a>\n",
" <a href=\"https://colab.research.google.com/github/ultralytics/ultralytics/blob/main/examples/tutorial.ipynb\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>\n",
" <a href=\"https://www.kaggle.com/models/ultralytics/yolo11\"><img src=\"https://kaggle.com/static/images/open-in-kaggle.svg\" alt=\"Open In Kaggle\"></a>\n",
"\n",
"Welcome to the Ultralytics Explorer API notebook! This notebook serves as the starting point for exploring the various resources available to help you get started with using Ultralytics to explore your datasets using with the power of semantic search. You can utilities out of the box that allow you to examine specific types of labels using vector search or even SQL queries.\n",
"\n",
"We hope that the resources in this notebook will help you get the most out of Ultralytics. Please browse the Explorer <a href=\"https://docs.ultralytics.com/\">Docs</a> for details, raise an issue on <a href=\"https://github.com/ultralytics/ultralytics\">GitHub</a> for support, and join our <a href=\"https://ultralytics.com/discord\">Discord</a> community for questions and discussions!\n",
"\n",
"Try `yolo explorer` powered by Exlorer API\n",
"\n",
"Simply `pip install ultralytics` and run `yolo explorer` in your terminal to run custom queries and semantic search on your datasets right inside your browser!\n",
"\n",
"</div>"
]
},
{
"cell_type": "markdown",
"source": [
"## Ultralytics Explorer support deprecated ⚠️\n",
"\n",
"As of **`ultralytics>=8.3.10`**, Ultralytics explorer support has been deprecated. But dont worry! You can now access similar and even enhanced functionality through [Ultralytics HUB](https://hub.ultralytics.com/), our intuitive no-code platform designed to streamline your workflow. With Ultralytics HUB, you can continue exploring, visualizing, and managing your data effortlessly, all without writing a single line of code. Make sure to check it out and take advantage of its powerful features!🚀"
],
"metadata": {
"id": "RHe1PX5c7uK2"
},
"id": "RHe1PX5c7uK2"
},
{
"cell_type": "markdown",
"id": "2454d9ba-9db4-4b37-98e8-201ba285c92f",
"metadata": {
"id": "2454d9ba-9db4-4b37-98e8-201ba285c92f"
},
"source": [
"## Setup\n",
"Pip install `ultralytics` and [dependencies](https://github.com/ultralytics/ultralytics/blob/main/pyproject.toml) and check software and hardware."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "433f3a4d-a914-42cb-b0b6-be84a84e5e41",
"metadata": {
"id": "433f3a4d-a914-42cb-b0b6-be84a84e5e41"
},
"outputs": [],
"source": [
"%pip install ultralytics[explorer] openai\n",
"import ultralytics\n",
"\n",
"ultralytics.checks()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ae602549-3419-4909-9f82-35cba515483f",
"metadata": {
"id": "ae602549-3419-4909-9f82-35cba515483f"
},
"outputs": [],
"source": [
"from ultralytics import Explorer"
]
},
{
"cell_type": "markdown",
"id": "d8c06350-be8e-45cf-b3a6-b5017bbd943c",
"metadata": {
"id": "d8c06350-be8e-45cf-b3a6-b5017bbd943c"
},
"source": [
"## Similarity search\n",
"Utilize the power of vector similarity search to find the similar data points in your dataset along with their distance in the embedding space. Simply create an embeddings table for the given dataset-model pair. It is only needed once and it is reused automatically.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "334619da-6deb-4b32-9fe0-74e0a79cee20",
"metadata": {
"id": "334619da-6deb-4b32-9fe0-74e0a79cee20"
},
"outputs": [],
"source": [
"exp = Explorer(\"VOC.yaml\", model=\"yolo11n.pt\")\n",
"exp.create_embeddings_table()"
]
},
{
"cell_type": "markdown",
"id": "b6c5e42d-bc7e-4b4c-bde0-643072a2165d",
"metadata": {
"id": "b6c5e42d-bc7e-4b4c-bde0-643072a2165d"
},
"source": [
"One the embeddings table is built, you can get run semantic search in any of the following ways:\n",
"- On a given index / list of indices in the dataset like - `exp.get_similar(idx=[1,10], limit=10)`\n",
"- On any image/ list of images not in the dataset - `exp.get_similar(img=[\"path/to/img1\", \"path/to/img2\"], limit=10)`\n",
"In case of multiple inputs, the aggregade of their embeddings is used.\n",
"\n",
"You get a pandas dataframe with the `limit` number of most similar data points to the input, along with their distance in the embedding space. You can use this dataset to perform further filtering\n",
"<img width=\"1120\" alt=\"Screenshot 2024-01-06 at 9 45 42PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/7742ac57-e22a-4cea-a0f9-2b2a257483c5\">\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b485f05b-d92d-42bc-8da7-5e361667b341",
"metadata": {
"id": "b485f05b-d92d-42bc-8da7-5e361667b341"
},
"outputs": [],
"source": [
"similar = exp.get_similar(idx=1, limit=10)\n",
"similar.head()"
]
},
{
"cell_type": "markdown",
"id": "acf4b489-2161-4176-a1fe-d1d067d8083d",
"metadata": {
"id": "acf4b489-2161-4176-a1fe-d1d067d8083d"
},
"source": [
"You can use the also plot the similar samples directly using the `plot_similar` util\n",
"<p>\n",
"\n",
" <img src=\"https://github.com/AyushExel/assets/assets/15766192/a3c9247b-9271-47df-aaa5-36d96c5034b1\" />\n",
"</p>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9dbfe7d0-8613-4529-adb6-6e0632d7cce7",
"metadata": {
"id": "9dbfe7d0-8613-4529-adb6-6e0632d7cce7"
},
"outputs": [],
"source": [
"exp.plot_similar(idx=6500, limit=20)\n",
"# exp.plot_similar(idx=[100,101], limit=10) # Can also pass list of idxs or imgs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "260e09bf-4960-4089-a676-cb0e76ff3c0d",
"metadata": {
"id": "260e09bf-4960-4089-a676-cb0e76ff3c0d"
},
"outputs": [],
"source": [
"exp.plot_similar(\n",
" img=\"https://ultralytics.com/images/bus.jpg\", limit=10, labels=False\n",
") # Can also pass any external images"
]
},
{
"cell_type": "markdown",
"id": "faa0b7a7-6318-40e4-b0f4-45a8113bdc3a",
"metadata": {
"id": "faa0b7a7-6318-40e4-b0f4-45a8113bdc3a"
},
"source": [
"<p>\n",
"<img src=\"https://github.com/AyushExel/assets/assets/15766192/8e011195-b0da-43ef-b3cd-5fb6f383037e\">\n",
"\n",
"</p>"
]
},
{
"cell_type": "markdown",
"id": "0cea63f1-71f1-46da-af2b-b1b7d8f73553",
"metadata": {
"id": "0cea63f1-71f1-46da-af2b-b1b7d8f73553"
},
"source": [
"## 2. Ask AI: Search or filter with Natural Language\n",
"You can prompt the Explorer object with the kind of data points you want to see and it'll try to return a dataframe with those. Because it is powered by LLMs, it doesn't always get it right. In that case, it'll return None.\n",
"<p>\n",
"<img width=\"1131\" alt=\"Screenshot 2024-01-07 at 2 34 53PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/c4a69fd9-e54f-4d6a-aba5-dc9cfae1bc04\">\n",
"\n",
"</p>\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "92fb92ac-7f76-465a-a9ba-ea7492498d9c",
"metadata": {
"id": "92fb92ac-7f76-465a-a9ba-ea7492498d9c"
},
"outputs": [],
"source": [
"df = exp.ask_ai(\"show me images containing more than 10 objects with at least 2 persons\")\n",
"df.head(5)"
]
},
{
"cell_type": "markdown",
"id": "f2a7d26e-0ce5-4578-ad1a-b1253805280f",
"metadata": {
"id": "f2a7d26e-0ce5-4578-ad1a-b1253805280f"
},
"source": [
"for plotting these results you can use `plot_query_result` util\n",
"Example:\n",
"```\n",
"plt = plot_query_result(exp.ask_ai(\"show me 10 images containing exactly 2 persons\"))\n",
"Image.fromarray(plt)\n",
"```\n",
"<p>\n",
" <img src=\"https://github.com/AyushExel/assets/assets/15766192/2cb780de-d05b-4412-a526-7f7f0f10e669\">\n",
"\n",
"</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1cfab84-9835-4da0-8e9a-42b30cf84511",
"metadata": {
"id": "b1cfab84-9835-4da0-8e9a-42b30cf84511"
},
"outputs": [],
"source": [
"# plot\n",
"from PIL import Image\n",
"\n",
"from ultralytics.data.explorer import plot_query_result\n",
"\n",
"plt = plot_query_result(exp.ask_ai(\"show me 10 images containing exactly 2 persons\"))\n",
"Image.fromarray(plt)"
]
},
{
"cell_type": "markdown",
"id": "35315ae6-d827-40e4-8813-279f97a83b34",
"metadata": {
"id": "35315ae6-d827-40e4-8813-279f97a83b34"
},
"source": [
"## 3. Run SQL queries on your Dataset!\n",
"Sometimes you might want to investigate a certain type of entries in your dataset. For this Explorer allows you to execute SQL queries.\n",
"It accepts either of the formats:\n",
"- Queries beginning with \"WHERE\" will automatically select all columns. This can be thought of as a short-hand query\n",
"- You can also write full queries where you can specify which columns to select\n",
"\n",
"This can be used to investigate model performance and specific data points. For example:\n",
"- let's say your model struggles on images that have humans and dogs. You can write a query like this to select the points that have at least 2 humans AND at least one dog.\n",
"\n",
"You can combine SQL query and semantic search to filter down to specific type of results\n",
"<img width=\"994\" alt=\"Screenshot 2024-01-06 at 9 47 30PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/92bc3178-c151-4cd5-8007-c76178deb113\">\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8cd1072f-3100-4331-a0e3-4e2f6b1005bf",
"metadata": {
"id": "8cd1072f-3100-4331-a0e3-4e2f6b1005bf"
},
"outputs": [],
"source": [
"table = exp.sql_query(\"WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10\")\n",
"table"
]
},
{
"cell_type": "markdown",
"id": "debf8a00-c9f6-448b-bd3b-454cf62f39ab",
"metadata": {
"id": "debf8a00-c9f6-448b-bd3b-454cf62f39ab"
},
"source": [
"Just like similarity search, you also get a util to directly plot the sql queries using `exp.plot_sql_query`\n",
"<img src=\"https://github.com/AyushExel/assets/assets/15766192/f8b66629-8dd0-419e-8f44-9837969ba678\">\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18b977e7-d048-4b22-b8c4-084a03b04f23",
"metadata": {
"id": "18b977e7-d048-4b22-b8c4-084a03b04f23"
},
"outputs": [],
"source": [
"exp.plot_sql_query(\"WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10\", labels=True)"
]
},
{
"cell_type": "markdown",
"id": "f26804c5-840b-4fd1-987f-e362f29e3e06",
"metadata": {
"id": "f26804c5-840b-4fd1-987f-e362f29e3e06"
},
"source": [
"## 3. Working with embeddings Table (Advanced)\n",
"Explorer works on [LanceDB](https://lancedb.github.io/lancedb/) tables internally. You can access this table directly, using `Explorer.table` object and run raw queries, push down pre and post filters, etc."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea69260a-3407-40c9-9f42-8b34a6e6af7a",
"metadata": {
"id": "ea69260a-3407-40c9-9f42-8b34a6e6af7a"
},
"outputs": [],
"source": [
"table = exp.table\n",
"table.schema"
]
},
{
"cell_type": "markdown",
"id": "238db292-8610-40b3-9af7-dfd6be174892",
"metadata": {
"id": "238db292-8610-40b3-9af7-dfd6be174892"
},
"source": [
"### Run raw queries\n",
"Vector Search finds the nearest vectors from the database. In a recommendation system or search engine, you can find similar products from the one you searched. In LLM and other AI applications, each data point can be presented by the embeddings generated from some models, it returns the most relevant features.\n",
"\n",
"A search in high-dimensional vector space, is to find K-Nearest-Neighbors (KNN) of the query vector.\n",
"\n",
"Metric\n",
"In LanceDB, a Metric is the way to describe the distance between a pair of vectors. Currently, it supports the following metrics:\n",
"- L2\n",
"- Cosine\n",
"- Dot\n",
"Explorer's similarity search uses L2 by default. You can run queries on tables directly, or use the lance format to build custom utilities to manage datasets. More details on available LanceDB table ops in the [docs](https://lancedb.github.io/lancedb/)\n",
"\n",
"<img width=\"1015\" alt=\"Screenshot 2024-01-06 at 9 48 35PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/a2ccdaf3-8877-4f70-bf47-8a9bd2bb20c0\">\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d74430fe-5aee-45a1-8863-3f2c31338792",
"metadata": {
"id": "d74430fe-5aee-45a1-8863-3f2c31338792"
},
"outputs": [],
"source": [
"dummy_img_embedding = [i for i in range(256)]\n",
"table.search(dummy_img_embedding).limit(5).to_pandas()"
]
},
{
"cell_type": "markdown",
"id": "587486b4-0d19-4214-b994-f032fb2e8eb5",
"metadata": {
"id": "587486b4-0d19-4214-b994-f032fb2e8eb5"
},
"source": [
"### Inter-conversion to popular data formats"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bb2876ea-999b-4eba-96bc-c196ba02c41c",
"metadata": {
"id": "bb2876ea-999b-4eba-96bc-c196ba02c41c"
},
"outputs": [],
"source": [
"df = table.to_pandas()\n",
"pa_table = table.to_arrow()"
]
},
{
"cell_type": "markdown",
"id": "42659d63-ad76-49d6-8dfc-78d77278db72",
"metadata": {
"id": "42659d63-ad76-49d6-8dfc-78d77278db72"
},
"source": [
"### Work with Embeddings\n",
"You can access the raw embedding from lancedb Table and analyse it. The image embeddings are stored in column `vector`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "66d69e9b-046e-41c8-80d7-c0ee40be3bca",
"metadata": {
"id": "66d69e9b-046e-41c8-80d7-c0ee40be3bca"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"embeddings = table.to_pandas()[\"vector\"].tolist()\n",
"embeddings = np.array(embeddings)"
]
},
{
"cell_type": "markdown",
"id": "e8df0a49-9596-4399-954b-b8ae1fd7a602",
"metadata": {
"id": "e8df0a49-9596-4399-954b-b8ae1fd7a602"
},
"source": [
"### Scatterplot\n",
"One of the preliminary steps in analysing embeddings is by plotting them in 2D space via dimensionality reduction. Let's try an example\n",
"\n",
"<img width=\"646\" alt=\"Screenshot 2024-01-06 at 9 48 58PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/9e1da25c-face-4426-abc0-2f64a4e4952c\">\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9a150e8-8092-41b3-82f8-2247f8187fc8",
"metadata": {
"id": "d9a150e8-8092-41b3-82f8-2247f8187fc8"
},
"outputs": [],
"source": [
"!pip install scikit-learn --q"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "196079c3-45a9-4325-81ab-af79a881e37a",
"metadata": {
"id": "196079c3-45a9-4325-81ab-af79a881e37a"
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"from sklearn.decomposition import PCA\n",
"\n",
"# Reduce dimensions using PCA to 3 components for visualization in 3D\n",
"pca = PCA(n_components=3)\n",
"reduced_data = pca.fit_transform(embeddings)\n",
"\n",
"# Create a 3D scatter plot using Matplotlib's Axes3D\n",
"fig = plt.figure(figsize=(8, 6))\n",
"ax = fig.add_subplot(111, projection=\"3d\")\n",
"\n",
"# Scatter plot\n",
"ax.scatter(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], alpha=0.5)\n",
"ax.set_title(\"3D Scatter Plot of Reduced 256-Dimensional Data (PCA)\")\n",
"ax.set_xlabel(\"Component 1\")\n",
"ax.set_ylabel(\"Component 2\")\n",
"ax.set_zlabel(\"Component 3\")\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "1c843c23-e3f2-490e-8d6c-212fa038a149",
"metadata": {
"id": "1c843c23-e3f2-490e-8d6c-212fa038a149"
},
"source": [
"## 4. Similarity Index\n",
"Here's a simple example of an operation powered by the embeddings table. Explorer comes with a `similarity_index` operation-\n",
"* It tries to estimate how similar each data point is with the rest of the dataset.\n",
"* It does that by counting how many image embeddings lie closer than `max_dist` to the current image in the generated embedding space, considering `top_k` similar images at a time.\n",
"\n",
"For a given dataset, model, `max_dist` & `top_k` the similarity index once generated will be reused. In case, your dataset has changed, or you simply need to regenerate the similarity index, you can pass `force=True`.\n",
"Similar to vector and SQL search, this also comes with a util to directly plot it. Let's look at the plot first\n",
"<img width=\"633\" alt=\"Screenshot 2024-01-06 at 9 49 36PM\" src=\"https://github.com/AyushExel/assets/assets/15766192/96a9d984-4a72-4784-ace1-428676ee2bdd\">\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "953c2a5f-1b61-4acf-a8e4-ed08547dbafc",
"metadata": {
"id": "953c2a5f-1b61-4acf-a8e4-ed08547dbafc"
},
"outputs": [],
"source": [
"exp.plot_similarity_index(max_dist=0.2, top_k=0.01)"
]
},
{
"cell_type": "markdown",
"id": "28228a9a-b727-45b5-8ca7-8db662c0b937",
"metadata": {
"id": "28228a9a-b727-45b5-8ca7-8db662c0b937"
},
"source": [
"Now let's look at the output of the operation"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f4161aaa-20e6-4df0-8e87-d2293ee0530a",
"metadata": {
"id": "f4161aaa-20e6-4df0-8e87-d2293ee0530a"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"sim_idx = exp.similarity_index(max_dist=0.2, top_k=0.01, force=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b01d5b1a-9adb-4c3c-a873-217c71527c8d",
"metadata": {
"id": "b01d5b1a-9adb-4c3c-a873-217c71527c8d"
},
"outputs": [],
"source": [
"sim_idx"
]
},
{
"cell_type": "markdown",
"id": "22b28e54-4fbb-400e-ad8c-7068cbba11c4",
"metadata": {
"id": "22b28e54-4fbb-400e-ad8c-7068cbba11c4"
},
"source": [
"Let's create a query to see what data points have similarity count of more than 30 and plot images similar to them."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "58d2557b-d401-43cf-937d-4f554c7bc808",
"metadata": {
"id": "58d2557b-d401-43cf-937d-4f554c7bc808"
},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"sim_count = np.array(sim_idx[\"count\"])\n",
"sim_idx[\"im_file\"][sim_count > 30]"
]
},
{
"cell_type": "markdown",
"id": "a5ec8d76-271a-41ab-ac74-cf8c0084ba5e",
"metadata": {
"id": "a5ec8d76-271a-41ab-ac74-cf8c0084ba5e"
},
"source": [
"You should see something like this\n",
"<img src=\"https://github.com/AyushExel/assets/assets/15766192/649bc366-ca2d-46ea-bfd9-3097cf575584\">\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a7b2ee3-9f35-48a2-9c38-38379516f4d2",
"metadata": {
"id": "3a7b2ee3-9f35-48a2-9c38-38379516f4d2"
},
"outputs": [],
"source": [
"exp.plot_similar(idx=[7146, 14035]) # Using avg embeddings of 2 images"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,286 @@
---
comments: true
description: Dive into advanced data exploration with Ultralytics Explorer. Perform semantic searches, execute SQL queries, and leverage AI-powered natural language insights for seamless data analysis.
keywords: Ultralytics Explorer, data exploration, semantic search, vector similarity, SQL queries, AI, natural language queries, machine learning, OpenAI, LLMs, Ultralytics HUB
---
# VOC Exploration Example
<div align="center">
<a href="https://www.ultralytics.com/events/yolovision" target="_blank"><img width="1024%" src="https://github.com/ultralytics/docs/releases/download/0/ultralytics-yolov8-banner.avif" alt="Ultralytics YOLO banner"></a>
<a href="https://docs.ultralytics.com/zh">中文</a> |
<a href="https://docs.ultralytics.com/ko">한국어</a> |
<a href="https://docs.ultralytics.com/ja">日本語</a> |
<a href="https://docs.ultralytics.com/ru">Русский</a> |
<a href="https://docs.ultralytics.com/de">Deutsch</a> |
<a href="https://docs.ultralytics.com/fr">Français</a> |
<a href="https://docs.ultralytics.com/es/">Español</a> |
<a href="https://docs.ultralytics.com/pt">Português</a> |
<a href="https://docs.ultralytics.com/tr">Türkçe</a> |
<a href="https://docs.ultralytics.com/vi">Tiếng Việt</a> |
<a href="https://docs.ultralytics.com/ar">العربية</a>
<br>
<br>
<a href="https://github.com/ultralytics/ultralytics/actions/workflows/ci.yaml"><img src="https://github.com/ultralytics/ultralytics/actions/workflows/ci.yaml/badge.svg" alt="Ultralytics CI"></a>
<a href="https://pepy.tech/projects/ultralytics"><img src="https://static.pepy.tech/badge/ultralytics" alt="Ultralytics Downloads"></a>
<a href="https://zenodo.org/badge/latestdoi/264818686"><img src="https://zenodo.org/badge/264818686.svg" alt="Ultralytics YOLO Citation"></a>
<a href="https://discord.com/invite/ultralytics"><img alt="Ultralytics Discord" src="https://img.shields.io/discord/1089800235347353640?logo=discord&logoColor=white&label=Discord&color=blue"></a>
<a href="https://community.ultralytics.com/"><img alt="Ultralytics Forums" src="https://img.shields.io/discourse/users?server=https%3A%2F%2Fcommunity.ultralytics.com&logo=discourse&label=Forums&color=blue"></a>
<a href="https://reddit.com/r/ultralytics"><img alt="Ultralytics Reddit" src="https://img.shields.io/reddit/subreddit-subscribers/ultralytics?style=flat&logo=reddit&logoColor=white&label=Reddit&color=blue"></a>
<br>
<a href="https://console.paperspace.com/github/ultralytics/ultralytics"><img src="https://assets.paperspace.io/img/gradient-badge.svg" alt="Run Ultralytics on Gradient"></a>
<a href="https://colab.research.google.com/github/ultralytics/ultralytics/blob/main/examples/tutorial.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open Ultralytics In Colab"></a>
<a href="https://www.kaggle.com/models/ultralytics/yolo11"><img src="https://kaggle.com/static/images/open-in-kaggle.svg" alt="Open Ultralytics In Kaggle"></a>
<a href="https://mybinder.org/v2/gh/ultralytics/ultralytics/HEAD?labpath=examples%2Ftutorial.ipynb"><img src="https://mybinder.org/badge_logo.svg" alt="Open Ultralytics In Binder"></a>
<br>
</div>
Welcome to the Ultralytics Explorer API notebook! This notebook serves as the starting point for exploring the various resources available to help you get started with using Ultralytics to explore your datasets using with the power of semantic search. You can utilities out of the box that allow you to examine specific types of labels using vector search or even SQL queries.
Try `yolo explorer` powered by Explorer API
Simply `pip install ultralytics` and run `yolo explorer` in your terminal to run custom queries and semantic search on your datasets right inside your browser!
## Ultralytics Explorer support deprecated ⚠️
As of **`ultralytics>=8.3.10`**, Ultralytics explorer support has been deprecated. But don't worry! You can now access similar and even enhanced functionality through [Ultralytics HUB](https://hub.ultralytics.com/), our intuitive no-code platform designed to streamline your workflow. With Ultralytics HUB, you can continue exploring, visualizing, and managing your data effortlessly, all without writing a single line of code. Make sure to check it out and take advantage of its powerful features!🚀
## Setup
Pip install `ultralytics` and [dependencies](https://github.com/ultralytics/ultralytics/blob/main/pyproject.toml) and check software and hardware.
```bash
%pip install ultralytics[explorer] openai
import ultralytics
ultralytics.checks()
```
## Similarity Search
Utilize the power of vector similarity search to find the similar data points in your dataset along with their distance in the embedding space. Simply create an embeddings table for the given dataset-model pair. It is only needed once, and it is reused automatically.
```python
exp = Explorer("VOC.yaml", model="yolo11n.pt")
exp.create_embeddings_table()
```
One the embeddings table is built, you can get run semantic search in any of the following ways:
- On a given index / list of indices in the dataset like - exp.get_similar(idx=[1,10], limit=10)
- On any image/ list of images not in the dataset - exp.get_similar(img=["path/to/img1", "path/to/img2"], limit=10) In case of multiple inputs, the aggregate of their embeddings is used.
You get a pandas dataframe with the limit number of most similar data points to the input, along with their distance in the embedding space. You can use this dataset to perform further filtering
![Similarity search table](https://github.com/ultralytics/docs/releases/download/0/similarity-search-table.avif)
```python
# Search dataset by index
similar = exp.get_similar(idx=1, limit=10)
similar.head()
```
You can use the also plot the similar samples directly using the `plot_similar` util
![Similarity search image 1](https://github.com/ultralytics/docs/releases/download/0/similarity-search-image-1.avif)
```python
exp.plot_similar(idx=6500, limit=20)
# exp.plot_similar(idx=[100,101], limit=10) # Can also pass list of idxs or imgs
exp.plot_similar(
img="https://ultralytics.com/images/bus.jpg", limit=10, labels=False
) # Can also pass any external images
```
![Similarity search image 2](https://github.com/ultralytics/docs/releases/download/0/similarity-search-image-2.avif)
## Ask AI: Search or filter with Natural Language
You can prompt the Explorer object with the kind of data points you want to see, and it'll try to return a dataframe with those. Because it is powered by LLMs, it doesn't always get it right. In that case, it'll return None.
![Ask ai table](https://github.com/ultralytics/docs/releases/download/0/ask-ai-nlp-table.avif)
```python
df = exp.ask_ai("show me images containing more than 10 objects with at least 2 persons")
df.head(5)
```
for plotting these results you can use `plot_query_result` util Example:
```python
plt = plot_query_result(exp.ask_ai("show me 10 images containing exactly 2 persons"))
Image.fromarray(plt)
```
![Ask ai image 1](https://github.com/ultralytics/docs/releases/download/0/ask-ai-nlp-image-1.avif)
```python
# plot
from PIL import Image
from ultralytics.data.explorer import plot_query_result
plt = plot_query_result(exp.ask_ai("show me 10 images containing exactly 2 persons"))
Image.fromarray(plt)
```
## Run SQL queries on your Dataset
Sometimes you might want to investigate a certain type of entries in your dataset. For this Explorer allows you to execute SQL queries. It accepts either of the formats:
- Queries beginning with "WHERE" will automatically select all columns. This can be thought of as a shorthand query
- You can also write full queries where you can specify which columns to select
This can be used to investigate model performance and specific data points. For example:
- let's say your model struggles on images that have humans and dogs. You can write a query like this to select the points that have at least 2 humans AND at least one dog.
You can combine SQL query and semantic search to filter down to specific type of results
```python
table = exp.sql_query("WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10")
exp.plot_sql_query("WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10", labels=True)
```
![SQL queries table](https://github.com/ultralytics/docs/releases/download/0/sql-queries-table.avif)
```python
table = exp.sql_query("WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10")
table
```
Just like similarity search, you also get a util to directly plot the sql queries using `exp.plot_sql_query`
![SQL queries image 1](https://github.com/ultralytics/docs/releases/download/0/sql-query-image-1.avif)
```python
exp.plot_sql_query("WHERE labels LIKE '%person, person%' AND labels LIKE '%dog%' LIMIT 10", labels=True)
```
## Working with embeddings Table (Advanced)
Explorer works on [LanceDB](https://lancedb.github.io/lancedb/) tables internally. You can access this table directly, using `Explorer.table` object and run raw queries, push down pre- and post-filters, etc.
```python
table = exp.table
table.schema
```
### Run raw queries¶
Vector Search finds the nearest vectors from the database. In a recommendation system or search engine, you can find similar products from the one you searched. In LLM and other AI applications, each data point can be presented by the embeddings generated from some models, it returns the most relevant features.
A search in high-dimensional vector space, is to find K-Nearest-Neighbors (KNN) of the query vector.
Metric In LanceDB, a Metric is the way to describe the distance between a pair of vectors. Currently, it supports the following metrics:
- L2
- Cosine
- Dot Explorer's similarity search uses L2 by default. You can run queries on tables directly, or use the lance format to build custom utilities to manage datasets. More details on available LanceDB table ops in the [docs](https://lancedb.github.io/lancedb/)
![Raw-queries-table](https://github.com/ultralytics/docs/releases/download/0/raw-queries-table.avif)
```python
dummy_img_embedding = [i for i in range(256)]
table.search(dummy_img_embedding).limit(5).to_pandas()
```
### Interconversion to popular data formats
```python
df = table.to_pandas()
pa_table = table.to_arrow()
```
### Work with Embeddings
You can access the raw embedding from lancedb Table and analyse it. The image embeddings are stored in column `vector`
```python
import numpy as np
embeddings = table.to_pandas()["vector"].tolist()
embeddings = np.array(embeddings)
```
### Scatterplot
One of the preliminary steps in analysing embeddings is by plotting them in 2D space via dimensionality reduction. Let's try an example
![Scatterplot Example](https://github.com/ultralytics/docs/releases/download/0/scatterplot-sql-queries.avif)
```python
pip install scikit-learn
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
# Reduce dimensions using PCA to 3 components for visualization in 3D
pca = PCA(n_components=3)
reduced_data = pca.fit_transform(embeddings)
# Create a 3D scatter plot using Matplotlib's Axes3D
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d")
# Scatter plot
ax.scatter(reduced_data[:, 0], reduced_data[:, 1], reduced_data[:, 2], alpha=0.5)
ax.set_title("3D Scatter Plot of Reduced 256-Dimensional Data (PCA)")
ax.set_xlabel("Component 1")
ax.set_ylabel("Component 2")
ax.set_zlabel("Component 3")
plt.show()
```
### Similarity Index
Here's a simple example of an operation powered by the embeddings table. Explorer comes with a `similarity_index` operation-
- It tries to estimate how similar each data point is with the rest of the dataset.
- It does that by counting how many image embeddings lie closer than max_dist to the current image in the generated embedding space, considering top_k similar images at a time.
For a given dataset, model, `max_dist` & `top_k` the similarity index once generated will be reused. In case, your dataset has changed, or you simply need to regenerate the similarity index, you can pass `force=True`. Similar to vector and SQL search, this also comes with a util to directly plot it. Let's look
```python
sim_idx = exp.similarity_index(max_dist=0.2, top_k=0.01)
exp.plot_similarity_index(max_dist=0.2, top_k=0.01)
```
![Similarity Index](https://github.com/ultralytics/docs/releases/download/0/similarity-index.avif)
at the plot first
```python
exp.plot_similarity_index(max_dist=0.2, top_k=0.01)
```
Now let's look at the output of the operation
```python
sim_idx = exp.similarity_index(max_dist=0.2, top_k=0.01, force=False)
sim_idx
```
Let's create a query to see what data points have similarity count of more than 30 and plot images similar to them.
```python
import numpy as np
sim_count = np.array(sim_idx["count"])
sim_idx["im_file"][sim_count > 30]
```
You should see something like this
![similarity-index-image](https://github.com/ultralytics/docs/releases/download/0/similarity-index-image.avif)
```python
exp.plot_similar(idx=[7146, 14035]) # Using avg embeddings of 2 images
```

View File

@ -365,7 +365,7 @@ nav:
- datasets/explorer/index.md
- Explorer API: datasets/explorer/api.md
- Explorer Dashboard Demo: datasets/explorer/dashboard.md
- VOC Exploration Example: datasets/explorer/explorer.ipynb
- VOC Exploration Example: datasets/explorer/explorer.md
- YOLOv5:
- yolov5/index.md
- Quickstart: yolov5/quickstart_tutorial.md
@ -661,7 +661,6 @@ plugins:
add_share_buttons: True
add_css: False
default_image: https://raw.githubusercontent.com/ultralytics/assets/main/yolov8/banner-yolov8.png
- mkdocs-jupyter
- redirects:
redirect_maps:
hi/index.md: index.md

View File

@ -90,7 +90,6 @@ dev = [
"mkdocs>=1.6.0",
"mkdocs-material>=9.5.9",
"mkdocstrings[python]",
"mkdocs-jupyter", # notebooks
"mkdocs-redirects", # 301 redirects
"mkdocs-ultralytics-plugin>=0.1.8", # for meta descriptions and images, dates and authors
"mkdocs-macros-plugin>=1.0.5" # duplicating content (i.e. export tables) in multiple places