What does this repo signal mean?

Cohere published cohere-ai/sandbox-toy-semantic-search (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo cohere-ai/sandbox-toy-semantic-search · language Python · Toy demo repo with modest traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals. It also maps to Infrastructure in the data-business radar.

Cohere Repo: cohere-ai/sandbox-toy-semantic-search

Captured source

source ↗

GitHub/github.com/cohere-ai/sandbox-toy-semantic-search

cohere-ai/sandbox-toy-semantic-search repository metadata

Source ↗

published Nov 2, 2022seen Jun 5captured Jun 11http 200method plain

cohere-ai/sandbox-toy-semantic-search

Description: A demonstration of how a toy (but usable!) semantic search engine can be quickly built using Cohere's platform.

Language: Python

License: MIT

Stars: 118

Forks: 6

Open issues: 3

Created: 2022-11-02T12:03:13Z

Pushed: 2023-07-25T22:24:53Z

Default branch: main

Fork: no

Archived: yes

README:

################################################################################
# ____ _ ____ _ _ #
# / ___|___ | |__ ___ _ __ ___ / ___| __ _ _ __ __| | |__ _____ __ #
# | | / _ \| '_ \ / _ \ '__/ _ \ \___ \ / _` | '_ \ / _` | '_ \ / _ \ \/ / #
# | |__| (_) | | | | __/ | | __/ ___) | (_| | | | | (_| | |_) | (_) >

(where ` is the key you obtained, without the ` brackets).

Alternatively, you can pass COHERE_TOKEN= as an additional argument to any make command below.

Building an index

Follow these steps to first build a semantic index of your document collection. These steps produce a semantic index for the official python docs, but could be adapted for arbitary data collections.

Step 1: Get some text

First, download the python documentaton by running one of the following commands.

If you want to get started quickly, run

make download-python-docs-small

to limit the document set to the python tutorial. We only recommend doing this for a quick test, as the results will be very limited.

If you want to test the search engine over the entire python documentation, run

make download-python-docs

but be aware that producing the embeddings will take hours (although this only needs to be done once).

Alternatively if you want to experiment with your own text, then simply download it as .txt files to a directory called txt/ in this repository.

Step 2: Process the text into a index of embeddings (representations)

Once you have some text, we need to process it into a search index of embeddings and addresses.

This can be done by using the command

make embeddings

assuming your target text is under the ./txt/ directory.

The command will search the ./txt/ directory recursively for files with a .txt extension, and build a simple database of the embeddings, file name and line number of each paragraph.

Warning: If you have a lot of text to search, this can take a little while to finish!

Step 3: Build and launch the search engine

Once you have an embeddings.npz file built, you can use the following command to build a docker image which will serve a simple REST app to allow you to query the database you have made:

make build

You can then start the server using

make run

This is slightly overkill for a simple example, but it's designed to reflect the fact that building an index of a large body of text is relatively slow, and ensures that querying the engine is fast.

If you want to use this project as a building block for a real application, it is likely that you will want to maintain your database of text embeddings in a server architecture and query it with a lightweight client. Packaging the server as a docker application means that it is very simple to turn this into a 'real' application by deploying it to a cloud service.

Step 4: Query your search engine

If you open a new terminal window for any of the options below, remember to run

export COHERE_TOKEN=

Via a viewer script

By far the easiest option is to run our helper script:

scripts/ask.sh "My query here"

to query the database. The script takes an optional second argument specifying the number of desired results.

The script pops up a modified vim interface, with the following commands:

Press q to quit.
Press the UP or LEFT arrow to page up in the list of results (show in the bottom pane)
Press the DOWN or RIGHT arrow to page down in the list of results

The top pane will show you the position in the document where the result is found.

Via a REST API

Once the server is running, you can query it using a simple REST api. You can explore the API directly by going to /docs#/default/search_search_post here. It's a simple JSON REST API; here's how you can ask a query using curl:

curl -X POST -H "Content-Type: application/json" -d '{"query": "How do I append to a list?", "num_results": 3}' http://localhost:8080/search

This will return a JSON list of length num_results, each with the filename and line-number (doc_url and block_url) of the blocks that were the closest semantic match to your query. But you probably want to actually just read the bit of the files that's the best answer.

Via vim

As we are searching through local text files, it's actually a bit easier to parse the output using command line tools; use the provided python script utils/query_server.py to query it on the command line. query_server.py prints out the results in the standard file_name:line_number: format, so we can page through the actual results in a nice way be leveraging vim's quickfix mode.

Assuming you have vim on your machine, you can simply

vim +cw -M -q <(python utils/query_server.py "my_query" --num_results 3)

to get vim to open the indexed text files at the locations returned by the search algorithm. (use :qall to close both the window and the quickfix navigator). You can cycle through the returned results using :cn and :cp. The results aren't perfect; it's semantic search, so you would expect the matching to be a bit fuzzy. Despite this, I often find you can get the answer to your question in the first few results, and using Cohere's API lets you express your question in natural language, and let's you build a suprisingly effective search engine in just a few lines of code.

Example queries

Some good-to-try queries in the python docs case that show the search working well on generic, natural language questions are:

How do I put new items in a list? (Note that this question avoids using the keyword 'append', and doesn't exactly match how the docs explain append (they

say it's used to _add_ new items to the end of a list). But the semantic search correctly figures out that the relevant paragraph is still the best match.)

How do I put things in a list?
Are dictionary keys in insertion order?
What is the difference between a tuple and a list? (notice for this question, that the first result for me is an FAQ about...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Toy demo repo with modest traction.