Search
How search works
Search is handled by OpenSearch, an ElasticSearch clone that runs as a service on AWS. Indexing and querying is managed by an extension to Wagtail's in-built search functionality.
This document outlines how that process is customised; the original process is based on the setup used by Github Doc's search.
This approach balances comprehensiveness (resurn all results) with a nuanced ability to tweak values for searches, so that the most relevant results should be first, and if not it's relatively easy to iteratively improve them.
If you are simply looking for the boost calculations, head straight for the end of this document.
Indexing
The first step is to control our search indexing in detail. Each searchable object type is listed below, along with the ORM fields that object contains that merit indexing; that is to say any data on that object that users may at any point use to search, refine or filter in order to find the result they are looking for.
Title, headings, excerpt and content
Each content editable text field on every page and content type (as far as possible) will be combined into only 4 fields (actually 8, see the next section); title
, headings
, excerpt
and content
. This is to minimise storage and retrieval times for OpenSearch, while separating the content into chunks that may be boosted independently.
Matches on document titles should be more relevant than matches on heading, which are more relevant than matches on the excerpt, which in turn are more relevant than matches within the body content. At the same time matches anywhere within the content should be equally ranked, and (for simplicity) we treat matches in any heading as equally valid.
As an example, a page containing headings "How to sign up" and "Quick start" would be indexed as a document with the headings
field contents being "How to sign up Quick start".
Note that fields containing metadata for filtering etc will also be present on the search documents.
Explicit fields
Each text field on each of the models above will be indexed twice; first as a raw, unprocessed field with an _explicit
suffix, and also as a stemmed and procesed field using the standard field name (this "tokenized" approach is the Wagtail default). Non-text fields will be indexed unprocessed with no suffix.
As an example, a Page
model with a title
field would be indexed as a Page
document in OpenSearch, containing both title
and title_explicit
fields. The title
field will support misspellings and stemmings (see below) while the title_explicit
field exists in order to be able to prioritise results containing exact matches.
Stemming
Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a search for connected would also find documents which only have the other forms.
We use the standard Snowball stemmer (English version) as recommended by ElasticSearch, OpenSearch and Wagtail, as well as the Github Docs team. It supports English language synonyms as well as stemming different forms of the same word, and it tries to avoid stemming similar words with different meanings as the same stem (even when they're based on the same root word).
Retrieval
As we ask OpenSearch to return results (documents) relating to the user's query, we will apply a series of "boost" values to various different configurations of the results. OpenSearch uses these boost values to give each result a score, which is how the results get ranked (note that the score itself doen't matter; what matters is the relative scores of different results).
Rather than send a single query to OpenSearch, we send a set of similar queries which allows us to ask for various types of match, all ranked according to their likelihood to be what the user is looking for.
An example is that we may ask for both and exact match and a fuzzy match for the same term in the same query, with the exact match having a higher score. Each search result will only be included once but this approach maximises the chances of: - returning results that match explicit (possibly advanced, complex) queries - returning relevant results to broader, less explicit queries
In reality each query submitted by the user will result in at least 7 individual search operations, often 19, and perhaps more depending on filtering etc. These will be managed within a single query to OpenSearch, and turnaround times should not suffer as a result.
Note that this means we have very large DSL strctures representing our search queries to OpenSearch.
Single or multiple term; AND, OR and Phrase matching
Searching for a single word or term is relatively simple; we want to search against each field and prioritise the fields with more importance, and where we have to use less processing (thereby giving priority to results that are exactly as typed by the user, rather than e.g. word variants).
When searching for terms containing multiple word we want to do all the above, but we also want to combine the words so that by default we search using both AND and OR strategies. Results for the AND search will be boosted higher than results for the OR search.
Effectively a search for "paper aeroplane" should return any document containing either word; at the same time it should rank results containing both "paper" and "aeroplane" higher than results with only the word "paper" or the word "aeroplane".
In addition, we also want to use "phrase" searching, where the set of words must all appear together in the same order; this is boosted above AND and OR matches as most relevant, and of course even more so if it's matched against unprocessed (explicit) fields.
Fuzzy matching
To take account of misspellings, we'll include fuzzy matching of terms so that "similar" words will be returned even when they don't exactly match any word or stem.
Fuzzy matching will only be applied to Page titles and key fields on other records, like Person names.
In order that these results are always below any non-fuzzy matches, the boost values for these results will be fractional; when multiplied they will relatively lower the relevancy score of the results returned by this search.
Filtering
Non-text fields like dates etc will be indexed to support the results being filtered by these values.
Advanced search terms
This multi-dimensional approach to search, with a complex set of queries representing various different ways to retrieve each given search term, avoids the need for "advanced search" queries being exposed to the user. As such, users will not benefit from trying to use keyword like "AND" or boolean symbols like "-" in their search queries.
How boosting is applied
As per the latest approach implemented by OpenSearch, no boosting is applied at index time - all boost values are calculated at retrieval time and applied in depth to each query.
TO calculate a query's boost value, the specific boost components for the field type, the analysis type, and the query type (i.e. phrase vs AND vs OR vs fuzzy) are multiplied together. In some cases extra values may be multiplied in to account for e.g. prioritising an exact match of a Tools page over other types.
This calculated value is then applied to a specific query, and this query is merged with the others in the matrix we're applying to create an overall DSL that's sent to OpenSearch. The bossts are applied to the subqueries, resulting in a set of results for which the scores have all been calculated according to the various vboosts applied.
Document types
OpenSearch treats each potential result as a "document" - it doesn't distinguish between pages, people, tools, news pages, etc. It's therefore important in our combined search that our boost values work together as a whole, with the boosts appropriately scaled.
Boost values
The following are the various base boost values we use in the search setup. When a search request is constructed, these values are multiplied together as appropriate for each specific type of query we send to OpenSearch.
Please note the intention is that these can be modified in a dedicated part of the admin system; these values are therefore only defaults, and based on the GitHub blog post referenced above.
Boost variable | Score | Description |
---|---|---|
BOOST_TITLE | 4.0 | Any match on a title field |
BOOST_HEADINGS | 3.0 | Any match on a non-title heading field |
BOOST_CONTENT | 1.0 | Any match wihin the content of the document |
BOOST_PHRASE | 10.0 | Exact match on an entire search phrase |
BOOST_AND | 2.5 | AND matches represent stronger links between terms than OR matches, they get this boost |
BOOST_FUZZY | 0.025 | Fuzzy handles basic misspellings gracefully but these results should be low in the list |
BOOST_EXPLICIT | 3.5 | No stemming or synonyms used; matches are on unprocessed text |
BOOST_TOKENIZED | 1.0 | Default; stemming and synonyms used for broader matching |
Boost combinations
These are basic combinations for a generic page type, to illustrate the process.
Field | Match Type | Boosts |
---|---|---|
title_explicit |
Exact Phrase | BOOST_EXPLICIT * BOOST_PHRASE * BOOST_TITLE |
title |
Exact Phrase | BOOST_PHRASE * BOOST_TITLE |
title_explicit |
AND | BOOST_EXPLICIT * BOOST_AND * BOOST_TITLE |
title |
AND | BOOST_AND * BOOST_TITLE |
title_explicit |
OR | BOOST_EXPLICIT * BOOST_TITLE |
title |
OR | BOOST_TITLE |
headings_explicit |
Exact Phrase | BOOST_EXPLICIT * BOOST_PHRASE * BOOST_HEADINGS |
headings |
Exact Phrase | BOOST_PHRASE * BOOST_HEADINGS |
headings_explicit |
AND | BOOST_EXPLICIT * BOOST_AND * BOOST_HEADINGS |
headings |
AND | BOOST_AND * BOOST_HEADINGS |
headings_explicit |
OR | BOOST_EXPLICIT * BOOST_HEADINGS |
headings |
OR | BOOST_HEADINGS |
content_explicit |
Exact Phrase | BOOST_EXPLICIT * BOOST_PHRASE * BOOST_CONTENT |
content |
Exact Phrase | BOOST_PHRASE * BOOST_CONTENT |
content_explicit |
AND | BOOST_EXPLICIT * BOOST_AND * BOOST_CONTENT |
content |
AND | BOOST_AND * BOOST_CONTENT |
content_explicit |
OR | BOOST_EXPLICIT * BOOST_CONTENT |
content |
OR | BOOST_CONTENT |
title |
OR / Fuzzy | BOOST_FUZZY |
Notes
The rest of this document is notes for implementation and not documentation
Specific content types
~~These content-specific boosts are applied to results in their own types and multiplied with any other boosts already applied.~~
Indexed fields are laid out below ~~along with a mapping to the relevbatn search field~~
- ContentPage (N.B. all pages will inherit from / implement these fields)
- title
- body (mapped to "content" and "headings" in search index)
- live filter
- slug filter (??)
- first_published_at filter
- last_published_at filter
- PageTopic
- topic
- PageWithTopics (most pages inherit from here)
- topics
- Network page
- NewsCategory
- category title
- NewsPage
- news_categories
- Document (i.e. PDFs etc) ??
- Person
- full_name
- contact_email
- profile_completness (for ordering)
- has_photo (for ordering)
- roles (job_titles)
- team related field
- town_city_or_region
- regional_building
- international_building
- ~~location_in_building~~
- key_skills
- other_key_skills
- fluent_languages
- intermediate_languages
- learning_interests
- other_learning_interests
- networks
- professions
- additional_roles
- other_additional_roles
- buildings
- is_active filter
- country filter
- grade filter
- do_not_work_for_ditc filter ???
- Team
- name
- abbreviation (can we do this with/without spaces to catch e.g. "UK DSE" vs "UKDSE")
- people ?
- description
- leaders ??
- job_titles ?
- Topic ???
- Tool (will this be a page?)
News
- BOOST_RECENCY
Tools
- BOOST_EXACT
Teams
People
Resolved Questions
- Teams - do we want to index e.g. all the job titles within the team? (Y)
- Teams - do we want to index members' names so searhcing a person also returns their team? (N)
- Can we do anything to help find "teams' main pages" as per a piece of feedback? Not a real thing - it was private (but maybe change private to hidden)
- index people against role and team membership
- Index excerpts with their own boost variable
- We add alt tags (actually figcaptions) to content
- people search maybe actually works ok at the moment
Open questions
- People indexing probably deserves dedicated treatment re fields, boosting, fuzzy matching etc
- What are all the different categories from an indexing point of view (e.g. is "working at DBT" content distinct from general content in any significant way?)
- Do we index excerpts any differently to content? Slightly higher boosted?
- some sort of self-serve debug (why doesnt something turn up)
- people - boosts on having profile pic and profile completeness
- do we need UI to show the matching field - so when you search people against team name you can see why they match
- News needs to prioritise recency - does all content?