Content indexing is the process of analyzing and organizing digital content, such as pages, news posts, and files, so that the Staffbase search engine can quickly return relevant results when you perform a search.
How Does Content Indexing Work?
When content is uploaded or created in the platform, the Staffbase algorithm analyzes it and extracts key elements, such as the title, text, and metadata. These elements are then stored in a structured index that helps the search engine deliver fast and accurate results.
When content is indexed, Staffbase applies the following strategies to ensure users can easily find easily, even if they don’t use the exact wording. For example, when searching for the title "International Volunteer Day":
- Phrase Search Indexing
Multi-word phrases (usually 2–3 words) are indexed together so they can be matched as a unit during a search.
Example: "International Volunteer Day", "Volunteer Day", "International Volunteer" - Full Word Search Indexing
Each word is indexed in its full form, allowing exact matches during a search.
Example: "International", "Volunteer", “Day" - Prefix Search Indexing
Words are broken into multiple prefixes (from 1 to 20 characters), allowing matches based on the beginning of a word.
Example: For "International": "I", "In", "Int", "Inte", "Inter", … up to "International" (1–20 characters)
This layered approach ensures that users receive relevant results whether they search for the full phrase, a single word, or even just the beginning of a term.
What is Indexed?
|
|
|
Pages |
News |
Files |
Users |
Apps & Links |
Plugins (e.g. Forms & Surveys) |
|---|---|---|---|---|---|---|---|
|
Title |
Phrase |
|
|
|
|
|
|
|
Full word |
|
|
|
|
|
|
|
|
Prefix |
|
|
|
|
|
|
|
|
Content |
Phrase |
|
|
|
User fields |
|
|
|
Full word |
|
|
|
User fields |
|
|
|
|
Prefix |
|
|
|
|
|
|
|
|
Metadata |
- |
Description |
Teaser text |
- |
- |
Description |
- |
|
Phrase |
|
|
|
|
|
|
|
|
Full word |
|
|
|
|
|
|
|
|
Prefix |
|
|
|
|
|
|
|
|
Additional Fields |
- |
|
|
|
- |
- |
- |
Title: Refers to the name given to a page, a news, or a file.
Content: Refers to the substance of a page, a news, or a file.
Metadata: Refers to information that is not part of the core of a news or a page. For example, the teaser text for a news and the description field for pages
Description: This refers to additional context about a page. The description is available for Pages only and shows in the search results.
Teaser Text: Refers to the purpose and aim of the news. The teaser text is available for News only.
Full word: Refers to exact word matching. Example: If users search for “Staffbase”, the search results show content that matches the word “Staffbase” exactly.
Phrase: Refers to multi-word search terms. Example: If users search for “Staffbase Studio”, the results show content that matches the exact combination of both words.
Prefix: Refers to the prefix word matching. Example: If users search for “Staff”, the search results show content that matches the prefix “Staff”, such as “Staffbase”.
User fields: Refers to the profile fields (system and custom) created by an admin in the Studio settings.
Additional Fields: Other structured data that supports search, such as hashtags used in pages or news posts.
Data Processing During the Indexing
During the indexing, data is processed to make it easier to retrieve information later.
When indexing, the search analyses text in the following ways:
- Case insensitivity: All characters in the content are considered lowercase. For example, apple and Apple are considered the same.
- ASCII characters: The indexing converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the indexing process changes à to a.
- Language analyzer: The language analyzers are applied to content and teaser text within News and content and description within Pages. The aim is to analyze specific language texts. The analyzers handle the following using the language context:
- Stop words: Stop words are a commonly used word in a language that is typically ignored in search queries or text analysis because it is considered to be of little value in representing the meaning of a sentence. These words are often short and occur frequently in a language but only carry a little specific information about the content, such as “the”," "and," "is," "in," "of," and "to".
- Stemming: Stemming removes suffixes from words to obtain a common linguistic base. This helps to group variations of a word, reduce the dimensionality of the data, and improve the efficiency of text processing and analysis.
- Special characters: The special characters, such as !"#$%&'()*+,-./:;<=>?@[]^_`{|}~§° are replaced with an empty space.
How is Content Ranked?
Staffbase uses a full-text search algorithm based on BM25, an industry-standard ranking model. This algorithm determines which results are most relevant to your search query. The key ranking factors include:
- Term frequency (TF): The number of times a given word (term) appears in a document
- Inverse Document Frequency (IDF): The importance of a term used in a text, considering term frequency and document frequency
- Document Length (DL): The length of a document compared to the average length of all documents
The algorithm considers different parts of each document, such as the title, description, content, and additional fields, and evaluates them based on three types of matches:
- Phrase matches (for example, “employee handbook”)
- Individual word matches (for example, “employee” or “handbook”)
- Prefix matches (for example, “hand” matches “handbook”)
Each type of match receives a different boost, depending on where it appears. Those boosts are multiplication factors of the relevance score. For example, in the title of a page:
- Phrase match: x 15
- Word match: x 8
- Prefix match: x 2
The boost multipliers for phrase, word, and prefix matches differ depending on placement within the content, such as the page title or description.
The algorithm adds up the scores for all matches within each field, and the highest scoring field determines the page’s final relevance score.
For News posts and Pages, Staffbase applies boosts to prioritize the most recent news, resulting in the final relevance score:
- Direct Access Boost: If a user has direct access to a page or news post, it receives an extra boost, since it’s more likely to be relevant.
- Freshness Boost: To prioritize more recent content, pages and news posts receive date-based boosts:
- Published within the last 5 weeks: +30
- Published within the last 6 months: +20
- Published within the last 12 months: +10
The results are then ranked, with the most relevant ones appearing at the top.
It is also possible to use the dropdown menus for sorting your search results, for example, by date or alphabetical.
Comments
0 comments
Please sign in to leave a comment.