In many cases organisations ask search consultants to “make search work”. And in most cases, it’s considered to be a single and straightforward IT project: the consultant does some ‘magic’ and search just starts working.
However, in most cases, it’s not that simple.
Nobody has the ‘search magic wand’. Usually, we have to do significant work on the content before doing anything with search. With this case study, let me show you one very typical example which will demonstrate the value and challenges of information quality.
Note: The project is on Microsoft 365 but could be on any other enterprise system.
An international organisation with offices in almost 50 countries realised that the more content it migrates to and stores in Microsoft 365, the worse the findability of everything becomes – despite the promise of out-of-the-box Microsoft Search.
There are some common mistakes organisations make when they migrate and create their content without planning and governance. The company in this example is not an exception: they had the following issues and challenges before we started to work with them:
- inconsistent information architecture
- missing metadata
- inconsistent metadata
- lack of content lifecycle
- lack of content curation
- inconsistent use of languages and translations.
Inconsistent information architecture
When there’s no plan and no guidance, nobody knows where and how to store the content. People do their best, but everyone has different backgrounds, experiences, and knowledge – therefore the way they store their content will be different.
Some might create top-level containers (site collections) for their team documents – others save all of the company archive into a single folder structure (library).
Some store their collaboration files in SharePoint – others share high importance corporate policies from their personal OneDrive. And everyone creates teams and channels in Microsoft Teams without knowing the implications.
And when none of these applications can satisfy the users, new applications come, including shadow-apps, because the promise of these is always better than the messy reality. And this spiral of adding more and more applications gets worse over time, with no real long-term benefits. Some examples are shown below.
- The organisation standard is to use Microsoft 365, but the users are not educated. They create SharePoint sites for everything, but they don’t use the available functionalities – instead, they store the content in multi-level embedded folders, with no metadata. Same as storing the documents in a network drive.
- The organisation opens Microsoft Teams to enhance collaboration. With no guidance or governance, thousands of teams and channels are created. Every user becomes a member of dozens of teams, and eventually, the communication becomes too noisy, and users stop using Teams completely.
- The organisation stores documents in SharePoint Online, but sharing is disabled. Users start to use OneDrive, and eventually OneDrive becomes the primary source of truth. And since OneDrive is personal and is not meant for teamwork, the organisation will face major issues when the file owner leaves the company.
- The organisation stores documents in SharePoint Online, but external sharing is disabled. The users recognise that there must be a better option than sending documents as email attachments – so they start using Dropbox or Google Docs to share content with external partners. In many cases, these (sensitive) documents can be found on the public internet, too.
Another challenge is when even the most advanced content management systems are being used as “smart” file shares. Users store their documents there, maybe organised into folders, but with no metadata at all. When they need something, they navigate to the content through the folder structure. However, if they don’t know where the content they need is stored, they’re lost.
At the same time, folder structures follow some logic: if you ask the users, they can tell you the first level is the client, the second level is the year, the third level is the project, etc. But using explicit metadata instead of cascaded folders is not something they’re familiar with or understand.
Some examples include:
- Many use folders instead of metadata. Storing year, client, customer, project name, project ID, etc. as metadata can provide much better filtering, sorting, ordering, grouping, and search options; therefore, the overall findability of content and user satisfaction will improve.
- In many cases, the metadata can be found in the document (implicit) but not added to the document as explicit metadata. While full-text search works to some extent in this case, explicit metadata can improve filtering, sorting, ordering, grouping, etc. options.
If there’s something worse than no metadata, it’s inconsistent metadata. Below are a few examples of what we find when doing content inventory at our clients:
- There’s no managed taxonomy, and users enter various synonyms for the same term. For example, “The Search Network”, “Search Network”, “Search Netw.”, “TSN”, etc. –to name just a few.
- Inconsistent use of languages and translations. If a user knows everything is in a common language (for example, English), it’s a very clear approach. However, if the organisation uses multiple languages, there must be a language strategy. What should be translated? What types of content are available locally? Also, if there’s a taxonomy, it has to be multi-lingual so that everyone can use it consistently and coherently.
- All metadata are free text, with no guidance or governance. Everyone enters whatever value they want to. In many cases, they are not even consistent with themselves.
- Date fields are used inconsistently. When there’s a “date” field, users’ understanding of what the “date” is about might be different. For example, during the scoping workshops, we often find that there’s only one “date” field assigned to the documents. Users use it to enter the date of approval, uploading, valid from, valid to, effective from, effective to, etc. – just to name a few, from the same environment. Different date formats also add to the complexity: in many cases, it’s not obvious whether 1/2/2021 means Jan 2 or Feb 1. It’s not only confusing when a user is entering the date but also when filtering or searching for a specific date range.
Lack of content lifecycle
Knowing which document is the “real”, the “official”, the most recent, or the approved one is essential. However, if there is no content lifecycle in place, and users send various versions of documents back and forth by email, there is no way to know which one is the “right” one: multiple versions, with different and often conflicting content, is very common.
Lack of content curation
Related to content lifecycle, content curation often fails, too. Even when the information architecture, taxonomy, and metadata are all in place, users make mistakes. This is why content curation should be part of the lifecycle: a formal process to review and correct information structure and metadata as needed.
When content curation is missing, information siloes and multiple document versions are created, and the quality of metadata becomes messier and messier. After a while, the result is an information jungle where nothing can be found.
Inconsistent use of languages and translations
In a multi-lingual organisation, localised and translated content are an integral part of operations. Not every document has to be translated and localised, but for the ones that need to be localised, a consistent and appropriate translation is a must. In many cases, the localised pages and documents are not synchronised and updated when the original (mostly English) page changes, resulting in the same inconsistent behaviour as inconsistent information architecture or inconsistent metadata described above.
The (false) hope and promise of AI and auto-tagging
In the last few years, we’ve seen the rise of intelligent classification and auto-tagging solutions. While these might work in specific domains, applying them in any generic use case is as risky as the conditions described above. AI models have to be trained, maintained, and curated – and this might require fewer resources in total and also more planning and preparation in advance. A training set of content has to be identified, tested, an AI model has to be taught, evaluated, tested – all this in an iterative way, to improve and enhance the AI model until the quality of tags and terms applied gets good enough.
And this leads us to the question: how to measure information quality? How do we know if the content is trustworthy and of good quality? How do we know if it’s not? How do we measure the quality of information quality? How do we measure the quality of auto-tagging, and how do we compare it to human tagging? What metrics should we apply? – these questions always have to be answered before the implementation starts, otherwise how do we know if/when we’re successful?
As you can see, the challenge is quite complex. Moreover, it is unique to every organisation – there is no one-size-fits-all solution. The best you can do is to undertake a detailed content inventory, evaluate and analyse what information your organisation has, and classify it by the following dimensions:
- type of content
- metadata requirements
- content lifecycle requirements
- permission and accessibility requirements.
Once the inventory is done, define the priorities.
Set up SMART (Specific, Measurable, Achievable, Relevant, Time-bound) goals, and commit to short, mid-, and long-term processes to improve information quality. Don’t rush this process, but take significant actions towards the desired goals. Measure often, and align your steps ahead as needed but always have a plan to follow. This leads you to your goals, step-by-step.
This chapter was originally published in Search Insights 2021. Download the report for free here: