Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.thedatacity.com/llms.txt

Use this file to discover all available pages before exploring further.

To ensure every update produces an accurate and reliable RTIC list, we use a three stage QA method designed to capture emerging sectors with precision. Screenshot 2025-11-06 at 14.48.33 In the Build and Refine stages, we ground taxonomy changes in research, tighten sector boundaries through iterative testing, and verify relevance using language checks and cross-validation. Only once these steps are passed do we move to the Release stage, where we benchmark against external sources and manually review a random sample for final assurance. These methods show how we maintain consistency, limit false positives, and deliver sector classifications we can trust.
In this article, when we say “lists,” we mean the machine-learning lists that form the verticals on our platform. These verticals represent the sub-sectors of each industry we cover.

Build Stage

Ground-truthing We begin by supporting each change to the taxonomy with journal articles, company reports and a review of relevant trade associations. This ensures that new companies brought into our positive training or those that are removed are grounded in real data. This also gives us clarity in being able to identify the company attributes and keywords that best capture the new industry boundaries. We update our positive training sets by adding companies with matched website text that best reflect new aspects of an industry. We find these companies through in-depth research. We update our negative training set by manually ensuring that each company found here is correctly establishing clear boundaries for what we don’t want included. Minimise unnecessary change An inevitable byproduct of machine learning is that small changes can bring about sizeable differences across lists. We use our similarity score to mitigate against this. This score allows us to identify companies that can replace those that are no longer active in our training sets, ensuring that the language we are looking for remains consistent.

Refine Stage

Boundary testing Updating a training set is an iterative process and involves checking that companies with the lowest score in the list are still relevant to the sector. This score is based on how similar the company is to those that were provided in the positive training set as good examples of the sector. If those with the lowest score aren’t relevant, they are added to the negative training set and the check repeats. More details on our list building process this can be found here. Once we confirm that low-scoring companies are relevant to the sector, we know our ML list has formed a boundary. We then set the score cutoff to a negative value to check if this boundary is accurate. Companies with a low negative score were almost included but fall just outside the list. If these companies are actually relevant to the sector, the list is too restrictive, and we need to adjust the negative training set. If these borderline companies work in similar but ultimately different or irrelevant industries, it suggests the list is correctly capturing the edge of the sector. Screenshot_7-11-2025_101911_products.thedatacity.com Language review Once we’ve assessed a list’s boundaries, we use keywords gathered from our research to internally QA that all companies are relevant. For example, in our RTIC005508 Renewable Energy Infrastructure vertical, most companies should include keywords like “grid,” “energy,” and “renewable” on their websites. We use this keyword search to identify companies that don’t match and manually check their relevancy, adding irrelevant ones to the negative training set. Sometimes a company may contain relevant language but should still be excluded for instance, an energy charity campaigning for renewable energy. Although their website contains sector relevant terms, they do not generate or distribute energy themselves, so we manually remove them. We also assess overall keyword accuracy through sector keyword enrichment, which compares the average prominence of keywords across all matched company websites to those in our list. If irrelevant terms like “natural gas” or “fracking” are overrepresented, this indicates the list is capturing the wrong companies and needs further adjustment. Screenshot_6-11-2025_162331_products.thedatacity.com Cross-validation and sense-checking Now that a list has been thoroughly checked, we finalise our refinements by checking that the companies with the highest turnover and employee counts are prime examples of what should be found in this list. Screenshot_7-11-2025_102119_products.thedatacity.com We use this feature to identify other large players that need to be sense-checked and researched to ensure that their inclusion and large impact on financial summary statistics are justified. Whilst our primary aim when creating any vertical is to capture a distinct aspect of the emerging economy, we can use small overlaps across verticals to illustrate the confidence we have in our lists. As an example, in our RTIC00518 Pollution Remediation vertical, the main overlap is with other verticals within our Net Zero RTIC but over 300 companies are also found in our RTIC0075 Land Remediation RTIC. This is what we would expect given the role that soil bioremediation and groundwater restoration play across both verticals. Screenshot_7-11-2025_102336_products.thedatacity.com Since we have accurately classified these 300 companies into industries that we know are adjacent to what we are trying to capture now, we can be confident that they also belong in this list. This cross-validation involves us drawing upon the accuracy of our prior classifications as a means to communicate our confidence in the accuracy of an updated list. It’s important to note that when any changes are made to the positive or negative training set, the previous steps are repeated as an iterative process until the list passes all of them. At this point, we begin the release process stage.

Release process

External benchmarking Through using relevant trade associations and in-depth research, we ensure that the scale of our final list is accurate. As an example, REScoop, the European federation for energy communities and cooperatives, highlighted that there are 213 of their members present in the UK. If our Community-Focused Renewable Energy vertical contained only 100 members, this would indicate that our list is failing to capture the entire sector. We critically review any such third-party lists to ensure that the companies we are comparing to are relevant. Random sample QA As the final step in our QA process, we randomly sample and manually review a subset of each list’s results to verify accuracy and identify any potential issues with the classification. It’s important to mention that this is the final step of a lengthy QA process and that we do not arrive here unless we are already confident in the accuracy of our list. It would not be efficient to conduct a manual review of a sample of the companies which only highlights unexpected problems in the list, requiring a further re-iteration and second manual review. As such, we aim to conduct a manual review once we are confident that the list is likely to pass. If a list does not ‘pass’ this review, then it is re-developed further to address any issues found. Companies that are completely irrelevant could indicate a potential problem which might be capturing many other false positives. We then revisit the training set, rebuild the list to ensure its accuracy and QA another random sample from this list. Companies that fail the QA due to slight technical differences, such as companies making domestic smart meters captured within an energy infrastructure list, are isolated, checked to ensure that these companies are outliers and not reflective of a skewed list, and removed from the final list. Our sample review provides a statistically reliable estimate of the overall performance of the classification in correctly assigning companies in the list, ensuring that the measured accuracy meaningfully reflects the accuracy of the overall list. This, combined with the build and refine stages gives us confidence that the classifier performs consistently and meets our quality standards.