Address Data Quality: An AI-Powered Approach

Modified on Tue, 8 Jul at 2:50 PM

In today's data-driven environment, the quality of your data significantly impacts operational efficiency. This is particularly true for address data, where inconsistencies, errors, and free-text entries can lead to practical challenges in logistics, customer relationship management, and analysis. Utilising AI and Large Language Models (LLMs) can provide an effective and intelligent solution for maintaining high-quality address data.

This article explores an Omniscope workflow designed to address the complexities of free-text address data, focusing on classification, merging, and extraction. This approach leverages AI to streamline processes that might traditionally be more manual.




The Challenge: Unstructured Address Data


Addresses entered as free text often present difficulties in standardisation and analysis. They can include variations in spelling, missing information, incorrect postcodes, and a mix of structured and unstructured elements. Traditional data quality methods often rely on rigid rules or extensive manual intervention, which can be time-consuming and prone to errors.




AI-Driven Data Quality Enhancements


Our featured Omniscope workflow demonstrates an AI-driven method for improving address data quality. By incorporating LLMs, we've developed a workflow that intelligently interprets, classifies, and transforms address data, offering a level of flexibility and accuracy that can be quite beneficial.

The workflow, which is attached to this article, but can also be found here, consists of three key AI-powered blocks, each addressing a specific aspect of address data quality:





1. Intelligent Postcode Classification: Valid, Almost Valid, or Incorrect?


The first step involves accurately classifying postcodes. This block uses an external list of valid UK postcodes and the interpretive capabilities of an LLM to categorise postcodes into three groups:

  • Valid: An exact match with an official UK postcode.

  • Almost Valid: These postcodes may have minor discrepancies, typically in the last three characters (the inward code). The AI is designed to identify the closest valid match, which is useful for catching common data entry mistakes.

  • Incorrect: Postcodes with significant errors in the inward code, regardless of the validity of the outward code, are flagged as incorrect.


How AI Contributes: The AI block was configured with a prompt providing examples of address formats and clear definitions for "valid" or "almost valid" postcodes. This allows the LLM to identify patterns and make informed decisions, even with variations in data entry.



2. Smart Merging of Postcode Data: Connecting Relevant Information


The second block focuses on merging full postcodes with their corresponding district parts. While a core Omniscope Join block could accomplish this with an additional extraction step, our AI-powered approach demonstrates the LLM's capacity to handle merging tasks with intuitive instructions.


How AI Contributes: By providing the AI with examples of full postcodes (e.g., L242SF, PL112LD, E76AY) and district postcodes (e.g., AB2, CM12, NE33), it understands the relationship and can perform the merge operation. The prompt explicitly instructs the AI to retain all rows from the first input, helping to ensure no data is lost, even for non-matching or empty postcodes. This illustrates the LLM's ability to interpret and execute data manipulation tasks based on natural language prompts.



3. Granular Address Component Extraction: Breaking Down Free Text


The final AI-powered block is dedicated to extracting individual components from the free-text address field, such as House number, Street, Town, and District. This can often be a challenging aspect of address data quality due to the unstructured nature of addresses.


How AI Contributes: The AI is provided with examples of various address formats (e.g., "71 Junction drive London W7 4XL", "Bingo Association Dunstable Bedfordshire LU41JF") along with clear instructions. Importantly, the AI is prompted to leverage common street endings (e.g., "road," "place," "drive") to identify street names, even when not explicitly stated. It also handles the removal of punctuation, helping to ensure cleaner extracted data. This ability to extract structure from seemingly unstructured text highlights the utility of LLMs in data preparation.



The Results: Practical Improvements in Data Quality


The workflow concludes with a report, offering two informative tabs:

  • Visualising Data Quality Issues: This tab provides a clear visual representation of the postcode classifications – identifying precise, not precise, and incorrect postcodes. This immediate feedback helps in quickly spotting areas where data quality can be improved.




  • Comparison with Manual Workflow: To illustrate the effectiveness of our AI-driven approach, we compared its results with a manually created workflow using core Omniscope blocks. While both workflows effectively identified correct addresses, we observed some differences in the "almost correct" and "incorrect" categories. The AI workflow, while performing well, showed slight variations in its ability to split addresses compared to the highly specific manual method. This offers insights into the capabilities and nuances of AI-powered data processing for complex natural language understanding tasks.






Why This Matters


This workflow, utilising AI and LLMs, offers several practical benefits for businesses managing address data:

  • Adaptability: AI can adapt to various data patterns in free-text addresses more flexibly than rigid rule-based systems.

  • Reduced Manual Effort: Automating complex classification and extraction tasks can help free up resources, allowing your team to focus on other priorities.

  • Improved Accuracy: While not always a perfect match for meticulously crafted manual solutions, the AI's ability to interpret context can contribute to overall improved accuracy of your address data.

  • Streamlined Development: Configuring AI blocks with natural language prompts can help accelerate the development of data quality solutions.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article