May 23, 2024

Migrating to GPT-4o: How Assembled achieved rapid transition

Vikram Tiwari

In the rapidly evolving field of AI, staying ahead is not just a goal but a necessity. At Assembled, we are dedicated to leveraging the latest technology to enhance customer experience and optimize efficiency. Recently, we migrated all our customers from GPT-4 to the newer, faster GPT-4o (Omni) model within 24 hours of its launch. Achieving this required precise execution and a robust metrics and testing framework to evaluate model quality. Here's how we did it.

The challenge: A leap in AI capability

OpenAI's release of GPT-4o, a multimodal model capable of understanding text, images, and more, presented a significant opportunity. The model promised enhanced performance and broader capabilities, making it a valuable upgrade for our services. However, migrating our entire customer base to a new AI model within a day posed a substantial risk. The task involved not just a simple swap of models but a comprehensive overhaul of our integration processes, ensuring that the new model met our quality and reliability standards without causing disruptions to our customers' operations across tens of thousands of support tickets.

The Assembled approach: Key strategies for rapid innovation

Culture of experimentation

At Assembled, we embrace change and actively seek to innovate. Our team is encouraged to challenge assumptions and test hypotheses, fostering an environment where new ideas can thrive. This culture is crucial for tackling large-scale projects like the GPT-4o migration. Instead of seeing the task as insurmountable, we viewed it as an opportunity to push our limits and demonstrate our innovative capabilities.

A/B testing and metrics tracking

We used an A/B test in parallel with our existing models to manage the migration efficiently. By running both GPT-4 and GPT-4o models concurrently, we tracked various metrics to ensure a smooth transition. These metrics included response text from the LLM, documents used to augment LLM knowledge, tokens used for queries and responses, and latency for the first and final tokens. This approach allowed us to gather data on the quality and accuracy of the resulting switch to make a more informed decision. It also enabled us to identify any discrepancies or issues early, ensuring that our customers experienced minimal disruption during the migration.

Golden dataset for benchmarking

We maintain a golden dataset of user tickets and human replies to benchmark updates. This dataset is specifically tailored to reflect the diverse range of queries our customers encounter, ensuring that any system changes are analyzed on a representative set of our customers' data. By consistently using this dataset, we can accurately measure the performance and quality of new models, ensuring that they meet our high standards before full deployment. This practice helps us maintain the reliability and effectiveness of our AI solutions, providing our customers with consistent and high-quality service. For our GPT-4o rollout, we ran our new model on the golden dataset and did a manual review of the quality of responses and that GPT-4o performed significantly better on most benchmarks.

Comprehensive analysis tools

Analyzing collected data accurately is crucial for maintaining the quality and performance of our AI models. We utilize several strategies to ensure consistent analysis:

  • Blind evaluations: Human reviewers compare responses from the current and updated systems without knowing which is which, reducing bias and ensuring objective assessments.
  • Metrics evaluation: Text replies are evaluated against set metrics, such as inclusion of URLs, sentiment, length, and use of bullet points. These metrics provide clear, quantifiable measures of performance and quality.
  • External data sources and tools: We analyze the use of external data sources to minimize hallucinations and ensure predictable results. By doing so, we ensure that the information provided by the LLM is accurate and relevant. We are also exploring the use of LLM Comparator to streamline this process, enhancing our ability to quickly and effectively evaluate new models.

Robust fallback mechanisms

To ensure reliability, we have robust fallback mechanisms in place. If any issues were detected during the migration, customers were seamlessly switched back to GPT-4, ensuring uninterrupted service. These mechanisms are designed to provide an extra layer of protection during major transitions, minimizing the impact of potential issues on our customers. By prioritizing safety and resilience, we can confidently implement significant changes, knowing that we have safeguards in place to maintain service continuity and customer trust.


By combining a culture of experimentation, parallel A/B testing, automated analysis, and robust fallback mechanisms, we successfully migrated our entire customer base to GPT-4o in under 24 hours. As we continue to innovate, we’re excited to continue delivering reliable, high-quality answers that enhance customer experience and drive support efficiency.

See us in action.

Want to know how Assembled can help your team rise to the occasion? Set up time with us to learn more!

request demo