Improving the Azure OpenAI models in Power Automate to save people time

This post is co-authored by Walter Sun, PhD, Vice President of Applied AI, Microsoft’s Business Applications & Platform Group.

In October of 2022 we released the Describe it to design it feature where you can write a simple sentence and get a flow based on this description. In this previous blog post, we shared how we initially built out this feature. Since release, we have seen it dramatically improve the experience of building flow reducing the flow creation time in half and flows are 1.8x more likely to run week-over-week compared to building from blank. In this post, we’ll dive more into the details of the improvements we’ve been making with this feature.

Model fine tuning

1) did NL2Flow suggest one or more flows, and

2) did the user review and accept the suggested flow

For privacy reasons, Microsoft does not, by default, get access to the sentences that you type or the details of the flows that are generated. However, administrators can opt-in to share this data with Microsoft, which can help us to improve our models. We used this opted-in data to identify which sentences had the lowest rate of successful suggestions and acceptance. With that data, we were able to generate more sample flows that targeted those areas and use that sample data for further fine-tuning of the model.

We have since had two additional rounds of model improvements – NL2Flow-003 (which was released in November) and NL2Flow-004 (which was released in January). Each release led to improvements in these two metrics. For example, between NL2Flow-002 and 003, we observed an increase in the flow acceptance rate by 3.0 points to 52.6%, and between NL2Flow-003 and 004 we saw an improvement of 2.2 points to 90.3% for successful suggestions.

This type of fine tuning is only possible because of the number of people trying out this feature – this month we’ve seen over 29K. As more people use Power Automate (and opt-in to helping Microsoft improve the model with their data), the models will only get better.

Finally, it’s worth noting that this process of fine tuning is distinct from traditional reinforcement learning (RL). In the future, we plan to apply RL, leveraging our A/B platform to handle the explore and exploit processes, but we need to have a large enough amount of data before we embark on this learning methodologyfor NL2Flow (like it has been done with ChatGPT).

Validating improvements

To ensure that the model is directly responsible for the improvements we observed, we show some users the current model and some a new one – a process called A/B testing. This is similar to the way we release new versions of the Power Platform, where a percentage of the users get newer release of the platform as well to test before full deployment to all. By doing so, we can know that the model itself is responsible for the changes, not other factors.

At the same time, it’s critical that we know that people are more successful with Describe it to design it, compared to other ways to build flows. And this is exactly what we see:

  • 76% of flows created by NL2Flow in the last 7 days ran at least once (compared to 64% for flows created from templates, or 43% when created from scratch)

Together, with A/B testing and comparing to other ways users build flows, we have mechanisms in place to ensure we’re continually improving the experience for users.

Microsoft is committed to creating Responsible AI by design. Our work is guided by a core set of principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. We are helping our customers use our AI products responsibly, sharing our learnings, and building trust-based partnerships. For these new services, we provide our customers with information about the intended uses, capabilities, and limitations of our AI platform service, so they have the knowledge necessary to make responsible deployment choices.

What’s next

As we shared at the Future of Work with AI event on March 16th – the next step is to support interactive design with Copilot for Power Automate – so you can use natural language while you build out your flow. This will build on the NL2Flow models that have already been developed, as well as adding in new interactive behaviors powered by ChatGPT. We’re excited to share more details on this in the coming weeks.