Which Large Language Models are best for regulatory work?

The Regulatory Institute has carried out a number of tests of Large Language Models (LLM) with a view to regulatory work. Some of the tests also aimed to establish whether its model laws were the most comprehensive reference laws in their respective sectors. This article presents the main results, including a surprising ranking of LLM performance. It also draws some conclusions on the best approach to designing comprehensive, relatively complete laws.

The main results of the five studies

The first study1 tested the capacity of a range of Large Language Models (LLM) to identify comprehensive legislation that could be utilised as a reference for legislative drafters. The LLM designated as Claude Sonnet 3.5 (Anthropic) performed best. Five LLMs also performed well: Mistral’s Le Chat, Sonar Huge / Llama evaluated via Perplexity, xAI’s Grok-2 evaluated via Perplexity, You.com Research, and Meta’s Llama 3.1. Other LLMs, including the highly publicised ChatGPT 4o (OpenAI) and Gemini (Google), exhibited significantly inferior results. Sonar Huge / Llama was slightly outstanding in this group.

A secondary aim of this study was to identify the most comprehensive real or model anti-alcohol reference laws. The most comprehensive real or model anti-alcohol reference laws seem to be the Model Law on Alcohol, Cannabis, and Tobacco Products by the Regulatory Institute, followed by various WHO model laws, followed by the laws of Sweden, Thailand and New Zealand.

The second study also tested the capacity of a range of Large Language Models (LLMs) to identify comprehensive legislation that could be utilised as a reference for legislative drafters. The LLM designated as Claude Sonnet 3.5 (Anthropic) performed best. Five LLMs also performed well: Mistral’s Le Chat, Sonar Huge / Llama evaluated via Perplexity, You.com Research, Deepai.org., and Meta’s Llama 3.1 that was slightly outstanding in this group. Other LLMs, including the highly publicised ChatGPT 4o (OpenAI) and Gemini (Google), exhibited significantly inferior results.

A secondary aim of this study was to identify the most comprehensive real or model emergency management laws. The most comprehensive real or model emergency management reference laws seem to be the one by the Regulatory Institute, followed by the Emergency Management Act of Canada, the UNDRR Model Law for Disaster Preparedness and Response, the European Union Civil Protection Mechanism, the Emergency and Disaster Management Act of British Columbia, Canada and the Civil Contingencies Act 2004 of the United Kingdom.

The purpose of the third study was to identify freely available Large Language Models (LLMs) that are suitable for the automated drafting of a comprehensive set of provisions on the basis of a submitted legislative structure in one go.

In the two previous studies, Anthropic’s Claude Sonnet 3.5 and Mistral’s Le Chat were amongst the best performing LLMs. But this time Mistral’s Le Chat offered way more granularity than Anthropic’s Claude Sonnet 3.5, once we pushed hard to make it accomplish the task. Both have different “drafting styles”. Even Sonar Huge (accessible via Perplexity, based on Meta’s Llama) provided much more granularity than Claude Sonnet 3.5. But it dragged even more its feet for the completion of the task than the other two, which is annoying for the user. Still Sonar Huge merits being placed in front of Claude Sonnet 3.5 this time, Claude Sonnet meriting only a third position. 

As in previous studies, the much-hyped ChatGPT 4o from OpenAI and Google’s Gemini produced only mediocre results. You.com Research and Meta’s Llama were in this third study quite disappointing as well. They belong this time into the same low performance class as ChatGPT 4o and Google’s Gemini Bard. 

However, we tested now also ChatGPT o1 preview which has been promoted as the next OpenAI step in terms of “reasoning”2. It reached, in its 3rd response, a granularity not far away from that of Claude Sonnet, whilst shifting back to “key points” instead of actual drafting in its 4th response. We would rank ChatGPT o1 preview at the fourth position and rather closer to the top class than to the lower class. As the final ChatGPT o1 is still to come, we recommend checking its performance, whilst discarding ChatGPT 4o.

The fourth and fifth tests assessed LLM’s ability to suggest changes and to produce revised texts. For the first time, the Chinese DeepSeek R13 took part. The tests assessed the LLM’s ability in two different scenarios: without and with reference documents. The results were as follows:

Ability of LLM without reference documents (not commendable) with reference documents
Making suggestions ChatGPT 4o, DeepSeek R1, Sonar Large and xAI’s Grok-2 performed relatively best. Mistral le Chat had no internet access, but with uploaded reference documents it performed best. Just with internet references, Deepai.org was best, followed by You.com Research, ChatGPT o1 and xAI’s Grok-2. 
Amending draft Mistral le Chat was relatively best, followed by DeepSeek R1. No other LLM produced usable results. Mistral le Chat was best with reference documents uploaded, followed by xAI’s Grok-2 and Perplexity Pro.

Methodological consequences for regulatory work

The secondary aim of the first two tests was to confirm that the Regulatory Institute’s model laws are the most comprehensive reference laws in their respective sectors, which they are, much to our relief. However, the emergence of LLM has moved the gold standard. Even the Regulatory Institute’s model laws, and even more so the model laws of international organisations, lacked important policy and regulatory elements that LLM was able to identify or propose by extrapolation. Therefore, in order to design a truly complete law, it is not enough to choose the most comprehensive reference law(s) as a basis. It is also necessary to ask different LLMs to identify policy and regulatory elements and to merge their lists.

However, the Regulatory Institute also recommends working with comprehensive reference laws rather than just lists of policy and regulatory elements. LLMs are good at identifying policy elements. They are less effective when it comes to regulatory elements, such as individual powers to act or duties to cooperate, individual sanctions or incentives, powers to amend the law to take account of rapidly evolving technical aspects, and so on. These – sometimes tiny – regulatory elements may all be crucial to the effective, efficient and equitable application of the future law. They are more likely to be found in comprehensive reference legislation, namely the Regulatory Institute’s model laws, or in the Regulatory Institute’s Cross-sectoral Standard Provisions, in its List of Powers/Obligations and its List of Sanctions.

Hence, the new “gold standard” for ensuring regulatory completeness consists of a mix of approaches:

  • identification of relatively comprehensive references, taking one or several of them as a basis or at least benchmark;
  • identification of candidate policy and regulatory elements by means of various LLMs4 and, where available, by dedicated articles such as these of the Regulatory Institute;
  • using LLMs as completeness checkers.

 

1 Access to the studies can be provided on request.

2 We had not tested ChatGPT o1 preview in our Tests 1 and 2 as an enhanced reasoning capacity should not change results for internet research and accordingly for the identification of comprehensive reference regulations.

3 It is commendable to use this LLM via the Perplexity platform or via You.com so as to ensure that data are stored in the U.S. or Europe and not in China.

4 Our tests have demonstrated that no LLM was able to identify more than half of all policy and regulatory elements.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

18 − twelve =