Causality and LLMs (part 1)
A partial review of paper: Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.
Welcome to the first post in my series on Causality and LLMs.
In this post, I review a paper on causality and LLMs by Emre Kiciman, Robert Ness, Amit Sharma (all Microsoft), and Chenhao Tan (Chicago).
TL;DR
How do LLMs such as GPT-3.5 and GPT-4 compare with traditional statistical causal discovery methods? Can LLMs perform causal reasoning?
The LLMs and other models are compared on two types of data: piecewise cause-effect, and larger dataset for full graph discovery.
On piecewise data: traditional statistical methods scored at most 83% accuracy. However, GPT-3.5 and GPT-4 scored 89% and 96%, respectively.
For full graph causal discovery, an Arctic Sea Ice dataset was used. Using two metrics, GPT -3.5 and GPT-4 outperformed covariance (statistical) based methods with GPT-4 again showing considerable improvement.
Prompts matter: the types of prompts were shown to affect LLM performance.
Traditional causal discovery often requires human domain knowledge to complete causal graphs. LLMs show they can be used as a proxy or assistant for humans alongside existing causal methods.
Can large language models (LLMs) do causal reasoning?
This post presents my main takeaways from the first part of the paper by Kiciman et al. (2023) called Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.
LLMs like ChatGPT and GPT-4 are exceptionally impressive and highly useful AI tools, trained on massive datasets. They are built using transformer neural network architecture and fine-tuned using reinforcement learning from human feedback. These models are text-based and generate next-word results based on an input prompt.
Although they provide considerable utility in generating answers to a vast range of questions, can they do causal reasoning?
Why Causal Reasoning in Machine Learning?
Machine learning models primarily excel at correlating inputs to outputs.
An LLM predicts the next word based on its probability of occurrence. Like other deep learning and predictive models, it focuses on correlation rather than explicitly modeling causality.
Why prioritize causal reasoning in machine learning models?
One crucial reason is for counterfactual reasoning. Predicting outcomes is valuable, but understanding causality enables us to explore how to alter predicted outcomes. Knowing what drives predictions allows us to strategize for desirable changes. Without a causal model, altering predicted outcomes remains a challenge.
Given the impressive capabilities of LLMs, do they exhibit causal understanding? Can they discern causal relationships between features, construct causal graphs involving multiple features, and handle counterfactual queries effectively?
Can LLMs perform causal discovery?
The first part of the paper looked at how well LLMs perform on two types of datasets. The first dataset is a pairwise cause-effect benchmark dataset called the set. The second is to test LLMs on full graph discovery on mainly an Arctic sea ice dataset. As you can tell, the main idea is causal discovery for both types of datasets.
In the first part of the paper, the focus was on evaluating LLM performance using two distinct datasets. The first dataset, known as the Tubingen set, is a benchmark for pairwise cause-effect relationships. The second dataset involved testing LLMs in discovering full causal graphs, primarily using data related to Arctic sea ice. The overarching theme of this section was causal discovery across different types of datasets.
Covariance- and logic-based causality
The paper begins by distinguishing between covariance-based and logic-based (or domain-specific) causality.
Covariance-based causality: This approach uses statistical methods like the PC method, conducting conditional independence tests on numerical data to uncover causal relationships. It typically results in generating a Markov Equivalence Class of graphs, representing all graphs indistinguishable in terms of conditional independence relationships.
Logic-based causal reasoning: In contrast, logic-based causality employs logical reasoning and domain knowledge to understand causal relationships. For instance, in legal cases determining liability, and understanding causal chains is crucial, relying on deep domain knowledge. LLMs, due to their text-based nature and ability to describe complex domains, might operate effectively within this framework of causality.
Pairwise Causal Discovery
Using the Tubingen pairwise cause-effect dataset, ChatGPT (GPT-3.5) and GPT-4 outperformed the results from other studies that used covariance-based, statistical methods for causal discovery. The statistical methods ranged from 68% to 83% in performance accuracy whereas GPT-3.5 models ranged from 81% to 89% and GPT-4 scored an impressive 96% accuracy.
GPT-4 left other methods in the dust. It crushed its competition. It blew them out of the water.
Importance of Prompts!
You will notice that GPT-3.5 had a range of results (81% - 89%). These results were due to different types of prompts given to the model.
Prompt 1: “Does changing A cause a change in B?” Used to initiate basic causal inference.
Prompt 2: “ You are a helpful assistant for causal reasoning.” This aimed to set the system message and align it with causal reasoning.
Prompt 3: “Explain your reasoning in a step-by-step manner.” This required the LLM to explain its thought process and reveal insights into the model’s reasoning capabilities.
The high results for GPT-4 (96%) were based on Prompt 3.
Full Graph Discovery
In the evaluation of full causal discovery using an Arctic Sea Ice dataset containing 12 nodes and 48 edges in the ground-truth set, GPT-3.5 and GPT-4 demonstrated superior performance compared to covariance-based methods. Particularly, GPT-4 exhibited significant improvement over GPT-3.5 and statistical approaches. Evaluation metrics included the Normalized Hamming Distance (NHD) and the Ratio of NHD.
NHD measures the number of differing edges between predicted and ground-truth graphs, normalized by the total possible edges. A lower NHD indicates better prediction accuracy. Meanwhile, the Ratio of NHD compares the algorithm's NHD to a baseline worst-case scenario, where lower ratios signify superior performance over random guessing.
GPT-3.5 showed comparable NHD values to statistical methods but excelled in terms of the Ratio of NHD, indicating more accurate predictions relative to the baseline. GPT-4 surpassed all other models, achieving superior performance in both NHD and Ratio of NHD metrics.
Not Causal Reasoning But Mimicking Domain Knowledge
Although the LLMs perform way above traditional methods, the authors believe it is not due to causal reasoning but due to having been trained on vast amounts of data that it can know the domain-specific knowledge and generate causal-type responses.
Quote from the Paper
“ Irrespective of whether LLMs are truly performing causal reasoning or not, their empirically observed ability to perform certain causal tasks is strong enough to provide a useful augmentation for aspects of causal reasoning where we currently rely on humans alone. “
More to come in the following posts. Stay tuned…

