1 Introduction
Market making generally refers to the act of providing liquidity to a market. A market maker (MM) will continuously quote prices to buy and sell a given financial instrument. The difference between these prices is called the spread
, and the goal of the market maker is to repeatedly earn this by transacting in both directions, yielding a net zero investment in the asset. This, however, is not done without risk. Market makers expose themselves to a process known as adverse selection which occurs when toxic agents exploit their technological or informational edge over the MM, leading to the accumulation of a large absolute stake. The problem: in general, as prices move, the MM’s inventory grows and it makes losses. This is known as inventory risk, and has been the subject of a great deal of research in optimal control, artificial intelligence, and reinforcement learning (RL) literature.
A standard assumption in existing work has thus far has been to assume that the market maker has perfect knowledge of its environment. Indeed, robustness of market making strategies to model ambiguity, and adversarial behaviour in the market — the topic of this paper — has only recently received attention; [cartea2017algorithmic] extended optimal control approaches for the market making problem to address the risk of model misspecification. In this paper, we design market making agents that are robust to adversarial and adaptively chosen market conditions by applying adversarial RL. Our starting point is the wellknown singleagent mathematical model of market making [avellaneda2008high], which has been used extensively in the quantitative finance, e.g., [cartea2015algorithmic, cartea2017algorithmic, gueant2013dealing, gueant2017optimal]. We convert it to a discretetime game, with a new “market player”, the adversary, that can be thought of as a proxy for other market participants that would like to profit from the market maker. The adversary controls dynamics of the market environment in a zerosum game against the market maker.
We evaluate the sensitivity of RLbased strategies to three core parameters of the model dynamics which affect price and execution for the market maker; each of which naturally vary over time in real markets. We do this by going beyond the fixed parametrisation of existing models — henceforth called the Fixed setting — with two extended learning settings. The Random setting initialises each instance of the model with (bounded) uniformly random values for the three parameters. The Strategic setting features an independent “market” learner whose objective is to minimise cumulative reward in a zerosum game with the market maker. The Random and Strategic settings are, on the one hand, more realistic than the Fixed setting, but on the other hand, significantly more complex for the market making agent. We show that market making strategies trained in each of these settings yield significantly different behaviour and suggest that robustness to adversarial environments has benefits beyond immediate implications.
1.1 Contributions
The key contributions of this paper are as follows:

[leftmargin=0.35cm]

We thoroughly investigate the impact of three environmental settings (one adversarial) on learning market making. We show that training against a Strategic adversary strictly dominates all three test settings in terms of a set of standard desiderata, including the Sharpe ratio (Section 5).

We prove that in several key instances of the Strategic setting the oneshot (singlestage) instantiation of our game has a Nash equilibrium the resembles that found by our ARL algorithm for the multistage game. We then confirm broader existence of equilibria in the multistage game by empirical best response computations (Sections 5 and 3).
1.2 Related work
Optimal control and market making.
The theoretical study of market making started with the pioneering work of ho1981optimal ho1981optimal, glosten1985bid glosten1985bid and grossman1988liquidity grossman1988liquidity, among others. Subsequent work focused on characterising optimal behaviour under different market dynamics and contexts. Most relevant is that of avellaneda2008high avellaneda2008high, who incorporated new insights into the dynamics of the limit order book to give a new market model; the one used in this paper. They derived closedform expressions for the optimal market maker with an exponential utility function when it has perfect knowledge of the market model parameters. This same problem was subsequently studied for other utility functions [fodra2012high, gueant2013dealing, cartea2015algorithmic, cartea2015order, gueant2017optimal]. In particular, cartea2017algorithmic cartea2017algorithmic study the impact of parameter uncertainty in the model of avellaneda2008high avellaneda2008high: dropping the assumption of perfect knowledge of market dynamics, they consider how to optimally trade while being robust to misspecification of the underlying model. This type of epistemic risk is the primary focus of our paper.
Machine learning and market making.
Several papers have applied AI techniques to design automated market makers for financial markets.^{1}^{1}1A separate strand of work in AI and Economics and Computation has studied automated market makers for prediction markets, see e.g., othman othman. While some similarities to the financial market making problem pertain, the focus in that strand of work focusses much more on price discovery and information aggregation. chan2001electronic chan2001electronic focussed on the impact of noise from uninformed traders on the quoting behaviour of a market maker trained with reinforcement learning. AbernethyK13 AbernethyK13 used an online learning approach. spooner2018market spooner2018market later explored the use of reinforcement learning to train inventorysensitive market making agents in a Fully datadriven limit order book model. Most recently, gueant2019deep gueant2019deep addressed scaling issues of finite difference approaches for highdimensional, multiasset market making using modelbased RL. While the approach taken in this paper is also based on RL, unlike the majority of these works, our underlying market model is taken from the mathematical finance literature. There models are typically analysed using methods from optimal control. We justify this choice in Section 2. To the best of our knowledge, we are the first to apply ARL to derive trading strategies that are robust to epistemic risk.
Risksensitive reinforcement learning.
The topic of risksensitivity and safety in RL has been a highly active topic for some time. This is especially true in robotics where exploration is very costly. For example: tamar2012policy tamar2012policy studied policy search in the presence of variancebased risk criteria, and bellemare2017distributional bellemare2017distributional presented a technique for learning the full distribution of (discounted) returns. For more details, see garcia2015comprehensive garcia2015comprehensive. These techniques are powerful, but can be complex to implement and can suffer from numerical instability. This is especially true when using exponential utility functions which, without careful calibration, may diverge early in training due to large negative rewards
[madisson2017particle]. An alternative approach is to train agents in an adversarial setting [pinto2017robust, perolat2018actor] taking the form of a zerosum game. These methods tackle the problem of epistemic risk.^{2}^{2}2The problem of robustness has also been studied outside of the use of adversarial learning; see, e.g., [rajeswaran2016epopt]. That is, they explicitly account for the misspecification between traintime simulations versus testtime. This robustness to test conditions and adversarial disturbances is especially relevant in financial problems and motivated the approach taken in this paper.2 Trading Model
We consider a standard model of market making as studied by avellaneda2008high avellaneda2008high and cartea2017algorithmic cartea2017algorithmic. The MM trades a single asset for which the price, , evolves stochastically. In discretetime,
(1) 
where and are the drift and volatility
coefficients, respectively. The randomness derives from the sequence of independent, Normally distributed random variables,
, each with zero mean and variance . The process begins with initial value and continues until step is reached.The market maker interacts with the environment at each step by placing limit orders about . Upon arrival of a matching market order, the agent is committed to buy or sell a single unit of the asset. The prices at which the MM is willing to buy (bid) and sell (ask) are denoted by and , respectively, and may be expressed as offsets from :
(2) 
these may be updated at each timestep at no cost to the agent. Equivalently, we may define:
(3)  
called the quoted spread and reservation price, respectively. These relate to the agent’s need for immediacy and bias in execution, amongst other things.
The probability of executing either buy or sell orders is dictated by the liquidity in the market and the values
. Transactions occur when market orders, arriving at random intervals, have sufficient size to consume the agent’s order from the book. We model these interactions by independent Poisson processes, denoted by and for the bid and ask sides, respectively, with intensities ; not to be confused with the terminal timestep . The dynamics of the agent’s inventory process, or holdings, , are then captured by the difference between these two terms,(4) 
where is known and the values of are constrained such that trading stops on the opposing side of the book when either limit is reached. Following the model of [avellaneda2008high], we define the order arrival intensities by,
(5) 
where describe the rate of arrival of market orders and distribution of volume in the book, respectively. This particular form derives from assumptions and observations on the structure and behaviour of limit order books which we omit here for simplicity; see [avellaneda2008high, gould2013limit, abergel2016limit] for more details.
In this framework, the evolution of the market maker’s cash is given by the difference relation,
(6) 
where . We have that the cash flow is a combination of: the profit due executing at the premium ; and the change in value of the agent’s holdings. The total value accumulated by the agent by timestep may thus be expressed as the sum of the cash held and value invested: , where
(7) 
and . This is known as the marktomarket (MtM) value of the agent’s portfolio.
Why not use a data driven approach?
Previous research into the use of RL for market making — and trading more generally — has focussed on datadriven limit order book models; see [nevmyvaka2006reinforcement, spooner2018market, vyetrenko2019risk]. These methods, however, are not amenable to the type of analysis presented in Section 3. Using an analytical model allows us to examine the characteristics of adversarial training in isolation while minimising systematic error due to bias often present in historical data.
3 Game formulation and oneshot analysis
We use the market dynamics above to define a zerosum stochastic game between a market maker and an adversary, which acts as a proxy for all other market participants.
Definition 1.
[Market Making Game] The game, between the MM and the adversary has stages. At each stage, MM chooses and the adversary . The resulting stage payoff is given by expected change in MtM value of the MM’s portfolio, i.e., , see (7). The total payoff paid by the adversary to MM is the sum of the stage payoffs.
In the remainder of this section, we study the game when , i.e., when there is a single stage. Later, in Section 5, we will analyse the game for .
Oneshot analysis.
At each stage, the MM’s payoff may be unrolled to give:
(8) 
For certain parameter ranges, this equation is concave in .
Lemma 1 (Payoff Concavity in ).
The payoff (8) is a concave function of on the intervals , and , respectively.
Proof.
The first derivative of the payoff w.r.t. is given by:
(9) 
The Hessian matrix is thus given by,
(10) 
which is negative semidefinite iff . ∎
Note that (8) is linear in both and . From this, we next show that there exists a Nash equilibrium (NE) when and are controlled and and are fixed.
Theorem 1 (NE for ).
The MM Game has a pure strategy Nash equilibrium for and (with finite ),
(11) 
which is unique for , and indefinite when .
Proof.
Equating (9) to zero and solving, we arrive at the solutions in (11). To prove that these correspond to a pure strategy Nash Equilibrium of the game we must show that the payoff is quasiconcave (resp. quasiconvex) in the MM’s (resp. adversary’s) strategy to satisfy the requirements of Sion’s minimax theorem [sion1958general]. Using Lemma 1, and noting the linearity with respect to , we have that for , there is a unique solution based on the extrema and . When , there exists a continuum of strategies with equal payoff. ∎
The solution (11) has a similar form to that of the optimal strategy for linear utility with terminal inventory penalty [fodra2012high], or equivalently that of a myopic agent with running penalty [cartea2015order]. Interestingly, the extension of Theorem 1 to an adversary with control over all three model parameters yields a similar result. In this case we omit the proof for brevity, and leave uniqueness to future work.
Theorem 2 (NE for ).
There exists a pure strategy Nash equilibrium of the MM game for (11), and .
4 Adversarial training
The oneshot setting is informative, but unrealistic. We now investigate a range of multistage settings with various different restrictions on the adversary and explore how adversarial training can improve the robustness of MM strategies. The following three types of adversary in turn increase the freedom of the adversary to control the market’s dynamics:

[wide,itemindent=]
 Fixed.

The simplest possible adversary which always plays the same fixed strategy: , and ; these values were chosen to match those originally used by avellaneda2008high avellaneda2008high. This amounts to a singleagent learning setting with stationary transition dynamics.
 Random.

The second type of adversary instantiates each episode with parameters chosen independently and uniformly at random in the following ranges: , and . These are chosen at the start of each episode and remain fixed until the terminal timestep . This is analogous to singleagent RL with nonstationary transition dynamics.
 Strategic.

The final type of adversary chooses the model parameters (bounded as in the previous setting) at each step of the game. This represents a fully adversarial and adaptive learning environment, and unlike the models presented in related work [cartea2017algorithmic], the source of risk here is exogenous and reactive to the quotes of the MM.
The principle of adversarial learning here — as with other successful applications [goodfellow2014generative] — is that any deviation by the MM away from the most robust strategy will be exploited. Compared to conventional training, this setting induces the MM to solve the objective under a worstcase criterion. While there are no guarantees that a NE will be reached, we show in Section 5 that this approach consistently outperforms previous approaches in terms of absolute performance and robustness to model ambiguity.
Robustness through ARL was first introduced by [pinto2017robust] who demonstrated its effectiveness across a number of standard OpenAI gym domains. We adapt their RARL algorithm to support incremental actorcritic based methods and facilitate asynchronous training; though many of the features remain the same. The adversary is trained in parallel with the market maker, is afforded the same access to state — including the inventory — and uses the same method for learning, which is described below.
4.1 Learning configuration
Both agents use the NACS() algorithm, a natural actorcritic method [thomas2014bias] for stochastic policies (i.e., mixed strategies) using semigradient SARSA( [rummery1994line] for policy evaluation. The value functions are represented by compatible [peters2008natural]radial basis function networks of 100 Gaussian prototypes with accumulating eligibility traces [sutton2018reinforcement].

[wide,itemindent=]
 States.

The state of the environment contains only the current time and the agent’s inventory , where the transition dynamics are governed by the definitions introduced in Section 2.
 Policies.

The market maker learns a bivariate Normal policy for and
with diagonal covariance matrix. The mean and variance vectors are modelled by linear function approximators using 3
^{rd}order polynomial bases [lagoudakis2003least]. The standard deviation is kept positive through a softplus transformation. A softplus transformation is also applied to the mean and the values of
are clipped below at zero.The adversary learns a Beta policy [chou2017improving], shifted and scaled to cover the market parameter intervals. The two shape parameters are learnt the same as for the standard deviation of the Normal distribution above, with a translation of to ensure unimodality and concavity.
 Rewards.

The reward function is an adaptation of the optimisation objective of Cartea, Jaimungal, and others, for the RL setting of incremental, discretetime feedback:
(12) where refers to the MtM value of the agent’s holdings (Eq. 7). This formulation can be either riskneutral (RN) or riskadverse (RA). For example, if and , then the agent is punished if the terminal inventory is nonzero, and the solution becomes timedependent. We will generally refer to the case of as RN, and the latter by RA.



5 Experiments
In each of the experiments to follow, the value function was pretrained for 1000 episodes (with a learning rate of ) to reduce variance in early policy updates. Both the value function and policy were then trained for episodes, with policy updates every 100 time steps, and a learning rate of for both the critic and policy. The value function was configured to learn returns. The starting time was chosen uniformly at random from the interval , with starting price and inventory . Innovations in occurred with fixed volatility between with increment .
5.1 Market maker desiderata
In almost every area of finance, the risk appetite of an agent varies depending on their tolerance to losses over gains. This phenomenon has been studied extensively in economics in the form of expected utility theory, as well as in cognitive psychology in the form of prospect theory. In general, it is not sufficient to describe a strategy only by the expected wealth it creates. Variance and other metrics also play an important role. In this work — as in much of the literature — we will be interested in the following set of quantitative desiderata:

[wide,itemindent=]
 Profit and loss

The distribution of profit and loss made over an episode — and it’s moments
and — are fundamental to the problem of trading. In general, solving for optimal trading strategies amounts identifying solutions that maximise the former while minimising the latter.  Sharpe ratio

A common metric used in financial literature and in industry to measure the level of expected compensation per unit risk taken by the investor. It is defined by the ratio: . While larger values are better, it is important to understand that the Sharpe ratio is not a sufficient statistic.
 Terminal inventory

The distribution of terminal inventory, , tells us about the robustness of the strategy to adverse price movements. In practice, it is a desirable trait for market makers to finish the trading day with small absolute values for ; though this is not a strict requirement.
 Quoted spread

The “competitiveness” of a market maker is often discussed in terms of the average quoted spread: . Tighter spreads imply more efficient markets and exchanges often compensate (with rebates) for smaller spreads.
In general, a market maker aims to maximise wealth subject to preferences on inventory and quoted spread. Higher terminal wealth with lower variance, less exposure to inventory risk and smaller spreads are all indicators of the risk aversion (or lack thereof) of a strategy. This is reflected in the majority of objective functions studied in the literature.
5.2 Results
Fixed setting.
Random setting.
We then trained MMs in an environment with a Random adversary; a simple extension to the training procedure that aims to develop robustness to epistemic risk. To compare with earlier results, the strategies were also tested against the Fixed adversary — a summary of which, for the same set of risk parameters, is given in Table 1(b).
We then examined more precisely how training against the Random adversary impacts test performance in the face of model ambiguity. To do so, we took the market makers from the Fixed setting, and those trained against the Random adversary in this section, and ran outofsample tests in an environment with a Random adversary. This means that the model dynamics at testtime were different from those at training time. While not explicitly adversarial, this misspecification of the model represents a nontrivial challenge for robustness. Overall, we found that market makers trained against the Fixed adversary exhibited no change in average wealth creation, but an increase of 98.1% in the variance across all risk parametrisations. On the other hand, market makers originally trained against the Random adversary yielded a lower average 86.0% increase in the variance. The Random adversary helps, but not by much compared with the Strategic adversary, as we will see next.
Strategic setting.
We first consider a Strategic adversary that controls the drift only; i.e., constant and . With RN rewards, we found that the adversary learns a (timeindependent) binary policy (Fig. 2) that is identical to the strategy in the corresponding oneshot game; see Section 3. We also found that the strategy learnt by the MM in this setting generates profits and associated Sharpe ratios in excess of all other strategies seen thus far when testing against a Fixed adversary (see Table 1(c)). This is true also when comparing with tests run against the Random or Strategic adversaries, suggesting that the adversarially trained MM is indeed more robust to testtime model discrepancies.
This, however, does not extend to Strategic adversaries with control over either only or only . In these cases, performance was found to be no better than the corresponding MMs trained in the Fixed setting with a conventional learning setup. The intuition for this again derives from the oneshot analysis. That is, the adversary almost surely chooses a strategy that minimises (equiv. maximises ) in order to decrease the probability of execution, thus decreasing profits derived from execution and the MM’s ability to manage inventory with any degree of certainty. The control afforded to the adversary must correlate in some way with sources of variance — such as inventory speculation — in order for robustness to correspond to practicable forms of riskaversion.
The natural question is then whether an adversary with simultaneous control over and produces strategies outperforming those that are robust to manipulation of alone. This is plausible since combining the three model parameters can lead to more interesting strategies, e.g., driving inventory up/down only to increase drift at the peak (i.e., pump and dump). To investigate, we trained with an adversary with control over all three parameters. The resulting performance is quoted in Table 1(c), which shows an improvement in the Sharpe ratio of 0.27 and lower variance on terminal wealth. Interestingly, these MMs also quote tighter spreads on average — the values even approaching that of the riskneutral MM trained against a Fixed adversary. This indicates that the strategies are able to achieve epistemic risk aversion without charging more to counterparties.
As in previous sections, we can explore the impact of varying and . In all cases, we found that the MM strategies trained against a Strategic adversary with RA reward outperformed their counterparts in Tables 1(a) and 1(b). It is unclear, however, if changes to the reward function away from the RN variant actually improved the strategy in general. Excluding when , all values appear to do worse than for an adversarially trained MM with RN reward. It may well be that the addition of inventory penalty terms actually boosts the power of the adversary and results in strategies that try to avoid trading at all, a frequent problem in this domain.
Verification of approximate equilibria.
Holding the strategy of one player fixed, we empirically computed the best response against it by training. We found consistently that neither the trader nor adversary deviated from their policy. This would suggest that equilibria do exist and that the solutions found by our ARL algorithm are that of the Nash equilibria of the corresponding game. While we do not provide a full theoretical analysis of the stochastic game, these findings are corroborated by those in Section 3. In other words, the “projection” of the equilibrium strategy onto the oneshot case is the same as that of the stochastic game; see, for example, Figure 2. This may explain why ARL performs so well in this domain and justifies its use in financial problems which often have similar mathematical underpinnings.
6 Conclusions
We have introduced a new approach for deriving trading strategies with ARL that are robust to the discrepancies between the market model in training and testing. In the proposed new domain, we show that our approach leads to strategies that outperform previous methods in terms of PnL and Sharpe ratio, and have comparable spread efficiency. This is shown to be the case for outofsample tests in all three of the proposed settings. In other words, our MMs are not only more robust to misspecification, but also dominate in overall performance, regardless of the reward function used. In some special cases we are even able to show that these strategies correspond to Nash equilibria of the corresponding oneshot instance, and more widely validate the existence of equilibria in the full stochastic game. In future we plan to:

[leftmargin=0.4cm]

Extend to oligopolies of market makers.

Apply to datadriven and multiasset models.

Further explore existence of equilibria and design provably convergent algorithms.
Finally, we remark that, while our paper focuses on market making, the approach can be applied to other trading scenarios such as optimal execution and statistical arbitrage, where we believe it is likely to offer similar benefits. Further, it is important to acknowledge that this methodology has significant implications for safety of RL in finance. Training strategies that are explicitly robust to model misspecification makes deployment in the realworld considerably more practicable.
Comments
There are no comments yet.