01/09/1978
The criteria of maximizing expected rewards has been widely used in Markov decision processes following Howard [2]. Recently considerations related to higher moments of rewards have also been incorporated by Jaquette [4] and Goldwerger [1]. This paper considers mean variance criteria for discounted Markov decision processes. Variability in rewards arising both out of variability of rewards during each period and due to stochastic nature of transitions is considered. It is shown that randomized policies need not be considered when a function of mean and variance (m - as) is to be optimized. However an example illustrates that policies which will simultaneously minimize variances for all states may not exist. We, therefore, provide a dynamic programming formulation for optimizing mi - asi for each state i. An example is given to illustrate the procedure.