We had quite some threads about loot and loot analysis in the last weeks.
Jimmy's insane loot test is one of them. Moreover, thx to
Starfinder we have now enough data to do some deeper analysis.
To analyze loot some statistical background is needed. At least one should be able to understand the main results and hence this “How to”.
The standard question about loot ingame is “how is loot today?”, so let’s try to see how to answer it.
----
Update: 2008/06/18
Before we start with the basics, here a summary of what we were able to assess so far.
We were able to estimate some basic loot characteristics for mobs with health between 1000 HP and 2000 HP. Since we have only limited data on other hp mobs, we can’t say if those findings are valid for them as well.
Mean return rate with hunting activities is estimated to be somewhere between 80% and 90%. This means, that when you hunt for quite some time, you should get 80% to 90% of PEDs spend on that mob back as loot.
We were not able to identify any trends or cycles with respect to loot value. For loot triggering however, we have some indications that there is some kind of grouping, i.e. there are periods where the number of globals or valued loot is rather high contrasted to other periods within an avatars hunting activities. So it seems that the grouping is mainly an avatar effect but we can’t exclude that there are other factors involved.
One of the main problems in all statistical analysis is to have a sufficiently large dataset. Therefore I have to thank Jimmy, Kolobok, Noodles and Woody for providing their data and I hope that maybe somebody else is willing to do the same.
----
Basics
All we have is observed data. We can describe it in a first step and try to model it in a second one. The latter one is only needed if one is interested in deeper insights.
There are two possibilities to describe a dataset. First, you can calculate and plot the distribution of the data and second, you can use some measures like the mean.
A frequency distribution describes the frequencies of single observations, while a probability distribution gives us the probabilities of random events. Both distributions are closely related to each other, but it would lead to far explaining it here.
Let’s denote a random observation as xi and the random variable to which it belongs as X (capital x). In observing loot we observe single xi’s. If we don’t know the loot function (i.e. how loot is calculated, I for sure do not), those xi’s are random and therefore we enter the field of statistics.
“How is loot today?” can mean, “what is the probability to get some nice loot over x Ped”. This is mathematically expressed as
S(x) = P(X > x),
where S(x) is a function of x giving us the probability that the realization of X (a single xi) is greater than x. S can be estimated by the empirical cumulative distribution function (ECDF) as
S = 1 – ECDF.
We will see in the next chapter how to do it. One further note. We are going to model probabilities of observed data. This has nothing to do with a loot function as implemented by
MA. This would be something like
L(p) = x, where L is a loot function that depends on the parameter vector p and gives a loot value as a result.
As already mentioned, there is also the possibility to use the mean to describe data. There are other measures as well. The median for example. This is the number that stands in the middle, i.e. 50% of the observations are higher than it, and 50% are lower. The median is better suited for skewed data.
If the distribution is symmetrically, then the mean is the same as the median. Both measures do give us the location of the underlying distribution and hence they do describe only one aspect. What’s about variability? To describe the variability of a sample, measures like the variance, the std. deviation or quartiles are used. They give us an indication on how data is spread about the mean.
“How is loot today?” can therefore also mean, “what is the mean loot today or how variable is loot today?”.
Survivor function
As mentioned above the survivor function for a random variable X is defined as
S(x) = P(X > x).
We will see now how to estimate it.
In our example we observe 5 events denoted as x(1),..,x(5) and therefore n=5 (number of cases in the sample). To get S we have first to order them and let’s denote the ordered sample as x1,..,x5.
S(x1) is then( n-1)/n,
S(x2) is then (n-2)/n
..
S(x5) is then (n-5)/n = 0
so
S(xi) = (n-i)/n.
Remark: This procedure does not account for ties and will interpolate them. A better estimator is the Kaplan-Meier estimator, but I won’t explain that here. Furthermore, there is also the possibility to do a continuity correction using S(xi) = (n-i+.5)/n. However, for our purposes the above mentioned method is sufficiently precise.
Example 1
Code:
x rank S
51 1 0,8
52 2 0,6
67 3 0,4
80 4 0,2
120 5 0
The probability to get a loot higher than 80 is 20%. The mean of the sample is 74 PED and the median 67. Since the median and the mean are quite different, the distribution is skewed.
Example 2
Now let’s use some real data. I will use a dataset that I did copy and paste from Starfinders side in the last days. Since this dataset consists of captured globals, instead of using the observed values I use x -50. You can use the original data as well, but since we would like to fit some models, it is better to transform them first.
Fig. 1

Click to enlarge
Quite impressive I would say, but where is the loot? Let’s zoom in.
Fig. 2

Click to enlarge
Oh, now things get clearer. We observe a high probability in the beginning, that drops quite fast till 500 PED and seems to disappear thereafter. So the distribution is rather skewed.
The median can be seen from the graph at 50% survival and is 27 and hence for the global data it would be 27 + 50 = 77. The mean is 65.22 and hence 115.22 for the original data. Quite some difference. As we know from the median, 50% of all globals will be less than 77 and 50% higher.
Using the survivor function we can find a percentage for the mean as well and I do get that to be about 22%. So 22% of all globals are higher than 115. This implies, in using means one will get a wrong expectation about loot, it will be reached only in 22% of all cases.
The models
As we have seen before, we are able to use the survivor function to describe the loot distribution. Is it possible to find a function for this distribution? As a statistician I can tell you maybe, so let’s try it.
In the second figure we have seen that loot seems to follow a specific curve. Not much variability around this curve. This is due to the rather high sample size and the way
MA calculates loot, hence modeling is easier. The curve looks quite like an exponential distribution. Let’s first see how an exponential distribution is defined:
S(x) = exp(-a*x),
where exp is the exponential function and a is a scale parameter. We know further, that for an exponentially distributed random variable the mean is 1/a, the variance is 1/a^2 and the median is ln(2)/a. A good estimator for 1/a is the observed mean.
Moreover, one can show, that for an exp distribution the percentage above the mean is exp(-1) = 36.8%. From the previous analysis we know that 22% of all globals are greater than the mean, so both things do not fit. Nevertheless, let’s try to fit it.
Fig. 3

Click to enlarge
As already mentioned, we won’t have a perfect fit. The exp dist looks similar but overestimates loot till 200 PED and underestimates thereafter. As an alternative one can try to use a different model. We have a lot of them. To make things short, the best model to describe the data is a Pareto distribution.
Fig. 4

Click to enlarge
Looks quite better now. The pareto distribution is a power function and is defined as
S(x) = (x/s)^(-k),
where s is a scale and k is a shape parameter.
The mean would be k*s/(k-1) and the median s * 2 ^ (1/k).
From the data I do get a
k =2.59 and s = 34.76,
therefore the estimated mean is 56.62 and the median is 45.43.
Those values are closer to our observed ones but not perfect yet.
So why I’m not able to find any perfect fitting model. Maybe
MA did invent some new system that leads to a yet unknown distribution? Let’s make a step back.
We have global data. All values are greater than 50. It might be that globals are only shown when higher than 50 but the distribution starts before. This is what we call left truncation. Furthermore, we have quite some heterogeneous data. Many different mob types with maybe different loot and we have different days.
It is known that a random variable Y = b * exp(X) that depends on an exp distributed X, will follow a pareto distribution. Functions of the form b*f(X) do typically arise in statistics when a mixture of several distributions is involved. So it might be rather plausible to assume a mixture distributed loot. What we need now is to identify those mixtures.
Some more models, the effect of health
From Jimmy’s thread we know that loot depends on mob health. To find a model where health is an additional variable, might be rather tricky since we don’t know the model yet. Fortunately, there is a non parametric solution. Like with the ECDF, we can try to model the data empirically using health as an additional variable. I’m going to use Cox-Proportional Hazard Model.
Fig. 5

Click to enlarge
Quite a good fit with some minor underestimation (to see differences, x axis is now limited to 500 PED). The effect of health is highly significant (p < 0.001), but we did already know that.
So our observed distribution depends on mobs health. In fig. 5 the distribution is plotted for the overall mean health. Let’s check how this looks like for 500, 1000, 1500 and 2000 HP.
Fig. 6

Click to enlarge
Loot increases proportional to health. Mean health from data is 1146 and the ecdf is between the plotted 1000 and 1500 HP curve, so this fits again quite good. (Btw., Ambus are the most hunted mobs, therefore HP > 1000 is rather obvious).
If I find the time I will show you, how we can fit models for constant HP and what else I was able to identify till now. There is still a lot to explain but I hope this intro was helpfully.
Btw. Loot didn’t change over days.

Click to enlarge
The graph is called a boxplot. It shows the median in the middle and the percentiles 25, 75. To test the day effect one needs a statistical test. The one appropriate for the data is the Kruskal-Wallis Test. It gives a p = .275. Since this value is higher than 0.05, we say not statistically significant.
Section 2:
Mixtures
We have seen in fig. 6, that loot depends on the total health of the mob. Since we used a semi parametric model, we have to very this finding. So let’s take the most hunted mobs, an Ambu Young (1010 HP), Argo Y (300 HP) and Aurli Ravager (2800 HP).
Fig. 7

Click to enlarge
Since Ambu is the most hunted mob (10%), its distribution is quite close to that of total loot. As expected, Argo Y are below Ambu’s distribution and Aurli’s are above. The sample is now much smaller due to lack of data, copy & paste is quite time consuming, therefore we eill have a lower precision. Means (including the subtracted 50 PED) are 39.9, 80.7 and 116.1. Relating Ambu to Argo, Aurli to Ambu gives 1.46 and 1.27, which shows the proportionality to health as expected.
Now let’s try to fit a model for Ambus:
Fig. 8

Click to enlarge
Once again, the pareto distribution fits better as the exp indicating once again that there might be some mixture distribution involved. So where does the mixture come from? Ambu’s are fast regenerating mobs. So someone is able to kill one faster and some will be slower having more dmg done as a result. This might explain a portion of the mixture. Furthermore, there is the possibility that there is more than one distribution where globals do come from. Right, we have hof’s as well.
Now some formal things again. A mixture distribution is when you observe in one dataset, data that is coming from two or more distributions. Mathematically this is written as
f(x)= p * f1(x) + (1-p)*f2(x),
where f(x) is the density of the overall sample, p is the proportion of the first distribution in the sample and (1-p) that of the second one. I’m using here the density for simplicity. Doing that with S would lead to a more complex formula.
It is quite tricky to estimate a mixture from one dataset and there are several approaches. I’m using an MLE (maximum likelihood) approach for that. So let’s see to what this leads when using two exp distributions.
Fig. 9

Click to enlarge
In using MLE I get p = .994 for the first distribution with a mean of 45.35 + 50 and a (1-p) = .006 with a mean of 5958.8 + 50 for the second one. There is still some over- underestimation, but it shows that data comes clearly form a mixture. Since the second mean is quite large, it is rather obvious that we have two distributions, one for globals and one for hof’s. The estimated means are not very precise due to the small sample size. Moreover, I have two very large observations above 15000 PED in the data. Let’s try to exclude them.
p is now .98 with a mean of 43.2 + 50 and (1-p) = .02 with a mean of 422.5 + 50 for the second one. This seems more reasonable and corresponds quite well to the hof data. There is still the same over – underestimation as in fig. 9. This might be related to the variability in hp that we can’t observe.
So to conclude. Recorded globals data is composed of two distributions, one for globals and one for hofs. There is still space for a third distribution (ath’s). Hof’s have a frequency of 2% within globals data.
To know how to break even, we need to know the frequency of globals. There is not much data to estimate it from, but I’ll show what we can do till we have it.
Some experimental stuff
As we have seen, global loot is a mixture of several distributions. Maybe we can find some parameters to describe it. What I’m going to show now is very experimental. Furthermore, I don’t have enough data to do reliable estimations. Therefore you can ignore the estimated values. What’s interesting with this experiment is, that the correlation between loot an health is quite linear.
Code:
hp p1 p2 m1 m2 m3
300 0.23311 0.74469 13.753 34.667 485.73
500 0.2911 0.69868 13.656 41.443 662.81
1010 0.36811 0.62644 20.078 59.993 966.42
2000 0.55995 0.43653 27.231 101.84 2038.6
For estimation I used a mixture of 3 exp distributions. Since there are 5 parameters to estimate, there is a lot of space left.
Fig. 10

Click to enlarge
For a hp = 2000 mob the model with 3 exp’s does quite a good job. So let’s see if we find a relation between hp and the estimated means m1, .., m3.
Fig. 11

Click to enlarge

Click to enlarge
Fig. 12

Click to enlarge
The mean of the exp distributions seems to increase linearly with hp. So if I’m right with my assumption of 3 exp distribution and the relation to hp is linearly, then it is indifferent which mob you hunt to get the same return. However, this does not imply that one mob will global more often than another one.
added: 2008/04/22
As the mean increases the probability to get one decreases (fig. 12). Overall loot expectation is 40, 40, 50 and 67 PED, so this does not follow the cost to kill a mob. I.e. when a mob with 500 HP brings in mean 40 PED, then one with 2000 HP should bring at least 4 times as much (160 PED). I get 67 PED.
So what is the problem here?
As explained above, we have the problem that a global is recorded when loot is above 50 PED. As explained in Starfinders post, there are two possibilities that could apply to the observed data, loot is shifted by 50 or truncated at 50 PED. Unfortunately there is no method to find this out with global data only. From Buckaroo (thx Buck) I know that in his data 1/3 of the looted value comes from globals. This is an indication that the observed globals are truncated at 50 PED and not shifted.
If this is the case, then the estimated means from the 50 PED corrected data would be the real means. The estimated weights (probabilities) per mean from the 3 exp mix model might be wrong and should be revised. I have to check this with some simulations.
Edit 08/05/05
As a result form the above mentioned experimental chapter I was able to derive a loot model that is posted
here.
see post below for a model proposal.