Importance of Natural Resources

Maximum Entropy Methods Tutorial: Modeling the Open Source Ecosystem 3


In the previous unit, what I showed you was a way to rescue MaxEnt models to describe the language abundance P(n) by reference to a hidden variable, epsilon which we referred to as “programmer time” and what we assumed was that the system was constrained in two ways: it was constrained not only to having a certain average number of projects per language. We fixed the average popularity of languages, but we also fixed the average programmer time devoted to projects in a particular language. And so this
distribution, here, looks like this. in functional form. And when we integrate out this variable, epsilon, we get
something that looks like this. So we get a different prediction for the language distribution. And we get a prediction that, I argue, looks
reasonably good. It certainly looks better than the
exponential distribution. I feel honor-bound to tell you
about the controversy that happens when we try to make
these models… and in particular there is a very different mechanistic
model that looks quite similar. So this is the Fisher log series. And the argument behind the Fisher log
series, to explain this distribution, involves the idea of a “hidden” additional
constraint. In the open source question, what I’ve
done is describe that additional constraint as “programmer time”, just
because it seems like it might be a constraint in the system. OK?
That the average programmer time for languages is fixed, not just the
average number of projects. But, and so that means that languages
can vary in their popularity, but also in their efficiency. In the ecological
modelling, this is “species” languages are species, and this is the
abundance of species. So that means the number of particular
instances of the species in the wild. And also metabolic rate: how much energy
a particular species consumes. And so in that case, the system is
constrained to a certain average species abundance, and a certain average
species energy consumption Here languages are constrained to a certain
abundance, and a certain consumption of programmer energy. That’s the analogy. So, we can also build a mechanistic model. of programming language popularity. And previously, when we studied the taxi-cab
problem, what we did was, when we produced the mechanistic model,
we were able to find a very simple one that had the same predictions as the
MaxEnt model did. Here, by contrast, what we’re going to find
is that the mechanistic model is going to produce similar behavior, but
the functional form will actually be slightly distinct. So, here’s the mechanistic
model… we imagine that languages all start out with a baseline popularity. Whoever invents the language,
for example, has to write at least one project. There is at least one
programmer at the beginning of a language’s invention, who knows how to program in that language, somewhat by definition. And so there’s two ways that that popularity can grow. It can grow, for example, linearly. So, on day 1 there’s 1 programmer. And on day 2, that one programmer is
joined by another programmer. And on day 3, those two programmers
are joined by a third. And so, over time, what you have is a growth rate that’s linear… in time. But, perhaps a more plausible model for how languages accrue popularity is multiplicative. At time 1, there’s 1 programmer, and he has some efficiency of converting other programmers to his cause. So maybe he’s able to double the number of programmers. And he’s able to double the number of
programmers, because his language is particularly good, and perhaps perhaps
people who like to program in that language happen to be particularly persuasive. And so on the second day, those two
programmers each themselves go out and convert two people,
because they are the same as the original programmer in their effectiveness and the language itself is just as
convincing as it was before. So each of those two programmers goes out and gathers two for each and we go to 4 And by a similar argument, we go to 8, and so this would be the
exponential growth model… where the number of programmers
as a function of time increases multiplicatively, as opposed to additively. So, let’s make this model a little more realistic, and in particular, let’s allow
the multiplicative factor, which in this case we set to 2, we’re going to allow this multiplicative factor
to vary. And in fact, we’re going to draw this multiplicative factor, alpha, from some distribution. And in fact, it doesn’t really
matter what that distribution is as long as alpha is always greater than 0, so it’s not possible for all programmers
to suddenly disappear. so it’s always greater than zero, and it’s bounded at some point, so it’s impossible for a language to become infinitely popular after a finite number of steps. So we’re going to draw… each day we’re going to draw a number,
alpha, from this distribution, here. So, after one day, there are alpha programmers. After two days, there is alpha… or rather
alpha(1) programmers. (This is the draw on the first day.) On the second day
there’s alpha(2) times alpha(1) programmers, and so on.
alpha(3) times alpha(2) times alpha(1) So this is now growth that occurs through a random multiplicative process. It’s similar to growth that would happen through a random additive process except now, instead of adding a random
number of programmers each day you multiply the total number of programmers each day, by some factor alpha drawn from this distribution. So, you can always convert this multiplicative process into an additive process,
by a very simple trick of taking the logarithm. Over time, if we count programmer numbers we’re multiplying, but if you’re working in
log space, we’re just adding. We’re adding a random number to the distribution, as long as alpha is always strictly greater than zero, these will
always be well defined. Now, all of a sudden, it looks like the
additive model in log space. And what we know from the central limit
theorem, is that if you add together lots of random numbers, that distribution
tends towards a Gaussian distribution. with some particular mean (mu) and some
particular variance (sigma). Let’s not worry about what mu and sigma are in particular, but rather note that
that growth happens in log space The distribution of these sums over long time scales will end up looking like a
Gaussian distribution. The average boost per day to a language looks in log space like a Gaussian
distribution. What that means is that the exponential growth model with random kicks, random multiplicative
kicks, actually looks like a Gaussian in log space, or what we call
a log-normal in actual number space So, instead of looking at the logarithm of
the popularity of the language, just look at the total popularity of the language, and what that means is that it looks like the exponential of
log(n) minus some mean squared over two sigma squared… and then you just have to be careful to normalize
things properly here. So, this is the log-normal distribution. And a mechanistic model where language
growth happens multiplicatively where a language gains new adherents in proportion to the number of adherents it already has, where a language gains new projects in proportion to the number
of projects it already has dependent upon the environment –
that’s where the multiplicative randomness comes from. Alpha is a random number it’s not a constant. It’s not 2. It’s not
the language always necessarily doubles. But the fact that it grows through a multiplicative random process as opposed to an additive process means that you have a log-normal growth. And so now you can say, “OK, let’s imagine that languages grow through this
log-normal process. And let’s find the best fit parameters for mu and sigma.” And if you do that, you find that the mechanistic log-normal model looks pretty good as well. We were impressed by how well
the blue line fit this distribution compared to the red exponential model the MaxEnt model, constraining only N That was the red model. Here the blue model does well… this is the Fisher-log
series. Unfortunately, a mechanistic model… and
I’ve given you a short account of the mechanistic model, here. Where what’s happening is you’re adding together lots of small multiplicative
random kicks The mechanistic model also works [?] I will tell you that this fits better. If you do a statistical analysis, both of
these models have two parameters If you do a statistical analysis, the
Fisher-log series actually fits better in particular, it’s able to explain these really high-popularity languages better these deviations here seem larger than the deviations here, but you have to remember that this is on a log scale So this gets much closer up here than this does here. So the mechanistic model, at least
visually, looks like it’s extremely competitive, with the Fisher log
series model derived from a MaxEnt argument Statistically speaking, if you look at
these two, this one is actually slightly dispreferred. But like many people, what you want is some ironclad evidence for one versus the other. And I think the best way to look for that kind of evidence is to figure out what, if anything, this epsilon really is
in the real world. If we were able to build a solid theory about what epsilon was, and how we could measure it in the data, then we could see if this here, this joint distribution, was
well reproduced. If we could find evidence, for example, for the fact that these two co-vary. That here we have a term that boosts the popularity of a language if it becomes more efficient. So, if this goes down, this can get higher, and the language can still have the same probability of being found with those properties. And of course, the problem is that we don’t know how to measure this, sort of,
mysterious programmer time… programmer efficiency. The ecologists have a much better time
with this. Because the ecologists, they know what
their epsilon is. They know that their epsilon is metabolic
energy units intake. So this is “how much a particular instance
of this species consumes in energy over the course of a day, or
over the course of its lifetime.” And they’re able to measure that, and in fact, they’re able to measure this joint distribution. If we come to study the open source ecosystem, so far we don’t really have a way to measure this and so we’re unable to measure the joint and so now we’re left with one model that’s mechanistic, right, popularity accrual model and over here, this model that talks about there being two constraints on the system Average number and average
programmer time


Leave a Reply

Your email address will not be published. Required fields are marked *