Importance of Natural Resources

Inside TensorFlow: TensorFlow Lite

everybody for showing up. My name is Jared. I’m an engineer on the
TensorFlow Lite team. Today I will be giving a
very high level overview with a few deep dives
into the TensorFlow Lite stack, what it is, why we have
it, what it can do for you. Again, this is a
very broad topic. So there will be
some follow up here. And if you have any questions,
feel free to interrupt me. And you know, this is meant
to be enlightening for you. But it will be a
bit of a whirlwind. So let’s get started. First off, I do want
to talk about some of the origins of
TensorFlow Light and what motivated its
creation, why we have it in the first place and we can’t
just use TensorFlow on devices. I’ll briefly review how you
actually use TensorFlow Lite. That means how you
use the converter. How you use the runtime. And then talk a little bit about
performance considerations. How you can get the best
performance on device when you’re using
TensorFlow Lite. OK. Why do you need TensorFlow
Lite in your life? Well, again, here’s some kind
of boilerplate motivation for why we need on device ML. But these are actually
important use cases. You don’t always
have a connection. You can’t just always be
running inference in the cloud and streaming that
to your device. A lot of devices, particularly
in developing countries, have restrictions on bandwidth. They can’t just be
streaming live video to get their selfie
segmentation. They want that done
locally on their phone. There’s issues with
latency if you need real time object detection. Streaming to the cloud,
again, is problematic. And then there’s
issues with power. On a mobile device,
often the radio is using the most
power on your device. So if you can do things locally,
particularly with a hardware backend like a
DSP or an MPU, you will extend your battery life. But along with
mobile ML execution, there are a number of challenges
with memory constraints, with the low powered CPUs that
we have on mobile devices. There’s also a very kind of
fragmented and heterogeneous ecosystem of hardware backends. This isn’t like the
cloud where often you have a primary provider
of your acceleration backend with, say, NVIDIA GPUs or TPUs. There’s a large class
of different kinds of accelerators. And there’s a problem
with how can we actually leverage all of these. So again, TensorFlow works great
on large well-powered devices in the cloud, locally on
beefy workstation machines. But TensorFlow Lite is not
focused on these cases. It’s focused on the edge. So stepping back a bit,
we’ve had TensorFlow for a number of years. And why couldn’t we
just trim this down and run it on a mobile device? This is actually what we call
the TensorFlow mobile project. And we tried this. And after a lot of effort,
and a lot of hours, and blood, sweat,
and tears, we were able to create kind of a
reduced variant of TensorFlow with a reduced operator set
and a trimmed down runtime. But we were hitting
a lower bound on where we could go in terms
of the size of the binary. And there was also
issues in how we could make that runtime
a bit more extensible, how we could map it onto
all these different kinds of accelerators that you
get in a mobile environment. And while there have been
a lot of improvements in the TensorFlow ecosystem
with respect to modularity, it wasn’t quite
where we needed it to be to make that a reality. AUDIENCE: How small a memory
do you need to get to? JARED DUKE: Memory? AUDIENCE: Yeah. Three [INAUDIBLE] seem too much. JARED DUKE: So this is
just the binary size. AUDIENCE: Yeah. Yeah. [INAUDIBLE] JARED DUKE: So in app size. In terms of memory, it’s
highly model dependent. So if you’re using
a very large model, then you may be required
to use lots of memory. But there are different
considerations that we’ve taken into
account with TensorFlow Lite to reduce the
memory consumption. AUDIENCE: But your
size, how small is it? JARED DUKE: With
TensorFlow Lite? AUDIENCE: Yeah. JARED DUKE: So the core
interpreter runtime is 100 kilobytes. And then with our
full set of operators, it’s less than a megabyte. So TFMini was a
project that shares some of the same origins
with TensorFlow Lite. And this was,
effectively, a tool chain where you could take
your frozen model. You could convert it. And it did some kind of
high level operator fusings. And then it would do code
gen. And it would kind of bake your model into
your actual binary. And then you could run this
on your device and deploy it. And it was well-tuned
for mobile devices. But again, there are
problems with portability when you’re baking the
model into an actual binary. You can’t always stream
this from the cloud and rely on this
being a secure path. And it’s often discouraged. And this is more of a
first party solution for a lot of vision-based use
cases and not a general purpose solution. So enter TensorFlow Lite. Lightweight machine
learning library from all embedded devices. The goals behind this
were making ML easier, making it faster, and making the
kind of binary size and memory impact smaller. And I’ll dive into each of
these a bit more in detail in terms of what it looks like
in the TensorFlow Lite stack. But again, the
chief considerations were reducing the footprint
in memory and binary size, making conversion
straightforward, having a set of APIs that were
focused primarily on inference. So you’ve already crafted
and authored your models. How can you just run and deploy
these on a mobile device? And then taking advantage again
of mobile-specific hardware like these ARM CPUs,
like these DSP and NPUs that are in development. So let’s talk about
the actual stack. TensorFlow Lite has
a converter where you ingest the graph def, the
saved model, the frozen graphs. You convert it to a TensorFlow
Lite specific model file format. And I’ll dig into
the specifics there. There’s an interpreter for
actually executing inference. There’s a set of ops. We call it the
TensorFlow Lite dialect of operators, which is
slightly different than the core TensorFlow operators. And then there’s a way to plug
in these different hardware accelerators. Just walking through
this briefly, again, the converter spits
out a TFLite model. You feed it into your runtime. It’s got a set of
optimized kernels and then some hardware plugins. So let’s talk a little bit
more about the converter itself and things that are
interesting there. It does things like
constant folding. It does operator
fusing where you’re baking the activations
and the biased computation into these high level operators
like convolution, which we found to provide a
pretty substantial speed up on mobile devices. Quantization was one of
the chief considerations with developing this
converter, supporting both quantization-aware training
and post-training quantization. And it was based
on flat buffers. So flat buffers are an analog
to protobufs, which are used extensively in TensorFlow. But they were developed with
more real time considerations in mind, specifically
for video games. And the idea is that you
can take a flat buffer. You can map it into memory and
then read and interpret that directly. There’s no unpacking step. And this has a lot
of nice advantages. You can actually map this
into a page and it’s clean. It’s not a dirty page. You’re not dirtying
up your heap. And this is extremely important
in mobile environments where you are
constrained on memory. And often the app is going
in and out of foreground. And there’s low memory pressure. And there’s also a
smaller binary size impact when you use flat
buffers relative to protobufs. So the interpreter, again,
was built from the ground up with mobile devices in mind. It has fewer dependencies. We try not to depend on
really anything at base. We have very few
absolute dependencies. I already talked about
the binary size here. It’s quite a bit smaller than– the minimum binary size we were
able to get with TensorFlow Mobile was about three
megabytes for just the runtime. And that’s without
any operators. It was engineered
to start up quickly. That’s kind of a combination of
being able to map your models directly into memory but then
also having a static execution plan where there’s– during conversion,
we basically map out directly what the sequence of
nodes that would be executed. And then for the
memory planning, basically there’s a pass when
you’re running your model where we prepare each operator. And they kind of cue up
a bunch of allocations. And those are all baked into
a single pass where we then allocate a single block
of memory and tensors are just fused into that large
contiguous block of memory. We don’t yet support
control flow. But I will be talking about
that later in the talk. It’s something that we’re
thinking about and working on. It’s on the near horizon
for actual shipping models. So what about the operator set? So we support float
and quantized types for most of our operators. A lot of these are backed by
hand-tuned, neon, and assembly based kernels that
are specifically optimized for ARM devices. Ruy is our newest GEMM
backend for TensorFlow Lite. And it was built from the
ground up with mobile execution in mind, a
[INAUDIBLE] execution. We support about 120
built-in operators right now. You will probably
realize that that’s quite a bit smaller than the set
of TensorFlow ops, which is probably into the
thousands by now. I’m not exactly sure. So that can cause problems. But I’ll dig into some solutions
we have on the table for that. I already talked about
some of the benefits of these high level
kernels having fused activations and biases. And then we have a way for you
to kind of, at conversion time, stub out custom operators
that you would like. Maybe we don’t yet
support them in TF Lite or maybe it’s a one off
operator that’s not yet supported in TensorFlow. And then you can plug-in
your operator implementation at runtime. So the hardware
acceleration interface, we call them delegates. This is basically
an abstraction that allows you to plug
in and accelerate subgraphs of the overall graph. We have NNAPI, GPU, EdgeTPU,
and DSP backends on Android. And then on iOS, we have
a metal delegate backend. And I’ll be digging into some
of these and their details here in a few slides. OK. So what can I do with it? Well, I mean this is largely
a lot of the same things that you can do with TensorFlow. There’s a lot of speech and
vision-related use cases. I think often we think
of mobile inference as being image classification
and speech recognition. But there are quite
a few other use cases that are being used
now and are in deployment. We’re being used broadly across
a number of both first party and third party apps. OK. So let’s start with models. We have a number of
models in this model repo that we host online. You can use models
that have already been authored in TensorFlow and
feed those into the converter. We have a number of
tools and tutorials on how you can apply transfer
learning to your models to make them more
specific to your use case, or you can author
models from scratch and then feed that into
the conversion pipeline. So let’s dig into conversion and
what that actually looks like. Well, here’s a
brief snippet of how you would take a saved model,
feed that into our converter, and output a TFLite model. It looks really simple. In practice, we
would like to say that this always just works. That’s sadly not yet a reality. There’s a number of failure
points that people run into. I’ve already highlighted
this mismatch in terms of supported operators. And that’s a big pain point. And we have some things in
the pipeline to address that. There’s also different semantics
in TensorFlow that aren’t yet natively supported in TFLite,
things like control-flow, which we’re working on,
things like assets, hash tables, TensorLists,
those kinds of concepts. Again, they’re not yet natively
supported in TensorFlow Lite. And then certain types
we just don’t support. They haven’t been prioritized
in TensorFlow Lite. You know, double execution,
or bfloat16, none of those, or even FP16 kernels are
not natively supported by the TFLite
built-in operators. So how can we fix that? Well, a number of months ago,
we started a project called– well, the name is
a little awkward. It’s using select TensorFlow
operators in TensorFlow Lite. And effectively,
what this does is it allows you to,
as a last resort, convert your model for
the set of operators that we don’t yet support. And then at runtime, you
could plug-in this TensorFlow select piece of code. And it would let you run
these TensorFlow kernels within the TFLite runtime at
the expense of a modest increase in your binary size. What does that actually mean? So the converter basically,
it recognizes these TensorFlow operators. And if you say, I
want to use them, if there’s no TFLite
built-in counterpart, then it will take that node def. It’ll bake it in to the TFLite
custom operator that’s output. And then at runtime,
we have a delegate which resolves this
custom operator and then does some
data marshaling into the eager execution of
TensorFlow, which again would be built into the TFLite
runtime and then marshaling that data back out into
the TFLite tensors. There’s some more information
that I’ve linked to here. And the way you can actually
take advantage of this, here’s our original Python
conversion script. You drop in this
line basically saying the target ops set includes
these select TensorFlow ops. So that’s one thing that
can improve the conversion and runtime experience
for models that aren’t yet natively supported. Another issue that
we’ve had historically– our converter was called TOKO. And its roots were in
this TFMini project, which was trying to
statically compute and bake this graph into your runtime. And it was OK for it to
fail because it would all be happening at build time. But what we saw is
that that led to a lot of hard to decipher opaque
error messages and crashes. And we’ve since set out to
build a new converter based on MLIR, which is
just basically tooling that’s feeding
into this converter helping us map from the
TensorFlow dialect of operators to a TensorFlow Lite
dialect of operators with far more formal
mechanisms for translating between the two. And this, we think will give
us far better debugging, and error messages,
and hints on how we can actually fix conversion. And the other reason that
motivated this switch to a new converter was
to support control flow. This will initially start by
supporting functional control flow forms, so if and
while conditionals. We’re still considering
how we can potentially map legacy control flow
forms to these new variants. But this is where
we’re going to start. And so far, we see
that this will unlock a pretty large class
of useful models, the RNN class type
models that so far have been very difficult to
convert to TensorFlow Lite. TensorFlow 2.0. It’s supported. There’s not a whole lot that
changes on the conversion end and certainly nothing
that changes on the TFLite end except for maybe the
change to that saved model is now the primary serialization
format with TensorFlow. And we’ve also made a few
tweaks and added some sugar for our conversion APIs
when using quantization. OK. So you’ve converted your model. How do you run it? Here’s an example of
our API usage in Java. You basically create your input
buffer, your output buffer. It doesn’t necessarily
need to be a byte buffer. It could be a single or
multidimensional array. You create your interpreter. You feed it your TFLite model. There are some options
that you can give it. And we’ll get to those later. And then you run inference. And that’s about it. We have different bindings
for different platforms. Our first class bindings
are Python, C++, and Java. We also have a set of
experimental bindings that we’re working on or in
various states of both use and stability. But soon we plan to have
our Objective C and Swift bindings be stable. And they’ll be available as
the normal deployment libraries that you would get
on iOS via CocoaPods. And then for Android, you can
use our JCenter or BingeRate ARs for Java. But those are primarily focused
on third party developers. There are other ways
you can actually reduce the binary size of TFLite. I mentioned that the core
runtime is 100 kilobytes. There’s about 800
or 900 kilobytes for the full set of operators. But there are ways
that you can basically trim that down and only include
the operators that you use. And everything else gets
stripped by the linker. We expose a few build
rules that help with this. You feed it your TFLite model. It’ll parse that in output. Basically, a CC file, which
does the actual op registration. And then you can
rely on your linker to strip the unused kernels. OK. So you’ve got your
model converted. It’s up and running. How do you make it run fast? So we have a number of
tools to help with this. We have a number of backends
that I talked about already. And I’ll be digging
into a few of these to highlight how they can
help and how you can use them. So we have a benchmarking tool. It allows you to identify
bottlenecks when actually deploying your model
on a given device. It can output profiles
for which operator’s taking the most amount of time. It lets you plug in
different backends and explore how this actually
affects inference latency. Here’s an example
of how you would build this benchmark tool. You would push it
to your device. You would then run it. You can give it different
configuration options. And we have some helper
scripts that kind of help do this all atomically for you. What does the output look like? Well, here you can get
a breakdown of timing for each operator in
your execution plan. You can isolate
bottlenecks here. And then you get a nice summary
of where time is actually being spent. AUDIENCE: In the
information, there is just about operation
type or we also know if it’s the earlier
convolution of the network or the later convolutions in the
network or something like that? JARED DUKE: Yeah. So there’s two breakdowns. One is the run
order which actually is every single
operator in sequence. And then there’s
the summary where it coalesces each operator
into a single class. And you get a nice
summary there. So this is useful for, one,
identifying bottlenecks. If you have control over a graph
and then the authoring side of things, then you
can maybe tailor the topology of your graph. But otherwise, you can file
a bug on the TFLite team. And we can investigate
these bottlenecks and identify where there’s
room for improvement. But it also affords– it affords you, I
guess, the chance to explore some of the more
advanced performance techniques like using these
hardware accelerators. I talked about delegates. The real power, I
think, of delegates is that it’s a nice way
to holistically optimize your graph for a given backend. That is you’re not just
delegating each op one by one to this hardware accelerator. But you can take an entire
subgraph of your graph and run that on an accelerator. And that’s particularly
advantageous for things like GPUs or neural
accelerators where you want to do as much
computation on the device as possible with no CPU
interop in between. So NNAPI is the abstraction in
Android for accelerating ML. And it was actually developed
fairly closely in tandem with TFLite. You’ll see a lot of similarities
into the high level op definitions that
are found in NNAPI and those found in TFLite. And this is effectively
an abstraction layer at the platform level that we
can hook into on the TensorFlow Lite side. And then vendors can plug
in their particular drivers for DSP, for GPUs. And with Android Q, it’s really
getting to a nice stable state where it’s approaching parity
in terms of features and ops with TensorFlow Lite. And there are increasingly– there’s increased adoption
both in terms of user base but also in terms
of hardware vendors that are contributing to
these drivers more recently we’ve released our GPO back
end and we’ve also open source. This can yield a pretty
substantial speedup on many floating point
convolution models, particularly larger models. There is a small binary size
cost that you have to pay. But if it’s a good
match for your model, then it can be a huge win. And this is– we found
a number of clients that are deploying this with
things like face detection and segmentation. AUDIENCE: Because if you’re
on top of [INAUDIBLE] GPU. JARED DUKE: Yeah, so on Android,
there’s a GLES back end. There’s also an
OpenCL back end that’s in development that
will afford a kind of 2 to 3x speed up over
the GLES back end. There’s also a Vulcan
back end, and then on iOS, it’s metal-based. There’s other delegates
and accelerators that are in various
states of development. One is for the Edge TPU
project, which can either use kind of runtime
on device compilation, or you can use or take
advantage of the conversion step to bake the compiled model
into the TFLite graph itself. We also announced,
at Google I/O, support for Qualcomm’s
Hexagon DSPs that we’ll be releasing
publicly soon-ish. And then there’s some more
kind of exotic optimizations that we’re making for the
floating point CPU back end. So how do you take advantage
of some of these back ends? Well, here is kind
of our standard usage of the Java APIs for inference. If you want to use NnAPI, you
create your NnAPI delegate. You feed it into your model
options, and away you go. And it’s quite similar for
using the GPU back end. There are some more
sophisticated and advanced techniques for both an API
and GPU on our interop. This is one example where you
can basically use a GL texture as the input to your graph. That way, you avoid
needing to copy– marshal data back and
forth from CPU to GPU. What are some other things
we’ve been working on? Well, the default out
of the box performance is something that’s critical. And we recently landed a pretty
substantial speed up there with this ruy library. Historically, we’ve used
what’s called gemmlowp for quantized matrix
multiplication, and then eigen for floating
point multiplication. Ruy was built from the
ground up basically to [INAUDIBLE]
throughput much sooner in terms of the size of the
inputs to, say, a given matrix multiplication
operator, whereas more desktop and cloud-oriented
matrix multiplication libraries are focused on peak
performance with larger sizes. And we found this, for a large
class of convolution models, is providing at
least a 10% speed-up. But then on kind of our
multi-threaded floating point models, we see two to
three times the speed-up, and then the same on more recent
hardware that has these neon dot product intrinsics. There’s some more
optimizations in the pipeline. We’re also looking
at different types– Sparse, fp16 tensors to take
advantage of mobile hardware, and we’ll be announcing
related tooling and features support soon-ish. OK, so a number of
best practices here to get the best
performance possible– just pick the right model. We find a lot of developers
come to us with inception, and it’s hundreds of megabytes. And it takes seconds
to run inference, when they can get
just as good accuracy, sometimes even better, with
an equivalent MobileNet model. So that’s a really
important consideration. We have tools to improve
benchmarking and profiling. Take advantage of
quantization where possible. I’m going to dig into
this in a little bit how you can actually
use quantization. And it’s really a
topic for itself, and there will be,
I think, a follow-up session about quantization. But it’s a cheap way of
reducing the size of your model and making it run faster
out of the box on CPU. Take advantage of
accelerators, and then for some of these
accelerators, you can also take advantage of zero copy. So with this kind of
library of accelerators and many different permutations
of quantized or floating point models, it can be quite daunting
for many developers, probably most developers,
to figure out how best to optimize their model
and get the best performance. So we’re thinking of some more
and working on some projects to make this easy. One is just accelerator
whitelisting. When is it better to use, say,
a GPU or NnAPI versus the CPU, both local tooling to
identify that for, say, a device you plugged
into your dev machine or potentially as a service,
where we can farm this out across a large bank of devices
and automatically determine this. There’s also cases where you may
want to run parts of your graph on different accelerators. Maybe parts of it map
better to a GPU or a DSP. And then there’s also the issue
of when different apps are running ML simultaneously, so
you have a hotware detection running at the same time you’re
running selfie segmentation with a camera feed. And they’re both trying to
access the same accelerator. How can you coordinate efforts
to make sure everyone’s playing nicely? So these are things
we’re working out. We plan on releasing tooling
that can improve this over the next quarter or two. So we talked about quantization. There are a number of
tools available now to make this possible. There are a number of
things being worked on. In fact, yesterday,
we just announced our new post-training
quantization that does full quantization. I’ll be talking
about that more here in the next couple of slides. Actually, going
back a bit, we’ve long had what’s called our
legacy quantized training path, where you would instrument
your graph at authoring time with these fake quant nodes. And then you could use
that to actually generate a fully quantized model as
the output from the TFLite conversion process. And this worked quite
well, but it was– it can be quite painful
to use and quite tedious. And we’ve been
working on tooling to make that a lot easier
to get the same performance both in terms of
model size reduction and runtime
acceleration speed-up. AUDIENCE: Is part about
the accuracy– it seems like training time [INAUDIBLE]. JARED DUKE: Yeah,
you generally do. So we first introduced this
post-training quantization path, which is hybrid, where we
are effectively just quantizing the waits, and then
dequantizing that at runtime and running everything
in fp32, and there was an accuracy hit here. It depends on the
model, how bad that is, but sometimes it
was far enough off the mark from quantization
aware training that it was not usable. And so that’s where– so again, with the
hybrid quantization, there’s a number of benefits. I’m flying through slides
just in the interest of time. The way to enable that
post-training quantization– you just add a flag to
the conversion paths, and that’s it. But on the accuracy side,
that’s where we came up with some new tooling. We’re calling it per axis or
per channel quantization, where with the waits,
you wouldn’t just have a single quantized
set of parameters for the entire
tensor, but it would be per channel in the tensor. And we found that that, in
combination with feeding it kind of an evaluation data
set during conversion time, where you would explore the
range of possible quantization parameters, we could get
accuracy that’s almost on par with quantization
aware training. AUDIENCE: I’m curious, are
some of these techniques also going to be used
for TensorFlow JS, or did they not have this– do they not have similarities? They use MobileNet,
right, for a browser? JARED DUKE: They do. These aren’t yet, as far as
I’m aware, used or hooked into the TFJS pipeline. There’s no reason
it couldn’t be. I think part of the problem
is just very different tool chains for development. But– AUDIENCE: How do you
do quantized operations in JavaScript? [INAUDIBLE] JARED DUKE: Yeah, I mean
I think the benefit isn’t as clear, probably not as
much as if you were just quantizing to fp16. That’s where you’d probably
get the biggest win for TFJS. In fact, I left it
out of these slides, but we are actively working
on fp16 quantization. You can reduce the size
of your model by half, and then it maps really
well to GPU hardware. But I think one thing
that we want to have is that quantization is
not just a TFLite thing, but it’s kind of a
universally shared concept in the TensorFlow ecosystem. And how can we take the
tools that we already have that are sort
of coupled to TFLite and make them more
generally accessible? So to use this new
post-training quantization path, where you can get
comparable accuracy to training time quantization, effectively,
the only difference here is feeding in this
representative data set of what the inputs would
look like to your graph. It can be a– for an
image-based model, maybe you feed it 30 images. And then it is able to explore
the space of quantization and output values
that would largely match or be close
to what you would get with training-aware
quantization. We have lots of
documentation available. We have a model repo that
we’re going to be investing heavily in to expand this. What we find is that a lot
of TensorFlow developers– or not even TensorFlow
developers– app developers will find some random graph when
they search Google or GitHub. And they try to convert
it, and it fails. And a lot of times,
either we have a model that’s
already been converted or a similar model that’s
better suited for mobile. And we would rather have
a very robust repository that people start
with, and then only if they can’t find
an equivalent model, they resort to our conversion
tools or even authoring tools. AUDIENCE: Is there a TFLite
compatible section in TFHub? JARED DUKE: Yeah,
we’re working on that. Talked about the
model repo training. So what if you want to
do training on device? That is a thing. We have an entire project
called The [INAUDIBLE] Team Federated Learning
Team, who’s focused on this. But we haven’t supported
this in TensorFlow Lite for a number of reasons,
but it’s something that we’re working on. There’s quite a few bits and
components that still have yet to land to support
this, but it’s something that we’re thinking about,
and there is increasing demand for this kind of on-device
tuning or transfer learning scenario. In fact, this is something
that was announced at WWDC, so. So we have a roadmap up. It’s now something that
we publish publicly to make it clear what
we’re working on, what our priorities are. I touched on a lot
of the things that are in the pipeline, things
like control flow and training, proving our runtime. Another thing that we
want to make easier is just to use TFLite in
the kind of native types that you are used to using. If you’re an Android developer,
say, if you have a bitmap, you don’t want to convert
it to a byte buffer. You just want to feed us your
bitmap, and things just work. So that’s something
that we’re working on. A few more links here
to authoring apps with TFLite, different roadmaps
for performance and model optimization. That’s it. So any questions,
any areas you’d like to dive into more deeply? AUDIENCE: So this [INAUDIBLE]. So what is [INAUDIBLE] has more
impact like a fully connected [INAUDIBLE]? JARED DUKE: Sorry. What’s– AUDIENCE: For a speed-up. JARED DUKE: Oh. Why does it? AUDIENCE: Yeah. JARED DUKE: So certain
operators have been, I guess, more optimized to
take advantage of quantization than others. And so in the hybrid
quantization path, we’re not always doing
computation in eight-bit types. We’re doing it in a mix of
floating point and eight-bit types, and that’s why there’s
not always the same speed-up you would get with like an LSTM
and an RNN versus a [INAUDIBLE] operator. AUDIENCE: So you
mentioned that TFLite is on billions of mobile devices. How many apps have you seen
added to the Play Store that have TFLite in them? JARED DUKE: Tim would
have the latest numbers. It’s– I want to say it’s
into the tens of thousands, but I don’t know
that I can say that. It’s certainly in the
several thousands, but we’ve seen pretty
dramatic uptick, though, just tracking Play Store analytics. AUDIENCE: And in the near
term, are you thinking more about trying to increase
the number of devices that are using TFLite or
trying to increase the number of developers
that are including it in the applications
that they built. JARED DUKE: I think both. I mean there are projects
like the TF Micro, where we want to support
actual microcontrollers and running TFLite on extremely
restricted, low power arm devices. So that’s one class
of efforts on– we have seen demand for actually
running TFLite in the cloud. There’s a number of
benefits with TFLite like the startup time, the
lower memory footprint that do make it attractive. And some developers
actually want to run the same model they’re
running on device in the cloud, and so there is
demand for having like a proper x86
optimized back end. But at the same time, I
think one of our big focuses is just making it easier
to use– meet developers where they’re at. And part of that is
a focus on creating a very robust model
repository and more idiomatic APIs they can use
on Android or iOS and use the types
they’re familiar with, and then just making
conversion easy. Right now, if you do take
kind of a random model that you found off
the cloud and try to feed it into our
converter, chances are that it will probably fail. And some of that is
just teaching developers how to convert just the
part of the graph they want, not necessarily all of the
training that’s surrounding it. And part of it is just
adding the features and types to TFLite that would match
the semantics of TensorFlow. I mean, I will say
that in the long run, we want to move toward a more
unified path with TensorFlow and not live in somewhat
disjoint worlds, where we can take advantage
of the same core runtime libraries, the same core
conversion pipelines, and optimization pipelines. So that’s things that we’re
thinking about for the longer term future. AUDIENCE: Yeah, and
also [INAUDIBLE] like the longer term. I’m wondering what’s the
implication of the ever increasing network speed
on the [INAUDIBLE] TFLite? [INAUDIBLE],, which maybe
[INAUDIBLE] faster than current that we’ve [INAUDIBLE]
take [INAUDIBLE] of this. JARED DUKE: We haven’t thought
a whole lot about that, to be honest. I mean, I think we’re still
kind of betting on the reality that there will always be
a need for on device ML. I do think, though,
that 5G probably unlocks some interesting
hybrid scenarios, where you’re doing some on device,
some cloud-based ML, and I think for a while, the
fusion of on device hotware detection, as soon as the
OK, Google is detected, then it starts feeding
things into the cloud. That’s kind of an example
of where there is room for these hybrid solutions. And maybe those will become
more and more practical. Everyone is going
to run to your desk and start using TensorFlow
Lite after this? AUDIENCE: You probably
already are, right? [INAUDIBLE] if you have one of
the however many apps that was listed on Tim’s slide, right? JARED DUKE: I mean, yeah. If you’ve ever done,
OK, Google, then you’re using TensorFlow Lite. AUDIENCE: [INAUDIBLE]. Thank you. JARED DUKE: Thank you. [APPLAUSE]

Reader Comments

Leave a Reply

Your email address will not be published. Required fields are marked *