编译tensorflow-serving GPU

背景

因为业务方上了bert的模型,所以要制作一个GPU版本的sidecar的TFServing镜像。

编译

找到一种可以直接通过docker编译,不用在主机上装各种东西的方法:使用devel版本的镜像。

devel镜像

TFServing-devel版本的镜像自带了很多编译Tensorflow Serving的组件,比如bazel、gcc、glibc等等等等,因此会非常大。打好镜像后我们把bin文件拷贝到非devel版本的镜像里面运行即可。

去dockerhub拉镜像:https://hub.docker.com/r/tensorflow/serving/tags?page=1&name=2.0

拉下面这两个:

tensorflow/serving   2.0.0-gpu           af288d8e0730        11 months ago       2.49GB
tensorflow/serving   2.0.0-devel-gpu     111028dae1da        11 months ago       11.8GB

运行

运行并进入容器:

docker run -itd --name tfs --network=host tensorflow/serving:2.0.0-devel-gpu /bin/bash
docker exec -it tfs /bin/bash

在容器里更改代码,然后编译:

bazel build -c opt --config=cuda //tensorflow_serving/model_servers:tensorflow_model_server --verbose_failures

问题

在编译过程中还是会遇到各种问题,这里挑了几个有代表性的拿出来讲一下:

no such package

编译过程中会遇到几次类似的:

ERROR: /tensorflow-serving/tensorflow_serving/model_servers/BUILD:318:1: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it and referenced by '//tensorflow_serving/model_servers:server_lib'
ERROR: Analysis of target '//tensorflow_serving/model_servers:tensorflow_model_server' failed; build aborted: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it
INFO: Elapsed time: 1346.453s

解决方法:重试几次。或者用以下两种方法

宿主机搭建一个文件服务器

使用ng搭建:

vim /usr/local/etc/nginx/nginx.conf
http {
    autoindex on;
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;

    keepalive_timeout  65;
    server {
        listen       8001;
        server_name  127.0.0.1;

        location / {
            root   <your_path>;
            index  index.html index.htm;
        }
    }
}
    

使用宿主机的代理

  1. 在宿主机找到ip
(base) ➜  bin ifconfig | grep "inet " | grep -v 127.0.0.1
	inet xxx.xxx.xxx.xxx netmask 0xfffffff0 broadcast xxx.xxx.xxx.xxx
	inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx
	inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx
  1. 进去容器后设置代理
export ALL_PROXY='socks5://xxx.xxx.xxx.xxx:1080'
  1. 看看是否设置上了:
curl cip.cc

gcc: Internal error: Killed (program cc1)

内存不足,调大docker镜像内存吧:
Preferences -> Advances
卤煮是调大到12G,swap给了2G才编完的。

can not be used when making a shared object; recompile with -fPIC

/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status

在github上找了个issue:https://github.com/netfs/serving/commit/be7c70d779a39fad73a535185a4f4f991c1d859a 但是本地代码里面已经有这个fix了。后来是把version去掉了才编译过的:

BUILD文件修改:

cc_library(
    name = "tensorflow_model_server_main_lib",
    srcs = [
        "main.cc",
    ],
    #hdrs = [
    #    "version.h",
    #],
    #linkstamp = "version.cc",
    visibility = [
        ":tensorflow_model_server_custom_op_clients",
        "//tensorflow_serving:internal",
    ],
    deps = [
        ":server_lib",
        "@org_tensorflow//tensorflow/c:c_api",
        "@org_tensorflow//tensorflow/core:lib",
        "@org_tensorflow//tensorflow/core/platform/cloud:gcs_file_system",
        "@org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system",
        "@org_tensorflow//tensorflow/core/platform/s3:s3_file_system",
    ],
)

main.cc文件修改

//#include "tensorflow_serving/model_servers/version.h"

...
if (display_version) {
    std::cout << "TensorFlow ModelServer: " << "r1.12" << "\n"
              << "TensorFlow Library: " << TF_Version() << "\n";
    return 0;
  }

保存镜像

编完后保存:

docker commit -a "xxx" -m "tfserving gpu build" b629d5936020 tensorflow/serving:2.0.0-devel-gpu-build

镜像导入导出:

docker save -o xxx.tar tensorflow/serving:mkl
docker load -i xxx.tar

启动参数

sudo nvidia-docker run -p 8500:8500 \
  --mount type=bind,source=xxx/models,target=xxx \
  -t --entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu \
  --port=8500 --per_process_gpu_memory_fraction=0.5 \
  --enable_batching=true --model_name=east --model_base_path=/models/east_model &

参数含义:

  • -p 8500:8500 :指的是开放8500这个gRPC端口。
  • --mount type=bind, source=/your/local/model, target=/models:把你导出的本地模型文件夹挂载到docker container的/models这个文件夹,tensorflow serving会从容器内的/models文件夹里面找到你的模型。
  • -t --entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu:如果使用非devel版的docker,启动docker之后是不能进入容器内部bash环境的,--entrypoint的作用是允许你“间接”进入容器内部,然后调用tensorflow_model_server命令来启动TensorFlow Serving,这样才能输入后面的参数。紧接着指定使用tensorflow/serving:latest-gpu 这个镜像,可以换成你想要的任何版本。
  • --port=8500:开放8500这个gRPC端口(需要先设置上面entrypoint参数,否则无效。下面参数亦然)
  • --per_process_gpu_memory_fraction=0.5:只允许模型使用多少百分比的显存,数值在[0, 1]之间。
  • --enable_batching:允许模型进行批推理,提高GPU使用效率。
  • --model_name:模型名字,在导出模型的时候设置的名字。
  • --model_base_path:模型所在容器内的路径,前面的mount已经挂载到了/models文件夹内,这里需要进一步指定到某个模型文件夹,例如/models/east_model指的是使用/models/east_model这个文件夹下面的模型。

使用代码里面的tool构建

也可以使用代码里面的tool构建tfserving镜像:

拉代码

git clone --recurse-submodules https://github.com/tensorflow/models.git
git checkout r2.0
cd serving

构建ModelServer

修改代码后,构建优化版本的ModelServer。

CPU版本:

docker build --pull -t $USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.devel .
如果机器安装了Intel的MKL库(据说要比开源的OpenBLAS快),那么可以使用:

docker build --pull -t $USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.devel-mkl .

GPU版本:

docker build --pull -t $USER/tensorflow-serving-devel-gpu
-f tensorflow_serving/tools/docker/Dockerfile.devel-gpu .
上面的(任选一个)过程会构建$USER/tensorflow-serving-devel这个镜像。

构建Tensorflow Serving 镜像

接下来我们用上面构建的$USER/tensorflow-serving-devel来构建Tensorflow Serving镜像。

CPU版本:

docker build -t $USER/tensorflow-serving
--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile .
如果是MKL的CPU版本:

docker build -t $USER/tensorflow-serving
--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.mkl .

GPU的版本:

docker build -t $USER/tensorflow-serving
--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.gpu .

ref

文件服务器:https://blog.csdn.net/qq_39567427/article/details/104877041
bazel:https://blog.gmem.cc/bazel-study-note
https://www.cnblogs.com/zjutzz/p/10305995.html
bazel指定从文件服务器下载:http://www.jeepxie.net/article/392509.html
docker内使用宿主机代理服务器:https://arminli.com/blog/183
https://www.jianshu.com/p/01f0ee9086e2

fPIC:https://www.cnblogs.com/zl1991/p/11465111.html

http://webcache.googleusercontent.com/search?q=cache:ZulKFDzVupwJ:fancyerii.github.io/books/tfserving-docker/+&cd=4&hl=zh-CN&ct=clnk&gl=us

comments powered by Disqus