背景
因为业务方上了bert的模型,所以要制作一个GPU版本的sidecar的TFServing镜像。
编译
找到一种可以直接通过docker编译,不用在主机上装各种东西的方法:使用devel版本的镜像。
devel镜像
TFServing-devel版本的镜像自带了很多编译Tensorflow Serving的组件,比如bazel、gcc、glibc等等等等,因此会非常大。打好镜像后我们把bin文件拷贝到非devel版本的镜像里面运行即可。
去dockerhub拉镜像:https://hub.docker.com/r/tensorflow/serving/tags?page=1&name=2.0
拉下面这两个:
tensorflow/serving 2.0.0-gpu af288d8e0730 11 months ago 2.49GB
tensorflow/serving 2.0.0-devel-gpu 111028dae1da 11 months ago 11.8GB
运行
运行并进入容器:
docker run -itd --name tfs --network=host tensorflow/serving:2.0.0-devel-gpu /bin/bash
docker exec -it tfs /bin/bash
在容器里更改代码,然后编译:
bazel build -c opt --config=cuda //tensorflow_serving/model_servers:tensorflow_model_server --verbose_failures
问题
在编译过程中还是会遇到各种问题,这里挑了几个有代表性的拿出来讲一下:
no such package
编译过程中会遇到几次类似的:
ERROR: /tensorflow-serving/tensorflow_serving/model_servers/BUILD:318:1: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it and referenced by '//tensorflow_serving/model_servers:server_lib'
ERROR: Analysis of target '//tensorflow_serving/model_servers:tensorflow_model_server' failed; build aborted: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it
INFO: Elapsed time: 1346.453s
解决方法:重试几次。或者用以下两种方法
宿主机搭建一个文件服务器
使用ng搭建:
vim /usr/local/etc/nginx/nginx.conf
http {
autoindex on;
include mime.types;
default_type application/octet-stream;
sendfile on;
keepalive_timeout 65;
server {
listen 8001;
server_name 127.0.0.1;
location / {
root <your_path>;
index index.html index.htm;
}
}
}
使用宿主机的代理
- 在宿主机找到ip
(base) ➜ bin ifconfig | grep "inet " | grep -v 127.0.0.1
inet xxx.xxx.xxx.xxx netmask 0xfffffff0 broadcast xxx.xxx.xxx.xxx
inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx
inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx
- 进去容器后设置代理
export ALL_PROXY='socks5://xxx.xxx.xxx.xxx:1080'
- 看看是否设置上了:
curl cip.cc
gcc: Internal error: Killed (program cc1)
内存不足,调大docker镜像内存吧:
Preferences -> Advances
卤煮是调大到12G,swap给了2G才编完的。
can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status
在github上找了个issue:https://github.com/netfs/serving/commit/be7c70d779a39fad73a535185a4f4f991c1d859a 但是本地代码里面已经有这个fix了。后来是把version去掉了才编译过的:
BUILD文件修改:
cc_library(
name = "tensorflow_model_server_main_lib",
srcs = [
"main.cc",
],
#hdrs = [
# "version.h",
#],
#linkstamp = "version.cc",
visibility = [
":tensorflow_model_server_custom_op_clients",
"//tensorflow_serving:internal",
],
deps = [
":server_lib",
"@org_tensorflow//tensorflow/c:c_api",
"@org_tensorflow//tensorflow/core:lib",
"@org_tensorflow//tensorflow/core/platform/cloud:gcs_file_system",
"@org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system",
"@org_tensorflow//tensorflow/core/platform/s3:s3_file_system",
],
)
main.cc文件修改
//#include "tensorflow_serving/model_servers/version.h"
...
if (display_version) {
std::cout << "TensorFlow ModelServer: " << "r1.12" << "\n"
<< "TensorFlow Library: " << TF_Version() << "\n";
return 0;
}
保存镜像
编完后保存:
docker commit -a "xxx" -m "tfserving gpu build" b629d5936020 tensorflow/serving:2.0.0-devel-gpu-build
镜像导入导出:
docker save -o xxx.tar tensorflow/serving:mkl
docker load -i xxx.tar
启动参数
sudo nvidia-docker run -p 8500:8500 \
--mount type=bind,source=xxx/models,target=xxx \
-t --entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu \
--port=8500 --per_process_gpu_memory_fraction=0.5 \
--enable_batching=true --model_name=east --model_base_path=/models/east_model &
参数含义:
- -p 8500:8500 :指的是开放8500这个gRPC端口。
- --mount type=bind, source=/your/local/model, target=/models:把你导出的本地模型文件夹挂载到docker container的/models这个文件夹,tensorflow serving会从容器内的/models文件夹里面找到你的模型。
- -t --entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu:如果使用非devel版的docker,启动docker之后是不能进入容器内部bash环境的,--entrypoint的作用是允许你“间接”进入容器内部,然后调用tensorflow_model_server命令来启动TensorFlow Serving,这样才能输入后面的参数。紧接着指定使用tensorflow/serving:latest-gpu 这个镜像,可以换成你想要的任何版本。
- --port=8500:开放8500这个gRPC端口(需要先设置上面entrypoint参数,否则无效。下面参数亦然)
- --per_process_gpu_memory_fraction=0.5:只允许模型使用多少百分比的显存,数值在[0, 1]之间。
- --enable_batching:允许模型进行批推理,提高GPU使用效率。
- --model_name:模型名字,在导出模型的时候设置的名字。
- --model_base_path:模型所在容器内的路径,前面的mount已经挂载到了/models文件夹内,这里需要进一步指定到某个模型文件夹,例如/models/east_model指的是使用/models/east_model这个文件夹下面的模型。
使用代码里面的tool构建
也可以使用代码里面的tool构建tfserving镜像:
拉代码
git clone --recurse-submodules https://github.com/tensorflow/models.git
git checkout r2.0
cd serving
构建ModelServer
修改代码后,构建优化版本的ModelServer。
CPU版本:
docker build --pull -t $USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.devel .
如果机器安装了Intel的MKL库(据说要比开源的OpenBLAS快),那么可以使用:
docker build --pull -t $USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.devel-mkl .
GPU版本:
docker build --pull -t $USER/tensorflow-serving-devel-gpu
-f tensorflow_serving/tools/docker/Dockerfile.devel-gpu .
上面的(任选一个)过程会构建$USER/tensorflow-serving-devel这个镜像。
构建Tensorflow Serving 镜像
接下来我们用上面构建的$USER/tensorflow-serving-devel来构建Tensorflow Serving镜像。
CPU版本:
docker build -t $USER/tensorflow-serving
--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile .
如果是MKL的CPU版本:
docker build -t $USER/tensorflow-serving
--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.mkl .
GPU的版本:
docker build -t $USER/tensorflow-serving
--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel
-f tensorflow_serving/tools/docker/Dockerfile.gpu .
ref
文件服务器:https://blog.csdn.net/qq_39567427/article/details/104877041
bazel:https://blog.gmem.cc/bazel-study-note
https://www.cnblogs.com/zjutzz/p/10305995.html
bazel指定从文件服务器下载:http://www.jeepxie.net/article/392509.html
docker内使用宿主机代理服务器:https://arminli.com/blog/183
https://www.jianshu.com/p/01f0ee9086e2