RESTART:Shell(CuDNN 版本问题)

问题描述

  • python IDLE中直接运行某些TensorFlow程序一切正常,运行卷积相关程序时自动退出,现象:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    >>> 
    RESTART: G:\...\lenet5\mnist_lenet5_backward.py
    Extracting ./data/train-images-idx3-ubyte.gz
    Extracting ./data/train-labels-idx1-ubyte.gz
    Extracting ./data/t10k-images-idx3-ubyte.gz
    Extracting ./data/t10k-labels-idx1-ubyte.gz

    =============================== RESTART: Shell ===============================
    >>>
  • 在命令行中运行python程序,现象:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    G:\...\lenet5>python mnist_lenet5_backward.py
    Extracting ./data/train-images-idx3-ubyte.gz
    Extracting ./data/train-labels-idx1-ubyte.gz
    Extracting ./data/t10k-images-idx3-ubyte.gz
    Extracting ./data/t10k-labels-idx1-ubyte.gz
    2019-08-21 13:33:59.248820: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    2019-08-21 13:33:59.777150: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1105] Found device 0 with properties:
    name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 0.8605
    pciBusID: 0000:01:00.0
    totalMemory: 2.00GiB freeMemory: 1.66GiB
    2019-08-21 13:33:59.782067: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
    2019-08-21 13:34:01.714271: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:378] Loaded runtime CuDNN library: 7501 (compatibility version 7500) but source was compiled with 7003 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
    2019-08-21 13:34:01.725498: F C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\kernels\conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

    注意后半部分:

    Loaded runtime CuDNN library: 7501 (compatibility version 7500) but source was compiled with 7003 (compatibility version 7000).

​ 说的是当前安装的CuDNN版本为7.5版,但程序运行需要7.0版,即版本不兼容。

当前情况

目前使用环境为:

  • Win10 + python 3.6.8
  • tensorflow-gpu 1.5.0
  • cuda 9.0.176 + GeForce940MX
  • cudnn-9.0-v7.5.1.10

解决方法

重新下载安装CuDNN,目前只能下载[Download cuDNN v7.0.5 (Dec 5, 2017), for CUDA 9.0]这个版本了,下载cuDNN v7.0.5 Library for Windows 10,安装就是将里面的3个文件复制到cuda安装路径即可,一般位置为:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0

安装后就可以正常运行了卷积训练了。

  • IDLE中运行情况

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    >>> 
    RESTART: G:\...\lenet5\mnist_lenet5_backward.py
    Extracting ./data/train-images-idx3-ubyte.gz
    Extracting ./data/train-labels-idx1-ubyte.gz
    Extracting ./data/t10k-images-idx3-ubyte.gz
    Extracting ./data/t10k-labels-idx1-ubyte.gz
    After 6803 training step(s), loss on training batch is 0.780407.
    After 6903 training step(s), loss on training batch is 0.767262.
    After 7003 training step(s), loss on training batch is 0.713715.
    After 7103 training step(s), loss on training batch is 0.692996.
  • 命令行中运行情况

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    G:\TestProject\python\tensorflow\peking_caojian\7CNNbase\lenet5>python mnist_lenet5_backward.py
    Extracting ./data/train-images-idx3-ubyte.gz
    Extracting ./data/train-labels-idx1-ubyte.gz
    Extracting ./data/t10k-images-idx3-ubyte.gz
    Extracting ./data/t10k-labels-idx1-ubyte.gz
    2019-08-21 14:46:54.309030: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    2019-08-21 14:46:54.870614: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1105] Found device 0 with properties:
    name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 0.8605
    pciBusID: 0000:01:00.0
    totalMemory: 2.00GiB freeMemory: 1.66GiB
    2019-08-21 14:46:54.876303: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
    After 9006 training step(s), loss on training batch is 0.671621.
    After 9106 training step(s), loss on training batch is 0.784085.
    After 9206 training step(s), loss on training batch is 0.738665.
    After 9306 training step(s), loss on training batch is 0.691877.

注:上面在命令行中运行python程序,提示的AVX是警告信息,如果是安装了GPU版本的TensorFlow,可以 忽略,原因可以参考下面链接: