I have been running codes on Google cloud GPU for some days now. Recently, a problem occurred frequently and had become a headache of mine for days.
For some time, an error message will occur
packet_write_wait: Connection to x.x.x.x: Broken pipe
and the program will just stop running.
I have tried some suggestions online, such as send a keepalive message to server. But nothing helps.
I am really appreciating your help here!
2 Answers
When connecting to a server using SSH, an idle connection could be terminated if there's no "apparent" activity over the SSH connection. If you start a program over the SSH connection, but if that program has no terminal input or output activity for a period of time, then the server may kill the connection.
As an example, the HAProxy software running on the server may disconnect idle client connections after a preset time has elapsed, say 30 minutes.
If your problem is caused by a seemingly idle SSH session, you can work around the problem by asking SSH to keep the connection alive by setting the ServerAliveInterval and ServerAliveCountMax parameters. In some cases, it may be sufficient to set ServerAliveInterval to 30 or 60 seconds but leave the ServerAliveCountMax to its default value of 3. But please read the man page to determine how the combination affects the behavior in various cases (idle connection vs links with connection issues).
ServerAliveInterval
Sets a timeout interval in seconds after which if no data has been received from the server, ssh(1) will send a message through the encrypted channel to request a response from the server. The default is 0, indicating that these messages will not be sent to the server.
From ssh man page:
man ssh_config
ssh(1) obtains configuration data from the following sources in the following order: 1. command-line options 2. user's configuration file (~/.ssh/config) 3. system-wide configuration file (/etc/ssh/ssh_config)
Try man ssh to see how to set the command-line option.
ssh -o ServerAliveInterval=30 -o ServerAliveCountMax=5
Got the same problem because someone plugged a new device in the network and set, erroneously, the same IP address of the device that I was accessing. I could identify this running
arp {IP}(on linux) and checked the MAC address changing. After the device removal of the network, got an stable ssh connection with the host.
Another option is black hole the MAC address in the switch, if you can't find physically the device.