How does ping roughly work over IPv4 on Linux?

Introduction

The ping program is one of the most common programs which is used to check the “aliveness” of a host and a typical execution looks as follows:

$ ping 127.0.0.1 -c 1 -4

PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.062 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.062/0.062/0.062/0.000 ms

The -c switch indicates that we want to send a single “probe”. The -4 switch limits the ping program to stay confined to making network operations related to IPv4 only.

It basically works by sending a special network packet to your destination host and waits for the host to reply back. Then, it prints if any packets were lost and the timing statistics. I wanted to understand how the program works - what does it send? what does it receive? The final product ideally would be a C program which will be a basic version of ping.

Theory

This pdf here has a good description of the working of ping. The non-detailed version is that we create a special ICMP packet, package it up within a IP packet and send it across to the destination. The destination Linux kernel receives the packet, and sends a reply ICMP packet embedded within a IP packet. The destination host doesn’t have any user space program running to receive the “ping” packet. Each packet only has header information. You can embed specific data into the ICMP packet, but that’s not required. The post here describes the packet structure a bit more along with a graphical representation.

With that bit of theory under our belt, let’s look into what system calls are made as part of the above invocation of the ping program using strace.

System calls made as part of ping

If you don’t have strace installed, please install it using your package manager. Let’s now execute the above ping program under strace:

$ strace -e trace=network ping 127.0.0.1 -c 1 -4

You will see the output of the above command similar to:

..
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) = 3
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
connect(4, {sa_family=AF_INET, sin_port=htons(1025), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(34117), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0
setsockopt(3, SOL_IP, IP_RECVTTL, [1], 4) = 0
setsockopt(3, SOL_IP, IP_RETOPTS, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_SNDBUF, [324], 4) = 0
setsockopt(3, SOL_SOCKET, SO_RCVBUF, [65536], 4) = 0
getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_SNDTIMEO, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
sendto(3, "\10\0q9\0\0\0\1\254k\331Z\0\0\0\0B,\0\0\0\0\0\0\20\21\22\23\24\25\26\27"..., 64, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 64
recvmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, msg_namelen=128->16, msg_iov=[{iov_base="\0\0x\314\0m\0\1\254k\331Z\0\0\0\0B,\0\0\0\0\0\0\20\21\22\23\24\25\26\27"..., iov_len=192}], msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=0x1d /* SCM_??? */}, {cmsg_len=20, cmsg_level=SOL_IP, cmsg_type=IP_TTL, cmsg_data=[64]}], msg_controllen=56, msg_flags=0}, 0) = 64
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.188 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.188/0.188/0.188/0.000 ms
+++ exited with 0 +++
a

Let’s first look at the first four lines of the trace:

socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) = 3

The above creates a socket of type SOCK_DGRAM and the protocol as IPPROTO_ICMP. The IPPROTO_ICMP socket protocol was added to allow a friendlier way to create ICMP packets. This eliminates the need to create “RAW” sockets which in turn eliminates the need to have the CAP_NET_RAW capability. The file descriptor returned is important to note here - 3.

socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 4

This creates another socket with IPPROTO_IP protocol and then uses it to connect to the UDP port 1025 on the target host:

connect(4, {sa_family=AF_INET, sin_port=htons(1025), sin_addr=inet_addr(“127.0.0.1”)}, 16) = 0

And then, uses getsockname to get the address the socket is bound to:

getsockname(4, {sa_family=AF_INET, sin_port=htons(34117), sin_addr=inet_addr(“127.0.0.1”)}, [16]) = 0

The above three steps are needed to figure out the IP address of the network interface that will be used to send the ICMP packets to the destination host. I am not quite sure why we need the new socket, hence I created an issue on the iputils project to request a clarification.

Let’s now continue with the trace. We can see a bunch of setsockopt system calls, but they are all on the first socket that was created i.e. for IPPROTO_ICMP with the file descriptor, 3.

Finally, we have the call to, sendto and recvmsg system calls which are used to send the IP packet (with the ICMP packet embedded in it) to the destination host and then receive the reply from the destination host respectively.

Implementation

We now know enough to copy bits and pieces from the ping implementation of the iputils project to transform our above understanding into code. The C implementation is in ping.c. It just sends a single ping and does a blocking read to read the reply. It has a number of inline comments to help understand what’s going on.

One of the key comments there in is about the use of the ICMP identifier RFC. When use a RAW socket, i.e. IPPROTO_RAW as the protocol type, we have to set the ICMP identifier when sending and check if it’s the same on receipt of an ICMP reply that whether it is meant for us or not. We don’t need to do that for IPPROTOCOL_ICMP since the Kernel automatically does that for us.

You can compile it on a Linux system as:

$ gcc ping.c

If we now try to execute the created binary, we will likely get a permission denied error:

$ ./a.out 127.0.0.1
Error creating socket: Permission denied

That’s because the IPPROTO_ICMP support was added to Linux along with a configurable sysctl parameter: ping_group_Range. To print the current value of this:

$ sudo sysctl net.ipv4.ping_group_range
net.ipv4.ping_group_range = 1   0

Now, we can update the parameter above to include our group ID:

$ id -g
1000
$ sudo sysctl -w net.ipv4.ping_group_range="0 1000"
net.ipv4.ping_group_range = 0 2000

(If you are a member of multiple groups, the range has to include only one of the groups)

Now, let’s try sending a single ping:

$ ./a.out 127.0.0.1
Sent 64 bytes
127.0.0.1
Reply of 64 bytes received
icmp_seq = 1

Or an external host:

04:48 $ ./a.out 8.8.8.8
Sent 64 bytes
8.8.8.8
Reply of 64 bytes received
icmp_seq = 1

Parting notes

If you have been following along starting from strace at the beginning you can see that I could run ping without needed sudo or having to set the group sysctl parameter. What happened? The ping program has the setuid bit set:

 $ ls -lrt /bin/ping
-rwsr-xr-x 1 root root 64424 Mar  9  2017 /bin/ping

Hence, we could do the same for our ./a.out file above:

05:30 $ sudo chown root:root ./a.out
05:30 $ sudo chmod u+s ./a.out
05:30 $ sudo chmod g+s ./a.out

Then, we would not need to change the sysctl parameter.

Resources