I think what you are asking GTP processing in Kernel is slow because of
- buffer copy to skb
- network stack traversal to reach udp
- locks and critical region for GTP processing
- non dedicated poll for fetching GTP packets to the desired CPU
- context switches.
Yes all these affect your packet acquisition and processing time. With userspace direct buffer DMA and burst RX/TX these are mitigated. But still the performance depends upon code and cache locality of the data. So a well written code whether it is RAW socket, AF_PACKET, DPDK or PF_RING is to be pursed rather than just RX-TX.