Note: This post is written at a bit higher level than I would desire. This is because I don’t want to reveal any proprietary information. I apologize in advance since you may need to do some googling on the side to get more specifics.
Low latency means getting data to the output as quickly as possible once receiving the input.The input can be triangles in the case of graphics or Ethernet frames in the case of low latency trading. The output is the display, packets or messages to trade clients.
Latency is measured in absolute time. For example, given a 60Hz refresh rate, the time it takes to draw a single frame of a video would be 16ms. This means that if you were running a graphics card that could produce one image every 16ms, you would be keeping up with the maximum frame rate of the monitor.
In Finance, there is no minimum latency that needs to be achieved. The only thing that matters is whether you are faster or slower than your competition.
Let’s first look at what contributed to latency:
- Crossing clock domains.
- Waiting to fetch from storage.
- How fast your physical interfaces can provide or accept data.
- Data Overlap.
Looking at each individually:
Clock domain crossing:
Please see my blog posts on www.asicsolutions.com for more details on clock crossing. Looking at Ethernet, at least one clock crossing is inevitable. Ethernet RX recovers a clock from the physical wires carrying the data. Ethernet TX uses a generated clock at a specified frequency required by the interface, typically generated from a PLL. Therefore, you would need to add a FIFO to cross clock domains from RX to TX.
Why does clock domain crossing cost latency?
Referring back to the CDC articles I wrote, you can see that it costs between 2-3 clock cycles to pass a signal from one domain to another. In the case of a FIFO, you need to pass the write pointer to the read side and compare to determine if there is data available. Additional latency may be required in an FPGA due to routing to/ from storage elements in the case of BRAM.
Fetching from storage:
Reading from internal BRAMs take one or more clock cycles. External memory such as DDR or QDR can cost upwards of 15-20 clock cycles. The advantage to external memory is sheer size. External memory can hold many orders of magnitude more data than BRAM. So large tables, packet data, etc… would need to be stored here.
There is a serialization delay for the Serdes in the case of Ethernet or DisplayPort. This is unavoidable and the only way to mitigate this is to chose a vendor that has a lower latency.
The IP talking to the physical interfaces also has latency. Some latency is unavoidable, but there are often places to save if you trade off the time and effort.
If you need to stall the pipeline to perform a calculation, you will reduce your overall throughput.
Data comes in too fast to handle with a single pipeline and overlaps previous data being processed.
How can we cut latency:
- Clock domain crossings should be minimized. I can’t go much more into this.
- If you need to wait for external data, you have a couple of choices:
- Caching can provide fast access to data according the scheme selected. A pure cache typically helps in some cases, but a miss will incur latency. Since most times that you’ll need external memory, you’ll have well defined access patterns, a specialed cache or prefetch buffer can help.
- Cycle Hiding allows you to pipeline along with the fetch so that when the data is returned, you are ready for it. This tends to help more in the case of throughput.
- Running at a faster clock speed. If you can process data at a faster clock rate, preferably a multiple of the base input rate, you can run a narrower width.
- To fix stalling, the design should be pipelined as much as possible. If a FIFO or stall logic is necessary, then the overall throughput is negatively affected.
- If Stalling is inevitable and data overlapping is occurring, then parallel pipelines that ping-pong may need to be introduced.