|
Message
From: Mark McDougall<markm@v...>
Date: Tue Feb 19 01:52:33 CET 2008
Subject: [oc] PCI core question
Howard Harte wrote:> One issue I'd like some advice on is improving read latency. For single > 32-bit writes, they complete in about 300ns, which is fine. For 32-bit > reads, they complete in 2.5uS, which is a really long time. I'm reading > and writing to a FIFO, which occupies a single address on the wishbone > backplane.
It's been a while since I used the core, but we did some profiling and analysis because we had similar issues.
By far the most significant factor in read performance is the fact that all reads are posted, which means the core will immediately disconnect after latching the address/command. It is up to the chipset on the motherboard then to decide when to issue a retry to the core in order to receive the result. We were seeing gaps of 8 more more PCI clocks, for example, between retries.
IIRC the *absolute* best performance you'll hope to get on reads on an Intel mobo is around 3MHz (300ns) - or 10 clocks per transfer, due to this disconnect. For 32-bit reads, that of course equates to 12MBps. The best we actually measured on our platform was around 9MBps for internal register accesses, and ~4.4MBps when accessing external SRAM due to the fact that it required 2 retries/access.
You're seeing the above-mentioned 300ns per write because that likely corresponds to the gaps between successive commands on the PCI bus as well. That's pretty much exactly what we saw too.
> Some things I've thought about are mapping the FIFO to a separate BAR, > ignoring the lower address bits, and enabling read prefetching, but > figured this might be dangerous since it's a FIFO.
Mapping to a separate BAR and igfnoring the lower address bits likely won't make any difference at all.
And you definitely *don't* want to enable pre-fetching, as your performance will actually drop, as it fetches an entire cache-line from wishbone space before allowing the retry to resume... we tried that! ;) Besides, Intel mobo chipsets won't burst on memory reads, so the rest of the cacheline will *always* be discarded! All that aside, you have a FIFO so it wouldn't help you at all, unless you did something really dangerous like alias the read port to multiple consecutive addresses... but then you have issues on FIFO empty... don't go there...
> Another thing I considered is doing bus mastering to empty the FIFO into > main memory, but this is a large change to my design.
We had DMA bus mastering for memory access and IDE and the throughput was limited by the memory/IDE, so I can't recall what the bus utilisation was during the actual block transfer, but we did see upwards of 16MB/s overall performance.
> Any other thoughts on how to proceed?
Simulate your design and you'll see exactly where your latencies are. Use a PCI bus analyser to look at how often the mobo is attempting to connect to the core and issue retries etc.
Regards,
-- Mark McDougall, Engineer Virtual Logic Pty Ltd, <http://www.vl.com.au> 21-25 King St, Rockdale, 2216 Ph: +612-9599-3255 Fax: +612-9599-3266
|
 |