FPGA Tools Synthesis QoR Benchmark
--它们分别是(括号内为综合时选择的器件):
Synplify Premier H-2013.03 (Virtex5)
Quartus 13.0 (Cyclone III)
ISE 14.6 (Spartan6)
Vivado 2013.2 (Artix7)
测试使用了5种简单场景,包括:
1. in2reg Timing-Driven QoR Test -- 输入到寄存器的优化能力
2. reg2reg Timing-Driven QoR Test -- 寄存器到寄存器的优化能力
3. Sequential Optimization Test -- 跨寄存器的优化能力
4. Register Replication Test (Logical) -- 根据逻辑关系复制寄存器的能力
5. Register Replication Test (Physical) -- 根据布局关系复制寄存器的能力
========= Test Case 1. in2reg Timing-Driven QoR Test =========
本例子对256个输入信号进行“与”后送给一个寄存器,并仅对其中一个输入信号添加时序约束。在添加约束前,电路逻辑应该是对称树状的(如果用两输入与门,则有8级,因2^8=256);在添加约束后,约束的路径的组合逻辑应该被收到一级。在这里我们对第128个输入信号添加了时序约束,并禁用了全局时钟(否则时钟延时会改善Slack)。该例子RTL如下:
- module i2r(input clk,input [255:0] dat_i,output reg dat_o);
- always@(posedge clk) dat_o <= &dat_i;
- endmodule
Synplify/Quartus/Vivado使用的SDC约束如下:
- create_clock [get_ports clk] -period 1
- set_input_delay -clock clk -max 0.9 [get_ports {dat_i[127]}]
ISE使用的XCF/UCF约束如下(在XST中把位宽标识符从<>改为[]):
- NET clk TNM_NET = clk;
- TIMESPEC TS_clk = PERIOD clk 1.000 ns;
- NET "dat_i[127]" OFFSET = IN 0.1 BEFORE clk;
Synplify综合结果如下,dat_i[127]只经过了一个LUT4就到达了dat_o:
- Instance / Net Pin Pin Arrival
- Name Type Name Dir Delay Time
- -----------------------------------------------------------------------
- dat_i[255:0] Port dat_i[127] In 0.000 0.900
- dat_i[127] Net - - 0.000 -
- dat_i_ibuf[127] IBUF I In - 0.900
- dat_i_ibuf[127] IBUF O Out 0.992 1.892
- dat_i_c[127] Net - - 0.360 -
- dat_o_RNO LUT4_L I3 In - 2.252
- dat_o_RNO LUT4_L LO Out 0.376 2.628
- dat_o_2 Net - - 0.000 -
- dat_o FD D In - 2.628
- =======================================================================
- 0.900 0.000 FF IC IOIBUF_X0_Y3_N1 dat_i[127]~input|i
- 1.786 0.886 FF CELL IOIBUF_X0_Y3_N1 dat_i[127]~input|o
- 2.277 0.491 FF IC LCCOMB_X1_Y3_N20 WideAnd0~84|datad
- 2.387 0.110 FF CELL LCCOMB_X1_Y3_N20 WideAnd0~84|combout
- 2.387 0.000 FF IC FF_X1_Y3_N21 dat_o~reg0|d
- 2.478 0.091 FF CELL FF_X1_Y3_N21 dat_o~reg0
- Location Delay type Delay(ns) Physical Resource
- Logical Resource(s)
- ------------------------------------------------- -------------------
- R22.I Tiopi 0.902 dat_i[127]
- dat_i[127]
- dat_i_127_IBUF
- ProtoComp3.IMUX.143
- SLICE_X52Y37.B6 net (fanout=1) 2.846 dat_i_127_IBUF
- SLICE_X52Y37.COUT Topcyb 0.423 out_wg_cy[19]
- out_wg_lut[17]
- out_wg_cy[19]
- SLICE_X52Y38.CIN net (fanout=1) 0.003 out_wg_cy[19]
- SLICE_X52Y38.COUT Tbyp 0.086 out_wg_cy[23]
- out_wg_cy[23]
- SLICE_X52Y39.CIN net (fanout=1) 0.003 out_wg_cy[23]
- SLICE_X52Y39.COUT Tbyp 0.086 out_wg_cy[27]
- out_wg_cy[27]
- SLICE_X52Y40.CIN net (fanout=1) 0.214 out_wg_cy[27]
- SLICE_X52Y40.COUT Tbyp 0.086 out_wg_cy[31]
- out_wg_cy[31]
- SLICE_X52Y41.CIN net (fanout=1) 0.003 out_wg_cy[31]
- SLICE_X52Y41.COUT Tbyp 0.086 out_wg_cy[35]
- out_wg_cy[35]
- SLICE_X52Y42.CIN net (fanout=1) 0.003 out_wg_cy[35]
- SLICE_X52Y42.COUT Tbyp 0.086 out_wg_cy[39]
- out_wg_cy[39]
- SLICE_X52Y43.CIN net (fanout=1) 0.003 out_wg_cy[39]
- SLICE_X52Y43.CLK Tckcin (-Th) -0.107 dat_o_OBUF
- out_wg_cy[42]
- dat_o
Vivado综合结果如下(评论略...):
- Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
- ---------------------------------------------------- -------------------
- (clock clk rise edge) 0.000 0.000 r
- input delay 0.900 0.900
- D19 0.000 0.900 r dat_i[127]
- net (fo=0) 0.000 0.900 dat_i[127]
- D19 IBUF (Prop_ibuf_I_O) 0.978 1.878 r dat_i_IBUF[127]_inst/O
- net (fo=1, routed) 2.250 4.128 dat_i_IBUF[127]
- SLICE_X1Y113 LUT6 (Prop_lut6_I5_O) 0.124 4.252 r dat_o_reg_i_25/O
- net (fo=1, routed) 0.151 4.403 n_0_dat_o_reg_i_25
- SLICE_X1Y113 LUT6 (Prop_lut6_I2_O) 0.124 4.527 r dat_o_reg_i_17/O
- net (fo=1, routed) 0.303 4.830 n_0_dat_o_reg_i_17
- SLICE_X0Y111 LUT6 (Prop_lut6_I0_O) 0.124 4.954 r dat_o_reg_i_4/O
- net (fo=1, routed) 0.290 5.244 n_0_dat_o_reg_i_4
- SLICE_X1Y110 LUT3 (Prop_lut3_I2_O) 0.124 5.368 r dat_o_reg_i_1/O
- net (fo=1, routed) 0.000 5.368 p_0_in
- SLICE_X1Y110 FDRE r dat_o_reg/D
========= Test Case 2. reg2reg Timing-Driven QoR Test =========
本例子跟上一个类似,只不过换成了256个寄存器输出进行“与”后送给一个寄存器,并将其中255个寄存器的输出设为False Path。留下的那一条路径应该被Timing-Driven到最短。在这里留下的是从dat_buf[127]输出的路径。该例子RTL如下:
- module r2r(input clk,input [255:0] dat_i,output reg dat_o);
- reg [255:0] dat_buf;
- always@(posedge clk) dat_buf <= dat_i;
- always@(posedge clk) dat_o <= &dat_buf;
- endmodule
Synplify/Quartus使用的SDC约束如下:
- create_clock [get_ports clk] -period 1
- set_false_path -from [get_cells {dat_buf[0]}]
- set_false_path -from [get_cells {dat_buf[1]}]
- ...
- set_false_path -from [get_cells {dat_buf[126]}]
- set_false_path -from [get_cells {dat_buf[128]}]
- ...
- set_false_path -from [get_cells {dat_buf[255]}]
Vivado使用的SDC约束如下(Vivado SDC不支持-from Cell):
- create_clock [get_ports clk] -period 1
- set_false_path -through [get_pins {dat_buf_reg[0]/Q}]
- set_false_path -through [get_pins {dat_buf_reg[1]/Q}]
- ...
- set_false_path -through [get_pins {dat_buf_reg[126]/Q}]
- set_false_path -through [get_pins {dat_buf_reg[128]/Q}]
- ...
- set_false_path -through [get_pins {dat_buf_reg[255]/Q}]
ISE使用的XCF/UCF约束如下:
- NET clk TNM_NET = clk;
- TIMESPEC TS_clk = PERIOD clk 1.000 ns;
- INST "dat_buf_0" TIG;
- INST "dat_buf_1" TIG;
- ...
- INST "dat_buf_126" TIG;
- INST "dat_buf_128" TIG;
- ...
- INST "dat_buf_255" TIG;
Synplify综合结果如下,dat_buf[127]只经过了一个LUT4就到达了dat_o:
- Instance / Net Pin Pin Arrival
- Name Type Name Dir Delay Time
- ----------------------------------------------------------------
- dat_buf[127] FD Q Out 0.450 4.588
- dat_buf[127] Net - - 0.360 -
- dat_o_RNO LUT4_L I3 In - 4.948
- dat_o_RNO LUT4_L LO Out 0.376 5.324
- dat_o_2 Net - - 0.000 -
- dat_o FD D In - 5.324
Quartus综合结果如下,dat_buf[127]只经过了一个LCCOMB就到达了dat_o:
- 2.624 0.199 uTco FF_X40_Y14_N19 dat_buf[127]
- 2.624 0.000 FF CELL FF_X40_Y14_N19 dat_buf[127]|q
- 2.910 0.286 FF IC LCCOMB_X40_Y14_N24 WideAnd0~84|datad
- 3.020 0.110 FF CELL LCCOMB_X40_Y14_N24 WideAnd0~84|combout
- 3.020 0.000 FF IC FF_X40_Y14_N25 dat_o~reg0|d
- 3.111 0.091 FF CELL FF_X40_Y14_N25 dat_o~reg0
不过在上一个例子和本个例子中,随着约束的变化,Quartus有时不能收敛到最佳结果;有时会经过两级才到达,有时本来综合后只经过一级的开了物理综合又会多出好几级,总之收敛得不是特别的稳定。
ISE综合结果如下(评论略...):
- Location Delay type Delay(ns) Physical Resource
- Logical Resource(s)
- ------------------------------------------------- -------------------
- SLICE_X25Y35.DQ Tcko 0.430 dat_buf[127]
- dat_buf_127
- SLICE_X26Y35.B6 net (fanout=1) 0.351 dat_buf[127]
- SLICE_X26Y35.COUT Topcyb 0.483 out_wg_cy[19]
- out_wg_lut[17]
- out_wg_cy[19]
- SLICE_X26Y36.CIN net (fanout=1) 0.003 out_wg_cy[19]
- SLICE_X26Y36.COUT Tbyp 0.093 out_wg_cy[23]
- out_wg_cy[23]
- SLICE_X26Y37.CIN net (fanout=1) 0.003 out_wg_cy[23]
- SLICE_X26Y37.COUT Tbyp 0.093 out_wg_cy[27]
- out_wg_cy[27]
- SLICE_X26Y38.CIN net (fanout=1) 0.003 out_wg_cy[27]
- SLICE_X26Y38.COUT Tbyp 0.093 out_wg_cy[31]
- out_wg_cy[31]
- SLICE_X26Y39.CIN net (fanout=1) 0.003 out_wg_cy[31]
- SLICE_X26Y39.COUT Tbyp 0.093 out_wg_cy[35]
- out_wg_cy[35]
- SLICE_X26Y40.CIN net (fanout=1) 0.214 out_wg_cy[35]
- SLICE_X26Y40.COUT Tbyp 0.093 out_wg_cy[39]
- out_wg_cy[39]
- SLICE_X26Y41.CIN net (fanout=1) 0.003 out_wg_cy[39]
- SLICE_X26Y41.CLK Tcinck 0.295 dat_o_OBUF
- out_wg_cy[42]
- dat_o
Vivado综合结果如下(评论略...):
- Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
- ----------------------------------------------------- -------------------
- SLICE_X55Y110 FDRE (Prop_fdre_C_Q) 0.456 5.517 r dat_buf_reg[127]/Q
- net (fo=1, routed) 0.165 5.682 dat_buf[127]
- SLICE_X54Y110 LUT6 (Prop_lut6_I5_O) 0.124 5.806 r dat_o_reg_i_25/O
- net (fo=1, routed) 0.497 6.304 n_0_dat_o_reg_i_25
- SLICE_X52Y106 LUT6 (Prop_lut6_I2_O) 0.124 6.428 r dat_o_reg_i_17/O
- net (fo=1, routed) 0.447 6.874 n_0_dat_o_reg_i_17
- SLICE_X52Y102 LUT6 (Prop_lut6_I0_O) 0.124 6.998 r dat_o_reg_i_4/O
- net (fo=1, routed) 0.159 7.157 n_0_dat_o_reg_i_4
- SLICE_X52Y102 LUT3 (Prop_lut3_I2_O) 0.124 7.281 r dat_o_reg_i_1/O
- net (fo=1, routed) 0.000 7.281 p_0_in
- SLICE_X52Y102 FDRE r dat_o_reg/D
========= Test Case 3. Sequential Optimization Test =========
本例子主要测试跨寄存器的优化能力,在本例子的RTL中描述了一段完全Dummy的电路,不过需要跨寄存器才能识别:
- module seq_opt(input clk,dat_i,output reg dat_o);
- reg dat_buf,dat_buf2,dat_buf2n;
- always@(posedge clk) dat_buf <= dat_i;
- always@(posedge clk) dat_buf2 <= dat_buf;
- always@(posedge clk) dat_buf2n <= ~dat_buf;
- always@(posedge clk) dat_o <= dat_buf2 | dat_buf2n;
- endmodule
在SDC/XCF/UCF中声明了clk是频率为1GHz的时钟。
Synplify综合结果如下,电路被完全优化掉:
- @W:CL169 : seq_opt.v(6) | Pruning register dat_buf2
- @W:CL169 : seq_opt.v(7) | Pruning register dat_buf2n
- @W:CL169 : seq_opt.v(5) | Pruning register dat_buf
- @W:CL189 : seq_opt.v(8) | Register bit dat_o is always 1, optimizing ...
- @W:CL159 : seq_opt.v(1) | Input clk is unused
- @W:CL159 : seq_opt.v(1) | Input dat_i is unused
Quartus综合结果如下,电路也被完全优化掉。注意如果去掉第一级dat_buf,Quartus就不会进行优化了;这种行为更为安全,不会把鉴相器之类的电路优化掉:
- Register Reason for Removal Register Causing Removal
- dat_buf2n Merged with dat_buf2
- dat_buf2 Lost fanout
- dat_buf Lost fanout dat_buf2
- dat_o~reg0 Stuck at VCC due to stuck port data_in
ISE只有打开Register Balancing时才能把电路优化掉(但其实这并不是一个Retiming相关的功能):
- WARNING:Xst:2677 - Node <dat_buf> of sequential type is unconnected in block <seq_opt>.
Vivado完整的保留了电路:
- Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
- --------------------------------------------------- -------------------
- (clock clk rise edge) 0.000 0.000 r
- N13 0.000 0.000 r clk
- net (fo=0) 0.000 0.000 clk
- N13 IBUF (Prop_ibuf_I_O) 1.012 1.012 r clk_IBUF_inst/O
- net (fo=1, routed) 2.016 3.028 clk_IBUF
- BUFGCTRL_X0Y0 BUFG (Prop_bufg_I_O) 0.096 3.124 r clk_IBUF_BUFG_inst/O
- net (fo=4, routed) 1.722 4.846 clk_IBUF_BUFG
- SLICE_X5Y54 r dat_buf2n_reg/C
- --------------------------------------------------- -------------------
- SLICE_X5Y54 FDRE (Prop_fdre_C_Q) 0.456 5.302 r dat_buf2n_reg/Q
- net (fo=1, routed) 0.280 5.582 dat_buf2n
- SLICE_X4Y54 LUT2 (Prop_lut2_I0_O) 0.124 5.706 r dat_o_reg_i_1/O
- net (fo=1, routed) 0.000 5.706 n_0_dat_o_reg_i_1
- SLICE_X4Y54 FDRE r dat_o_reg/D
========= Test Case 4. Register Replication Test (Logical) =========
本例子中含有一个寄存器到256个寄存器的路径。显然,对这个寄存器复制多份可以大大改善时序。本例子RTL如下:
- module reg_dup(input clk,dat_i,output reg [255:0] dat_o);
- reg dat_buf,dat_buf2;
- always@(posedge clk) dat_buf <= dat_i;
- always@(posedge clk) dat_buf2 <= dat_buf;
- always@(posedge clk) dat_o <= {({255{dat_buf2}} & dat_o[254:0]),dat_buf2};
- endmodule
在SDC/XCF/UCF中声明了clk是频率为1GHz的时钟。
Synplify不开启Physical Synthesis时没有动作,开启后dat_buf2仅被复制了两个:
- Starting Points with Worst Slack
- ********************************
- Starting
- Instance Reference Type Pin Net
- Clock
- ------------------------------------------------------------------
- dat_buf2_rep_0 clk FD Q dat_buf2_0_rep_0
- dat_o[23] clk FDR Q dat_o_0[23]
- dat_buf2 clk FD Q dat_buf2_0
- dat_buf2_rep_1 clk FD Q dat_buf2_0_rep_1
Quartus不开启物理综合时也没有动作,开启后dat_buf2被复制了16份:
- Node Action Operation Destination Node
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_1
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_3
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_5
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_7
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_9
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_11
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_13
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_15
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_17
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_19
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_21
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_23
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_25
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_27
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_29
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_31
- FlipFlop dat_buf2 has been replicated 17 time(s)
Vivado无论怎么设置都没有动作,包括将物理综合策略设为AggressiveFanoutOpt也没有效果:
- Phase 9 Very High Fanout Optimization
- INFO: [Physopt 32-76] Pass 1. Identified 1 candidate net for fanout optimization.
- INFO: [Physopt 32-29] End Pass 1. Optimized 0 net. Created 0 new instance.
========= Test Case 5. Register Replication Test (Physical) =========
本例子中含有一个寄存器到2个寄存器的路径,但是这2个寄存器天南地北,因此基于逻辑扇出的计算是认不出这条路径的,只有读入布局信息才能针对这条路径进行复制寄存器的优化,即将原来的一个寄存器分别往2个寄存器的方向复制两份。本例子RTL如下(同时将dat_o[0]和dat_o[1]定义为IOB并相隔很远放置):
- module reg_dup_phy(input clk,dat_i,output reg [1:0] dat_o);
- reg dat_buf,dat_buf2;
- always@(posedge clk) dat_buf <= dat_i;
- always@(posedge clk) dat_buf2 <= dat_buf;
- always@(posedge clk) dat_o <= {dat_buf2,~dat_buf2};
- endmodule
在SDC/XCF/UCF中声明了clk是频率为1GHz的时钟。
Synplify无论使用Physical Synthesis还是Physical Plus都没有任何反应;
Quartus优化结果如下:
- Node Action Operation Destination Node
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_1
- dat_buf2 Duplicated Physical Synthesis dat_buf2~_Duplicate_3
- dat_buf2 Deleted Physical Synthesis
ISE和Vivado也没有任何反应;
xilinx花了500+ person years 开发vivado看来是白干了。
LZ是否能用各个工具来综合一套benchmark suite比如mcnc或iscas,然后来比较各个工具综合出来电路的最高运行频率。这样或许能够更贴近现实地来进行比较。
现在各家也是在跑opencores上面的工程,来进行PK,希望可以有这样的banchmark
这样对比的话Altera完胜额,小编如果有空可以评测下CME的国产FPGA的综合工具性能。
CME(京微雅格)工具下载网址如下:
http://www.capital-micro.com/download_software_p4.htm
注:
1. FTP账户名:ftp_temp,FTP密码:hello123;
2. License请跟帖申请吧!
CME的国产FPGA的综合工具在哪里下载?给个链接
http://www.capital-micro.com/download_software_p4.htm
注:
1. FTP账户名:ftp_temp,FTP密码:hello123;
2. License请跟帖申请吧!
小编可以再加入一个Mentor Precision试试~
这个工具用的人比较少。对Precision RTL Plus 2012c.14 (Virtex 5)进行了顶楼5种case的测试,结果如下:
- +---------------------------------------------------------------------------+
- | FPGA Tools Synthesis QoR Benchmark @2013 |
- +---------------------------------------------------------------------------+
- | | in2reg Opt | reg2reg Opt | Seq. Opt | Reg Rep(Log)| Reg Rep(Phy)|
- |---------+-----------------------------------------------------------------+
- | Synplify| Excellent | Excellent | Excellent| Average | Poor |
- |---------+-----------------------------------------------------------------+
- | Quartus | Good | Good | Excellent| Good | Excellent |
- |---------+-----------------------------------------------------------------+
- | ISE | Poor | Poor | Good | Excellent | Poor |
- |---------+-----------------------------------------------------------------+
- | Vivado | Poor | Poor | Poor | Poor | Poor |
- +---------------------------------------------------------------------------+
- |Precision| Average | Average | Poor | Good | Poor |
- +---------------------------------------------------------------------------+
第1、2个case中,Precision收敛到LUT4 + MUXCY而不是最短的LUT4_L;第3个case中即便打开Retiming和Xilinx Advanced Sequential Optimization也无任何反应 ;第4个case中打开物理综合后对dat_buf2复制了41份;第5个case无任何反应。
要注意Precision的Physical Synthesis的Retiming非常激进(不安全),为了使结果有可比性在跑物理综合时对一些寄存器加了map_only(比如第一个case中的dat_o);另外SDC不支持get_cells之类的语法,而且寄存器的重命名较独特,如dat_buf[127]在约束中要写成reg_dat_buf(127)。0
Mentor在介绍precision时一直在说DO254,按道理precision不应该那么激进。这个tool平时只是偶尔跑一下,感觉属于中庸型的。
学FPGA也有1年多了,但看了小编的几篇帖子,感觉很多都是第一次听说,不太明白。希望小编推荐几本书或学习方法,为我等初学者指条明路,万分感谢!
!
学习了
学习学习
学习学习
大神救救我
jordar bhura
学习了
很好!
谢谢小编
学习了
好东西,mark ,刚好要写综合工具
mark
不錯的資料 真的很實用 謝謝分享
xilinx的综合工具的性能看来是不忍直视啊
连vivado也这么挫。