=Paper=
{{Paper
|id=Vol-1482/145
|storemode=property
|title=Достижение рекордных показателей в GreenGraph500 для вычислительных систем на ПЛИС. Теория и практика
(How to reach GreenGraph500 top with FPGA-based supercomputer? Theory and practice)
|pdfUrl=https://ceur-ws.org/Vol-1482/145.pdf
|volume=Vol-1482
}}
==Достижение рекордных показателей в GreenGraph500 для вычислительных систем на ПЛИС. Теория и практика
(How to reach GreenGraph500 top with FPGA-based supercomputer? Theory and practice)==
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org
Äîñòèæåíèå ðåêîðäíûõ ïîêàçàòåëåé â
GreenGraph500 äëÿ âû÷èñëèòåëüíûõ ñèñòåì íà
ÏËÈÑ. Òåîðèÿ è ïðàêòèêà
À.Ä. Ñèçîâ, Ñ. . Åëèçàðîâ
Ìîñêîâñêèé ãîñóäàðñòâåííûé óíèâåðñèòåò èìåíè Ì.Â. Ëîìîíîñîâà
Ñîâðåìåííûå ìèðîâûå äîñòèæåíèÿ â îáëàñòè ðàçðàáîòêè ýíåðãîýåêòèâíûõ
ïðîãðàììèðóåìûõ ëîãè÷åñêèõ ñõåì (ÏËÈÑ), îáøèðíûé îïûò ïðèìåíåíèÿ ðå-
êîíèãóðèðóåìûõ ñïåöâû÷èñëèòåëåé ïðè ðàçðàáîòêå ïðîáëåìíî îðèåíòèðîâàí-
íûõ ñóïåðêîìïüþòåðîâ è óæå ïðîäåìîíñòðèðîâàííûå âîçìîæíîñòè ñîçäàíèÿ íà
ÏËÈÑ êîíòðîëëåðîâ ïàìÿòè è êîììóíèêàöîííûõ ïðîöåññîðîâ ñî ñâåðõíèçêîé
ëàòåíòíîñòüþ, ïîçâîëÿþò ïðåäïîëàãàòü, ÷òî èìåííî íà òàêîé ýëåìåíòíîé áàçå
ñåãîäíÿ ìîãóò áûòü ñîçäàíû âû÷èñëèòåëüíûå ñèñòåìû ñ ðåêîðäíûìè íà òåñòå
GreenGraph500 ïîêàçàòåëÿìè.  ðàáîòå îáñóæäàþòñÿ òðåáîâàíèÿ ê âû÷èñëèòåëü-
íîé ñèñòåìå íà ÏËÈÑ ñ âíåøíåé ïàìÿòüþ ïðèìåíèòåëüíî ê ðåøåíèþ çàäà÷è ïî-
èñêà âøèðü ïî ãðàó (Breadth rst sear h BFS), ó÷èòûâàþùèå èìåþùèéñÿ ìè-
ðîâîé îïûò è îñîáåííîñòè ëó÷øèõ ñóùåñòâóþùèõ ïàðàëëåëüíûõ àëãîðòìîâ BFS.
àññìîòðåí ðåàëüíûé âû÷èñëèòåëüíûé óçåë, ñîäåðæàùèé ÏËÈÑ Kintex Ultra
S ale ñ 4-ìÿ êîíòðîëëåðàìè ïàìÿòè RLDRAMIII. Îöåíåíà ïðîèçâîäèòåëüíîñòü
ñèñòåìû èç 32-äâóõ òàêèõ óçëîâ, ðàññ÷èòàíà ýíåðãîýåêòèâíîñòü ïî êðèòåðè-
ÿì ðåéòèíãà GreenGraph500 è äàíû ðåêîìåíäàöèè ïî äàëüíåéøåé îïòèìèçàöèè
àïïàðàòóðû.
1. Ââåäåíèå
Graph500 ìèðîâîé ðåéòèíã ñóïåðêîìïüþòåðîâ, ïðåäíàçíà÷åííûõ äëÿ ðåøåíèÿ çàäà÷,
ñâÿçàííûõ ñ îáðàáîòêîé áîëüøèõ ãðàîâ. Äëÿ ðàíæèðîâàíèÿ ýòèõ ñèñòåì èñïîëüçóåòñÿ
BFS ïîèñê â øèðèíó â íåîðèåíòèðîâàííîì ðàçðåæåííîì ãðàå. Ýòîò òåñò â áîëüøåé ñòå-
ïåíè íàãðóæàåò êîììóíèêàöèîííóþ ïîäñèñòåìó è êîíòðîëëåðû ïàìÿòè, òàê êàê äàííûé
àëãîðèòì ïîäðàçóìåâàåò ðàáîòó ñ áîëüøèì îáúåìîì íåðåãóëÿðíûõ äàííûõ â ïðîòèâîïî-
ëîæíîñòü Top500, îðèåíòèðîâàííîìó íà âû÷èñëåíèÿ íàä ÷èñëàìè ñ ïëàâàþùåé òî÷êîé íà
òåñòå HPL Linpa k.  äîïîëíåíèå ê Top500 î÷åíü âîñòðåáîâàí Green500 ðåéòèíã ýíåð-
ãîýåêòèâíîñòè âû÷èñëèòåëüíûõ ñèñòåì íà òåñòå Linpa k. Ïðåäëîæåííûé â 2012 ãîäó
GreenGraph500, ñî÷åòàåò óêàçàííûå âûøå ïîäõîäû è ðàíæèðóåò ñèñòåìû èç Graph500 ïî
ïðîèçâîäèòåëüíîñòè â GTEPS (109 ïðîéäåííûõ äóã â ñåêóíäó) íà Âàòò ýëåêòðîïîòðåáëå-
íèÿ. Âàæíîñòü ýòîãî òåñòà ñëîæíî ïåðåîöåíèòü, òàê êàê èìåííî ýíåðãîýåêòèâíîñòü è
ñêîðîñòü ðàáîòû ñî ñâåðõáîëüøèìè îáúåìàìè íåðåãóëÿðíûõ äàííûõ ÿâëÿþòñÿ îñíîâíûìè
òðåáîâàíèÿìè ê ñóïåðêîìïüþòåðàì è öåíòðàì îáðàáîòêè äàííûõ áóäóùåãî [1℄.
Ñîâðåìåííûé îïûò ïîêàçûâàåò, ÷òî îäèí èç íàèáîëåå óäà÷íûõ ïîäõîäîâ ê ïîñòðîå-
íèþ çàêàçíûõ ïðîáëåìíî îðèåíòèðîâàííûõ âû÷èñëèòåëüíûõ ñèñòåì (ÏÎÂÑ) ìàêñèìàëüíîé
ýíåðãîýåêòèâíîñòè èñïîëüçîâàíèå ñïåöèàëüíûõ óñêîðèòåëåé íà áàçå ïðîãðàììèðóåìîé
ëîãèêè (ÏËÈÑ) [2℄. Ñ äðóãîé ñòîðîíû, êðèòè÷åñêèì àêòîðîì, îãðàíè÷èâàþùèì ïðîèçâî-
äèòåëüíîñòü ïðè ðåøåíèè ãðàîâûõ çàäà÷, ÿâëÿåòñÿ ñêîðîñòü ñëó÷àéíîãî äîñòóïà â ïàìÿòü.
Ïîêàçàíî [3℄, ÷òî ïðîèçâîäèòåëüíîñòü òðàäèöèîííûõ CPU/GPU àðõèòåêòóð, îðèåíòèðîâàí-
íûõ íà áëî÷íóþ ðàáîòó ñ âíåøíåé ïàìÿòüþ è èñïîëüçóþùèõ ãëóáîêèå êîíâåéåðû êîìàíä â
ñîâîêóïíîñòè ñ íåñêîëüêèìè ñòóïåíÿìè êåøèðîâàíèÿ äàííûõ, ñíèæàåòñÿ íà 1-2 ïîðÿäêà íà
çàäà÷àõ òèïà BFS. Îäíàêî íà ÏËÈÑ âîçìîæíî ðåàëèçîâàòü ñïåöèàëèçèðîâàííûå êîíòðîë-
ëåðû ïàìÿòè, ïðàêòè÷åñêè ëèøåííûå óêàçàííîãî íåäîñòàòêà, òàêèå êàê, íàïðèìåð, â ñèñòå-
ìå Convey MX-100, âõîäÿùåé â ïåðâóþ ñîòíþ ðåéòèíãà Graph500 [4℄. Ïðîèçâîäèòåëüíîñòü
ïîäñèñòåìû ïàìÿòè ìîæíî åùå óâåëè÷èòü, ïåðåéäÿ ê îòëè÷íûì îò DDR3/DDR4 àðõèòåê-
145
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org
òóðàì [6℄. Äðóãîé ïîäñèñòåìîé, îïðåäåëÿþùåé ïðîèçâîäèòåëüíîñòü â çàäà÷å BFS, ÿâëÿåòñÿ
êîììóíèêàöèîííàÿ ñåòü, ñîåäèíÿþùàÿ âû÷èñëèòåëüíûå ìîäóëè [7℄. Èçâåñòíî, ÷òî íàèáîëåå
áûñòðûå è íèçêîëàòåíòíûå êîììóíèêàöèîííûå ñåòè äëÿ ÏÎÂÑ íà ÏËÈÑ ïîñòðîåíû íà
ìóëüòèãèãàáèòíûõ òðàíñèâåðàõ è êîììåð÷åñêè äîñòóïíûõ êîììóòàòîðàõ PCIe [8℄.
Ïðåäëàãàÿ ÏËÈÑ, â êà÷åñòâå îñíîâíîãî âû÷èñëèòåëüíîãî óçëà, íóæíî ïðèíèìàòü âî
âíèìàíèå èçâåñòíûå íåäîñòàòêè ÏËÈÑ îòíîñòèòåëüíî óçëîâ íà îñíîâå CPU/GPU: íîìè-
íàëüíàÿ ðàáî÷àÿ ÷àñòîòà ÏËÈÑ ñîñòàâëÿåò 300-600 Ì ö, êîòîðàÿ â 5-10 ðàç óñòóïàåò ðà-
áî÷åé ÷àñòîòå ñîâðåìåííûõ êîììåð÷åñêèõ ïðîöåññîðîâ. Îáúåì áûñòðîé ïàìÿòè, ðàñïîëî-
æåííîé íåïîñðåäñòâåííî íà êðèñòàëëå ÏËÈÑ, îãðàíè÷åí 1-10 ÌÁàéò, ÷òî íå ïîçâîëÿåò
èñïîëüçîâàòü ÏËÈÑ äëÿ ðåøåíèÿ çàäà÷ áîëüøîãî ðàçìåðà áåç ïðèìåíåíèÿ âíåøíåé ïàìÿ-
òè. Öåíà òîïîâûõ ÏËÈÑ íà ïîðÿäîê ïðåâûøàåò öåíó ñîîòâåòñòâóþùèõ CPU/GPU. Êðîìå
òîãî, ñîçäàíèå ÏÎÂÑ íà áàçå ÏËÈÑ ïðåäïîëàãàåò äëÿ êàæäîé êîíêðåòíîé çàäà÷è ñîçäà-
íèå è îòëàäêó âû÷èñëèòåëÿ íà ÿçûêå îïèñàíèÿ àïïàðàòóðû, ñëîæíîñòü êîòîðîé íà ïîðÿäîê
âûøå íàïèñàíèÿ ïðîãðàììû ïîä òðàäèöèîííûå àðõèòåêòóðû íà ÿçûêàõ âûñîêîãî óðîâíÿ.
 íàñòîÿùåé ðàáîòå ïðîâîäèòñÿ àíàëèç ëèòåðàòóðû è òðåáîâàíèé ê àïïàðàòíîé áàçå
ÏÎÂÑ äëÿ ïîñòðîåíèÿ òîïîâûõ ðåøåíèé â GreenGraph500. Âûïîëíÿåòñÿ ðàñ÷åò ïàðàìåò-
ðîâ îïòèìàëüíîé êîíèãóðàöèè, äàþòñÿ ðåêîìåíäàöèè äëÿ ñîçäàíèÿ ÏÎÂÑ äëÿ ãðàîâûõ
çàäà÷ ðàçëè÷íîãî ðàçìåðà. Ïðîâîäèòñÿ àíàëèç ïðèìåíèìîñòè ðàçðàáîòàííîãî äëÿ äàííî-
ãî ÏÎÂÑ àëãîðèòìà ïîèñêà âøèðü. Â ðàìêàõ äàííîé ðàáîòû ïðåäïîëàãàåòñÿ îïðåäåëåíèå
ïðîèçâîäèòåëüíîñòè îäíîãî óçëà íà àëãîðèòìå BFS ñ ïîìîùüþ ìîäåëèðîâàíèÿ ðàáîòû ðå-
àëüíîãî ÏËÈÑ.
2. Îáùàÿ ïàìÿòü
Êàê ñêàçàíî âûøå, BFS ïðåäïîëàãàåò ìíîæåñòâî ñëó÷àéíûõ îáðàùåíèÿ â îáùóþ ïà-
ìÿòü âñåé âû÷èñëèòåëüíîé ñèñòåìû.  ðàáîòå [3℄ ïîêàçàíî, ÷òî ïèêîâàÿ ïðîèçâîäèòåëüíîñòü
êîíòðîëëåðîâ ïàìÿòè â òðàäèöèîííûõ CPU/GPU àðõèòåêòóðàõ, ðàññ÷èòàííûõ íà áëî÷íîå
÷òåíèå, äîñòèãàåòñÿ òîëüêî ïðè ðàáîòå ñ áîëüøèìè 4 ÊÁ è áîëåå áëîêàìè äàííûõ è ñíè-
æàåòñÿ íà ïîðÿäêè ïðè ÷òåíèÿõ îòäåëüíûõ ìàøèííûõ ñëîâ. àçìåð îáðàáàòûâàåìûõ àë-
ãîðèòìàìè BSF ãðàîâ ëåæèò â äèàïàçîíå îò Á äî ÏÁ, ïðè òîì, ÷òî êàæäûé çàïðîñ íà
÷òåíèå â BFS îïåðèðóåò åäèíèöàìè ìàøèííûõ ñëîâ (4/8 áàéò íà ñëîâî), àäðåñà çàïðîñîâ
ïðàêòè÷åñêè ñëó÷àéíû, ïîýòîìó ýåêòèâíîå ÷òåíèå áîëüøèìè áëîêàìè íåâîçìîæíî. Òà-
êèì îáðàçîì, àðõèòåêòóðà êîíòðîëëåðà ïàìÿòè â êëàññè÷åñêèõ CPU/GPU àðõèòåêòóðàõ
ÿâëÿåòñÿ àêòîðîì, îãðàíè÷èâàþùèì îáùóþ ïðîèçâîäèòåëüíîñòü ñèñòåìû íà òåñòå BFS.
Ýòî ïîçâîëÿåò ïîëàãàòü, ÷òî ïåðåõîä ê ïðîáëåìíî-îðèåíòèðîâàííûì êîíòðîëëåðàì ïàìÿòè,
íà êîòîðûõ âîçìîæíî äîñòèæåíèå ìàêñèìàëüíûõ ïðîïóñêíûõ ñïîñîáíîñòåé íà îïåðàöèÿõ
äîñòóïà ïî ñëó÷àéíûì àäðåñàì, ÿâëÿåòñÿ îäíèì èç ïåðñïåêòèâíûõ íàïðàâëåíèé â ñîçäàíèè
ÏÎÂÑ äëÿ ãðàîâûõ çàäà÷.
Óâåëè÷åíèå ïðîèçâîäèòåëüíîñòè ïîäñèñòåìû ïàìÿòè âîçìîæíî òàêæå ïðè èñïîëüçî-
âàíèè äðóãèõ òèïîâ ÎÇÓ, òàê íàïðèìåð ïðîèçâîäèòåëüíîñòü RLDRAMIII (Redu e laten y
DRAM) íà ñëó÷àéíûõ ÷òåíèÿõ â 2-3 ðàçà áîëüøå, ÷åì äëÿ ñîîòâåòñòâóþùåé DDR3. [6℄
3. Êîììóíèêàöèîííàÿ ïîäñèñòåìà
 ñòàòüå [7℄ ïîêàçàíî, ÷òî â âû÷èñëèòåëüíûõ ñèñòåìàõ ñ ìíîãèìè óçëàìè ïðîèçâîäè-
òåëüíîñòü àëãîðèòìà BFS îïðåäåëÿåòñÿ êîììóíèêàöèîííîé ïîäñèñòåìîé, îáåñïå÷èâàþùåé
îáìåí äàííûìè ìåæäó âû÷èñëèòåëüíûìè óçëàìè, ïîýòîìó ñíèæåíèå êîëè÷åñòâà ïåðåñû-
ëàåìûõ äàííûõ ïîçâîëÿåò çíà÷èòåëüíî ïîâûñèòü ïðîèçâîäèòåëüíîñòü ñèñòåìû â öåëîì. Â
ðàáîòå [8℄ ïîêàçàíà âîçìîæíîñòü ïîñòðîåíèÿ è ýåêòèâíîé ìàñøòàáèðóåìîñòè ñèñòåìû
èç íåñêîëüêèõ ÏËÈÑ, â êîòîðîé êîììóíèêàöèîííàÿ ïîäñèñòåìà ïîñòðîåíà íà áàçå ìóëü-
òèãèãàáèòíûõ òðàíñèâåðîâ. Êîììåð÷åñêè äîñòóïíîé ñåòüþ òàêîãî òèïà ÿâëÿåòñÿ ïàêåòíàÿ
146
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org
ñåòü PCIe ñ òîïîëîãèåé òèïà "çâåçäà" , êîòîðàÿ â ðàìêàõ ñòàíäàðòà PCIe Gen3 ïîçâîëÿåò
äîñòèãàòü ïðîïóñêíîé ñïîñîáíîñòè äî 16 Áàéò/ â äóïëåêñíîì ðåæèìå.
4. Ïðîåêò BFS äëÿ ÏÎÂÑ
1. f o r (i = 0; i < size (V ); i ++)
2. lvl [v℄ = Inf ;
3. lvl [s℄ = 0;
4. write_to_bfs_queue (n , s ); // write v to queue on hip n
5. // On every hip , on every level :
6. w h i l e (Q is not empty )
7. f o r ( all u in Q) // 1 read
8. f o r ( all v in CSR [u ℄) // 3 reads
9. i f ( v lo ated in lo al_mem )
10. i f ( lvl [v ℄ > lvl [u ℄) // 2 reads
11. d [v℄ = u; // write
12. lvl [v ℄ = lvl [u ℄; // write
13. // add v into lo al queue
14. write_to_bfs_queue ( lo al ,v );
15. else
16. // send remote he k request
17. write_to_ he k_queue (n , v );
èñ. 1. Ïðîåêò ðàñïðåäåëåííîãî àëãîðèòìà ïîèñêà âøèðü íà ÏÎÂÑ
Äëÿ ÏÎÂÑ íà ÏËÈÑ òðåáóåòñÿ ìóëüòèòðåäîâûé àëãîðèòì, â êîòîðîì ðàçðåøåíû òîëü-
êî ëîêàëüíûå ÷òåíèÿ, îïåðàöèè ãëîáàëüíîé ñèíõðîíèçàöèè íå òðåáóþò áîëüøîãî êîëè÷å-
ñòâà ïåðåñûëîê è â ìàêñèìàëüíîé ñòåïåíè èñïîëüçóþòñÿ âîçìîæíîñòè ÏËÈÑ è ïîäñèñòå-
ìû ïàìÿòè. Ïðîåêò òàêîãî àãëîðèòìà ïðèâåäåí íà ðèñ. 1. Èçíà÷àëüíî, âåðøèíû â ãðàå
ðàçáèâàþòñÿ ìåæäó óçëàìè òàêèì îáðàçîì, ÷òî ðåáðà, ñîîòâåòñòâóþùèå ñïèñêó âåðøèí,
îáðàáàòûâàåìûõ íà äàííîì óçëå, íàõîäÿòñÿ â ëîêàëüíîé ïàìÿòè ñîîòâåòñòâóþùåãî ÏËÈÑ.
 ïàìÿòè êàæäîãî óçëà òàêæå õðàíèòñÿ òàêæå ëîêàëüíûé ó÷àñòîê ðîíòà.  êà÷åñòâå
îðìàòà õðàíåíèÿ ãðàà èñïîëüçóåòñÿ Compressed Sparse Row (CSR) îðìàò. Êàæäûé ëî-
êàëüíûé ðîíò íà îïðåäåëåííîì óðîâíå ïîèñêà îáðàáàòûâàåòñÿ íåçàâèñèìî, ïðè÷åì, åñëè
â ïðîöåññå ïîèñêà îáðàáàòûâàåìîå ðåáðî ñâÿçûâàåò ëîêàëüíóþ âåðøèíó ñ âåðøèíîé, äàí-
íûå î êîòîðîé õðàíÿòñÿ â óäàëåííîé ïàìÿòè, èíîðìàöèÿ î äàííîé âåðøèíå ïîñûëàåòñÿ
íà óäàëåííûé âû÷èñëèòåëüíûé óçåë, ãäå è ïðîèñõîäèò åå ïîñëåäóþùàÿ îáðàáîòêà. àçðå-
øåíèå êîíëèêòîâ ìåæäó ïîòîêàìè âíóòðè óçëà âû÷èñëèòåëÿ îñóùåñòâëÿåòñÿ ñ ïîìîùüþ
àïïàðàòíî ðåàëèçîâàííûõ íà óðîâíå êîíòðîëëåðà ïàìÿòè àòîìàðíûõ îïåðàöèé è full/empty
ïðèçíàêîâ ÿ÷ååê äàííûõ.
5. Îöåíêà ïðîèçâîäèòåëüíîñòè ÏÎÂÑ íà ÏËÈÑ
5.1. Îöåíêà ïðîèçâîäèòåëüíîñòè óçëà
Ïðåäëàãàåìûé ÏÎÂÑ ñîñòîèò èç 32 âû÷èñëèòåëüíûõ óçëîâ, ñîåäèíåííûõ êîììóíèêàöè-
îííîé ïîäñèñòåìîé èç ìóëüòèãèãàáèòíûõ òðàíñèâåðîâ ðàáîòàþùèõ ïî ïðîòîêîëó PCIe Gen3
4x. Êàæäûé âû÷èñëèòåëüíûé óçåë ïðåäñòàâëÿåò èç ñåáÿ êðèñòàëë ÏËÈÑ Kintex Ultras ale
XCKU095 åìêîñòüþ 940 òûñ. LUT, ðàáîòàþùèé íà ÷àñòîòå 660 Ì ö, è ÷åòûðå êîíòðîëëåðà
âíåøíåé ïàìÿòè RLDRAMIII, ðàáîòàþùèé íà ÷àñòîòå 800 Ì ö, åìêîñòüþ 64 Ìáàéò êàæ-
äûé. Îöåíêó ïðîèçâîäèòåëüíîñòè äàííîãî âû÷èñëèòåëüíîãî óçëà áóäåì ïðîâîäèòü ïóòåì
ñðàâíåíèÿ ñ ñóùåñòâóþùèìè âû÷èñëèòåëüíûìè ñèñòåìàìè íà ÏËÈÑ îò êîìïàíèè Convey,
ïðîèçâîäèòåëüíîñòü êîòîðûõ èçâåñòíà [4℄. Ñèñòåìà Convey MX-100 ñîñòîèò èç ÷åòûðåõ âû-
÷èñëèòåëüíûõ êðèñòàëëîâ V6 HX565T åìêîñòüþ 585 òûñ. LUT, ðàáîòàþùèé íà ÷àñòîòå 550
Ì ö è ïîäñèñòåìû ïàìÿòè èç 32 êàíàëîâ DDR3.
147
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org
 ðàáîòå [9℄ áûëî ïîêàçàíî, ÷òî èñïîëüçîâàíèå àëãîðèòìà îïòèìèçàöèè ïî íàïðàâëåíè-
ÿì ïîçâîëÿåò ñíèçèòü êîëè÷åñòâî îáðàáàòûâàåìûõ ðåáåð ãðàà äî ðàçìåðà ìèíèìàëüíîãî
îñòîâíîãî äåðåâà, èëè â 16 ðàç äëÿ ãðàà ïëîòíîñòüþ 16 ðåáåð íà âåðøèíó. Â ðàáîòå [4℄ èñ-
ïîëüçóåòñÿ ñïîñîá õðàíåíèÿ, êîòîðûé ïîçâîëÿåò óìåíüøèòü êîëè÷åñòâî çàïðîñîâ íà ÷òåíèå
äàííûõ ïðè îáðàáîòêå îäíîãî ðåáðà äî òðåõ, ÷òî ïîçâîëÿåò îöåíèòü òðåáóåìóþ ïðîïóñêíóþ
ñïîñîáíîñòü ïàìÿòè â ñèñòåìå MX-100 ïî îðìóëå 14,6 [5℄ /16*3 = 2,7 GR/s (GR/s 109
÷òåíèé â ñåêóíäó). Â ñîîòâåòñòâèè ñ ðàáîòîé [6℄ ïðîèçâîäèòåëüíîñòü èñïîëüçóåìîé â ñèñòå-
ìå Convey ïàìÿòè DDR3 íà ñëó÷àéíûõ ÷òåíèÿõ ìîæåò áûòü îöåíåíà â 0,6 GR/s äëÿ 32-õ
êàíàëîâ. àññìàòðèâàåìîå â [10℄ ïåðåóïîðÿäî÷èâàíèå çàïðîñîâ ïðè äîñòóïå ê ïàìÿòè ïîç-
âîëÿåò ïîâûñèòü ïðîèçâîäèòåëüíîñòü ÷òåíèÿ â 4-5 ðàç îòíîñèòåëüíî ñêîðîñòè ñëó÷àéíîãî
÷òåíèÿ. Ïðîâåäåííàÿ íàìè ìîäåëèðîâàíèå ðàáîòû êîíòðîëëåðà RLDRAMIII ïîêàçàëî, ÷òî
ïðîèçâîäèòåëüíîñòü ïðåäëàãàåìîé ïîäñèñòåìû ïàìÿòè ñîñòàâëÿåò 140*16 = 2,24 GR/s íà
ñëó÷àéíûõ ÷òåíèÿõ è 750*16 = 11,5 GR/s íà ïîñëåäîâàòåëüíûõ ÷òåíèÿõ ïðè ðàçìåðå áëîêà
18 áàéò, ÷òî ïîçâîëÿåò ãîâîðèòü î ñîçäàíèè ïîäñèñòåìû ïàìÿòè ñ ïðîïóñêíîé ñïîñîáíîñòüþ
äî 10 GR/s.
Òîãäà, îòòàëêèâàÿñü îò ïðîèçâîäèòåëüíîñòè ïîäñèñòåìû ïàìÿòè, ïðåäëàãàåìûé ÏÎÂÑ
ñìîæåò îáðàáàòûâàòü â 10/2,7 = 3,7 áîëüøå ðåáåð â ñåêóíäó, ÷åì MX-100. Ñëåäîâàòåëüíî,
ïðîèçâîäèòåëüíîñòü îäíîãî ÏËÈÑ, ïðåäïîëàãàÿ ëèíåéíóþ ìàñøòàáèðóåìîñòü âû÷èñëèòå-
ëÿ, ìîæíî îöåíèòü â (14,6/4)*940 òûñ. LUT*660 Ì ö/565 òûñ. LUT*550 Ì ö = 7,3 GTEPS.
5.2. Âîçìîæíîñòè ìàñøòàáèðóåìîñòè ñèñòåìû
 ðàçäåëå 4 áûëî ïîêàçàíî, ÷òî â ñëó÷àå îáðàáîòêè ðåáðà, êîòîðîå ñîåäèíÿåò ëîêàëü-
íóþ âåðøèíó ñ âåðøèíîé, íàõîäÿùåéñÿ â óäàëåííîé ïàìÿòè, ïî êîììóòàöèîííîé øèíå
ïîñûëàåòñÿ çàïðîñ íà óäàëåííóþ îáðàáîòêó äàííîé âåðøèíû. Ýòîò çàïðîñ ïðåäïîëàãàåò
ïåðåäà÷ó 8 áàéò ïîëåçíîé èíîðìàöèè íîìåð çàïðàøèâàåìîé âåðøèíû, íîìåð çàïðàøè-
âàþùåé âåðøèíû è åå óðîâåíü. Ïðè ðàâíîìåðíîì ðàñïðåäåëåíèè âåðøèí ìåæäó óçëàìè,
ó÷èòûâàÿ ìàêñèìàëüíî âîçìîæíóþ ïðîèçâîäèòåëüíîñòü îäíîãî óçëà, îöåíêà äëÿ êîòîðîé
äàíà â ïðåäûäóùåì ðàçäåëå, íåîáõîäèìóþ ïðîïóñêíóþ ñïîñîáíîñòü ìîæíî ðàññ÷èòàòü êàê
8 áàéò*(7,6 GTEPS/16) = 3,8 Áàéò/ñ äëÿ ñèñòåìû ñ äîñòàòî÷íî áîëüøèì êîë-âîì âû÷èñ-
ëèòåëüíûõ óçëîâ.  ïðåäëàãàåìîì ÏÎÂÑ äëÿ ñîåäèíåíèÿ âû÷èñëèòåëåé èñïîëüçóåòñÿ ñåòü,
ïîñòðîåííàÿ íà êîììóòàòîðàõ PCIe Gen3 4x, ïðîïóñêíàÿ ñïîñîáíîñòü êîòîðîé íà çàïèñü èç
îäíîãî âû÷èñëèòåëÿ â äðóãîé ñîñòàâëÿåò 4 Áàéò/ . Îäíàêî, èçâåñòíî [11℄, ÷òî ïðîïóñêíàÿ
ñïîñîáíîñòü PCIe ïðè ïåðåäà÷å ñîîáùåíèé âåëè÷èíîé 8 áàéò ñîñòàâëÿåò ïðèáëèçèòåëüíî
30% îò ìàêñèìàëüíîé, áîëåå 90% ïðè äëèíå ñîîáùåíèÿ â 100 è áîëåå áàéò. Ýòî çíà÷èò, ÷òî
äëÿ ïîëíîöåííîé çàãðóçêè ÏËÈÑ ïîòðåáóåòñÿ ëèáî ïåðåéòè ê øèíå PCIe áîëüøåé øèðè-
íû, ëèáî èñïîëüçîâàòü ìåõàíèçì àãðåãàöèè ñîîáùåíèé óäàëåííîé çàïèñè. Ýòè îïòèìèçàöèè
ïîçâîëÿò äëÿ ñèñòåìû èç 4 óçëîâ äîñòè÷ü ïðàêòè÷åñêè ëèíåéíîé ìàñøòàáèðóåìîñòè. Îä-
íàêî â ðàññìàòðèâàåìîì ïðîòîòèïå 4-õ óçëîâûå áëîêè îáúåäèíåíû PCIe Gen3 8x, ïðîâåäÿ
àíàëîãè÷íûå âûêëàäêè, ïðîèçâîäèòåëüíîñòü ÏÎÂÑ ñ 32 óçëàìè ìîæåò áûòü îöåíåíà â 7,6
GTEPS * 32 óçëà*(8 áàéò/ /4) /3,8 áàéò/ = 128 GTEPS èëè 200 MTEPS/W (îöåíèâàÿ
ýíåðãîïîòðåáëåíèå â 20 Âàòò íà ÏËÈÑ).
5.3. Ñðàâíåíèå ñ ñóùåñòâóþùèìè óñòðîéñòâàìè
àññìàòðèâàåìàÿ ÏÎÂÑ ïî ðàçìåðó ðåøàåìîé çàäà÷è BFS äîëæíà áûòü ïî îòíåñåíà
ïî êëàññèèêàöèè GreenGraph500 ê ðàçäåëó Small data, êîòîðûé â ðåäàêöèè îò èþëÿ 2015
ïðåäñòàâëåí â ïåðâîé äåñÿòêå 4-ìÿ ñèñòåìàìè îðèãèíàëüíîé àðõèòåêòóðû è 6-þ ñèñòåìàìè
íà áàçå ìèêðîïðîöåññîðîâ äëÿ ñîòîâûõ òåëåîíîâ è ïëàíøåòîâ. Âòîðîé è òðåòèé äåñÿòîê
ëèäåðîâ GreenGraph500 ïðàêòè÷åñêè ïîëíîñòüþ çàíÿòû SMP ñèñòåìàìè íà îäíîì, äâóõ èëè
÷åòûðåõ òîïîâûõ x86 ïðîöåññîðàõ Intel Sandybridge. Ïðàêòè÷åñêè âñå ëèäåðû èñïîëüçóþò
ïðåäåëüíî îïòèìèçèðîâàííûå àëãîðèòìû îò ãðóïïû GraphCREST [9℄. Ïîïàäàíèå â äåñÿòêó
148
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org
Òàáëèöà 1. Ñðàâíåíèå ðàññìàòðèâàåìîãî ÏÎÂÑ ñ ñóùåñòâóþùèìè ðåøåíèÿìè.
Ïàðàìåòðû Convey Xperia Z1 Fermi Cray XE6 Intel SB ÏÎÂÑ
MX- [13℄ GPU Hopper [15℄ EP [13℄ 32 óçëà
100 [12℄ [14℄
Âîçìîæíûé 29 20 20 31 28 22
ðàçìåð ãðàà
åçóëüòàò 14,6 1,03 0,63 62 28,61 128
GTEPS
åçóëüòàò 146 235 2,6 0,15 61,48 200
MTEPS/W
Green 8 SD 2 SD 29 SD 19 BD 1 BD 5 SD
Graph500
Small or
Big Data
Category
Graph500 79 153 171 54 70 46
òðåáóåò ýíåðãîýåêòèâíîñòè íà óðîâíå 130 MTEPS/W, êîòîðàÿ, êàê ïîêàçàíî âûøå, ìî-
æåò áûòü äîñòèãíóòà íà ÏÎÂÑ â ðàññìàòðèâàåìîé â íàñòîÿùåé ñòàòüå êîíèãóðàöèè ïðè
ýåêòèâíîñòè ðåàëèçàöèè àãëîðèòìà BFS íà ÏËÈÑ ÏÎÂÑ íà óðîâíå ñèñòåì Convey.
Îòìåòèì, ÷òî ïðàêòè÷åñêè âñå ïðåäñòàâëåííûå â ðàçäåëå Small data âû÷èñëèòåëè èìå-
þò îäèí óçåë è íå äîïóñêàþò ìàñøòàáèðîâàíèÿ, ò.ê. ëèáî ïðèíöèïèàëüíî îäíîïðîöåññîðíûå
(ñèñòåìû íà Snapdragon è ïîäîáíûå), ëèáî èñïîëüçóþò íå ìàñøòàáèðóåìûå ðåøåíèÿ (SMP
ñèñòåìà íà 4-õ Intel Sandybridge). Â ïðîòèâîïîëîæíîñòü èì ïðåäëîæåííàÿ ÏÎÂÑ ìíîãî-
óçëîâàÿ è óæå ñîäåðæèò 32 âû÷èñëèòåëüíûõ óçëà. Åå îòíåñåíèå ê Small data ñâÿçàíî òîëüêî
ñ îñîáåííîñòüþ èñïîëüçóåìîé RLDRAMIII (ìàëàÿ åìêîñòü ìîäóëÿ). Â êëàññå Big data ïî-
ïàäàíèå â ïåðâóþ äåñÿòêó òðåáóåò ýíåðãîýåêòèâíîñòè íà óðîâíå 20 MTEPS/W, êîòîðàÿ
î÷åâèäíî áóäåò äîñòèãíóòà ïðè ïåðåõîäå íà áîëåå åìêèå ìîäóëè ïàìÿòè òèïà DDR3/4.
6. Çàêëþ÷åíèå
Øèðîêèé êëàññ çàäà÷, òðåáóþùèõ íåðåãóëÿðíîé ðàáîòû ñ áîëüøèìè è ñâåðõáîëüøèìè
îáúåìàìè äàííûõ, â òîì ÷èñëå çàäà÷à ïîèñêà âøèðü ïî ãðàó, ìîæåò ýåêòèâíî ðåøàòüñÿ
íà ìàññèâå ÏËÈÑ, îñíàùåííûõ êîíòðîëëåðàìè ïàìÿòè è ñâÿçàííûìè êîììóíèêàöèîííîé
øèíîé PCIe. Ïðè ñîâìåñòíîé îïòèìèçàöèè àëãîðèòìà ïîèñêà, êîëè÷åñòâà è òèïà èñïîëüçóå-
ìûõ êîíòðîëëåðîâ ïàìÿòè è ïàðàìåòðîâ êîììóíèêàöèîííîé øèíû, ìîãóò áûòü ðàçðàáîòàíû
ñèñòåìû ñ ðåêîðäíûìè ïîêàçàòåëÿìè â òåñòå GreenGraph500.
Ëèòåðàòóðà
1. Fran esquini, Emilio and Castro et al. //On the energy e ien y and performan e of
irregular appli ation exe utions on multi ore, NUMA and many ore platforms, Journal of
Parallel and Distributed Computing, 2014, Elsevier.
2. Fran is o, Phil and others //The Netezza data applian e ar hite ture: a platform for high
149
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org
performan e data warehousing and analyti s, IBM Redbooks, 2011.
3. Agarwal, Virat and Petrini, Fabrizio and Pasetto, Davide and Bader, David A //S alable
graph exploration on multi ore pro essors, Pro eedings of the 2010 ACM/IEEE
International Conferen e for High Performan e Computing, Networking, Storage and
Analysis, P. 111, 2010, IEEE Computer So iety
4. Attia, Osama G and Johnson, Tyler and Townsend, Kevin and Jones, Philip and
Zambreno, Joseph //CyGraph: A Re ongurable Ar hite ture for Parallel Breadth-First
Sear h, Parallel & Distributed Pro essing Symposium Workshops (IPDPSW), 2014 IEEE
International, P. 228235, 2014, IEEE.
5. Graph500 List July 2015, URL: http://www.graph500.org/results_jul_2015
6. Avnet //Optimal Memory Interfa e Design with Xilinx 7 Series Xfest-2012 presentation,
2012, URL: http://www.em.avnet. om/en-us/design/trainingandevents/Do uments/X-
FEST%202012%20PRESENTATIONS/xfest12_pdf_memory_v1_2_may15.pdf
7. Che oni, Fabio and Petrini, Fabrizio //Traversing Trillions of Edges in Real Time: Graph
Exploration on Large-S ale Parallel Ma hines, Parallel and Distributed Pro essing
Symposium, 2014 IEEE 28th International, P. 425434, 2014, IEEE.
8. Theodore Markettos, A and Fox, Paul J and Moore, Simon W and Moore, Andrew W
//Inter onne t for ommodity FPGA lusters: standardized or ustomized?, Field
Programmable Logi and Appli ations (FPL), 2014 24th International Conferen e on,
P. 18, 2014, IEEE.
9. Yasui, Yui hiro and Fujisawa, Katsuki and Goto, Keisuke //NUMA-optimized parallel
breadth-rst sear h on multi ore single-node system, Big Data, 2013 IEEE International
Conferen e on, P. 394402, 2013, IEEE.
10. Jin, Zheming and Bakos, Jason D //Memory A ess S heduling on the Convey HC-1, 2013
IEEE 21st Annual International Symposium on Field-Programmable Custom Computing
Ma hines
11. Understanding Performan e of PCI Express Systems Xilinx White paper O tober 2014,
URL: http://www.xilinx. om/support/do umentation/white_papers/wp350.pdf.
12. Convey //Convey MX Series Ar hite tural Overview, White paper,
URL: http://www. onvey omputer. om/les/5913/5266/3278/CONV-12-
036.1MXar hOvrvwWeb.pdf
13. Yasui, Yui hiro and Fujisawa, Katsuki and Sato, Yukinori, //Fast and energy-e ient
breadth-rst sear h on a single numa system, Super omputing, P. 365381, 2014, Springer.
14. Hong, Sungpa k and Oguntebi, Tayo and Olukotun, Kunle, //E ient parallel graph
exploration on multi- ore CPU and GPU, Parallel Ar hite tures and Compilation
Te hniques (PACT), 2011 International Conferen e on, P. 7888, 2011, IEEE.
15. Beamer, S ott and Bulu , Aydin and Asanovi , Krste and Patterson, Dean, //Distributed
memory breadth-rst sear h revisited: Enabling bottom-up sear h, Parallel and Distributed
Pro essing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th
International, P. 16181627, 2013, IEEE.
150
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org
How to reach GreenGraph500 top with FPGA-based
supercomputer? Theory and practice
Anatoliy Sizov and Sergey Elizarov
Keywords: Field-Programmable Gate Array, Hybrid computing systems, application-specific
supercomputer, breath first search algorithm, low latency communications processor, low
latency memory controller, green graph 500, graph
World achievements in energy-efficient Field-Programmable Gate Array (FPGA), wide
experience in reconfigurable application-specific supercomputers field and new FPGA-based
low latency communications processors and memory controllers allows us to assume
excellent efficiency for custom FPGA-based supercomputers on GreenGraph500 benchmark.
In this paper breath first search (BFS) algorithm implementation on FPGA systems are
discussed. One node consisting of FPGA Kintex Ultra Scale with 4 RLDRAMIII memory
controllers, and BFS algorithm are examined. 32 nodes system performance are estimated,
energy-efficient calculated using GreenGraph500 criteria.