Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org Äîñòèæåíèå ðåêîðäíûõ ïîêàçàòåëåé â GreenGraph500 äëÿ âû÷èñëèòåëüíûõ ñèñòåì íà ÏËÈÑ. Òåîðèÿ è ïðàêòèêà À.Ä. Ñèçîâ, Ñ. . Åëèçàðîâ Ìîñêîâñêèé ãîñóäàðñòâåííûé óíèâåðñèòåò èìåíè Ì.Â. Ëîìîíîñîâà Ñîâðåìåííûå ìèðîâûå äîñòèæåíèÿ â îáëàñòè ðàçðàáîòêè ýíåðãîýåêòèâíûõ ïðîãðàììèðóåìûõ ëîãè÷åñêèõ ñõåì (ÏËÈÑ), îáøèðíûé îïûò ïðèìåíåíèÿ ðå- êîíèãóðèðóåìûõ ñïåöâû÷èñëèòåëåé ïðè ðàçðàáîòêå ïðîáëåìíî îðèåíòèðîâàí- íûõ ñóïåðêîìïüþòåðîâ è óæå ïðîäåìîíñòðèðîâàííûå âîçìîæíîñòè ñîçäàíèÿ íà ÏËÈÑ êîíòðîëëåðîâ ïàìÿòè è êîììóíèêàöîííûõ ïðîöåññîðîâ ñî ñâåðõíèçêîé ëàòåíòíîñòüþ, ïîçâîëÿþò ïðåäïîëàãàòü, ÷òî èìåííî íà òàêîé ýëåìåíòíîé áàçå ñåãîäíÿ ìîãóò áûòü ñîçäàíû âû÷èñëèòåëüíûå ñèñòåìû ñ ðåêîðäíûìè íà òåñòå GreenGraph500 ïîêàçàòåëÿìè.  ðàáîòå îáñóæäàþòñÿ òðåáîâàíèÿ ê âû÷èñëèòåëü- íîé ñèñòåìå íà ÏËÈÑ ñ âíåøíåé ïàìÿòüþ ïðèìåíèòåëüíî ê ðåøåíèþ çàäà÷è ïî- èñêà âøèðü ïî ãðàó (Breadth rst sear h  BFS), ó÷èòûâàþùèå èìåþùèéñÿ ìè- ðîâîé îïûò è îñîáåííîñòè ëó÷øèõ ñóùåñòâóþùèõ ïàðàëëåëüíûõ àëãîðòìîâ BFS. àññìîòðåí ðåàëüíûé âû÷èñëèòåëüíûé óçåë, ñîäåðæàùèé ÏËÈÑ Kintex Ultra S ale ñ 4-ìÿ êîíòðîëëåðàìè ïàìÿòè RLDRAMIII. Îöåíåíà ïðîèçâîäèòåëüíîñòü ñèñòåìû èç 32-äâóõ òàêèõ óçëîâ, ðàññ÷èòàíà ýíåðãîýåêòèâíîñòü ïî êðèòåðè- ÿì ðåéòèíãà GreenGraph500 è äàíû ðåêîìåíäàöèè ïî äàëüíåéøåé îïòèìèçàöèè àïïàðàòóðû. 1. Ââåäåíèå Graph500  ìèðîâîé ðåéòèíã ñóïåðêîìïüþòåðîâ, ïðåäíàçíà÷åííûõ äëÿ ðåøåíèÿ çàäà÷, ñâÿçàííûõ ñ îáðàáîòêîé áîëüøèõ ãðàîâ. Äëÿ ðàíæèðîâàíèÿ ýòèõ ñèñòåì èñïîëüçóåòñÿ BFS  ïîèñê â øèðèíó â íåîðèåíòèðîâàííîì ðàçðåæåííîì ãðàå. Ýòîò òåñò â áîëüøåé ñòå- ïåíè íàãðóæàåò êîììóíèêàöèîííóþ ïîäñèñòåìó è êîíòðîëëåðû ïàìÿòè, òàê êàê äàííûé àëãîðèòì ïîäðàçóìåâàåò ðàáîòó ñ áîëüøèì îáúåìîì íåðåãóëÿðíûõ äàííûõ â ïðîòèâîïî- ëîæíîñòü Top500, îðèåíòèðîâàííîìó íà âû÷èñëåíèÿ íàä ÷èñëàìè ñ ïëàâàþùåé òî÷êîé íà òåñòå HPL Linpa k.  äîïîëíåíèå ê Top500 î÷åíü âîñòðåáîâàí Green500  ðåéòèíã ýíåð- ãîýåêòèâíîñòè âû÷èñëèòåëüíûõ ñèñòåì íà òåñòå Linpa k. Ïðåäëîæåííûé â 2012 ãîäó GreenGraph500, ñî÷åòàåò óêàçàííûå âûøå ïîäõîäû è ðàíæèðóåò ñèñòåìû èç Graph500 ïî ïðîèçâîäèòåëüíîñòè â GTEPS (109 ïðîéäåííûõ äóã â ñåêóíäó) íà Âàòò ýëåêòðîïîòðåáëå- íèÿ. Âàæíîñòü ýòîãî òåñòà ñëîæíî ïåðåîöåíèòü, òàê êàê èìåííî ýíåðãîýåêòèâíîñòü è ñêîðîñòü ðàáîòû ñî ñâåðõáîëüøèìè îáúåìàìè íåðåãóëÿðíûõ äàííûõ ÿâëÿþòñÿ îñíîâíûìè òðåáîâàíèÿìè ê ñóïåðêîìïüþòåðàì è öåíòðàì îáðàáîòêè äàííûõ áóäóùåãî [1℄. Ñîâðåìåííûé îïûò ïîêàçûâàåò, ÷òî îäèí èç íàèáîëåå óäà÷íûõ ïîäõîäîâ ê ïîñòðîå- íèþ çàêàçíûõ ïðîáëåìíî îðèåíòèðîâàííûõ âû÷èñëèòåëüíûõ ñèñòåì (ÏÎÂÑ) ìàêñèìàëüíîé ýíåðãîýåêòèâíîñòè  èñïîëüçîâàíèå ñïåöèàëüíûõ óñêîðèòåëåé íà áàçå ïðîãðàììèðóåìîé ëîãèêè (ÏËÈÑ) [2℄. Ñ äðóãîé ñòîðîíû, êðèòè÷åñêèì àêòîðîì, îãðàíè÷èâàþùèì ïðîèçâî- äèòåëüíîñòü ïðè ðåøåíèè ãðàîâûõ çàäà÷, ÿâëÿåòñÿ ñêîðîñòü ñëó÷àéíîãî äîñòóïà â ïàìÿòü. Ïîêàçàíî [3℄, ÷òî ïðîèçâîäèòåëüíîñòü òðàäèöèîííûõ CPU/GPU àðõèòåêòóð, îðèåíòèðîâàí- íûõ íà áëî÷íóþ ðàáîòó ñ âíåøíåé ïàìÿòüþ è èñïîëüçóþùèõ ãëóáîêèå êîíâåéåðû êîìàíä â ñîâîêóïíîñòè ñ íåñêîëüêèìè ñòóïåíÿìè êåøèðîâàíèÿ äàííûõ, ñíèæàåòñÿ íà 1-2 ïîðÿäêà íà çàäà÷àõ òèïà BFS. Îäíàêî íà ÏËÈÑ âîçìîæíî ðåàëèçîâàòü ñïåöèàëèçèðîâàííûå êîíòðîë- ëåðû ïàìÿòè, ïðàêòè÷åñêè ëèøåííûå óêàçàííîãî íåäîñòàòêà, òàêèå êàê, íàïðèìåð, â ñèñòå- ìå Convey MX-100, âõîäÿùåé â ïåðâóþ ñîòíþ ðåéòèíãà Graph500 [4℄. Ïðîèçâîäèòåëüíîñòü ïîäñèñòåìû ïàìÿòè ìîæíî åùå óâåëè÷èòü, ïåðåéäÿ ê îòëè÷íûì îò DDR3/DDR4 àðõèòåê- 145 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org òóðàì [6℄. Äðóãîé ïîäñèñòåìîé, îïðåäåëÿþùåé ïðîèçâîäèòåëüíîñòü â çàäà÷å BFS, ÿâëÿåòñÿ êîììóíèêàöèîííàÿ ñåòü, ñîåäèíÿþùàÿ âû÷èñëèòåëüíûå ìîäóëè [7℄. Èçâåñòíî, ÷òî íàèáîëåå áûñòðûå è íèçêîëàòåíòíûå êîììóíèêàöèîííûå ñåòè äëÿ ÏÎÂÑ íà ÏËÈÑ ïîñòðîåíû íà ìóëüòèãèãàáèòíûõ òðàíñèâåðàõ è êîììåð÷åñêè äîñòóïíûõ êîììóòàòîðàõ PCIe [8℄. Ïðåäëàãàÿ ÏËÈÑ, â êà÷åñòâå îñíîâíîãî âû÷èñëèòåëüíîãî óçëà, íóæíî ïðèíèìàòü âî âíèìàíèå èçâåñòíûå íåäîñòàòêè ÏËÈÑ îòíîñòèòåëüíî óçëîâ íà îñíîâå CPU/GPU: íîìè- íàëüíàÿ ðàáî÷àÿ ÷àñòîòà ÏËÈÑ ñîñòàâëÿåò 300-600 Ì ö, êîòîðàÿ â 5-10 ðàç óñòóïàåò ðà- áî÷åé ÷àñòîòå ñîâðåìåííûõ êîììåð÷åñêèõ ïðîöåññîðîâ. Îáúåì áûñòðîé ïàìÿòè, ðàñïîëî- æåííîé íåïîñðåäñòâåííî íà êðèñòàëëå ÏËÈÑ, îãðàíè÷åí 1-10 ÌÁàéò, ÷òî íå ïîçâîëÿåò èñïîëüçîâàòü ÏËÈÑ äëÿ ðåøåíèÿ çàäà÷ áîëüøîãî ðàçìåðà áåç ïðèìåíåíèÿ âíåøíåé ïàìÿ- òè. Öåíà òîïîâûõ ÏËÈÑ íà ïîðÿäîê ïðåâûøàåò öåíó ñîîòâåòñòâóþùèõ CPU/GPU. Êðîìå òîãî, ñîçäàíèå ÏÎÂÑ íà áàçå ÏËÈÑ ïðåäïîëàãàåò äëÿ êàæäîé êîíêðåòíîé çàäà÷è ñîçäà- íèå è îòëàäêó âû÷èñëèòåëÿ íà ÿçûêå îïèñàíèÿ àïïàðàòóðû, ñëîæíîñòü êîòîðîé íà ïîðÿäîê âûøå íàïèñàíèÿ ïðîãðàììû ïîä òðàäèöèîííûå àðõèòåêòóðû íà ÿçûêàõ âûñîêîãî óðîâíÿ.  íàñòîÿùåé ðàáîòå ïðîâîäèòñÿ àíàëèç ëèòåðàòóðû è òðåáîâàíèé ê àïïàðàòíîé áàçå ÏÎÂÑ äëÿ ïîñòðîåíèÿ òîïîâûõ ðåøåíèé â GreenGraph500. Âûïîëíÿåòñÿ ðàñ÷åò ïàðàìåò- ðîâ îïòèìàëüíîé êîíèãóðàöèè, äàþòñÿ ðåêîìåíäàöèè äëÿ ñîçäàíèÿ ÏÎÂÑ äëÿ ãðàîâûõ çàäà÷ ðàçëè÷íîãî ðàçìåðà. Ïðîâîäèòñÿ àíàëèç ïðèìåíèìîñòè ðàçðàáîòàííîãî äëÿ äàííî- ãî ÏÎÂÑ àëãîðèòìà ïîèñêà âøèðü.  ðàìêàõ äàííîé ðàáîòû ïðåäïîëàãàåòñÿ îïðåäåëåíèå ïðîèçâîäèòåëüíîñòè îäíîãî óçëà íà àëãîðèòìå BFS ñ ïîìîùüþ ìîäåëèðîâàíèÿ ðàáîòû ðå- àëüíîãî ÏËÈÑ. 2. Îáùàÿ ïàìÿòü Êàê ñêàçàíî âûøå, BFS ïðåäïîëàãàåò ìíîæåñòâî ñëó÷àéíûõ îáðàùåíèÿ â îáùóþ ïà- ìÿòü âñåé âû÷èñëèòåëüíîé ñèñòåìû.  ðàáîòå [3℄ ïîêàçàíî, ÷òî ïèêîâàÿ ïðîèçâîäèòåëüíîñòü êîíòðîëëåðîâ ïàìÿòè â òðàäèöèîííûõ CPU/GPU àðõèòåêòóðàõ, ðàññ÷èòàííûõ íà áëî÷íîå ÷òåíèå, äîñòèãàåòñÿ òîëüêî ïðè ðàáîòå ñ áîëüøèìè 4 ÊÁ è áîëåå áëîêàìè äàííûõ è ñíè- æàåòñÿ íà ïîðÿäêè ïðè ÷òåíèÿõ îòäåëüíûõ ìàøèííûõ ñëîâ. àçìåð îáðàáàòûâàåìûõ àë- ãîðèòìàìè BSF ãðàîâ ëåæèò â äèàïàçîíå îò Á äî ÏÁ, ïðè òîì, ÷òî êàæäûé çàïðîñ íà ÷òåíèå â BFS îïåðèðóåò åäèíèöàìè ìàøèííûõ ñëîâ (4/8 áàéò íà ñëîâî), àäðåñà çàïðîñîâ ïðàêòè÷åñêè ñëó÷àéíû, ïîýòîìó ýåêòèâíîå ÷òåíèå áîëüøèìè áëîêàìè íåâîçìîæíî. Òà- êèì îáðàçîì, àðõèòåêòóðà êîíòðîëëåðà ïàìÿòè â êëàññè÷åñêèõ CPU/GPU àðõèòåêòóðàõ ÿâëÿåòñÿ àêòîðîì, îãðàíè÷èâàþùèì îáùóþ ïðîèçâîäèòåëüíîñòü ñèñòåìû íà òåñòå BFS. Ýòî ïîçâîëÿåò ïîëàãàòü, ÷òî ïåðåõîä ê ïðîáëåìíî-îðèåíòèðîâàííûì êîíòðîëëåðàì ïàìÿòè, íà êîòîðûõ âîçìîæíî äîñòèæåíèå ìàêñèìàëüíûõ ïðîïóñêíûõ ñïîñîáíîñòåé íà îïåðàöèÿõ äîñòóïà ïî ñëó÷àéíûì àäðåñàì, ÿâëÿåòñÿ îäíèì èç ïåðñïåêòèâíûõ íàïðàâëåíèé â ñîçäàíèè ÏÎÂÑ äëÿ ãðàîâûõ çàäà÷. Óâåëè÷åíèå ïðîèçâîäèòåëüíîñòè ïîäñèñòåìû ïàìÿòè âîçìîæíî òàêæå ïðè èñïîëüçî- âàíèè äðóãèõ òèïîâ ÎÇÓ, òàê íàïðèìåð ïðîèçâîäèòåëüíîñòü RLDRAMIII (Redu e laten y DRAM) íà ñëó÷àéíûõ ÷òåíèÿõ â 2-3 ðàçà áîëüøå, ÷åì äëÿ ñîîòâåòñòâóþùåé DDR3. [6℄ 3. Êîììóíèêàöèîííàÿ ïîäñèñòåìà  ñòàòüå [7℄ ïîêàçàíî, ÷òî â âû÷èñëèòåëüíûõ ñèñòåìàõ ñ ìíîãèìè óçëàìè ïðîèçâîäè- òåëüíîñòü àëãîðèòìà BFS îïðåäåëÿåòñÿ êîììóíèêàöèîííîé ïîäñèñòåìîé, îáåñïå÷èâàþùåé îáìåí äàííûìè ìåæäó âû÷èñëèòåëüíûìè óçëàìè, ïîýòîìó ñíèæåíèå êîëè÷åñòâà ïåðåñû- ëàåìûõ äàííûõ ïîçâîëÿåò çíà÷èòåëüíî ïîâûñèòü ïðîèçâîäèòåëüíîñòü ñèñòåìû â öåëîì.  ðàáîòå [8℄ ïîêàçàíà âîçìîæíîñòü ïîñòðîåíèÿ è ýåêòèâíîé ìàñøòàáèðóåìîñòè ñèñòåìû èç íåñêîëüêèõ ÏËÈÑ, â êîòîðîé êîììóíèêàöèîííàÿ ïîäñèñòåìà ïîñòðîåíà íà áàçå ìóëü- òèãèãàáèòíûõ òðàíñèâåðîâ. Êîììåð÷åñêè äîñòóïíîé ñåòüþ òàêîãî òèïà ÿâëÿåòñÿ ïàêåòíàÿ 146 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org ñåòü PCIe ñ òîïîëîãèåé òèïà "çâåçäà" , êîòîðàÿ â ðàìêàõ ñòàíäàðòà PCIe Gen3 ïîçâîëÿåò äîñòèãàòü ïðîïóñêíîé ñïîñîáíîñòè äî 16 Áàéò/ â äóïëåêñíîì ðåæèìå. 4. Ïðîåêò BFS äëÿ ÏÎÂÑ 1. f o r (i = 0; i < size (V ); i ++) 2. lvl [v℄ = Inf ; 3. lvl [s℄ = 0; 4. write_to_bfs_queue (n , s ); // write v to queue on hip n 5. // On every hip , on every level : 6. w h i l e (Q is not empty ) 7. f o r ( all u in Q) // 1 read 8. f o r ( all v in CSR [u ℄) // 3 reads 9. i f ( v lo ated in lo al_mem ) 10. i f ( lvl [v ℄ > lvl [u ℄) // 2 reads 11. d [v℄ = u; // write 12. lvl [v ℄ = lvl [u ℄; // write 13. // add v into lo al queue 14. write_to_bfs_queue ( lo al ,v ); 15. else 16. // send remote he k request 17. write_to_ he k_queue (n , v ); èñ. 1. Ïðîåêò ðàñïðåäåëåííîãî àëãîðèòìà ïîèñêà âøèðü íà ÏÎÂÑ Äëÿ ÏÎÂÑ íà ÏËÈÑ òðåáóåòñÿ ìóëüòèòðåäîâûé àëãîðèòì, â êîòîðîì ðàçðåøåíû òîëü- êî ëîêàëüíûå ÷òåíèÿ, îïåðàöèè ãëîáàëüíîé ñèíõðîíèçàöèè íå òðåáóþò áîëüøîãî êîëè÷å- ñòâà ïåðåñûëîê è â ìàêñèìàëüíîé ñòåïåíè èñïîëüçóþòñÿ âîçìîæíîñòè ÏËÈÑ è ïîäñèñòå- ìû ïàìÿòè. Ïðîåêò òàêîãî àãëîðèòìà ïðèâåäåí íà ðèñ. 1. Èçíà÷àëüíî, âåðøèíû â ãðàå ðàçáèâàþòñÿ ìåæäó óçëàìè òàêèì îáðàçîì, ÷òî ðåáðà, ñîîòâåòñòâóþùèå ñïèñêó âåðøèí, îáðàáàòûâàåìûõ íà äàííîì óçëå, íàõîäÿòñÿ â ëîêàëüíîé ïàìÿòè ñîîòâåòñòâóþùåãî ÏËÈÑ.  ïàìÿòè êàæäîãî óçëà òàêæå õðàíèòñÿ òàêæå ëîêàëüíûé ó÷àñòîê ðîíòà.  êà÷åñòâå îðìàòà õðàíåíèÿ ãðàà èñïîëüçóåòñÿ Compressed Sparse Row (CSR) îðìàò. Êàæäûé ëî- êàëüíûé ðîíò íà îïðåäåëåííîì óðîâíå ïîèñêà îáðàáàòûâàåòñÿ íåçàâèñèìî, ïðè÷åì, åñëè â ïðîöåññå ïîèñêà îáðàáàòûâàåìîå ðåáðî ñâÿçûâàåò ëîêàëüíóþ âåðøèíó ñ âåðøèíîé, äàí- íûå î êîòîðîé õðàíÿòñÿ â óäàëåííîé ïàìÿòè, èíîðìàöèÿ î äàííîé âåðøèíå ïîñûëàåòñÿ íà óäàëåííûé âû÷èñëèòåëüíûé óçåë, ãäå è ïðîèñõîäèò åå ïîñëåäóþùàÿ îáðàáîòêà. àçðå- øåíèå êîíëèêòîâ ìåæäó ïîòîêàìè âíóòðè óçëà âû÷èñëèòåëÿ îñóùåñòâëÿåòñÿ ñ ïîìîùüþ àïïàðàòíî ðåàëèçîâàííûõ íà óðîâíå êîíòðîëëåðà ïàìÿòè àòîìàðíûõ îïåðàöèé è full/empty ïðèçíàêîâ ÿ÷ååê äàííûõ. 5. Îöåíêà ïðîèçâîäèòåëüíîñòè ÏÎÂÑ íà ÏËÈÑ 5.1. Îöåíêà ïðîèçâîäèòåëüíîñòè óçëà Ïðåäëàãàåìûé ÏÎÂÑ ñîñòîèò èç 32 âû÷èñëèòåëüíûõ óçëîâ, ñîåäèíåííûõ êîììóíèêàöè- îííîé ïîäñèñòåìîé èç ìóëüòèãèãàáèòíûõ òðàíñèâåðîâ ðàáîòàþùèõ ïî ïðîòîêîëó PCIe Gen3 4x. Êàæäûé âû÷èñëèòåëüíûé óçåë ïðåäñòàâëÿåò èç ñåáÿ êðèñòàëë ÏËÈÑ Kintex Ultras ale XCKU095 åìêîñòüþ 940 òûñ. LUT, ðàáîòàþùèé íà ÷àñòîòå 660 Ì ö, è ÷åòûðå êîíòðîëëåðà âíåøíåé ïàìÿòè RLDRAMIII, ðàáîòàþùèé íà ÷àñòîòå 800 Ì ö, åìêîñòüþ 64 Ìáàéò êàæ- äûé. Îöåíêó ïðîèçâîäèòåëüíîñòè äàííîãî âû÷èñëèòåëüíîãî óçëà áóäåì ïðîâîäèòü ïóòåì ñðàâíåíèÿ ñ ñóùåñòâóþùèìè âû÷èñëèòåëüíûìè ñèñòåìàìè íà ÏËÈÑ îò êîìïàíèè Convey, ïðîèçâîäèòåëüíîñòü êîòîðûõ èçâåñòíà [4℄. Ñèñòåìà Convey MX-100 ñîñòîèò èç ÷åòûðåõ âû- ÷èñëèòåëüíûõ êðèñòàëëîâ V6 HX565T åìêîñòüþ 585 òûñ. LUT, ðàáîòàþùèé íà ÷àñòîòå 550 Ì ö è ïîäñèñòåìû ïàìÿòè èç 32 êàíàëîâ DDR3. 147 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org  ðàáîòå [9℄ áûëî ïîêàçàíî, ÷òî èñïîëüçîâàíèå àëãîðèòìà îïòèìèçàöèè ïî íàïðàâëåíè- ÿì ïîçâîëÿåò ñíèçèòü êîëè÷åñòâî îáðàáàòûâàåìûõ ðåáåð ãðàà äî ðàçìåðà ìèíèìàëüíîãî îñòîâíîãî äåðåâà, èëè â 16 ðàç äëÿ ãðàà ïëîòíîñòüþ 16 ðåáåð íà âåðøèíó.  ðàáîòå [4℄ èñ- ïîëüçóåòñÿ ñïîñîá õðàíåíèÿ, êîòîðûé ïîçâîëÿåò óìåíüøèòü êîëè÷åñòâî çàïðîñîâ íà ÷òåíèå äàííûõ ïðè îáðàáîòêå îäíîãî ðåáðà äî òðåõ, ÷òî ïîçâîëÿåò îöåíèòü òðåáóåìóþ ïðîïóñêíóþ ñïîñîáíîñòü ïàìÿòè â ñèñòåìå MX-100 ïî îðìóëå 14,6 [5℄ /16*3 = 2,7 GR/s (GR/s  109 ÷òåíèé â ñåêóíäó).  ñîîòâåòñòâèè ñ ðàáîòîé [6℄ ïðîèçâîäèòåëüíîñòü èñïîëüçóåìîé â ñèñòå- ìå Convey ïàìÿòè DDR3 íà ñëó÷àéíûõ ÷òåíèÿõ ìîæåò áûòü îöåíåíà â 0,6 GR/s äëÿ 32-õ êàíàëîâ. àññìàòðèâàåìîå â [10℄ ïåðåóïîðÿäî÷èâàíèå çàïðîñîâ ïðè äîñòóïå ê ïàìÿòè ïîç- âîëÿåò ïîâûñèòü ïðîèçâîäèòåëüíîñòü ÷òåíèÿ â 4-5 ðàç îòíîñèòåëüíî ñêîðîñòè ñëó÷àéíîãî ÷òåíèÿ. Ïðîâåäåííàÿ íàìè ìîäåëèðîâàíèå ðàáîòû êîíòðîëëåðà RLDRAMIII ïîêàçàëî, ÷òî ïðîèçâîäèòåëüíîñòü ïðåäëàãàåìîé ïîäñèñòåìû ïàìÿòè ñîñòàâëÿåò 140*16 = 2,24 GR/s íà ñëó÷àéíûõ ÷òåíèÿõ è 750*16 = 11,5 GR/s íà ïîñëåäîâàòåëüíûõ ÷òåíèÿõ ïðè ðàçìåðå áëîêà 18 áàéò, ÷òî ïîçâîëÿåò ãîâîðèòü î ñîçäàíèè ïîäñèñòåìû ïàìÿòè ñ ïðîïóñêíîé ñïîñîáíîñòüþ äî 10 GR/s. Òîãäà, îòòàëêèâàÿñü îò ïðîèçâîäèòåëüíîñòè ïîäñèñòåìû ïàìÿòè, ïðåäëàãàåìûé ÏÎÂÑ ñìîæåò îáðàáàòûâàòü â 10/2,7 = 3,7 áîëüøå ðåáåð â ñåêóíäó, ÷åì MX-100. Ñëåäîâàòåëüíî, ïðîèçâîäèòåëüíîñòü îäíîãî ÏËÈÑ, ïðåäïîëàãàÿ ëèíåéíóþ ìàñøòàáèðóåìîñòü âû÷èñëèòå- ëÿ, ìîæíî îöåíèòü â (14,6/4)*940 òûñ. LUT*660 Ì ö/565 òûñ. LUT*550 Ì ö = 7,3 GTEPS. 5.2. Âîçìîæíîñòè ìàñøòàáèðóåìîñòè ñèñòåìû  ðàçäåëå 4 áûëî ïîêàçàíî, ÷òî â ñëó÷àå îáðàáîòêè ðåáðà, êîòîðîå ñîåäèíÿåò ëîêàëü- íóþ âåðøèíó ñ âåðøèíîé, íàõîäÿùåéñÿ â óäàëåííîé ïàìÿòè, ïî êîììóòàöèîííîé øèíå ïîñûëàåòñÿ çàïðîñ íà óäàëåííóþ îáðàáîòêó äàííîé âåðøèíû. Ýòîò çàïðîñ ïðåäïîëàãàåò ïåðåäà÷ó 8 áàéò ïîëåçíîé èíîðìàöèè  íîìåð çàïðàøèâàåìîé âåðøèíû, íîìåð çàïðàøè- âàþùåé âåðøèíû è åå óðîâåíü. Ïðè ðàâíîìåðíîì ðàñïðåäåëåíèè âåðøèí ìåæäó óçëàìè, ó÷èòûâàÿ ìàêñèìàëüíî âîçìîæíóþ ïðîèçâîäèòåëüíîñòü îäíîãî óçëà, îöåíêà äëÿ êîòîðîé äàíà â ïðåäûäóùåì ðàçäåëå, íåîáõîäèìóþ ïðîïóñêíóþ ñïîñîáíîñòü ìîæíî ðàññ÷èòàòü êàê 8 áàéò*(7,6 GTEPS/16) = 3,8 Áàéò/ñ äëÿ ñèñòåìû ñ äîñòàòî÷íî áîëüøèì êîë-âîì âû÷èñ- ëèòåëüíûõ óçëîâ.  ïðåäëàãàåìîì ÏÎÂÑ äëÿ ñîåäèíåíèÿ âû÷èñëèòåëåé èñïîëüçóåòñÿ ñåòü, ïîñòðîåííàÿ íà êîììóòàòîðàõ PCIe Gen3 4x, ïðîïóñêíàÿ ñïîñîáíîñòü êîòîðîé íà çàïèñü èç îäíîãî âû÷èñëèòåëÿ â äðóãîé ñîñòàâëÿåò 4 Áàéò/ . Îäíàêî, èçâåñòíî [11℄, ÷òî ïðîïóñêíàÿ ñïîñîáíîñòü PCIe ïðè ïåðåäà÷å ñîîáùåíèé âåëè÷èíîé 8 áàéò ñîñòàâëÿåò ïðèáëèçèòåëüíî 30% îò ìàêñèìàëüíîé, áîëåå 90% ïðè äëèíå ñîîáùåíèÿ â 100 è áîëåå áàéò. Ýòî çíà÷èò, ÷òî äëÿ ïîëíîöåííîé çàãðóçêè ÏËÈÑ ïîòðåáóåòñÿ ëèáî ïåðåéòè ê øèíå PCIe áîëüøåé øèðè- íû, ëèáî èñïîëüçîâàòü ìåõàíèçì àãðåãàöèè ñîîáùåíèé óäàëåííîé çàïèñè. Ýòè îïòèìèçàöèè ïîçâîëÿò äëÿ ñèñòåìû èç 4 óçëîâ äîñòè÷ü ïðàêòè÷åñêè ëèíåéíîé ìàñøòàáèðóåìîñòè. Îä- íàêî â ðàññìàòðèâàåìîì ïðîòîòèïå 4-õ óçëîâûå áëîêè îáúåäèíåíû PCIe Gen3 8x, ïðîâåäÿ àíàëîãè÷íûå âûêëàäêè, ïðîèçâîäèòåëüíîñòü ÏÎÂÑ ñ 32 óçëàìè ìîæåò áûòü îöåíåíà â 7,6 GTEPS * 32 óçëà*(8 áàéò/ /4) /3,8 áàéò/ = 128 GTEPS èëè 200 MTEPS/W (îöåíèâàÿ ýíåðãîïîòðåáëåíèå â 20 Âàòò íà ÏËÈÑ). 5.3. Ñðàâíåíèå ñ ñóùåñòâóþùèìè óñòðîéñòâàìè àññìàòðèâàåìàÿ ÏÎÂÑ ïî ðàçìåðó ðåøàåìîé çàäà÷è BFS äîëæíà áûòü ïî îòíåñåíà ïî êëàññèèêàöèè GreenGraph500 ê ðàçäåëó Small data, êîòîðûé â ðåäàêöèè îò èþëÿ 2015 ïðåäñòàâëåí â ïåðâîé äåñÿòêå 4-ìÿ ñèñòåìàìè îðèãèíàëüíîé àðõèòåêòóðû è 6-þ ñèñòåìàìè íà áàçå ìèêðîïðîöåññîðîâ äëÿ ñîòîâûõ òåëåîíîâ è ïëàíøåòîâ. Âòîðîé è òðåòèé äåñÿòîê ëèäåðîâ GreenGraph500 ïðàêòè÷åñêè ïîëíîñòüþ çàíÿòû SMP ñèñòåìàìè íà îäíîì, äâóõ èëè ÷åòûðåõ òîïîâûõ x86 ïðîöåññîðàõ Intel Sandybridge. Ïðàêòè÷åñêè âñå ëèäåðû èñïîëüçóþò ïðåäåëüíî îïòèìèçèðîâàííûå àëãîðèòìû îò ãðóïïû GraphCREST [9℄. Ïîïàäàíèå â äåñÿòêó 148 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org Òàáëèöà 1. Ñðàâíåíèå ðàññìàòðèâàåìîãî ÏÎÂÑ ñ ñóùåñòâóþùèìè ðåøåíèÿìè. Ïàðàìåòðû Convey Xperia Z1 Fermi Cray XE6 Intel SB ÏÎÂÑ MX- [13℄ GPU Hopper [15℄ EP [13℄ 32 óçëà 100 [12℄ [14℄ Âîçìîæíûé 29 20 20 31 28 22 ðàçìåð ãðàà åçóëüòàò 14,6 1,03 0,63 62 28,61 128 GTEPS åçóëüòàò 146 235 2,6 0,15 61,48 200 MTEPS/W Green 8 SD 2 SD 29 SD 19 BD 1 BD 5 SD Graph500 Small or Big Data Category Graph500 79 153 171 54 70 46 òðåáóåò ýíåðãîýåêòèâíîñòè íà óðîâíå 130 MTEPS/W, êîòîðàÿ, êàê ïîêàçàíî âûøå, ìî- æåò áûòü äîñòèãíóòà íà ÏÎÂÑ â ðàññìàòðèâàåìîé â íàñòîÿùåé ñòàòüå êîíèãóðàöèè ïðè ýåêòèâíîñòè ðåàëèçàöèè àãëîðèòìà BFS íà ÏËÈÑ ÏÎÂÑ íà óðîâíå ñèñòåì Convey. Îòìåòèì, ÷òî ïðàêòè÷åñêè âñå ïðåäñòàâëåííûå â ðàçäåëå Small data âû÷èñëèòåëè èìå- þò îäèí óçåë è íå äîïóñêàþò ìàñøòàáèðîâàíèÿ, ò.ê. ëèáî ïðèíöèïèàëüíî îäíîïðîöåññîðíûå (ñèñòåìû íà Snapdragon è ïîäîáíûå), ëèáî èñïîëüçóþò íå ìàñøòàáèðóåìûå ðåøåíèÿ (SMP ñèñòåìà íà 4-õ Intel Sandybridge).  ïðîòèâîïîëîæíîñòü èì  ïðåäëîæåííàÿ ÏÎÂÑ ìíîãî- óçëîâàÿ è óæå ñîäåðæèò 32 âû÷èñëèòåëüíûõ óçëà. Åå îòíåñåíèå ê Small data ñâÿçàíî òîëüêî ñ îñîáåííîñòüþ èñïîëüçóåìîé RLDRAMIII (ìàëàÿ åìêîñòü ìîäóëÿ).  êëàññå Big data ïî- ïàäàíèå â ïåðâóþ äåñÿòêó òðåáóåò ýíåðãîýåêòèâíîñòè íà óðîâíå 20 MTEPS/W, êîòîðàÿ î÷åâèäíî áóäåò äîñòèãíóòà ïðè ïåðåõîäå íà áîëåå åìêèå ìîäóëè ïàìÿòè òèïà DDR3/4. 6. Çàêëþ÷åíèå Øèðîêèé êëàññ çàäà÷, òðåáóþùèõ íåðåãóëÿðíîé ðàáîòû ñ áîëüøèìè è ñâåðõáîëüøèìè îáúåìàìè äàííûõ, â òîì ÷èñëå çàäà÷à ïîèñêà âøèðü ïî ãðàó, ìîæåò ýåêòèâíî ðåøàòüñÿ íà ìàññèâå ÏËÈÑ, îñíàùåííûõ êîíòðîëëåðàìè ïàìÿòè è ñâÿçàííûìè êîììóíèêàöèîííîé øèíîé PCIe. Ïðè ñîâìåñòíîé îïòèìèçàöèè àëãîðèòìà ïîèñêà, êîëè÷åñòâà è òèïà èñïîëüçóå- ìûõ êîíòðîëëåðîâ ïàìÿòè è ïàðàìåòðîâ êîììóíèêàöèîííîé øèíû, ìîãóò áûòü ðàçðàáîòàíû ñèñòåìû ñ ðåêîðäíûìè ïîêàçàòåëÿìè â òåñòå GreenGraph500. Ëèòåðàòóðà 1. Fran esquini, Emilio and Castro et al. //On the energy e ien y and performan e of irregular appli ation exe utions on multi ore, NUMA and many ore platforms, Journal of Parallel and Distributed Computing, 2014, Elsevier. 2. Fran is o, Phil and others //The Netezza data applian e ar hite ture: a platform for high 149 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org performan e data warehousing and analyti s, IBM Redbooks, 2011. 3. Agarwal, Virat and Petrini, Fabrizio and Pasetto, Davide and Bader, David A //S alable graph exploration on multi ore pro essors, Pro eedings of the 2010 ACM/IEEE International Conferen e for High Performan e Computing, Networking, Storage and Analysis, P. 111, 2010, IEEE Computer So iety 4. Attia, Osama G and Johnson, Tyler and Townsend, Kevin and Jones, Philip and Zambreno, Joseph //CyGraph: A Re ongurable Ar hite ture for Parallel Breadth-First Sear h, Parallel & Distributed Pro essing Symposium Workshops (IPDPSW), 2014 IEEE International, P. 228235, 2014, IEEE. 5. Graph500 List July 2015, URL: http://www.graph500.org/results_jul_2015 6. Avnet //Optimal Memory Interfa e Design with Xilinx 7 Series Xfest-2012 presentation, 2012, URL: http://www.em.avnet. om/en-us/design/trainingandevents/Do uments/X- FEST%202012%20PRESENTATIONS/xfest12_pdf_memory_v1_2_may15.pdf 7. Che oni, Fabio and Petrini, Fabrizio //Traversing Trillions of Edges in Real Time: Graph Exploration on Large-S ale Parallel Ma hines, Parallel and Distributed Pro essing Symposium, 2014 IEEE 28th International, P. 425434, 2014, IEEE. 8. Theodore Markettos, A and Fox, Paul J and Moore, Simon W and Moore, Andrew W //Inter onne t for ommodity FPGA lusters: standardized or ustomized?, Field Programmable Logi and Appli ations (FPL), 2014 24th International Conferen e on, P. 18, 2014, IEEE. 9. Yasui, Yui hiro and Fujisawa, Katsuki and Goto, Keisuke //NUMA-optimized parallel breadth-rst sear h on multi ore single-node system, Big Data, 2013 IEEE International Conferen e on, P. 394402, 2013, IEEE. 10. Jin, Zheming and Bakos, Jason D //Memory A ess S heduling on the Convey HC-1, 2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Ma hines 11. Understanding Performan e of PCI Express Systems Xilinx White paper O tober 2014, URL: http://www.xilinx. om/support/do umentation/white_papers/wp350.pdf. 12. Convey //Convey MX Series Ar hite tural Overview, White paper, URL: http://www. onvey omputer. om/les/5913/5266/3278/CONV-12- 036.1MXar hOvrvwWeb.pdf 13. Yasui, Yui hiro and Fujisawa, Katsuki and Sato, Yukinori, //Fast and energy-e ient breadth-rst sear h on a single numa system, Super omputing, P. 365381, 2014, Springer. 14. Hong, Sungpa k and Oguntebi, Tayo and Olukotun, Kunle, //E ient parallel graph exploration on multi- ore CPU and GPU, Parallel Ar hite tures and Compilation Te hniques (PACT), 2011 International Conferen e on, P. 7888, 2011, IEEE. 15. Beamer, S ott and Bulu , Aydin and Asanovi , Krste and Patterson, Dean, //Distributed memory breadth-rst sear h revisited: Enabling bottom-up sear h, Parallel and Distributed Pro essing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, P. 16181627, 2013, IEEE. 150 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org How to reach GreenGraph500 top with FPGA-based supercomputer? Theory and practice Anatoliy Sizov and Sergey Elizarov Keywords: Field-Programmable Gate Array, Hybrid computing systems, application-specific supercomputer, breath first search algorithm, low latency communications processor, low latency memory controller, green graph 500, graph World achievements in energy-efficient Field-Programmable Gate Array (FPGA), wide experience in reconfigurable application-specific supercomputers field and new FPGA-based low latency communications processors and memory controllers allows us to assume excellent efficiency for custom FPGA-based supercomputers on GreenGraph500 benchmark. In this paper breath first search (BFS) algorithm implementation on FPGA systems are discussed. One node consisting of FPGA Kintex Ultra Scale with 4 RLDRAMIII memory controllers, and BFS algorithm are examined. 32 nodes system performance are estimated, energy-efficient calculated using GreenGraph500 criteria.