avva: (Default)
[personal profile] avva
Эту запись имеет смысл читать только тем, кто знает язык программирования C.

Мне его выдал форчун, уже не в первый раз, и я решил на этот раз записать. Если вам
нечего делать, разгадывайте, что делает этот код с 32-битным числом n:

n = ((n >> 1) & 0x55555555) | ((n << 1) & 0xaaaaaaaa);
n = ((n >> 2) & 0x33333333) | ((n << 2) & 0xcccccccc);
n = ((n >> 4) & 0x0f0f0f0f) | ((n << 4) & 0xf0f0f0f0);
n = ((n >> 8) & 0x00ff00ff) | ((n << 8) & 0xff00ff00);
n = ((n >> 16) & 0x0000ffff) | ((n << 16) & 0xffff0000);


Update: [livejournal.com profile] 37 нашёл отличную ссылку по теме.

Date: 2001-09-07 08:25 pm (UTC)
From: [identity profile] 37.livejournal.com
strtoul - äà ÿ, êàê ÿ äóìàþ, ïðîñòî íå îòðåçàë åãî â àââèíîì êîäå. Õîòÿ, êîíå÷íî, ïðîâåðþ åùå ðàç, íî äåëàë ýòè ïðèìåðû èç îäíîãî øàáëîíà. Ê ñîæàëåíèþ, êîä îñòàëñÿ ó ìåíÿ íà ðàüîòå, çàíîâî íàáèâàòü åãî íå áóäó, ìîãó ïîñëàòü Âàì äëÿ ïðîâåðêè òîëüêî â ïîíåäåëüíèê (íà òîò ñëó÷àé, åñëè Âû ñîìíåâàåòåñü â äîáðîñîâåñòíîñòè ìîèõ óïðàæíåíèé:-))).  ñâîþ î÷åðåäü, ìîæåòå ïðèñëàòü ìíå ñâîè ïðèìåðû äëÿ ïðîâåðêè, íî ñðàçó ñêàæó, ÷òî ðàçíèöà áîëåå ÷åì â 20 ðàç(!) è, îñîáåííî, âî âòîðîì ñëó÷àå â 20 ðàç âûçûâàåò áîëüøîå ñîìíåíèå è ïîäîçðåíèå â ÷èñòî ìåòîäè÷åñêîé îøèáêå. Íå î÷åíü ïîíÿòíî Âàøå íåâåðèå â ñàìó âîçìîæíîñòü ýòîãî - óæ íå ñ÷èòàåòå ëè Âû, ÷òî êîëè÷åñòâî ñòðîê Ñ-êîäà îäíîçíà÷íî êîððåëèðóåò ñ äëèíîé åãî âûïîëíåíèÿ? Òîãäà åùå ðàç ïîñìîòðèòå íà àññåìáëåðíûé êîä (î strtoul, ðàçóìååòñÿ, çàáóäüòå).

Date: 2001-09-07 10:26 pm (UTC)
From: (Anonymous)
20 times difference indeed sounds excessive. I just checked and looks like in attempt to optimize your code I slowed it down twice. From quick look at the assembly you gave, difference should be 5-10 times.

Reason is that for branches Intel (and generally all modern CPUs) takes a performance penalty due to pipeline stalls and flushes. Also in your main
loop there is 1 instruction between branches where as potentially Intel can execute several instructions simultaneosly.

As for assembly from avva's code, it's linear and the only possible problem I see with it is that on Pentium IV shifts are slower and have higher latency than ALU operations. But then, there is 7 assembly shifts in his variant and 16 in yours.

You are right in a sense that this all is mostly compiler problem. Given your code I don't see any reason why compiler shouldn't be able to generate code which executes at most 2-3 times slower than
original tricky implementation.

Date: 2001-09-08 06:24 am (UTC)
From: [identity profile] 37.livejournal.com
Êàê çàêëþ÷åíèå. Âû ïðàâû: îñíîâíîå çëî - â branch. Èíòåë âûïîëíÿåò ýòî î÷åíü ìåäëåííî, ÷åãî ÿ, êîíå÷íî, íå ó÷åë â îöåíêàõ (ïðî ñåáÿ ñ÷èòàë, ÷òî if(...) something ñòîèò î÷åíü ãðóáî ãîâîðÿ â ñðåäíåì 1.5 îïåðàöèè, à îêàçàëîñü - â ñðåäíåì áîëüøå 2-õ, ò.å., êîãäà ñðàâíåíèå - èñòèíà, âûïîëíåíèå èäåò áûñòðåå, ÷åì êîãäà ïðîïóñêàåòñÿ ñëåäóþùèé îïåðàòîð). Êàê ïàðàäîêñ, ÿ äëÿ ïðèìåðà ïîñòðîèë âîîáùå ÷èñòî ëèíåéíûé äëèíþùèé êîä ïî òèïó:
n1 |= (n & 0x800000000) >> 31;
.....
È ýòîò îãðîìíûé (è â àññåáëåðå as well) êîä âûïîëíÿëñÿ âñåãî â 2.5 ðàçà ìåäëåííåå, ÷åì êîä àââû, è íàìíîãî áûñòðåå ÷åì åñëè èñïîëüçîâàòü êîìàíäû òèïà
if(n & 0x800000000) n1 |= 0x00000001;
 ïîíåäåëüíèê ïîïðîáóþ ðàäè èíòåðåñà íà íåñêîëüêèõ äðóãèõ òèïàõ ïðîöåññîðîâ.
Çà ñèì ïðîùàþñü, èáî è òàê íàïëîäèë óæå íåèìîâåðíîå êîëè÷åñòâî òåêñòà â ÷óæîì æóðíàëå.

December 2025

S M T W T F S
  123 4 56
78 9 10 11 1213
1415 1617181920
21 22 23 24 2526 27
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 28th, 2025 01:29 pm
Powered by Dreamwidth Studios