Optimizing Code on CH32X035 (RISC-V)

Posted by

I am currently trying to get the best performance on this 48Mhz clocked Risc-V, to implement a HD6301 emulator (clocked at 4Mhz, instruction clock -> 1M ticks/sec) .

Due to the RISC Instruction set, you almost need to prepare what you expect the compiler to produce…

If you try this code (if C!=0, then branch PC+2+depl, if not, just do PC+2, add 3 to ticks counter) :

void op_bcs(void) 
{ 
   if (regs.CC_C!=0) 	regs.PC+=(2+(signed char)rom[(regs.PC+1)&0xFFF]);
			else regs.PC+=2;
			TICKS(3);
}

It would produce the following code (18 lines, 46 bytes) :

00002688 <op_bcs>:
    2688:	38c18793          	addi	a5,gp,908 # 2000100c <regs>
    268c:	4398                	lw	a4,0(a5)
    268e:	01d7c603          	lbu	a2,29(a5)
    2692:	00270693          	addi	a3,a4,2
    2696:	ca19                	beqz	a2,26ac <op_bcs+0x24>
    2698:	0705                	addi	a4,a4,1		// PC+1
    269a:	0752                	slli	a4,a4,0x14	// <<20 (And FFF)
    269c:	661d                	lui	a2,0x7
    269e:	8351                	srli	a4,a4,0x14	// >>20
    26a0:	cc060613          	addi	a2,a2,-832 # 6cc0 <rom>
    26a4:	9732                	add	a4,a4,a2
    26a6:	00070703          	lb	a4,0(a4)
    26aa:	96ba                	add	a3,a3,a4
    26ac:	4f98                	lw	a4,24(a5)
    26ae:	c394                	sw	a3,0(a5)
    26b0:	070d                	addi	a4,a4,3
    26b2:	cf98                	sw	a4,24(a5)
    26b4:	8082                	ret

But I have finally coded the BCC (same only bnez instead of beqz) opcode like this , using some tricks :

void op_bcc(void) 
{				// romendp1 points to ROM+0x1001, i.e. end+1
  u8 depl=(regs.CC_C==0)?regs.romendp1[regs.PCw]:0;  // PCw is int16_t part of PC
  regs.PC=regs.PC+2+(signed char)depl; 		// But calculate with PC, to avoid
  TICKS(3);					// adjustment to word
}

Which produces (16 lines, 40 bytes) – Can it be shorter ? :

00002660 <op_bcc>:
    2660:	38c18793          	addi	a5,gp,908 # 2000100c <regs>
    2664:	01d7c703          	lbu	a4,29(a5)
    2668:	4601                	li	a2,0
    266a:	e719                	bnez	a4,2678 <op_bcc+0x18>
    266c:	00079683          	lh	a3,0(a5)
    2670:	4bd8                	lw	a4,20(a5)
    2672:	9736                	add	a4,a4,a3
    2674:	00070603          	lb	a2,0(a4)
    2678:	4398                	lw	a4,0(a5)
    267a:	4f94                	lw	a3,24(a5)
    267c:	0709                	addi	a4,a4,2
    267e:	9732                	add	a4,a4,a2
    2680:	068d                	addi	a3,a3,3
    2682:	c398                	sw	a4,0(a5)
    2684:	cf94                	sw	a3,24(a5)
    2686:	8082                	ret

Carefully coding the rest of op-codes like this, and using union/structs for HD6301 registers, allowed the small CH32X035 to emulate more than 1.000.000 ticks/seconds…

Edit : I think that this (i.e. no depl=0, with add depl)

void op_bcc(void)
{ 
  if (regs.CC_C==0) regs.PC=(regs.PC+2)+(s8)regs.romendp1[regs.PCw]; 
    else regs.PC=(regs.PC+2); 
  TICKS(3);
}

is even better… 15 lines, same size (40 bytes) , but 10 instr. if C=1, instead of 12

00002660 <op_bcc>:
    2660:	38c18793          	addi	a5,gp,908 # 2000100c <regs>
    2664:	4398                	lw	a4,0(a5)
    2666:	01d7c683          	lbu	a3,29(a5)
    266a:	0709                	addi	a4,a4,2
    266c:	ea81                	bnez	a3,267c <op_bcc+0x1c>
    266e:	00079603          	lh	a2,0(a5)
    2672:	4bd4                	lw	a3,20(a5)
    2674:	96b2                	add	a3,a3,a2
    2676:	00068683          	lb	a3,0(a3)
    267a:	9736                	add	a4,a4,a3
    267c:	4f94                	lw	a3,24(a5)
    267e:	c398                	sw	a4,0(a5)
    2680:	00368713          	addi	a4,a3,3
    2684:	cf98                	sw	a4,24(a5)
    2686:	8082                	ret

Leave a Reply

Your email address will not be published. Required fields are marked *