Hi Chris,
Thank you so much for your kind response. We are currently implementing modulo scheduling compiler optimization at the machine code level (using LLVM-VPO compiler). The objective is to close the performance gap between Cortex A8 and A9. In order to model architecture restrictions within our compiler, we have been trying to understand the cycle counts reported by Carbon bare metal processor tool. I think we are getting better:):)