When cascaded like this, the chips collectively form a big shift register with a bit length equal to 16 x the number of chips. As long as you keep LOAD low, bits are just shifted through according to the number of clock cycles. You can think of the data bits as a train running on a track through a series of stations where each of them corresponds to the cascaded chips respectively. The clock is your train engine and when you stop the engine, the train stops and the passengers (the bits) arrive to the location where they happen to be when the train stopped. You then open the doors by raising LOAD to let the passengers off. As for your questions:
1. The above should answer that
2. Yes but it's only necessary to raise LOAD when bits are in place
3. I've never used the emulator so I wouldn't know but an external device generally cannot tell the difference if it's being bit banged or communicating with a an actual hardware peripheral (e.g. h/w SPI)
4. Yes, only one LOAD when bits have arrived to the designated chip (see 3)
I wouldn't know about power consumption.