fix: restore cross-device synchronization in device_to_device copies
When source and destination are on different devices, wait for source
stream writes to complete before scheduling the copy on the destination
stream. Without this, async CUDA/Metal copies could read stale data.
Addresses review feedback on PR #12.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>