Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos