The model comprises five key components: an adaptive image divider, a vision encoder, a large language model (LLM), a grounding vision encoder, and a pixel decoder. The adaptive image divider partitions input images into local and global regions, allowing efficient encoding at resolutions up to 4K. The vision encoder (scaled CLIP ViT-L/14) processes patches and global views, combining features for integration with the InternLM2 LLM, which aligns visual and textual modalities using Partial LoRA (Low-Rank Adaptation). For pixel-level grounding, GeoPixel incorporates SAM-2, a hierarchical masked autoencoder, and a lightweight pixel decoder to generate accurate segmentation masks. By leveraging pretrained encoders and modality-specific adaptations, GeoPixel achieves robust cross-modal alignment and enhanced performance for RS tasks, enabling precise segmentation and interpretation of high-resolution imagery.