Mode Distribue

Pourquoi distribuer ?

Meme un GPU puissant a ses limites. Pour traiter des volumes massifs (dizaines de To) dans des delais raisonnables, DRAGON permet de combiner plusieurs GPUs - locaux ou distants.

Scalabilite horizontale

flowchart LR subgraph Single["1 GPU"] G1["RTX 4090
18 GB/s"] end subgraph Dual["2 GPUs Local"] G2A["RTX 4090"] G2B["RTX 4090"] G2T["32 GB/s"] end subgraph Cluster["4 GPUs Distribue"] G4A["GPU Node 1"] G4B["GPU Node 2"] G4C["GPU Node 3"] G4D["GPU Node 4"] G4T["58 GB/s"] end Single --> |"x1.8"| Dual Dual --> |"x1.8"| Cluster style Single fill:#1e40af,stroke:#3b82f6 style Dual fill:#065f46,stroke:#10b981 style Cluster fill:#7c2d12,stroke:#f59e0b

Scalabilite quasi-lineaire

Avec une architecture bien concue, ajouter un GPU double presque la capacite de traitement. L'overhead reseau (~10%) est compense par la parallelisation.

Architecture distribuee

Topologie du systeme distribue

flowchart TB subgraph Client["Client DRAGON"] TUI["dragon-analyze-tui"] SCHED["Unified GPU Manager"] end subgraph Local["Machine Locale"] LG1["GPU Local 0
RTX 4090"] LG2["GPU Local 1
RTX 3080"] end subgraph Remote1["Serveur GPU 1
192.168.1.10"] SERVER1["dragon-gpu-server"] RG1["RTX 4090"] RG2["RTX 4090"] end subgraph Remote2["Serveur GPU 2
192.168.1.11"] SERVER2["dragon-gpu-server"] RG3["A100"] end TUI --> SCHED SCHED --> LG1 SCHED --> LG2 SCHED <-->|"TCP:9876"| SERVER1 SCHED <-->|"TCP:9876"| SERVER2 SERVER1 --> RG1 SERVER1 --> RG2 SERVER2 --> RG3 style Client fill:#1e40af,stroke:#3b82f6 style Local fill:#065f46,stroke:#10b981 style Remote1 fill:#7c2d12,stroke:#f59e0b style Remote2 fill:#7c2d12,stroke:#f59e0b

Composants du systeme distribue

Unified GPU Manager

Point d'entree unique qui abstrait la complexite des GPUs locaux et distants.

Detection automatique GPUs locaux
Connexion aux serveurs distants
Interface unifiee pour le scheduler
Gestion des deconnexions

GPU Server

Service standalone qui expose les GPUs d'une machine via le reseau TCP.

Ecoute sur port configurable
Authentification optionnelle
Multi-client concurrent
Metriques et monitoring

Remote GPU Worker

Abstraction d'un GPU distant, gere par le scheduler comme un GPU local.

Connexion TCP persistante
Serialisation des WorkBlocks
Compression optionnelle
Retry automatique

Protocole de communication

Sequence d'une requete distante

sequenceDiagram participant Client as Client participant Worker as Remote Worker participant Server as GPU Server participant GPU as GPU Note over Client,GPU: Phase 1: Connexion Client->>Worker: create(host, port) Worker->>Server: TCP Connect Server-->>Worker: ACK + Capabilities Worker-->>Client: ready Note over Client,GPU: Phase 2: Traitement Client->>Worker: submit(WorkBlock) Worker->>Server: HASH_REQUEST + data Server->>GPU: cudaMemcpy + kernel GPU-->>Server: hashes Server-->>Worker: HASH_RESPONSE + results Worker-->>Client: completed(results) Note over Client,GPU: Phase 3: Fermeture Client->>Worker: shutdown() Worker->>Server: DISCONNECT Server-->>Worker: BYE

Format des messages

Structure du protocole

Header (16 bytes) :
- magic: 4 bytes ("DRGN")
- version: 2 bytes
- type: 2 bytes (REQUEST, RESPONSE, ERROR, ...)
- payload_size: 8 bytes
Payload (variable) :
- Pour HASH_REQUEST : block_id + data
- Pour HASH_RESPONSE : block_id + hashes[]
Checksum (4 bytes) : CRC32 du payload

Message Type	Direction	Description
HELLO	Client -> Server	Initiation connexion avec version
CAPABILITIES	Server -> Client	GPUs disponibles, memoire, features
HASH_REQUEST	Client -> Server	Donnees a hasher
HASH_RESPONSE	Server -> Client	Hashes calcules
PING / PONG	Bidirectionnel	Keepalive
ERROR	Server -> Client	Erreur avec code et message
DISCONNECT	Client -> Server	Fermeture propre

Load Balancing

Le scheduler repartit intelligemment le travail entre les GPUs en tenant compte de leurs performances et de leur charge.

Algorithme de repartition

flowchart TB subgraph Input["Nouveau WorkBlock"] BLOCK["64 MB a traiter"] end subgraph Scoring["Calcul des scores"] S1["GPU Local 0
Score: 95"] S2["GPU Local 1
Score: 78"] S3["Remote 1
Score: 82"] S4["Remote 2
Score: 45"] end subgraph Factors["Facteurs"] F1["+ Debit recent"] F2["+ Memoire dispo"] F3["- Latence"] F4["- Queue length"] F5["+ Bonus local"] end subgraph Selection["Selection"] BEST["Meilleur score
GPU Local 0"] end BLOCK --> Scoring Factors --> Scoring Scoring --> Selection Selection --> |"dispatch"| S1 style BEST fill:#14532d,stroke:#22c55e

Algorithme de scoring

Pour chaque worker disponible, calculer :
- temps_estime = latence + (taille_bloc / debit)
- score_base = 1000 / temps_estime
Appliquer les modificateurs :
- Si queue_length > 5 : score *= 0.8
- Si local : score *= 1.2 (bonus localite)
- Si erreur recente : score *= 0.5
Selectionner le worker avec le score maximum
Fallback : si tous scores < seuil, attendre

Strategies de repartition

Round Robin Pondere

Chaque GPU recoit des blocs proportionnellement a sa capacite relative.

pie title Repartition typique "GPU Local 0 (40%)" : 40 "GPU Local 1 (25%)" : 25 "Remote 1 (25%)" : 25 "Remote 2 (10%)" : 10

Prefer Local

Privilegier les GPUs locaux pour minimiser la latence, utiliser les distants en debordement.

Local d'abord jusqu'a saturation
Seuil configurable (ex: 80%)
Spillover vers remote
Ideal pour faible bande passante

Tolerance aux pannes

En environnement distribue, les pannes sont inevitables. DRAGON gere automatiquement les defaillances de workers.

Gestion des echecs

stateDiagram-v2 [*] --> Submitted: submit() Submitted --> Processing: dispatch() Processing --> Completed: success Processing --> Timeout: timeout Processing --> Error: exception Timeout --> Retry1: retry < max Error --> Retry1: retry < max Retry1 --> Processing: autre worker Timeout --> Failed: retry >= max Error --> Failed: retry >= max Completed --> [*] Failed --> [*]: error callback note right of Retry1 Worker fautif exclu pour ce bloc end note note right of Failed Apres 2 retries sur workers differents end note

Type de panne	Detection	Action
Timeout	Timer depasse (calcule dynamiquement)	Retry sur autre worker
Erreur GPU (OOM, etc)	Exception propagee	Retry avec bloc plus petit
Deconnexion reseau	Socket error / ping timeout	Reconnexion auto, retry blocs en vol
Serveur crash	Connexion refusee	Marquer worker hors-ligne, redistribuer
Donnees corrompues	CRC mismatch	Re-demander le bloc

Algorithme de reconnexion

Detecter la deconnexion (socket error ou ping timeout)
Marquer les blocs en vol comme "a retenter"
Tenter reconnexion avec backoff exponentiel :
- Tentative 1 : immediate
- Tentative 2 : +1 seconde
- Tentative 3 : +2 secondes
- Tentative 4 : +4 secondes
- Max : 30 secondes entre tentatives
Si reconnexion reussie : reprendre le traitement
Apres 5 echecs : marquer worker comme "offline"

Configuration

Fichier dragon_gpu.yaml

Configuration client

dragon_gpu.yaml

local_gpus:
- enabled: true
- device_ids: [0, 1]
remote_gpus:
- - name: "server1"
- hostname: "192.168.1.10"
- port: 9876
- enabled: true
- - name: "server2"
- hostname: "192.168.1.11"
- port: 9876
- enabled: true
prefer_local: true
connection_timeout_ms: 5000

Configuration serveur

gpu_server.yaml

server:
- bind_address: "0.0.0.0"
- port: 9876
- max_clients: 10
gpu:
- device_ids: [0, 1]
- memory_limit_percent: 70
security:
- require_auth: false
- allowed_ips: ["192.168.1.0/24"]

Deploiement

Architecture de deploiement typique

flowchart TB subgraph Workstation["Poste de travail"] CLIENT["dragon-analyze-tui"] LOCAL_GPU["GPU Local
RTX 3080"] end subgraph Network["Reseau Local"] SWITCH["Switch 10 GbE"] end subgraph GPUServer1["Serveur GPU 1"] S1["dragon-gpu-server"] S1G1["RTX 4090 #1"] S1G2["RTX 4090 #2"] end subgraph GPUServer2["Serveur GPU 2"] S2["dragon-gpu-server"] S2G1["A100 80GB"] end subgraph Storage["Stockage"] NAS["NAS / SAN
Donnees source"] end CLIENT --> LOCAL_GPU CLIENT <--> SWITCH SWITCH <--> S1 SWITCH <--> S2 SWITCH <--> NAS S1 --> S1G1 S1 --> S1G2 S2 --> S2G1 NAS --> CLIENT style Workstation fill:#1e40af,stroke:#3b82f6 style GPUServer1 fill:#065f46,stroke:#10b981 style GPUServer2 fill:#065f46,stroke:#10b981

Demarrage des services

1. Demarrer les serveurs GPU

Sur chaque machine avec GPU :
- Verifier les drivers CUDA installes
- Configurer gpu_server.yaml
- Lancer : dragon-gpu-server --config gpu_server.yaml
Verifier les logs : "Listening on 0.0.0.0:9876"

2. Lancer l'analyse distribuee

Configurer dragon_gpu.yaml avec les serveurs distants
Lancer : dragon-analyze-tui --source /path/to/data
L'interface affiche les GPUs connectes dans l'onglet GPU
Demarrer l'analyse - les blocs sont repartis automatiquement

Performance reseau

Impact de la bande passante

xychart-beta title "Efficacite selon le reseau" x-axis ["1 GbE", "2.5 GbE", "10 GbE", "25 GbE", "Local"] y-axis "Efficacite %" 0 --> 100 bar [35, 55, 85, 95, 100]

Reseau	Bande passante	Debit effectif	Recommandation
1 GbE	~120 MB/s	~100 MB/s	1-2 GPUs distants max
2.5 GbE	~300 MB/s	~250 MB/s	2-4 GPUs distants
10 GbE	~1.2 GB/s	~1 GB/s	Recommande pour cluster
25 GbE	~3 GB/s	~2.5 GB/s	Optimal multi-GPU
InfiniBand	~12 GB/s	~10 GB/s	HPC / Datacenter

Optimisation : Compression

Pour les reseaux lents (< 10 GbE), DRAGON peut compresser les donnees avant transfert avec LZ4 (ratio ~2x, overhead CPU minimal). Gain net sur 1 GbE : +40% de debit effectif.