University of Florida - DeepCloud Team Presents

Principal Investigators: Dapeng Oliver Wu, Xiaolin (Andy) Li
Students: Xiaoyong Yuan, Zhuoyu Yuan, Junran Xie, Christian Fowler, Matthew Ruiz

Project Goals

1. Design for scientific research with deep learning applications:


Researchers focus on the science;
We manage the infrastructure.



2. Provide Scalability (Data Size: Exabytes 1018 - Zettabytes 1021), Efficiency, and Flexibility for deep learning.



3. Integrate with various learning forms: distributed learning, federated learning, real-time learning, online learning, etc.



Designs

DeepCloud Architecture

DeepCloud has been developed using MERN stack – MongoDB, Express, React.js and Node.js. The frontend is developed using React.js. It utilizes the express framework and integrates with backend written using Node.js to make API calls. MongoDB is used as the database to store user and account information. We additionally use Cassandra DB to store information about Models (schema and usage details of which should be available in Backend related documents). We have used Webpack as bundler and Babel as loader for the frontend.

Key Components:
Docker - OS-level virtualization, container
Kubernetes - Automated container deployment, scaling, and management
Istio/Envoy – Microservices/Service Mesh
KubeFlow - making deployments of machine learning (ML) workflows on Kubernetes
MLflow - machine learning lifecycle including experimentation, reproducibility and deployment

KubeFlow and MLFlow deployment.

Implement distributed deep learning using KubeFlow.

Deep Learning model training on local cluster and HPC via Kubernetes controller.

Implement Multi-Purpose Store.

Workflow

Workflow of KubeFlow

KubeFlow is implemented in our platform to facilitate multiple features.

User Workflow

Workflow of DeepCloud

Firstly, users or researchers launch a job to interact with DeepCloud - for example, selecting a model from Model Store or starting a deep learning Notebook. Secondly, Kubeflow manages the jobs created by DL Job Manager. Then, Kubernetes allocates resources for this job from local clusters or public cloud clusters and creates a Docker container to complete the task. Users or researchers only interact with DeepCloud in the first step of the workflow and DeepCloud will take charge of the rest steps for completing the job.

Features

Single Sign On (SSO) Authentication

The login component was updated to include Social Login using Google and GatorLink. This component consists two sub-components: Login and Register.

Features

Multi-Purpose Store Function

Once uploaded these models can be viewed in the model store. The pre-trained model also has validation checks to check for file extensions which can be easily modified or updated incase we support other files in future.



Research Methods



1.

Provide microservices for deep learning training using Istio, KubeFlow, and MLFlow.


2.

Implement AutoML, distributed deep learning, federated learning using KubeFlow and MLFlow.


3.

Implement Multi-Purpose Store (MPS) for easy use of micro-service, trained models, deep learning infrastructure.

4.

Use Istio and Kubernetes to manage and orchestrate the containers on the cluster along with elastic scaling and manage resources by Harbor, Dex, Rook/Minio, etc.

5.

Scale the infrastructure based on efficient monitoring of resource utilization using Grafana, Prometheus, and ELK stack.

Impacts

1. Provide researchers the ability and platform to manage and distribute training and testing for Deep Learning.


2. Facilitate researchers the ability to deploy automated machine learning and distributed deep learning framework.




Potential Future works

1. Support for users to provide compute resources in exchange for virtual currency (Blockchain). Use this currency across the platform.


2. Allow external users to access, contribute to the platform.

Project Demo

give us a shout

Wanna Talk With Us?

Contact Us