<![CDATA[The Learning Journey]]>https://thelearningjourney.co/https://thelearningjourney.co/favicon.pngThe Learning Journeyhttps://thelearningjourney.co/Ghost 5.40Mon, 23 Dec 2024 17:34:22 GMT60<![CDATA[Serverless Survey Application End-to-End on AWS with Python and Terraform]]>Introduction

I like this paradigm where you break your whole infrastructure into microservices, I thought it would be fun to create an app using Python and go full serverless on AWS.
Check out the full code here --> https://github.com/bobocuillere/Serverless-AWS-Project

The “serverless” approach lets

]]>
https://thelearningjourney.co/serverless-survey-application-end-to-end-on-aws-with-python-and-terraform/675169139da5400001e85fffMon, 16 Dec 2024 12:59:35 GMTIntroductionServerless Survey Application End-to-End on AWS with Python and Terraform

I like this paradigm where you break your whole infrastructure into microservices, I thought it would be fun to create an app using Python and go full serverless on AWS.
Check out the full code here --> https://github.com/bobocuillere/Serverless-AWS-Project

The “serverless” approach lets you concentrate on what truly matters: creating the application logic and delivering features. With AWS’s serverless services, you get automatic scaling, high availability, and a pay-as-you-go billing model, all while AWS handles the heavy lifting behind the scenes.

For my survey web application—built entirely from scratch—I wanted an environment where I could iterate quickly, scale effortlessly.

Achieving this meant two things:

  1. Go Serverless: Use AWS Lambda for the backend logic, Amazon S3 for hosting the frontend, Amazon Cognito for user authentication, Amazon DynamoDB for data storage, Amazon API Gateway for routing requests, and Amazon SNS for notifications.
  2. Automate Everything with Infrastructure as Code (IaC): Employ Terraform to define all of these resources and their configurations in code. This means I can spin up or tear down the entire environment with a single command.

By the end of this article, you’ll see how these AWS components fit together to form a coherent, secure, and scalable serverless application.

Architecture Overview: A Survey App Without Servers

Serverless Survey Application End-to-End on AWS with Python and Terraform

The entire goal is to have a browser-based survey frontend that never talks directly to a server you manage. Instead, it relies on AWS-managed services and logic running in AWS Lambda functions—both of which scale and operate without you ever provisioning a single VM.

We’ll look at the full request flow, authentication steps, data storage logic, and how secure isolation is maintained. We’ll also consider how each piece is wired together with Infrastructure as Code (IaC) and how the environment remains consistent across deployments.

Core AWS Services and Their Detailed Roles

Amazon Cognito (User Authentication & Identity Management)

  • What It Does: Cognito handles sign-up, sign-in, and token issuance. Users provide their credentials through the frontend, and the frontend code exchanges those credentials with Cognito for a JWT token.
  • Tokens & JWT Verification:
    Once a user logs in successfully, Cognito returns an ID token (JWT) and an access token. The frontend stores these tokens. On each subsequent API request, the token is included in the Authorization header. Backend Lambdas verify the token’s signature against Cognito’s JWKS (JSON Web Key Set) endpoints, ensuring requests are from authenticated users and haven’t been tampered with.

Amazon S3 (Frontend Hosting & Private Assets)
Public S3 Bucket (Frontend):

  • The frontend assets (HTML, CSS) are hosted here. The bucket is configured for public-read access, so when the user opens the site URL, the browser downloads these static files directly. This simple setup eliminates the need for a web server.

Private S3 Bucket (Protected Assets):

  • Certain assets should never be directly exposed on the public internet. The private S3 bucket is locked down with IAM policies. Only the frontend Lambda function, running with an IAM role that grants s3:GetObject, can retrieve these files. The browser never sees a direct link to these private files. For example, if a user wants to retrieve a special survey template, the frontend lambda sends an API request. If authorized, it fetches the needed object from the private bucket and returns the data to the user. The user never sees a public URL to that file, preserving confidentiality.
  • Why Two Buckets: Separating public and private content enforces a clear security boundary. Public assets are global and cacheable, private assets require controlled, token-validated access via Lambda.

Amazon API Gateway

  • Purpose: API Gateway provides a stable URL and a set of RESTful endpoints for frontend code to interact with the backend logic. For example:
  • POST /flask/login for authenticating the user and retrieving a JWT.
  • POST /survey/create for creating a new survey.
  • GET /survey/responses?survey_id=XYZ to fetch existing responses.
  • Integration Choices: Each route in API Gateway is integrated with the frontend Lambda, not the backend Lambda. This is a deliberate design choice.

AWS Lambda Functions (Backend and Frontend Logic in Python)

  • Frontend Lambda:
    This Python-based Lambda is the “gatekeeper.” It handles user-facing logic that involves verifying JWT tokens and performing operations such as creating, fetching surveys by invoking the backend lambda.
  • Backend Lambda:
    Another Python-based Lambda responsible for “core business logic.” It deals with survey creation, reading and writing responses to DynamoDB, and when needed, publishing messages to SNS for notifications.  The backend Lambda is never directly integrated with API Gateway. It is only called by the frontend Lambda, using the AWS SDK internally (e.g., boto3.client('lambda').invoke(...)). This design ensures that even if someone tried to craft a direct request to the backend functions, they’d fail—no public endpoint exists. The backend code remains protected behind the frontend Lambda’s logic.
  • Why Two-Lambda Setup: Separating frontend from backend logic enforces a clean security boundary and layered architecture. The frontend Lambda is responsible for authentication and preliminary checks; the backend Lambda handles database operations and notifications.

Amazon DynamoDB (Primary Data Store)

  • Why DynamoDB: It’s serverless, scales automatically, and doesn’t require capacity planning. Perfect for unpredictable workloads, like a survey that might suddenly go viral.
  • Integration with Backend Lambda: The backend Lambda uses boto3 to interact with DynamoDB. For example, when creating a survey, it puts an item into the Surveys table. When fetching responses, it queries by survey_id.

Amazon SNS (Event Notifications)

  • When Used: Suppose a survey is deleted. We might want to send an alert email to an admin or trigger a cleanup process. The backend Lambda publishes a message to an SNS topic.
  • Why SNS: It decouples the event producer (backend Lambda) from event consumers (emails, other Lambdas, etc.). If we later want SMS alerts or a Slack integration, we just add a new SNS subscription—no need to modify the backend code.

Putting It All Together Step-by-Step

Let’s walk through the entire flow of the survey application’s architecture from start to finish, showing exactly how each part interacts with the others. This step-by-step approach will help clarify the roles of the frontend Lambda, backend Lambda, and all the AWS services in between.

1- User Opens the Survey App:

  • The user navigates to the application’s URL. The frontend files (HTML, CSS, JavaScript) are served from a public Amazon S3 bucket.
  • Because it’s a static site, the user’s browser directly fetches these assets from S3, no servers needed.

2 - User Authenticates with Cognito:

  • In the browser, the user either signs up or logs in.
  • The frontend JavaScript sends credentials (username/password) to the API endpoint (e.g., POST /flask/login).
  • This request goes through Amazon API Gateway, which routes it to the frontend Lambda.
  • The frontend Lambda calls Cognito using the AWS SDK to verify credentials. If correct, Cognito returns a JWT token.
  • The frontend Lambda sends that JWT token back to the browser. Now the user’s browser stores this token for future requests.

3- User Creates or Manages a Survey:

  • With a JWT token in hand, the user wants to create a new survey. The browser sends POST /survey/create with the JWT included in the Authorization header.
  • Again, API Gateway receives this request and routes it to the frontend Lambda.
  • The frontend Lambda checks the JWT token to ensure it’s valid and that the user is allowed to perform the action.
  • If valid, the frontend Lambda then invokes the backend Lambda internally using the AWS SDK (e.g., boto3 for Python). There’s no direct public route to the backend Lambda; all calls must pass through the frontend Lambda first. This ensures strict control and an additional security layer.

4 - Backend Lambda Interacts with DynamoDB:

  • Upon receiving the validated request from the frontend Lambda, the backend Lambda executes the logic for creating the survey.
  • It constructs a DynamoDB put_item request to store the new survey record in the Surveys table.
  • DynamoDB responds immediately, and the backend Lambda returns a success message (including survey_id) back to the frontend Lambda.

5 - User Fetches Survey Responses (Another Example Flow):

  • Suppose the user requests GET /survey/responses?survey_id=XYZ to view all responses for a given survey.
  • The browser again includes the JWT token.
  • API Gateway routes the call to the frontend Lambda.
  • Frontend Lambda validates the JWT token and checks if this user is indeed allowed to view the specified survey’s responses.
  • If allowed, the frontend Lambda calls the backend Lambda, which queries the Responses table in DynamoDB.

6 - Working with Private Assets (On the private bucket):

  • If a particular survey requires loading a private template file (stored in a private S3 bucket), the process is similar.
  • The browser calls a special endpoint handled by the frontend Lambda.
  • The frontend Lambda checks the JWT and ensures the user can access that template.
  • If authorized, the frontend Lambda uses its IAM permissions to read the private S3 object directly (the browser never sees a private URL).

7 - Notifications via SNS:

  • Suppose the user deletes a survey. The browser calls DELETE /survey?survey_id=XYZ with the JWT token.
  • This request, like all others, passes through API Gateway to the frontend Lambda, which validates permissions.
  • If authorized, the frontend Lambda calls the backend Lambda to perform the deletion.
  • The backend Lambda deletes the survey record in DynamoDB.
  • After successful deletion, the backend Lambda publish a message to an SNS topic, here I tested with an email.

Troubleshooting Common Issues

Throughout development, I encountered a few issues. Here are some highlights and how I solved them:

CORS Errors on the Frontend:

  • Symptom: My frontend JavaScript called API Gateway endpoints but got CORS errors in the browser console.
  • Solution: I updated the API Gateway integration in Terraform to return Access-Control-Allow-Origin and other CORS headers. This involved setting method.response.header.Access-Control-Allow-Origin to '*' and ensuring the OPTIONS method was configured. Re-applying Terraform resolved the issue.

Invalid JWT Token Errors in Lambda:

  • Symptom: Even with a valid token, I got "Unauthorized" responses.
  • Solution: I made sure the Lambda functions fetched Cognito’s JSON Web Key Set (JWKS) and validated tokens correctly. Adjusting my Python code to parse the token’s kid, retrieve the right public key, and verify signatures fixed the problem.

DynamoDB Throttling Under Heavy Load:

  • Symptom: High-traffic tests caused some "ProvisionedThroughputExceeded" errors.
  • Solution: Switching DynamoDB tables to On-Demand capacity mode in Terraform allowed automatic scaling with no manual provisioning. After this change and re-apply, throttling disappeared under normal test loads.

Permission Denied for Private S3 Bucket Access:

  • Symptom: Backend Lambda failed to read private assets.
  • Solution: I updated Terraform IAM policies to grant s3:GetObject on the private bucket’s ARN to the Lambda execution role. After a quick apply, the function could access files properly.

Conclusion

Building a serverless survey application on AWS—and automating every aspect of its infrastructure with Terraform—has shown just how far cloud computing and DevOps practices have come. Instead of setting up servers, manually creating users in an identity service, or worrying about scaling databases, we focused on writing clear configuration files and straightforward application code.

In essence, this approach transforms the way you build and run applications. It takes you from a world where operations can be slow, error-prone, and costly, to one where agility, reliability, and cost-efficiency are the natural byproducts of well-chosen architectural patterns and tools. Whether you’re working on a small hobby project, a startup’s MVP, or a complex enterprise system, the principles and workflows described here will help you embrace modern cloud-native development with confidence.

]]>
<![CDATA[Kubernetes Networking: Pod and Service Networking]]>https://thelearningjourney.co/kubernetes-networking-pod/66faba5102b8400001dd01a1Tue, 01 Oct 2024 13:57:57 GMT

Understand Pod and Service Networking

Kubernetes Networking: Pod and Service Networking

Welcome back to my Kubernetes Networking series! In the first article, we covered the fundamentals of Kubernetes networking, including the basic components and the overall networking model. Now, we'll take a look into how Pods and Services communicate within a Kubernetes cluster.

Series Outline:

  1. Fundamentals of Kubernetes Networking
  2. Understand Pod and Service Networking
  3. Network Security with Policies and Ingress Controllers [In Progress]
  4. Service Meshes and Traffic Management in Kubernetes [In Progress]
  5. Kubernetes Networking Best Practices and Future Trends [In Progress]

Table of Contents

  1. Recap of the Kubernetes Networking Model
  2. Pod Networking in Depth
  3. Understanding Kubernetes Services
  4. Traffic Routing and Load Balancing
  5. Practical Examples
  6. Conclusion

1. Recap of the Kubernetes Networking Model

Let's briefly recap the key principles of the Kubernetes networking model:

  • Flat Network Structure: Every Pod in the cluster can communicate with every other Pod without Network Address Translation (NAT).
  • IP-per-Pod: Each Pod gets its own unique IP address within the cluster.
  • Consistent IP Addressing: The IP a Pod sees itself as is the same IP others use to reach it.

These principles simplify application development by abstracting away the underlying network complexities.


2. Pod Networking

Kubernetes Networking: Pod and Service Networking

2.1 Pod IP Allocation

When a Pod is created, it is assigned an IP address that allows it to communicate with other network entities in the cluster.

  • IPAM (IP Address Management): The Container Network Interface (CNI) plugin handles IP address allocation.
  • IP Range: The cluster has a predefined CIDR range for Pod IP addresses, configured at the time of cluster creation.
  • Per-Node IP Pools: Each Node may have a subset of IPs allocated to it to assign to the Pods it hosts.

Example:

If your cluster Pod CIDR is 10.244.0.0/16, Node 1 might be assigned 10.244.1.0/24, and Node 2 10.244.2.0/24. Pods on Node 1 get IPs like 10.244.1.5, 10.244.1.6, and so on.

2.2 Pod Network Namespace

  • Isolation: Each Pod runs in its own network namespace, providing isolation from other Pods.
  • Shared by Containers in a Pod: Containers within the same Pod share the network namespace, IP address, and network interfaces.
  • Loopback Communication: Containers in a Pod communicate over localhost.

2.3 Inter-Pod Communication

Pods communicate with each other using their IP addresses over the cluster network.

  • Same Node Communication:

    • Bridge Network: On the same Node, Pods communicate via a virtual bridge (e.g., cbr0).
    • Efficient Routing: Packets are switched locally without leaving the host.
  • Cross-Node Communication:

    • Routing Between Nodes: The cluster network routes packets between Nodes.
    • CNI Plugins Role: The CNI plugin sets up the necessary routes and network interfaces (e.g., VXLAN tunnels, BGP peering).

Key Points:

  • No NAT: Direct communication without the need for NAT simplifies connectivity.
  • Flat Address Space: Uniform addressing makes network policies and service discovery straightforward.

3. Understanding Kubernetes Services

Services provide stable endpoints to access a set of Pods.

  • Abstraction: Decouple the frontend from the backend Pods.
  • Load Balancing: Distribute traffic among healthy Pods.
  • Discovery: Allow clients to find services via DNS.

3.1 Service Types Explained

3.1.1 ClusterIP

Kubernetes Networking: Pod and Service Networking

  • Default Service Type.
  • Access Scope: Exposes the Service on an internal cluster IP.
  • Use Case: Ideal for internal communication within the cluster.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
  • ClusterIP Assigned: Kubernetes assigns a ClusterIP (e.g., 10.96.0.1).
  • Accessing the Service: Other Pods use my-service as the hostname.

3.1.2 NodePort

Kubernetes Networking: Pod and Service Networking

  • Access Scope: Exposes the Service on a static port on each Node's IP.
  • Port Range: Ports between 30000 and 32767.
  • Use Case: Accessing the Service from outside the cluster without a cloud provider's load balancer.
apiVersion: v1
kind: Service
metadata:
  name: my-nodeport-service
spec:
  type: NodePort
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
      nodePort: 31000
  • Accessing the Service: Use <NodeIP>:31000 from outside the cluster.

3.1.3 LoadBalancer

  • Cloud Provider Integration: Provisions an external load balancer (e.g., AWS ELB, GCP Load Balancer).
  • Access Scope: Exposes the Service externally with a public IP.
  • Use Case: Recommended for production environments requiring external access.
apiVersion: v1
kind: Service
metadata:
  name: my-loadbalancer-service
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  • External IP Assigned: The cloud provider assigns a public IP.
  • Accessing the Service: Use the external IP to reach the Service.

3.1.4 Headless Services

A headless service is a type of Kubernetes Service that does not allocate a ClusterIP. Instead, it allows direct access to the individual Pods' IPs. This is useful for applications that require direct Pod access, such as databases or stateful applications where each Pod needs to be addressed individually.

  • No ClusterIP: Specify clusterIP: None in the Service definition.
  • Direct Pod Access: Clients receive the Pod IPs directly, not the Service IP.
  • Use Case: Stateful applications, databases, and when you need direct control over load balancing.
apiVersion: v1
kind: Service
metadata:
  name: my-headless-service
spec:
  clusterIP: None
  selector:
    app: my-db
  ports:
    - port: 5432
      targetPort: 5432
  • Service Discovery: DNS responds with all the Pod IPs under the Service, allowing clients to connect directly to each Pod.

3.2 Service Discovery Mechanisms

3.2.1 Environment Variables

  • Deprecation Warning: Relies on environment variables set at Pod creation. Not updated if the Service changes.
  • Limited Use: Not suitable for dynamic environments.

3.2.2 DNS-Based Discovery

  • CoreDNS: Kubernetes uses CoreDNS for internal DNS resolution.
  • Naming Convention: Services are reachable at service-name.namespace.svc.cluster.local.
  • Automatic Updates: DNS records are updated dynamically as Pods come and go.

4. Traffic Routing and Load Balancing

4.1 kube-proxy and Its Modes

kube-proxy is a network proxy that runs on each Node and reflects the Services defined in Kubernetes.
It runs on each node of a Kubernetes cluster. It watches Service and Endpoints (and EndpointSlices ) objects and accordingly updates the routing rules on its host nodes to allow communicating over Services.

4.1.1 Operating Modes

  • Userspace Mode (Legacy):

    • How It Works: Intercepts Service traffic in userspace and forwards it to the backend Pod.
    • Performance: Less efficient due to context switching between kernel and userspace.
  • iptables Mode (Default):

    • How It Works: Uses iptables rules to route traffic directly in the kernel space.
    • Performance: More efficient, better scalability.
  • IPVS Mode:

    • How It Works: Uses IP Virtual Server (IPVS) for load balancing in the Linux kernel.
    • Benefits: Scales better for large numbers of Services and endpoints.

4.2 Session Affinity

Session affinity ensures that requests from a client are directed to the same Pod.

  • Client IP Affinity:
    • Configuration: Set sessionAffinity: ClientIP in the Service spec.
    • Timeout: Controlled by service.spec.sessionAffinityConfig.clientIP.timeoutSeconds.
apiVersion: v1
kind: Service
metadata:
  name: my-affinity-service
spec:
  selector:
    app: my-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800
  • Use Case: Applications that require stateful sessions.

5. Practical Examples

5.1 Creating a Service

Let's create a Deployment and expose it with a Service.

Step 1: Create a Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.17
          ports:
            - containerPort: 80

Step 2: Expose the Deployment

apiVersion: v1
kind: Service
metadata:
  name: nginx-service
spec:
  type: ClusterIP
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

Accessing the Service:

  • From another Pod in the same namespace: curl https://nginx-service
  • Using fully qualified domain name: curl https://nginx-service.default.svc.cluster.local

5.2 Using a Headless Service

Headless Services can be used in combination with StatefulSets.

apiVersion: v1
kind: Service
metadata:
  name: mysql
spec:
  clusterIP: None
  selector:
    app: mysql
  ports:
    - port: 3306
      targetPort: 3306

StatefulSet Pods:

  • Pods get DNS entries like mysql-0.mysql.default.svc.cluster.local
  • Useful for databases that require stable network identities.

6. Conclusion

Understanding Pod and Service networking at a deeper level makes you better to design and troubleshoot applications effectively in Kubernetes.

Key Takeaways:

  • Pod Networking:

    • Pods are assigned unique IPs from the cluster's Pod CIDR range.
    • Containers within a Pod share the same network namespace.
  • Service Types:

    • ClusterIP: For internal cluster communication.
    • NodePort: Exposes Service on each Node's IP at a static port.
    • LoadBalancer: Integrates with cloud providers to provide external access.
    • Headless Service: No ClusterIP; clients receive Pod IPs directly.
  • Traffic Routing:

    • kube-proxy handles traffic routing using iptables or IPVS.
    • Session affinity ensures consistent routing for client sessions.
  • Service Discovery:

    • CoreDNS allows Pods to resolve Services by name.
    • Headless Services provide direct access to Pod IPs.

You're now ready to build good, scalable applications on Kubernetes :).


In the Next Article:

We'll explore Network Security with Policies and Ingress Controllers, where we'll look at securing your cluster's network communication and managing external access to your services.

]]>
<![CDATA[Kubernetes Networking : Fundamentals of Kubernetes Networking]]>https://thelearningjourney.co/kubernetes-networking-fundamentals-of-kubernetes-networking/66366b361d64030001853330Sun, 26 May 2024 09:24:43 GMT

Welcome to this new series of article for Kubernetes networking, my goal is to give you all the information needed to not feel lost and to understand everything you should need when it comes to the Networking of Kubernetes.

We’ll explore the basics of Kubernetes networking, including the networking model, core components, and common networking solutions. This will give you a solid foundation for understanding the more advanced topics we'll cover in later articles.

I will do a 5 part series :

  • #1 Fundamentals of Kubernetes Networking
  • #2 Pod-to-Pod and Service Networking
  • #3 Network Security and Ingress Controllers [In Progress]
  • #4 Service Meshes and Advanced Networking [In Progress]
  • #5 Best Practices and Future Trends [In Progress]

1- Understanding The Kubernetes Networking Model

Kubernetes Networking : Fundamentals of Kubernetes Networking
Kubernetes Networking Model

Networking in Kubernetes revolves around a unique and robust model that ensures efficient communication between various components. Understanding the key aspects of this model is crucial to deploying and operating Kubernetes effectively.

1.1 Overview of Kubernetes Networking

Kubernetes networking is essential for application communication within a cluster. The networking model is designed to solve several challenges that arise when deploying and managing containerized applications:

  1. Flat Netowrking:
  • Kubernetes employs a flat networking model, where all Pods can communicate with each other without Network Address Translation (NAT).
  • It simplifies communication, as every Pod has its own IP address and can directly communicate with other Pods.
  • A flat network means that all Nodes and Pods are on a single network, without any subnets or segmentation.
  1. Cluster Networking:
  • The cluster network is shared by all Pods and Nodes in the Kubernetes cluster.
  • It allows seamless communication across all components, both intra and inter-Node.
  1. Service Abstractions:
  • Kubernetes uses Services to provide stable network endpoints for a set of Pods.
  • These Services act as load balancers and provide consistent IP addresses and DNS names, facilitating communication between different parts of an application or between applications.
  1. Network Plugins:
  • Kubernetes relies on network plugins known as CNI (Container Network Interface) plugins to implement networking.
  • These plugins manage the network configuration for Pods and provide different networking capabilities.

1.2 Core Networking Requirements

To maintain seamless communication, Kubernetes imposes several key requirements on the network infrastructure:

1.** Pod-to-Pod Communication:**

  • All Pods in a Kubernetes cluster should be able to communicate directly with each other, without the need for NAT.
  • This ensures that distributed applications can function correctly, as they often require communication between various Pods.
  1. Node-to-Pod Communication:
  • The IP address that a Pod sees itself as should be the same IP address that other Pods and Nodes see it as.
  • This consistency is curcial for applications relying on knowing their own IP, as well as for network policies and service discovery.

1.3 Networking in Pods, Nodes and the Cluster

Kubernetes networking encompasses several layers of communication, including:

  1. Networking in Pods:
  • Pods are the smallest deployable units in Kubernetes and represent single instances of an application.
  • Each Pod has its own IP address, which is unique within the cluster.
  • Containers within a Pod share the same network namespace, which means they can communicate with each other using localhost.
  1. Networking in Nodes:
  • Nodes are the machines that run the Pods in a Kubernetes cluster.
  • Each Node has a range of IP addresses that it can allocate to the Pods running on it.
  • Nodes communicate with each other over the cluster network, enabling inter-Node communication.
  1. Networking in the Cluster:
  • The cluster network is a flat network shared by all Pods and Nodes.
  • This network allows seamless communication across the entire cluster, regardless of where the Pods and Nodes are located.
  • The cluster network is typically implemented using overlay networks, which encapsulate network traffic to provide seamless connectivity.An overlay network is a virtual network that is built on top of another network.

We've explored the Kubernetes networking model, including an overview of how it works, the core requirements, and how networking functions at different layers within the cluster. Understanding these concepts is essential for anyone looking to deploy and manage applications on Kubernetes.

2- Basic Networking Components

In this section, we'll talk in more details about the fundamental networking components of Kubernetes. We’ll cover Pods, Services, Network Policies, and DNS in Kubernetes, all of which play a key role in Kubernetes networking.

2.1 Pods

Pods are the smallest deployable units in Kubernetes, typically representing one or more containers that share the same context. Let’s go into the details:

Pods are designed to encapsulate application containers, storage resources, a unique network identity (IP address), and other configurations:

  1. Containers:
  • A Pod can have one or more containers running within it.
  • The containers share the same network namespace, which means they can communicate with each other using localhost.
  • Multiple containers in a Pod are often used for closely related functions, such as a main application container and a sidecar container that helps with logging or monitoring.
  1. Networking:
  • Each Pod has a unique IP address within the cluster.
  • Pods can communicate with each other directly using these IP addresses.
  • The IP address of a Pod is typically assigned from the range of IPs available on the Node where the Pod is running.
  1. Lifespan:
  • Pods are ephemeral, meaning they can be created, destroyed, and replaced.
  • When a Pod is replaced, the new Pod will have a different IP address, which is important to consider when designing applications.

How do Pods communicate with each other?

  1. Direct Communication:
  • Pods can communicate directly with each other using their IP addresses.
  • This direct communication is straightforward when Pods are on the same Node, as they share the same local network.
  • For Pods on different Nodes, communication occurs over the cluster network.
  1. Pod-to-Service Communication:
  • Direct communication between Pods can become cumbersome as Pod IPs are ephemeral.
  • Instead, Kubernetes provides Services to enable stable communication, which we’ll explore next.

Kubernetes Networking : Fundamentals of Kubernetes Networking

2.2 Services

Services are an abstraction that defines a logical set of Pods and a policy by which to access them. They are a crucial part of Kubernetes networking because they provide stable endpoints for applications and manage load balancing:

  1. Types of Services
  • ClusterIP:
    • The default type of Service.
    • Exposes the Service on a cluster-internal IP, which means the Service is only accessible from within the cluster.
    • Ideal for internal communication between different parts of an application.
  • NodePort:
    • Exposes the Service on a specific port on each Node, which allows external traffic to reach the Service.
    • The Node’s IP address and the allocated port are used to access the Service.
  • LoadBalancer:
    • Exposes the Service externally using a cloud provider’s load balancer for example.
    • Provides a stable IP address and balances traffic across the Pods.
  • ExternalName:
    • Maps a Service to the contents of the externalName field (e.g., a DNS name).
    • Used for integrating external services or exposing internal services with custom DNS names.
  1. Service Proxies:
  • Services use kube-proxy to handle routing and load balancing.
  • kube-proxy runs on each Node and maintains network rules for directing traffic to the appropriate Pod.
  1. Service Discovery
  • Services can be discovered using their names or through DNS.
  • Kubernetes provides built-in DNS to resolve Service names to IP addresses.

2.3 Network Policies

Network Policies are a Kubernetes resource that controls the traffic allowed to and from Pods:

  1. Purpose:
  • Network Policies provide a way to enforce security controls between Pods and other network entities.
  • They are useful for restricting communication to only what is necessary, which enhances security.
  1. Use Cases:
  • Isolation because by default, all Pods can communicate with each other. Network Policies can isolate Pods or groups of Pods.
  • Security because Network Policies can prevent unintended communication, which helps to secure sensitive workloads.
  1. Implementation:
  • Network Policies are implemented by the CNI plugin used in the cluster.
  • Different plugins may offer varying levels of support and features for Network Policies.

2.4 DNS in Kubernetes

DNS is a fundamental part of Kubernetes networking, enabling name resolution for Pods and Services:

  1. Built-in DNS:
  • Kubernetes has a built-in DNS Service that provides name resolution for Services and Pods.
  • This Service is typically backed by CoreDNS, which is a flexible and extensible DNS server.
  1. Service and Pod DNS:
  • Each Service gets a DNS entry in the format service-name.namespace.svc.cluster.local
  • This allows Pods to refer to Services by name, even if the underlying IP addresses change.
  • Pods can also have DNS entries, which can be customized through annotations or other configurations.
  1. DNS Resoltion:
  • DNS resolution allows applications to refer to each other by name, which is easier to manage and more robust than using IP addresses directly.
  • The built-in DNS Service ensures that name resolution is consistent across the cluster.

We've covered in this section the basic networking components in Kubernetes, including Pods, Services, Network Policies, and DNS. Each of these components plays a crucial role in Kubernetes networking, enabling communication within the cluster and with external networks.

3-Kubernetes Networking Tools

Kubernetes networking relies on various tools and plugins to handle the complexities of network communication within and across clusters. In this section, we'll cover some of the most common networking solutions available in Kubernetes and how to choose the right Container Network Interface (CNI) for your cluster.

3.1 Common Networking Solutions

Kubernetes Networking : Fundamentals of Kubernetes Networking

Kubernetes networking is modular, allowing different solutions to be plugged in based on your needs. These solutions implement the CNI specification, providing networking functionality for Pods and Services. Here, we will explore some of the most common and widely used solutions:

  1. Flannel:
  • Overview:
    • Flannel is a simple and easy-to-configure networking solution.
    • It creates an overlay network using VXLAN or other mechanisms, which enables Pod-to-Pod communication across Nodes.
    • Flannel operates on Layer 3 of the OSI model, providing IP connectivity.
  • Use Cases:
    • Ideal for small to medium-sized clusters.
    • Useful for those who want a straightforward setup without advanced features.
  1. Calico:
  • Overview:
    • Calico provides networking and network policy capabilities.
    • It operates at Layer 3, using BGP (Border Gateway Protocol) to distribute routing information between Nodes.
    • Calico also supports eBPF (extended Berkeley Packet Filter) for efficient packet processing.
  • Use Cases:
    • Suitable for environments requiring advanced networking and security features.
    • Useful for enforcing complex network policies and ensuring high-performance networking.
  1. Weave Net:
  • Overview:
    • Weave Net creates a mesh overlay network for Kubernetes clusters.
    • It provides encrypted communication and uses fast data paths for efficient packet delivery.
  • Use Cases:
    • Ideal for secure networking needs.
    • Suitable for environments where simple and robust networking is desired.
  1. Cilium:
  • Overview:
    • Cilium provides advanced networking and security using eBPF.
    • It operates at Layer 3 and 4, offering features like network policies, load balancing, and deep packet inspection.
  • Use Cases:
    • Ideal for environments requiring high security and deep observability.
    • Suitable for advanced networking needs where performance is critical.
  1. Other Solutions:
  • Kube-router:
    • A network plugin that provides Pod networking, network policy, and service proxy functionality.
    • Useful for those who want a unified solution integrating various aspects of networking.
  • Open vSwitch (OVS):
    • A multilayer virtual switch used for advanced networking setups.
    • Suitable for environments requiring complex networking configurations.

Each of these solutions has its strengths and weaknesses, catering to different needs and use cases.

3.2 Choosing the Right CNI for Your Cluster

Choosing the right CNI for your Kubernetes cluster depends on several factors, including the size of your cluster, your networking requirements, and the features you need. Here’s how you can decide which CNI is right for you:

  1. Cluster Size:
  • For small to medium-sized clusters, simpler solutions like Flannel or Weave Net are often sufficient.
  • For larger clusters or environments requiring advanced routing, solutions like Calico or Cilium are more suitable.
  1. Networking Features:
  • If you need basic networking without advanced features, solutions like Flannel or Weave Net are ideal.
  • For environments requiring network policies, load balancing, or deep packet inspection, solutions like Calico or Cilium are better suited.
  1. Security:
  • If security is a primary concern, consider solutions that offer robust network policies and encryption, such as Calico or Weave Net.
  • Cilium, with its eBPF-based packet processing, also provides high-security capabilities.
  1. ** Performance:**
  • For high-performance networking, solutions like Calico with eBPF or Cilium are excellent choices.
  • These solutions provide efficient packet processing and low-latency communication.
  1. Comlexity:
  • If you want a straightforward setup without much configuration, Flannel or Weave Net are good options.
  • For more complex and feature-rich environments, Calico, Cilium, or OVS are suitable.
  1. Integration:
  • Consider the integration with other tools and services you may be using.
  • For example, if you need advanced service proxy functionality, kube-router might be a good choice as it integrates networking and service proxies.

In this section, we've explored the common networking solutions available for Kubernetes and how to choose the right CNI for your cluster. Each solution caters to specific use cases, ranging from simple networking to advanced security and performance.
The choice of CNI depends on factors like cluster size, networking features, security, performance, complexity, and integration.

4- Conclusion

The Kubernetes Networking series goal is to equip you with a comprehensive understanding of how networking functions within Kubernetes. This first article provided a foundational understanding of the Kubernetes networking model, core components, and essential networking tools.

Key Takeways:

  1. Kubernetes Networking Model:
  • We saw the unique networking model used by Kubernetes, which revolves around the concepts of flat networking, seamless communication between Pods, and the use of network plugins.
  • This model ensures that all Pods and Nodes within a cluster can communicate with each other directly, without Network Address Translation (NAT).
  1. Core Networking Components:
  • We covered the key networking components, including Pods, Services, Network Policies, and DNS.
  • Pods are the smallest deployable units, each with a unique IP address, while Services provide stable network endpoints.
  • Network Policies offer fine-grained control over communication, and DNS facilitates name resolution for Pods and Services.
  1. Networking Tools:
  • Kubernetes networking relies on various tools and plugins to manage network communication.
  • We explored common solutions like Flannel, Calico, Weave Net, and Cilium, each offering different features and catering to different use cases.

]]>
<![CDATA[Prometheus and Grafana: Everything to Know for Effective Monitoring]]>https://thelearningjourney.co/prometheus-and-grafana-everything-to-know-for-effective-monitoring/65b8fcd98a52790001f77aecFri, 02 Feb 2024 08:02:49 GMT

Monitoring is the continuous observation of system metrics, logs, and operations to ensure everything functions as expected. Effective monitoring can preemptively alert you to potential issues before they escalate into major problems . It's about gaining visibility into your IT environment's performance, availability, and overall health, enabling you to make informed decisions .

Together, Prometheus and Grafana form a powerful duo. Prometheus collects and stores the data, while Grafana brings that data to life through visualization.

In this article, you'll learn about their basics and advanced features, how they complement each other , and the best practices.

I - Prometheus

Prometheus and Grafana: Everything to Know for Effective Monitoring
Prometheus be like...

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It excels in gathering numerical data over time, making it ideal for monitoring the performance of systems and applications. Its philosophy is centered on reliability and simplicity.

Prometheus collects and stores its metrics as time series data, metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

When to Use Prometheus

Prometheus excels in environments where you need to track the performance and health of IT systems and applications. It's particularly well-suited for:

  • Machine-centric monitoring: Ideal for keeping an eye on the servers, databases, and other infrastructure components.
  • Dynamic service-oriented architectures: Microservices and cloud-native applications benefit from Prometheus's ability to handle service discovery and frequent changes in the monitored landscape.
  • Quick diagnostics during outages: Due to its autonomous nature, Prometheus is reliable when other systems fail, allowing you to troubleshoot and resolve issues swiftly.
  • Situations where high precision is not critical: Prometheus is perfect for monitoring trends over time, alerting on thresholds, and gaining operational insights.

Real-life example: Consider a scenario where you have a Kubernetes-based microservices architecture. Prometheus can dynamically discover new service instances, collect metrics, and help you visualize the overall health of your system. If a service goes down, Prometheus can still function independently, allowing you to diagnose issues even if parts of your infrastructure are compromised.

When Not to Use Prometheus

However, Prometheus might not be the best fit when:

  • Absolute accuracy is required: For tasks like per-request billing, where every single data point must be accounted for, Prometheus's data might not be granular enough.
  • Long-term historical data analysis: If you need to store and analyze data over very long periods, Prometheus might not be the best tool due to its focus on real-time monitoring.

Real-life example: If you're running an e-commerce platform and need to bill customers for each API request, relying on Prometheus alone might lead to inaccuracies because it's designed to monitor trends and patterns, not to track individual transactions with 100% precision. In this case, you'd want a system that logs each transaction in detail for billing, while still using Prometheus for overall system monitoring and alerting.

Prometheus is good particularly when you need reliability and can tolerate slight imprecisions in favor of overall trends and diagnostics.

Architecture Overview

Prometheus and Grafana: Everything to Know for Effective Monitoring

Prometheus includes several components that work together to provide a comprehensive monitoring solution:

  • Prometheus Server: The core component where data retrieval, storage, and processing occur. It consists of:

    • Retrieval Worker: Pulls metrics from the configured targets at regular intervals.
    • Time-Series Database (TSDB): Stores the retrieved time-series data efficiently on the local disk.
    • HTTP Server: Provides an API for queries, administrative actions, and to receive pushed metrics if using the Pushgateway.
  • Pushgateway: For supporting short-lived jobs that cannot be scraped, the Pushgateway acts as an intermediary, allowing these ephemeral jobs to push metrics. The Prometheus server then scrapes the aggregated data from the Pushgateway.

  • Jobs/Exporters: These are external entities or agents that expose the metrics of your target systems (e.g., databases, servers, applications) in a format that Prometheus can retrieve. They are either part of the target system or stand-alone exporters that translate existing metrics into the appropriate format.

  • Service Discovery: Prometheus supports automatic discovery of targets in dynamic environments like Kubernetes, as well as static configuration, which simplifies the management of target endpoints that Prometheus needs to monitor.

  • Alertmanager: Handles the alerts sent by the Prometheus server. It manages the routing, deduplication, grouping, and silencing of alert notifications. It can notify end-users through various methods, such as email, PagerDuty, webhooks, etc.

  • Prometheus Web UI and Grafana: The Web UI is built into the Prometheus server and provides basic visualizations and a way to execute PromQL queries directly. Grafana is a more advanced visualization tool that connects to Prometheus as a data source and allows for the creation of rich dashboards.

  • API Clients: These are the tools or libraries that can interact with the Prometheus HTTP API for further processing, custom visualization, or integration with other systems.

Core Features

Prometheus and Grafana: Everything to Know for Effective Monitoring

Prometheus is designed with a set of core features that make it an efficient for monitoring and alerting. These features are centered around a multi-dimensional data model, a powerful query language, and a flexible data collection approach.

  • Prometheus uses a multi-dimensional data model where time series data is identified by a metric name and a set of key-value pairs, known as labels. This allows precise representation of monitoring data, enabling you to distinguish between different instances of a metric or to categorize metrics across various dimensions.
  • Prometheus's data retrieval is predominantly based on a pull model over HTTP, which means that it fetches metrics from configured targets at defined intervals. However, it also offers a push model for certain use cases via the Pushgateway.
  • Service discovery can automatically identify monitoring targets in dynamic environments, or through static configuration for more stable setups.
  • Lastly, the built-in query language, PromQL, provides away for querying this data, allowing users to slice and dice metrics in a multitude of ways to gain insights.

These core features, when leveraged together, provide a powerful platform for monitoring at scale, capable of handling the complex and dynamic nature of modern IT infrastructure.

Metrics Collection

Prometheus and Grafana: Everything to Know for Effective Monitoring

Instrumenting your applications is about embedding monitoring code within them so that Prometheus can collect relevant metrics. It's like giving your applications a voice, allowing them to report on their health and behavior.

To instrument an application:

  1. Choose Libraries: Select the appropriate client library for your programming language that Prometheus supports. For instance, if your application is written in Python, you would use the Prometheus Python client.
  2. Expose Metrics: Use the library to define and expose the metrics you want to monitor. These could be anything from the number of requests your application handles to the amount of memory it's using.
  3. Create an Endpoint: Set up a metrics endpoint, typically /metrics, which is a web page that displays metrics in a format Prometheus understands.
  4. Configure Scraping: Tell Prometheus where to find this endpoint by adding the application as a target in Prometheus’s configuration.

Assuming you have a Python web application and you want to expose metrics for Prometheus to scrape. You would use the Prometheus Python client to define and expose a simple metric, like the number of requests received.

Here's an example using Flask:

from flask import Flask, Response
from prometheus_client import Counter, generate_latest

# Create a Flask application
app = Flask(__name__)

# Define a Prometheus counter metric
REQUEST_COUNTER = Counter('app_requests_total', 'Total number of requests')

@app.route('/')
def index():
    # Increment the counter
    REQUEST_COUNTER.inc()
    return 'Hello, World!'

@app.route('/metrics')
def metrics():
    # Expose the metrics
    return Response(generate_latest(), mimetype='text/plain')

if __name__ == '__main__':
    app.run(host='0.0.0.0')

This snippet shows a simple web server with two endpoints: the root (/) that increments a counter every time it's accessed, and the /metrics endpoint that exposes the metrics.

Prometheus Configuration Example

For Prometheus to scrape metrics from the instrumented application, you need to add the application as a target in Prometheus's configuration file (prometheus.yml). Here's a simple example:

global:
  scrape_interval: 15s  # By default, scrape targets every 15 seconds.

scrape_configs:
  - job_name: 'python_application'
    static_configs:
      - targets: ['localhost:5000']

This configuration tells Prometheus to scrape our Python application (which we're running locally on port 5000) every 15 seconds.

Service Discovery

Service discovery in Prometheus automates the process of finding and monitoring targets. It ensures Prometheus always knows what to monitor.

Prometheus supports several service discovery mechanisms:

  • Static Configuration: Define targets manually in the Prometheus configuration file.
  • Dynamic Discovery: Use integrations with systems like Kubernetes, Consul, or AWS to automatically discover targets as they change.

This means if a new instance of your application is up, Prometheus will automatically start monitoring it without manual intervention.

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

This code would go in your prometheus.yml file and tells Prometheus to discover all pods in a Kubernetes cluster.

Alertmanager

Alertmanager handles alerts sent by the Prometheus server and is responsible for deduplicating, grouping, and routing them to the correct receiver such as email, PagerDuty.

Here's an example alertmanager.yml configuration file for Alertmanager:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance']
  group_wait: 10s
  group_interval: 10m
  repeat_interval: 1h
  receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'your-email@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_identity: 'alertmanager@example.com'
    auth_password: 'password'

This Alertmanager configuration sets up email notifications as the method of alerting. It groups alerts by the alertname and instance labels, waiting for 10 seconds to group them. Notifications will be sent if the group waits or group interval has passed.

By instrumenting your applications, leveraging service discovery, and configuring Alertmanager, Prometheus becomes a vigilant guardian of your infrastructure, always on the lookout for anomalies and equipped to notify you the moment something needs attention.

II - Grafana

Prometheus and Grafana: Everything to Know for Effective Monitoring

Grafana is an open-source analytics and interactive visualization web application. It provides charts, graphs, and alerts when connected to supported data sources, including Prometheus. Essentially, Grafana allows you to turn your time-series database data into beautiful graphs and visualizations.

Overview and Core Functionalities

Grafana is known for its powerful and elegant dashboards. It is feature-rich and widely used for its:

Prometheus and Grafana: Everything to Know for Effective Monitoring
Example of a Dashboard
  • Dashboards: They are versatile, allowing users to create and design comprehensive overviews of metrics, complete with panels of various types that can display data from multiple sources all in one place.
  • Data Sources: Grafana supports a wide array of data sources, from Prometheus to Elasticsearch, InfluxDB, MySQL, and many more. Each data source has a dedicated query editor that is customized to feature the full capabilities of the source, allowing intricate control over the data display.
  • Visualization Panels: These are the building blocks of Grafana dashboards. Panels can be used to create a variety of visualization elements, from line charts to histograms and even geospatial maps.
  • Annotations: This feature allows you to mark events on your graphs, providing a rich context that can be invaluable during analysis.
  • Security Features: Grafana provides robust security features, including data source proxying, user roles, and authentication integrations with Google, GitHub, LDAP, and others.

Advanced Features of Grafana

Prometheus and Grafana: Everything to Know for Effective Monitoring

While Grafana is recognized for its dashboarding capabilities, it also offers a suite of advanced features that enable more detailed data analysis and manipulation. Here are some of these features :

Query management in Explore

Explore is an ad-hoc query workspace in Grafana, designed for iterative and interactive data exploration. It is particularly useful for:

  • Troubleshooting: Quickly investigate issues by freely writing queries and immediately visualizing results.
  • Data exploring: Go deeper into your metrics and logs beyond the pre-defined dashboards, enabling you to uncover insights that aren’t immediately visible.
  • Comparison: Run queries side by side to compare data from different time ranges, sources, or to visualize the effect of certain events.

Transformations

Transformations in Grafana allow you to manipulate the data returned from a query before it's visualized. This feature is crucial when you want to:

  • Join Data: Combine data from multiple sources into a single table, which is especially useful when you're comparing or correlating different datasets.
  • Calculate: Perform calculations to create new fields from the queried data, such as calculating the average response time from a set of individual requests.
  • Filter and Organize: Reduce the dataset to the relevant fields or metrics, and reorganize them to suit the requirements of your visualization.

State Timeline Panel

The State Timeline panel is one of Grafana's visualizations that displays discrete state changes over time. This is beneficial for:

  • Status Tracking: Monitoring the on/off status of servers, the state of feature flags, or the availability of services.
  • Event Correlation: Visualizing when particular events occurred and how they correlate with other time-based data on your dashboard.

Alerting

Grafana's alerting feature allows you to define alert rules for the visualizations. You can:

  • Set Conditions: Define conditions based on the data patterns or thresholds that, when met, will trigger an alert.
  • Notify: Set up various notification channels like email, Slack, webhooks, and more, to inform the relevant parties when an alert is triggered.

Templating and Variables

With templating and variables :

  • Create Dynamic Dashboards: Adjust the data being displayed based on user selection, without modifying the dashboard itself.
  • Reuse Dashboards: Use the same dashboard for different servers, applications, or environments by changing the variable, saving time, and ensuring consistency across different views.

Best Practices

Prometheus and Grafana: Everything to Know for Effective Monitoring

Creating dashboards that are both informative and clear is not easy. Here are some best practices:

  • Clarity: Keep your dashboards uncluttered. Each dashboard should have a clear purpose and focus on the most important metrics.
  • Organization: Group related metrics together. For instance, CPU, memory, and disk usage metrics could be on the same dashboard for server monitoring.
  • Consistency: Use consistent naming and labeling conventions across your dashboards to make them easier to understand and use.
  • Annotations: Use annotations to mark significant events, like deployments, so you can see how they affect your metrics at a glance.
  • Interactivity: Use Grafana’s interactive features like variables to allow users to explore data within the dashboard.
  • Refresh Rates: Set reasonable dashboard refresh rates to ensure up-to-date data without overloading your data source or Grafana server.
Naming Conventions for Metrics

Consistent naming conventions are crucial. They ensure that metrics are easily identifiable, understandable, and maintainable. For instance, a metric name like http_requests_total is clear and indicates that it’s a counter metric tallying the total number of HTTP requests.

Efficient Querying

For efficient PromQL queries:

  • Be Specific: Craft your queries to be as specific as possible to fetch only the data you need.
  • Use Filters: Apply label filters to reduce the data set and increase query speed.
  • Avoid Heavy Operations: Functions like count_over_time can be resource-intensive. Use them judiciously.
  • Time Ranges: Be cautious with the time range; querying a massive range can lead to high load and slow responses.
Security Considerations

For security in Prometheus and Grafana:

  • Data Encryption: Use HTTPS to encrypt data in transit between Prometheus, Grafana, and users.
  • Access Control: Implement strict access controls in Grafana, with different user roles and permissions.
  • Authentication: Use Grafana’s built-in authentication mechanisms, or integrate with third-party providers.
  • Regular Updates: Keep Prometheus and Grafana up to date with the latest security patches.

By integrating Prometheus with Grafana following these best practices, you can create a monitoring environment to your systems  in a secure, efficient, and user-friendly manner. This combination can be a powerful asset in any infrastructure, enabling teams to detect and address issues proactively.

III - Conclusion

Prometheus and Grafana: Everything to Know for Effective Monitoring

Grafana makes sense of all the numbers Prometheus collects by turning them into easy-to-read dashboards and graphs. This makes it easier for you to see what's going on and make smart decisions.

However, Prometheus isn't great for everything. If you need to track every tiny detail for things like billing, it's not the best choice. It's better for looking at overall trends and issues.

]]>
<![CDATA[AWS IAM: From Basics to Advanced]]>https://thelearningjourney.co/aws-iam-from-basics-to-advanced-strategies/65ac0c6c17195d00014d7705Wed, 24 Jan 2024 13:21:00 GMT

IAM is a crucial part of managing security in Amazon Web Services (AWS).

But what is it?

At its core, IAM is all about who can do what in your AWS account. Think of it as the gatekeeper. It lets you decide who is allowed to enter your AWS cloud space and what they can do once they're in. In AWS, “who” could be a person, a service, or even an application, and “what they can do” ranges from reading data from a database to launching a new virtual server.

In this guide, we'll explore the key aspects of IAM, from the basic concepts like users and permissions to more advanced features like role boundaries and trust policies. So, let's get started on this journey to master AWS IAM.

I — How IAM Works

AWS IAM: From Basics to Advanced

Let's break down some key IAM terms and concept:

  • Principal : A principal is like a person or thing that can make a request in AWS. It can be a user (like an employee), a service (like Amazon S3), or even an application. The principal is essentially who is asking for access.
  • Request: When a principal wants to do something in AWS, they make a request. For example, a user might request to open a file stored in AWS, or a service might request to launch a new virtual server, it is evaluated against IAM policies to determine if it should be allowed or denied.
  • Authentication : This is the process of proving who the principal is. Think of it like showing your ID card before entering a secure building. In AWS, this often means providing a username and password, an access key, or some other form of identity verification.
  • Authorization: Authorization is about deciding if a principal can do what they're asking to do. After the principal is authenticated, AWS checks if they have the right permissions.
  • Actions: Actions are the specific tasks that a principal can perform . These can be simple things like reading a file or more complex actions like creating a new database.
  • Resources: Resources are the actual items in AWS that the principals interact with. These could be anything like an Amazon S3 bucket where data is stored, an EC2 instance (virtual server), or a database in RDS.


Let's use an example.

Note that in a standard IAM policy, “Authentication” is not a field since it's a process rather than a policy attribute.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:user/Sophnel"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::example-bucket/*"
        }
    ]
}
IAM Policy

In this policy:

Principal: "AWS": "arn:aws:iam::123456789012:user/Sophnel"

  • The principal is the entity (user, role, service, etc.) that is allowed or denied access. In IAM policies attached to users or roles, this field is usually implicit and not included. It's more commonly used in resource-based policies, like those in S3 or SNS.

Action: "s3:GetObject"

  • This specifies what actions the principal can perform.
  • In this example, the action is s3:GetObject, which allows the principal to read objects from the specified S3 bucket.

Resource: "arn:aws:s3:::example-bucket/*"

  • This defines on which resources the actions can be performed.
  • Here, it's all objects (/*) in the example-bucket S3 bucket.

Effect: "Allow"

  • This indicates whether the actions are allowed or denied. Possible values are “Allow” or “Deny”.
  • In this case, it's allowing the specified actions.

Authentication:

  • Not a field in the policy, but an essential part of the IAM process.
  • The principal must authenticate (e.g., log in) before AWS evaluates this policy to grant or deny the specified actions.

Optional and Required Fields

  • Principal: Optional in IAM user/role policies but required in resource-based policies.
  • Action: Required. You must specify what actions the policy allows or denies.
  • Resource: Required. Policies must specify which resources the actions apply to.
  • Effect: Required. Policies need to specify whether they allow or deny the actions.
Remember, the structure and fields of an IAM policy can vary depending on its use (e.g., attached to a user/role or used as a resource-based policy). In user/role policies, the principal is usually the user/role itself and not explicitly stated. In resource-based policies (like those used in S3 buckets), the principal must be specified.

Understanding these terms helps to grasp how IAM works. It's all about managing who (principal) can do what (actions) with which items (resources) securely (authentication and authorization).

In the following sections, we'll learn about each of these aspects and see how they come together to form the backbone of AWS security.

II — IAM Fundamentals

In this section, we will learn more about the core components of IAM.

  • IAM Users: Think of an IAM User as an individual identity for anyone or anything that needs to access AWS. It's like having a personal profile.
  • IAM Groups: Groups are like departments in a company. Each group contains multiple users, and you manage permissions for the whole group, not each person individually.
  • IAM Roles: They are used to granting specific permissions not to users directly, but to entities (like users, applications, or services) for a limited time.
  • IAM Policies: Policies are documents that clearly outline permissions. They describe what actions are allowed or denied in AWS.

Least Privilege Principle

This principle means giving someone only the permissions they need to do their job – nothing more. It’s like giving a key card that only opens the doors someone needs to access. It reduces the risk of someone accidentally (or intentionally) doing something harmful.

This foundational understanding of IAM will set the stage for exploring more advanced topics.

III — Advanced Concepts in IAM

In this section, we go deeper into some advanced IAM concepts that play a critical role in managing access and security in AWS.

  1. Service Roles

Service Roles are special types of IAM roles designed specifically for AWS services to interact with other services. They are like permissions given to AWS services to perform specific tasks on your behalf.

For example, you might have an AWS Lambda function that needs to access your S3 buckets . You create a service role for Lambda that gives it the necessary permissions to read and write to your S3 buckets.

These roles are crucial for automating tasks within AWS and for enabling different AWS services to work together seamlessly and securely.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:*",
      "Resource": "arn:aws:s3:::MyExampleBucket/*"
    }
  ]
}

The role this policy is attached to, for example MyLambdaFunctionwill allow it to perform any action on MyExampleBucket.

2. IAM Boundaries

IAM Boundaries is an advanced feature in AWS IAM that helps in further tightening security. These boundaries are essentially guidelines or limits you set to control the maximum permissions that IAM roles and users can have.

  • Purpose: Boundaries are used to ensure that even administrators or other high-privileged users can't grant more permissions than intended. This is particularly useful in large organizations or environments where strict compliance and security guidelines are necessary.
  • How They Work: Imagine setting up a fence around what an IAM user or role can do. Even if someone can assign permissions, they can't go beyond what's defined by this fence. It's a safety net to prevent excessive privileges.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::MyExampleBucket/*"
    }
  ]
}

In this case, DevRole can only read from and write to MyExampleBucket, regardless of what other permissions she's given.

3. Trust Policies

In AWS, a trust policy is attached to a role and defines which entities (like an AWS service or another AWS account) are allowed to assume the role. It's like saying, “This ID badge can only be used by these specific people or roles.”

  • Function: They act as a rule book stating which AWS entities can use a certain set of permissions.
  • Example: If you have a service role for AWS EC2, the trust policy specifies that only Ec2 can assume this role, ensuring that the permissions are used only by the intended service.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
      "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Here,  EC2 instances can assume the MyEC2Function role .

4. IAM Policy with Conditions and Context Keys

IAM policy conditions are additional specifications you can include in a policy to control when it applies. Think of them like extra rules in a game that apply only under certain circumstances.
Context keys are the specific elements within a request that a condition checks against. They are the details that trigger the conditions.

  • How They Work: These conditions can check many factors, like the time of the request, the requester's IP address, whether they used Multi-Factor Authentication (MFA), and more.
  • Example: You could create a policy that allows access to a resource only if the request comes during business hours or from a specific office location.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::MyExampleBucket/*",
      "Condition": {
        "IpAddress": {"aws:SourceIp": "192.168.100.0/24"},
        "DateGreaterThan": {"aws:CurrentTime": "2023-01-01T09:00:00Z"},
        "DateLessThan": {"aws:CurrentTime": "2023-01-01T17:00:00Z"}
      }
    }
  ]
}

This policy will authorize the roles, users or groups it attached to access to MyExampleBucket, but only from the specified IP range and during specified hours.

5. Cross-Account Access Management

In larger organizations, you often have multiple AWS accounts. Cross-account access management is about allowing users from one AWS account to access resources in another.

  • Method: You create a role in one account (the resource account) and grant permissions to a user in another account (the accessing account) to assume this role.
  • Security: Trust policies play a crucial role here, specifying which accounts can access the role.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {"AWS": "arn:aws:iam::123456789012:root"},
      "Action": "sts:AssumeRole"
    }
  ]
}

In this example, users from Account B (ID 123456789012) can assume the CrossAccountRole in Account A.

6. IAM Role Chaining

Role chaining occurs when an IAM role assumes another role. It's like passing a baton in a relay race; one role hands over permissions to another.

  • Limitations: This process has a one-hour maximum duration and some restrictions, like not being able to access resources for which the original role didn't have permission.
  • Use Case: It's often used in complex scenarios where multiple levels of access and separation of duties are needed.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Resource": "arn:aws:iam::123456789012:role/SecondRole"
    }
  ]
}

Here, InitialRole can pass its permissions to SecondRole for further actions.

Let's say you have an IAM user named DevUser who needs to perform actions that require different permissions from time to time, perhaps for deploying a service. Instead of giving DevUser a broad range of permissions, you give them permission to assume SecondRole, which has the necessary permissions for that deployment task.

When DevUser runs a deployment script, they first assume SecondRole using the AssumeRole action. AWS STS then provides DevUser with temporary credentials. With these credentials, DevUser temporarily has the permissions of SecondRole and can carry out the deployment.

This follows the principle of least privilege by only granting permissions as needed for specific tasks, which is a security best practice. It also provides an audit trail because actions taken with the assumed role can be traced back to the original entity that assumed the role, ensuring accountability and easier troubleshooting in complex environments.

By incorporating these advanced IAM concepts into your AWS strategy, you'll be better equipped to handle complex security and access requirements. These features allow for greater flexibility, precision, and control in managing access to your AWS resources, ensuring that your cloud environment remains both robust and secure.

IV - Relationships Between AWS IAM Components


AWS IAM: From Basics to Advanced

The schema shows the relationships between different IAM components. The goal is to provide an easy way of understanding how they work with each other.

IAM Users:

  • Direct Permissions: Users can be assigned Identity-Based Policies that grant them direct permissions to perform actions on AWS resources.
  • Group Membership: Users can be part of Groups, and through this membership, they inherit any Identity-Based Policies attached to those Groups.
  • Permissions Boundaries: Users can have Permissions Boundaries applied to them, which restrict the maximum permissions they can have, regardless of what is directly assigned to them or inherited from Groups.
  • Role Assumption: Users can assume Roles if allowed by the Role's Trust Policy, temporarily gaining the permissions associated with those Roles.

IAM Groups:

  • Shared Permissions: Groups are used to manage permissions for multiple Users. They don't have their own permissions but can have Identity-Based Policies attached. These policies grant permissions to all Users within the Group.
  • No Direct Boundaries: Groups cannot have Permissions Boundaries applied to them. Boundaries are applied at the User or Role level.

IAM Roles:

  • Assumable Permissions: Roles are assumable sets of permissions that Users or AWS services can take on temporarily. They allow for more flexible and secure permission delegation.
  • Identity-Based Policies: Roles can have Identity-Based Policies attached, which specify the permissions that the entity assuming the Role will obtain.
  • Trust Policies: Roles have Trust Policies, which define which principals (Users or AWS services) are allowed to assume the Role.
  • Permissions Boundaries: Roles can also have Permissions Boundaries that restrict the maximum permissions that can be granted by the Role.

To conclude, AWS Identity and Access Management (IAM) is an essential component for managing security and access within the AWS ecosystem.

Key takeaways include:

  • The importance of the principle of least privilege
  • The role of trust policies in defining who can assume specific roles
  • The use of permissions boundaries to limit the maximum permissions assignable to users and roles.

Additionally, advanced concepts like service roles, cross-account access management, and role chaining offer sophisticated methods to handle complex access requirements and ensure seamless, secure operations across multiple services and accounts.

]]>
<![CDATA[How to Master DevOps with Python, Terraform, and Kubernetes on AWS]]>https://thelearningjourney.co/automating-the-cloud-the-evolution-of-a-python-app-with-docker-kubernetes-and-terraform/658eb42074deeb0001f2590fTue, 16 Jan 2024 21:20:28 GMT

Introduction

As a DevOps engineer, I am usually involves with pipelines, automation, and cloud services. However, I've always been curious about the other side of the tech world which is application development. So, I thought, why not mix things up a bit? That's how I found myself  building a Python financial app, complete with a REST API.

This blog post documents the journey of developing and deploying my mock financial application from scratch, from coding the initial app to its deployment on AWS using Docker, Kubernetes (EKS), Terraform, and Ansible. And guess what? I've automated the whole process - every single bit of it!

If you're itching to see how it all came together, check out my GitHub repository for all the details.

GitHub - bobocuillere/DevOpsProject-FintechAPP-AWS-Terraform: This Fintech DevOps Project presents a mock financial application using DevOps tools such as Kubernetes, Docker, AWS, RDS, Monitoring and more !.
This Fintech DevOps Project presents a mock financial application using DevOps tools such as Kubernetes, Docker, AWS, RDS, Monitoring and more !. - GitHub - bobocuillere/DevOpsProject-FintechAPP-AW…
How to Master DevOps with Python, Terraform, and Kubernetes on AWS

In this article, we'll learn:

  1. Python Development: We'll explore how I used Flask to create a Python-based REST API application.
  2. Containerization and Orchestration: I'll share the benefits I discovered in using Docker and Kubernetes for deploying applications.
  3. Scripting: You'll learn how Python and Bash scripting became my go-to for automating key tasks, boosting both efficiency and reliability.
  4. Terraform for Infrastructure: I'll walk you through my experience using Terraform to define and provision cloud infrastructure.
  5. Ansible for Configuration Management: I'll show how Ansible made configuring monitoring tools (like Prometheus and Grafana) a breeze.
  6. CI/CD Practices: And finally, I'll share insights on implementing a CI/CD pipeline with GitHub Actions for a deployment that's both consistent and automated.

I -Architecture

How to Master DevOps with Python, Terraform, and Kubernetes on AWS
Architecture

In our project, we integrate cloud-based services with external tools and continuous integration/deployment (CI/CD) using GitHub Actions to create a robust and scalable application.

Step 1: Code Commit to Deployment

  • Git Users: It all starts with developers committing code to a Git repository.
  • Git Push: Committing code triggers a push to the remote repository, setting in motion the automated CI/CD pipeline.
  • CI/CD Pipeline: Using GitHub Actions, this pipeline becomes automates the building, testing, and deployment of the application and the infrastructure seamlessly.

Step 2: Building and Storing the Docker Image

  • Amazon ECR: Once built, the application’s Docker image is sent to Amazon ECR, a container registry.

Step 3: Infrastructure Provisioning

  • VPC and Availability Zones: The VPC forms the core of the network architecture, providing isolation and routing, while Availability Zones offer redundancy and high availability.
  • Public Subnet: Host internet-facing services, including EKS Worker Nodes and monitoring tools.
  • Private Subnet: Enhances security by hosting services like the Amazon RDS out of direct internet reach.
  • Network Load Balancer: This acts as a traffic director, ensuring efficient and reliable handling of user requests.

Step 4: Securing Secrets and Database Connectivity

  • AWS Secrets Manager: A vault for sensitive information like database credentials and API keys, crucial for secure database and service access.

Step 5: User Interaction with the Application

  • User: The end-user interacts with the application via the Network Load Balancer, which intelligently routes their requests to the backend services on EKS Worker Nodes.

Step 6: Interaction Between Components

  • EKS Worker Nodes and AWS Secrets Manager: The application is hosts on the EKS nodes , to communicate with the RDS, it retrievethe credentials from Secrets Manager.
  • RDS Database and AWS Secrets Manager: The RDS instance integrates with Secrets Manager for credential retrievals.

II - Building with Python

How to Master DevOps with Python, Terraform, and Kubernetes on AWS

I structured the application with the following main components:

  • app.py: The main Flask application file.
  • models.py: To define database models (like Users, Accounts, Transactions).
  • views.py: To handle the routing and logic of the application.
  • templates/: A directory for HTML templates.
  • static/: A directory for static files like CSS, JavaScript, and images.
  • tests/: For unit test scripts.

Key Features

Building the REST API was interesting from a learning point pesperctive as it was my first time coding one. I structured endpoints to manage user authentication, accounts, and transactions.
I coded the logic to manage different transaction types such as deposits and withdrawals, ensuring accurate balance updates and transaction validations.

How to Master DevOps with Python, Terraform, and Kubernetes on AWS

  1. User Registration and Authentication:
    • Secure user registration and login system.
    • Password hashing for security.
    • Session-based user authentication.

How to Master DevOps with Python, Terraform, and Kubernetes on AWS
2. Account Operations:
- Creation of financial accounts.
- Viewing account details including balance and created date.

How to Master DevOps with Python, Terraform, and Kubernetes on AWS
3. Transaction Management:
- Performing deposit and withdrawal transactions.
- Viewing a list of transactions for each account.

III - Docker

Docker is a platform for developing, shipping, and running applications inside containers. Containers are lightweight, standalone, and executable software packages that include everything needed to run an application: code, runtime, system tools, system libraries, and settings.

The main goal was to ensure our application scales efficiently and runs consistently across different environments.

I'll share how I approached this task, the key questions that guided my decisions, and the rationale behind the choices I made.

The Dockerfile

A Dockerfile is a text document containing commands to assemble a Docker image. The Docker image is a lightweight, standalone, executable package that includes everything needed to run your application.
How to Master DevOps with Python, Terraform, and Kubernetes on AWS

  1. Selecting the Base Image:

    • FROM python:3.10.2-slim: A slim version of Python was chosen as the base image for its balance between size and functionality. It provided just the necessary components required to run our Flask application without the overhead of a full-fledged OS.
  2. Installing PostgreSQL Client:

    • The decision to install a PostgreSQL client (postgresql-client) was made to support our wait-for-postgres.sh script. This was a crucial part of ensuring that the Flask application only starts after the database is ready to accept connections.
  3. Optimizing for Docker Cache:

    • By copying only the requirements.txt file initially and installing dependencies, we leveraged Docker’s caching mechanism. This meant faster builds during development, as unchanged dependencies wouldn't need to be reinstalled each time.
  4. Setting Up the Application:

    • Copying the application code into the /app directory and setting it as the working directory established a clear and organized structure within the container.
  5. Exposing Ports and Setting Environment Variables:

    • EXPOSE 5000 and ENV FLASK_APP=app.py: These commands made our application accessible on port 5000 and specified the entry point for our Flask app.
  6. Implementing the Wait Script:

    • The wait-for-postgres.sh was a decision to handle dependencies between services, particularly ensuring the Flask app doesn’t start before the database is ready.

Docker Compose for Local Developement

How to Master DevOps with Python, Terraform, and Kubernetes on AWS

The docker-compose.yml file played a role in defining and linking multiple services (the Flask app and PostgreSQL database).

  • Environment variables like DATABASE_URI were configured to dynamically construct the database connection string, ensuring flexibility and ease of configuration.

Database Configuration:

  • Setting up PostgreSQL as a separate service with its environment variables (POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD) allowed for an isolated and controlled database environment. This separation is crucial in a microservices-oriented architecture.

Reflecting on this dockerization process, several key lessons stand out:

  • Flexibility and Control: Docker provided a level of control over the application's environment that is hard to achieve with traditional deployment methods.
  • Problem-Solving: Implementing the wait-for-postgres.sh script was a practical solution to a common problem in containerized environments – managing service dependencies.
  • Best Practices: The entire process reinforced the importance of understanding and implementing Docker best practices, from choosing the right base image to optimizing build times and ensuring service availability.

IV - Migrate to AWS

After successfully containerizing the application, the next phase was moving it to AWS and create a monitoring infrastructure with Prometheus and Grafana.

Terraform for Infrastructure as Code

To automate the cloud infrastructure setup, I used Terraform.

You can find below the architecture.

.
├── backend.tf
├── fintech-monitoring.pem
├── main.tf
├── modules
│   ├── ec2
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   ├── eks
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   ├── rds
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   ├── security_groups
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   └── vpc
│       ├── main.tf
│       ├── outputs.tf
│       └── variables.tf
├── outputs.tf
├── provider.tf
├── terraform.tfvars
└── variables.tf
Terraform architecture

I structured my Terraform configuration into distinct modules, each focusing on different aspects of the AWS infrastructure. This modular design enhanced readability, reusability, and maintainability.

  1. VPC Module: Set up the Virtual Private Cloud (VPC) for network isolation, defining subnets, route tables, internet gateways.
  2. EKS Module: Deployed the Amazon Elastic Kubernetes Service (EKS) cluster. This module handled the setup of the Kubernetes control plane, worker nodes, and necessary IAM roles for EKS.
  3. RDS Module: Created a PostgreSQL database instance using Amazon RDS. This module managed database configurations, including instance sizing, storage, and network access.
  4. Security Groups Module: Defined security groups needed for any parts of our infrastructure, ensuring tight security boundaries.
  5. Monitoring Module: Create the EC2 instances for the monitoring infrastructure.

Ansible for Prometheus and Grafana

For further automation about the monitoring configuration, I utilized Ansible. It played a crucial role in automating repetitive tasks, ensuring that the environment was configured correctly.

  • Prometheus Configuration: I wrote an Ansible playbook to set up Prometheus. This playbook handled the installation of a Prometheus server.
  • Grafana Setup: Another playbook was dedicated to Grafana. It automated the installation of Grafana.
  • Templates: We utilized Ansible templates to dynamically generate configuration files. For instance, we had Jinja2 templates for Prometheus and Grafana.
  • Handlers: Ansible handlers were used to manage services post-configuration changes. For example, when the Prometheus configuration was altered, a handler ensured that the Prometheus service was reloaded to apply the new settings.

Custom Scripts

Python and Bash scripts to automate various aspects of the infrastructure and monitoring setup. These scripts were designed to complement the Terraform and Ansible configurations, ensuring a seamless and automated workflow. Here's a detailed overview of each script and its purpose in the project:

1. update_inventory.py

import json
import boto3
import yaml

# Load vars.yml
with open('vars.yml') as file:
    vars_data = yaml.safe_load(file)
aws_region = vars_data['aws_region'] 

def get_instance_ip(instance_name):
    ec2 = boto3.client('ec2', region_name=aws_region)

    response = ec2.describe_instances(
        Filters=[
            {'Name': 'tag:Name', 'Values': [instance_name]},
            {'Name': 'instance-state-name', 'Values': ['running']}
        ]
    )

    for reservation in response['Reservations']:
        for instance in reservation['Instances']:
            ip_address = instance.get('PublicIpAddress')
            print(f"IP for {instance_name}: {ip_address}")
            return ip_address
    print(f"No running instance found for {instance_name}")
    return None

def update_inventory():
    grafana_ip = get_instance_ip('Grafana-Server')
    prometheus_ip = get_instance_ip('Prometheus-Server')

    inventory_content = f'''
all:
  children:
    grafana:
      hosts:
        {grafana_ip}:
    prometheus:
      hosts:
        {prometheus_ip}:
    '''

    with open('./inventory.yml', 'w') as file:
        file.write(inventory_content.strip())

    with open('./roles/grafana/templates/grafana.ini.j2', 'r') as file:
        lines = file.readlines()

    with open('./roles/grafana/templates/grafana.ini.j2', 'w') as file:
        for line in lines:
            if line.strip().startswith('domain'):
                file.write(f'domain = {grafana_ip}\n')
            else:
                file.write(line)

def update_env_file(grafana_ip, prometheus_ip):
    env_content = f'''
export GRAFANA_URL='https://{grafana_ip}:3000'
export GRAFANA_ADMIN_USER='admin'
export GRAFANA_ADMIN_PASSWORD='admin'
export PROMETHEUS_URL='https://{prometheus_ip}:9090'
'''
    with open('.env', 'w') as file:
        file.write(env_content.strip())

if __name__ == '__main__':
    update_inventory()
    grafana_ip = get_instance_ip('Grafana-Server')
    prometheus_ip = get_instance_ip('Prometheus-Server')
    update_env_file(grafana_ip, prometheus_ip)

  • Context: Managing dynamic IP addresses in a cloud environment can be challenging, especially when dealing with Ansible inventories.
  • Purpose: This script was designed to automatically update the Ansible inventory and the .env file with the latest IP addresses of the deployed AWS instances.
  • How it Works: It queries the AWS API to retrieve the current IP addresses of instances and updates the Ansible inventory file accordingly. This ensured that Ansible always had the correct IPs for configuration tasks, especially useful after infrastructure updates that might change instance IPs.

2. generate_grafana_api_key.py

import requests
import os
import boto3
import json
import yaml
from dotenv import load_dotenv
import subprocess
import time

load_dotenv()  # This loads the variables from .env into the environment

# Load vars.yml
with open('vars.yml') as file:
    vars_data = yaml.safe_load(file)
aws_region = vars_data['aws_region'] 

def get_terraform_output(output_name):
    command = f" cd ../terraform/ && terraform output -raw {output_name}"
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    stdout, stderr = process.communicate()

    if stderr:
        print("Error fetching Terraform output:", stderr.decode())
        return None

    return stdout.decode().strip()

# Function to generate Grafana API key
def generate_grafana_api_key(grafana_url, admin_user, admin_password):
    headers = {
        "Content-Type": "application/json",
    }
    timestamp = int(time.time())
    payload = {
        "name": f"terraform-api-key-{timestamp}",
        "role": "Admin"
    }
    response = requests.post(f"{grafana_url}/api/auth/keys", headers=headers, json=payload, auth=(admin_user, admin_password))
    if response.status_code == 200:
        return response.json()['key']
        print("API key generated successfully.")
    else:
        print(f"Response status code: {response.status_code}")
        print(f"Response body: {response.text}")
        raise Exception("Failed to generate Grafana API key")

# Function to update AWS Secrets Manager
def update_secret(secret_id, new_grafana_api_key):
    client = boto3.client('secretsmanager', region_name=aws_region)
    secret_dict = json.loads(client.get_secret_value(SecretId=secret_id)['SecretString'])
    secret_dict['grafana_api_key'] = new_grafana_api_key

    client.put_secret_value(SecretId=secret_id, SecretString=json.dumps(secret_dict))

    # Debugging step: Check if the secret is really updated on AWS
    updated_secret_dict = json.loads(client.get_secret_value(SecretId=secret_id)['SecretString'])
    if updated_secret_dict['grafana_api_key'] == new_grafana_api_key:
        print("Secret successfully updated on AWS.")
    else:
        print("Failed to update secret on AWS.")

if __name__ == "__main__":
    grafana_url = os.environ.get('GRAFANA_URL')
    admin_user = os.environ.get('GRAFANA_ADMIN_USER')
    admin_password = os.environ.get('GRAFANA_ADMIN_PASSWORD')
    secret_id = get_terraform_output("rds_secret_arn")  # From the terraform output

    api_key = generate_grafana_api_key(grafana_url, admin_user, admin_password)
    update_secret(secret_id, api_key)
  • Context: For integrating Grafana with external tools or automating dashboard configurations, an API key is required.
  • Purpose: This script automated the creation of a Grafana API key and save it to AWS Secrets Manager.
  • How it Works: It interacted with Grafana's API to generate a new API key with the necessary permissions. This key was then used in subsequent scripts for dashboard setup and data source configuration, avoiding manual intervention in the Grafana UI.

3. add_grafana_dashboard.py

import requests
import boto3
import json
import os
import yaml
import subprocess
from dotenv import load_dotenv

load_dotenv()  # This loads the variables from .env into the environment

# Load vars.yml
with open('vars.yml') as file:
    vars_data = yaml.safe_load(file)
aws_region = vars_data['aws_region'] 

def get_terraform_output(output_name):
    command = f"cd ../terraform && terraform output -raw {output_name}"
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    stdout, stderr = process.communicate()

    if stderr:
        print("Error fetching Terraform output:", stderr.decode())
        return None

    return stdout.decode().strip()

def get_grafana_api_key(secret_id):
    client = boto3.client('secretsmanager', region_name=get_terraform_output("aws_region"))
    secret = json.loads(client.get_secret_value(SecretId=secret_id)['SecretString'])
    return secret['grafana_api_key']


def add_prometheus_data_source(grafana_url, api_key, prometheus_url):
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    # Check if data source exists
    get_response = requests.get(f"{grafana_url}/api/datasources/name/Prometheus", headers=headers)
    
    if get_response.status_code == 200:
        # Data source exists, update it
        data_source_id = get_response.json()['id']
        data_source_config = get_response.json()
        data_source_config['url'] = prometheus_url
        update_response = requests.put(
            f"{grafana_url}/api/datasources/{data_source_id}",
            headers=headers,
            json=data_source_config
        )
        if update_response.status_code == 200:
            print("Prometheus data source updated successfully.")
        else:
            print(f"Failed to update Prometheus data source: {update_response.content}")
    else:
        # Data source does not exist, create it
        data_source_config = {
            "name": "Prometheus",
            "type": "prometheus",
            "access": "proxy",
            "url": prometheus_url,
            "isDefault": True
        }
        create_response = requests.post(f"{grafana_url}/api/datasources", headers=headers, json=data_source_config)
        if create_response.status_code == 200:
            print("New Prometheus data source added successfully.")
        else: 
            print(f"Failed to add as a new data source: {create_response.content}") 


def add_dashboard(grafana_url, api_key, dashboard_json):
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    response = requests.post(f"{grafana_url}/api/dashboards/db", headers=headers, json=dashboard_json)
    if response.status_code == 200:
        print("Dashboard added successfully.")
    else:
        print(f"Failed to add dashboard: {response.content}")

if __name__ == "__main__":
    grafana_url = os.environ.get('GRAFANA_URL')
    secret_id = get_terraform_output("rds_secret_arn")  # From the terraform output
    dashboard_json = {
        "dashboard": {
            "id": None,
            "title": "Simple Prometheus Dashboard",
            "timezone": "browser",
            "panels": [
                {
                    "type": "graph",
                    "title": "Up Time Series",
                    "targets": [
                        {
                            "expr": "up",
                            "format": "time_series",
                            "intervalFactor": 2,
                            "refId": "A"
                        }
                    ],
                    "gridPos": {
                        "h": 9,
                        "w": 12,
                        "x": 0,
                        "y": 0
                    }
                }
            ]
        }
    }

    api_key = get_grafana_api_key(secret_id)
    prometheus_url = os.environ.get('PROMETHEUS_URL') 
    add_prometheus_data_source(grafana_url, api_key, prometheus_url)
    add_dashboard(grafana_url, api_key, dashboard_json)
  • Context: Setting up dashboards in Grafana can be time-consuming, especially when dealing with complex metrics.
  • Purpose: To automate the creation of predefined dashboards in Grafana.
  • How it Works: This script used Grafana’s API to create dashboards from JSON templates. It allowed for quick and consistent dashboard setup across different environments, ensuring that monitoring was always up-to-date with minimal manual effort.

4. wrapper-rds-k8s.sh

#!/bin/bash

cd ../terraform

SECRET_ARN=$(terraform output -raw rds_secret_arn)
REGION=$(terraform output -raw aws_region)

# Fetch secrets
REGION=$(terraform output -raw aws_region)
DB_CREDENTIALS=$(aws secretsmanager get-secret-value --secret-id $SECRET_ARN --region $REGION --query 'SecretString' --output text)
DB_USERNAME=$(echo $DB_CREDENTIALS | jq -r .username)
DB_PASSWORD=$(echo $DB_CREDENTIALS | jq -r .password)
DB_ENDPOINT=$(terraform output -raw rds_instance_endpoint)
DB_NAME=$(terraform output -raw rds_db_name)

cd -
# Create Kubernetes secret manifest
cat <<EOF > db-credentials.yaml
apiVersion: v1
kind: Secret
metadata:
  name: fintech-db-secret
type: Opaque
data:
  username: $(echo -n $DB_USERNAME | base64)
  password: $(echo -n $DB_PASSWORD | base64)
EOF

# Create Kubernetes ConfigMap manifest for database configuration
cat <<EOF > db-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fintech-db-config
data:
  db_endpoint: $DB_ENDPOINT
  db_name: $DB_NAME
EOF

# Apply the Kubernetes manifests

  • Context: Kubernetes environments require precise configuration, especially when integrating with other services like Amazon RDS.
  • Purpose: This bash script was created to automate the configuration of Kubernetes resources and ensure seamless integration with the RDS database.
  • How it Works:
  • ConfigMap Updates: The script facilitated updating Kubernetes ConfigMaps with database connection details. This allowed our application running in the Kubernetes cluster to connect to the RDS instance with the correct credentials and endpoints.
  • Secrets Management: It also handled the creation and updating of Kubernetes Secrets. This was essential for securely storing sensitive information, such as database passwords, and making them accessible to the application pods.

Kubernetes

.
├── configmap.yaml
├── db-config.yaml
├── db-credentials.yaml
├── db-service.yaml
├── deployment.yaml
├── fintech-ingress.yaml
├── secret.yaml
├── service.yaml
└── wrapper-rds-k8s.sh
k8s folder

Kubernetes, or K8s, is a powerful system for automating the deployment, scaling, and management of containerized applications.

  • Creating the EKS Cluster: Using Amazon Elastic Kubernetes Service (EKS), I deployed a Kubernetes cluster. This cluster served as the backbone for managing and scaling the Docker containers.
  • Seamless Integration: The Docker containers of the Flask application were efficiently managed by Kubernetes, ensuring they were properly deployed and scaled based on the application's requirements.
  • Configuration Management: Kubernetes ConfigMaps and Secrets were used to manage configuration data and sensitive information, crucial for the Flask application to interact seamlessly with other components like the RDS database.

Advantages of Kubernetes in the Project

  • Scalability: Kubernetes excelled in scaling the application effortlessly. It allowed the Flask app to handle varying loads by adjusting the number of running containers.
  • Self-healing: Kubernetes’ self-healing capabilities automatically restarted failed containers, ensuring high availability.
  • Load Balancing: Kubernetes provided efficient load balancing, distributing network traffic to ensure stable application performance.
  • Easy Updates and Rollbacks: Kubernetes deployments also simplified updating the application with zero downtime and facilitated easy rollbacks to previous versions if needed.

Process of Transformation

Kubernetes Objects Creation: We created various Kubernetes objects, each serving a specific role in the application deployment:

  • ConfigMap (configmap.yaml): Used to store non-confidential configuration data, like database connection strings.
  • Secrets (db-credentials.yaml, secret.yaml): Managed sensitive data like database passwords, ensuring they're stored securely and accessible only to the relevant components.
  • Deployment (deployment.yaml): Defined the desired state of the application, including the Docker image to use, the number of replicas, and other specifications.
  • Service (service.yaml): Provided a stable interface to access the application pods.
  • Ingress (fintech-ingress.yaml): Managed external access to the application, routing traffic to the appropriate services.
  1. Database Integration: We used separate Kubernetes objects (db-config.yaml, db-service.yaml) to manage the database configuration and service, ensuring a decoupled architecture where the application and database are managed independently.
  2. Implementing Ingress: The fintech-ingress.yaml file for defining rules for external access to our application, including URL routing and SSL termination.

To access the app on AWS, get the address of the ingress by running kubectl get ingress

Best Practices Implemented

  • Immutable Infrastructure: We adopted the practice of immutable infrastructure by defining every aspect of our application in code, including the Kubernetes configurations. This approach reduces inconsistencies and potential errors during deployments.
  • Declarative Configuration: Kubernetes objects were defined using YAML files, making our setup declarative. This method ensures that the environment is reproducible and version-controlled, which aligns with Infrastructure as Code (IaC) principles.

By leveraging them alongside Terraform and Ansible, I was able to create a highly efficient, automated, and error-resistant environment. This not only saved time but also enhanced the reliability and consistency of the infrastructure and monitoring setup.

V - GitHub CI/CD : A Detailed Overview

How to Master DevOps with Python, Terraform, and Kubernetes on AWS
CICD Pipeline

This pipeline was designed to automate the build, test, deploy, and monitoring processes, ensuring a smooth and efficient workflow. Here's an in-depth look at the choices I made :

Preparation Stage

  # ---- Preparation Stage ----
  preparation:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
  • Git Checkout: We started with checking out the code from the GitHub repository. This step ensures that the most recent version of the code is used in the pipeline.
  • Docker Buildx Setup: Docker Buildx is an extended build tool for Docker, providing us with the ability to create multi-architecture builds. This was critical for ensuring our Docker images could run on various platforms.
  • AWS Credentials Configuration: We securely configured AWS credentials using GitHub secrets. This step is crucial for allowing the pipeline to interact with AWS services like ECR and EKS.

Terraform Provisioning Stage

# ---- Terraform Provisioning Stage ----
  terraform-provisioning:
    needs: preparation
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4
      - name: AWS Configure Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v3
      - name: Terraform Init and Apply
        run: |
          cd ./terraform
          terraform init
          terraform apply -auto-approve
  • Infrastructure as Code: We used Terraform, to manage and provision our cloud infrastructure. Terraform's declarative configuration files allowed us to automate the setup of AWS resources like EKS clusters, RDS instances, and VPCs.
  • Init and Apply: Terraform's init and apply commands were used to initialize the working directory containing Terraform configurations and to apply the changes required to reach the desired state of the configuration.

Build Stage

# ---- Build Stage ----
  build:
    needs: terraform-provisioning
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4
      - name: AWS Configure Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1

      - name: Build Docker Image
        run: docker build -t fintech-app-repo:${{ github.sha }} ./src
      - name: Save Docker Image
        run: |
          docker save fintech-app-repo:${{ github.sha }} > fintech-app.tar
      - name: Upload Docker Image Artifact
        uses: actions/upload-artifact@v4
        with:
          name: docker-image
          path: fintech-app.tar
  • Docker Image Building: We built Docker images for our application, tagging them with the specific GitHub SHA - a unique identifier for each commit.
  • Artifact Uploading: The built Docker images were saved and uploaded as artifacts, which could be used in subsequent stages of the pipeline.

Publish Stage

# ---- Publish Stage ----
  publish:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - name: AWS Configure Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
      - name: Login to Amazon ECR
        uses: aws-actions/amazon-ecr-login@v2
      - uses: actions/download-artifact@v4
        with:
          name: docker-image
          path: .
      - name: Load Docker Image
        run: docker load < fintech-app.tar
      - uses: aws-actions/amazon-ecr-login@v2
      - name: Push Docker Image to Amazon ECR
        run: |
          docker tag fintech-app-repo:${{ github.sha }} ${{ secrets.ECR_REGISTRY }}:${{ github.sha }}
          docker push ${{ secrets.ECR_REGISTRY }}:${{ github.sha }}
  • ECR Login and Image Push: We logged into AWS Elastic Container Registry (ECR) and pushed our Docker images. This ensured our images were stored in a secure, scalable, and managed AWS Docker container registry.

Deployment Stage

# ---- Deployment Stage ----
  deployment:
    needs: publish
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4
      - name: AWS Configure Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
      - name: Retrieve and Set up Kubernetes Config
        run: |
          cd ./terraform
          terraform init
          eval "$(terraform output -raw configure_kubectl)"

      - name: Install eksctl
        run: |
          ARCH=amd64
          PLATFORM=$(uname -s)_$ARCH
          curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
          tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp
          sudo mv /tmp/eksctl /usr/local/bin

      - name: Check and Add IAM User to EKS Cluster
        env:
          CLUSTER_NAME: fintech-eks-cluster # Replace with your actual cluster name
          USER_ARN: ${{ secrets.USER_ARN }}
        run: |
          # Check if the user is already mapped to the EKS cluster
          if eksctl get iamidentitymapping --cluster "$CLUSTER_NAME" --arn "$USER_ARN" | grep -q "$USER_ARN"; then
            echo "User ARN $USER_ARN is already mapped to the EKS cluster"
          else
            # Add the user to the EKS cluster
            eksctl create iamidentitymapping --cluster "$CLUSTER_NAME" --arn "$USER_ARN" --username wsl2 --group system:masters
            echo "User ARN $USER_ARN added to the EKS cluster"
          fi

      - name: run k8s script
        run: |
          cd ./k8s/
          chmod +x ./wrapper-rds-k8s.sh
          ./wrapper-rds-k8s.sh

      - name: Update Kubernetes Deployment Image Tag
        run: |
          sed -i "s|image:.*|image: ${{ secrets.ECR_REGISTRY }}:${{ github.sha }}|" ./k8s/deployment.yaml

      - name: Apply Kubernetes Ingress
        run: |
          kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.2/deploy/static/provider/aws/deploy.yaml
          sleep 25
      - name: Apply Kubernetes Manifests
        run: |
          kubectl apply -f ./k8s/
          sleep 30

      - name: Check Pods Status
        run: kubectl get pods -o wide

      - name: Get Ingress Address
        run: kubectl get ingress -o wide
  • Kubernetes Configuration: We retrieved and set up Kubernetes configuration using outputs from Terraform. This allowed us to interact with our EKS cluster.
  • IAM Identity Mapping: We checked and added IAM users to the EKS cluster using eksctl, enhancing our cluster's access and security management (I created a specific IAM user for the ci-cd pipeline but I wanted to interact with the cluster on my local machine which have another IAM user).
  • Kubernetes Script Execution: Executed a custom bash script (wrapper-rds-k8s.sh) to manage Kubernetes resources and settings, showcasing our ability to automate complex Kubernetes tasks.
  • Image Update and Manifests Application: Updated the Kubernetes deployment image tags and applied various Kubernetes manifests, including deployment and service configurations.

Monitoring Setup Stage

 # ---- Monitoring Setup Stage ----
  monitoring-setup:
    needs: deployment
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v4
      - name: AWS Configure Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
      - name: install with pip the following packages
        run: |
          pip3 install boto3
          pip3 install requests
          pip3 install python-dotenv

      - name: Update Inventory with Latest IP Addresses
        run: |
          cd ./terraform
          terraform init
          cd ../ansible
          python3 update_inventory.py

      - name: Create PEM Key File
        run: |
          cd ./ansible
          echo -e "${{ secrets.PEM_KEY }}" > ../terraform/fintech-monitoring.pem
          chmod 400 ../terraform/fintech-monitoring.pem

      - name: ansible playbook for the monitoring
        run: |
          cd ./ansible
          ansible-playbook playbook.yml -vv

      - name: Generate Grafana API Key and Update AWS Secret
        run: |
          cd ./ansible
          python3 generate_grafana_api_key.py

      - name: Add Dashboard to Grafana
        run: |
          cd ./ansible
          python3 add_grafana_dashboard.py
  • Ansible Playbook Execution: Used Ansible to configure Prometheus and Grafana on our AWS infrastructure. This automated the setup and ensured consistent configuration across environments.
  • Python Scripts for Grafana: Automated Grafana dashboard creation and API key management using custom Python scripts. This demonstrated our capability to integrate different technologies for a comprehensive monitoring solution.

Best Practices and Key Takeaways

  • Securing Secrets: We used GitHub secrets to manage sensitive information, ensuring that credentials were not hardcoded or exposed in the pipeline.
  • Modular Approach: By structuring our CI/CD pipeline into distinct stages, we achieved clarity and better control over each phase of the deployment process.
  • Infrastructure as Code: Leveraging Terraform for provisioning allowed us to maintain a consistent and reproducible infrastructure setup, reducing manual errors and improving efficiency.
  • Containerization and Registry Management: Using Docker and ECR ensured that our application was packaged consistently and stored securely, facilitating smoother deployments.
  • Automated Monitoring Setup: The integration of Ansible and custom Python scripts for setting up Prometheus and Grafana streamlined our monitoring setup, illustrating our focus on automation and reliability.

VI - Lessons Learned and Growth

As I conclude this project, it's important to reflect on the lessons learned and the personal and professional growth that came with it.

Understanding the Full Spectrum: From coding in Python to orchestrating containers with Kubernetes, every step was a puzzle piece, contributing to a bigger picture.

The Power of Automation: One of the key takeaways from this experience is the incredible power and efficiency of automation. Whether it was using Terraform for infrastructure setup, Ansible for configuration, or GitHub Actions for CI/CD, automating repetitive and complex tasks not only saved time but also reduced the scope for errors.

Collaboration and Community: The role of community resources and collaboration was invaluable. Whether it was seeking help from online forums, GitHub repositories, or directly friends.


]]>
<![CDATA[How to create a fully scalable, highly available, multi-server web application on AWS with Terraform]]>https://thelearningjourney.co/how-to-create-a-fully-scalable-highly-available-multi-server-web-application-on-aws-2/6428a5324cdf85000161cf2bSun, 25 Apr 2021 21:05:00 GMT

Check the code on GITHUB --> https://github.com/bobocuillere/HA-web-application-workshop-AWS

Nowadays, using Terraform or any Infrastructure as Code services is vital, especially in the cloud computing world where everything is "disposable," you could create an entire infrastructure in minutes and made it available globally without waiting weeks to do it like before.

Terraform come into play by allowing to rapidly provision and manage infrastructure.
This blog post will entirely transform the "The AWS Highly available web application" workshop into terraform. You can find it by following this link.
(https://ha-webapp.workshop.aws/introduction/overview.html).

How to create a fully scalable, highly available, multi-server web application on AWS with Terraform


Why? Because I don't want to waste 4 hours each time I want to do it, and more importantly, it provides versioning, meaning I can rapidly see, change the infrastructure or even destroy it rapidly.

1. Architecture

Prerequisites:

  • AWS Account
  • Terraform 0.12 or later, I'm using 0.14.9
How to create a fully scalable, highly available, multi-server web application on AWS with Terraform
source: https://docs.aws.amazon.com/whitepapers/latest/best-practices-wordpress/reference-architecture.html

So let's start by explaining the architecture!

The workshop doesn't use Route 53 or a CDN.

We will start with number 3.

  • 3) The Internet Gateway will allow communication with the different components (instances) in your VPC.
  • 4) The Application Load Balancer is here to distribute the traffic across an auto-scaling group across the EC2 instances in the different AZs.
  • 5) Nat Gateways provides our instances in the Application subnets a way to reach the internet but not the other way around.
  • 6) WordPress is installed and run in an auto scaling group.
  • 7) Elasti Memcached is a way to add a caching layer for the frequently accessed data.
  • 8) RDS Aurora, Serverless database on AWS.
  • 9) File sharing system on AWS.

The workshop is to build a highly available web application through regional services and availability zones.

We have six subnets in total (3 in each AZs).

  • Public subnets: The only subnet that directly links to the internet, a bastion host could be placed here.
  • Application subnets: This is where the WordPress instances will be.
  • Data subnets: This is where the database is located.

Every part is redundant and resilient to failure.

2. Terraform

Find the terraform code by following this link :GitHub for the project

So how the code works?

We have 11 files in total:

  • 7 of them is for each lab of the workshop
  • VARS and a TFVARS file
  • An output file
  • And the template file for the bash script.
How to create a fully scalable, highly available, multi-server web application on AWS with Terraform

Start by changing your AWS_PROFILE name in the TFVARS.

Your profile name is found at %USERPROFILE%\.aws\credentials (Windows) and ~/.aws/credentials (Linux & Mac).

How to create a fully scalable, highly available, multi-server web application on AWS with Terraform

The variable named "linux_ami" is a RHEL 8 image.

The AMI has to be changed if you use another region other than us-east-1.

How to create a fully scalable, highly available, multi-server web application on AWS with Terraform

You can change any other variable to suit yourself but let's keep it vanilla as AWS put it in the workshop.

Everything is set up to launch terraform.

Open a PowerShell shell and type:

terraform init

and

terraform apply
How to create a fully scalable, highly available, multi-server web application on AWS with Terraform

The entire workshop will be created (resources, security groups, configurations, etc.).

When it's finished, you'll click on the LoadBalancer_DNS_OUTPUT to access the website.

How to create a fully scalable, highly available, multi-server web application on AWS with Terraform

I have modified the script in the workshop because it didn't work (many dependencies and packages failed), so it's just the default Apache homepage.

You will have to change the bash script if you want more of it, my goal was to transform the workshop into terraform.

How to create a fully scalable, highly available, multi-server web application on AWS with Terraform
]]>
<![CDATA[How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2]]>PART 1 is available by clicking HERE.

Quick recap: An Active Directory with ADFS was configured.

We created a user named Jean, added him to two previously created groups.


1. Configuring AWS

Now it's time to configure AWS.

  • Connect to the AWS Dashboard -->go to IAM
]]>
https://thelearningjourney.co/how-to-federate-your-on-premise-users-to-aws-using-adfs-and-saml-2-0-part-2-2/6428a5324cdf85000161cf2aFri, 08 Jan 2021 19:12:00 GMT

PART 1 is available by clicking HERE.

Quick recap: An Active Directory with ADFS was configured.

We created a user named Jean, added him to two previously created groups.


1. Configuring AWS

Now it's time to configure AWS.

  • Connect to the AWS Dashboard -->go to IAM-->Select Identity Providers
  • Choose SAML, enter a Provider name and choose the XML metadata file. Type from your AD https://<yourservername>/FederationMetadata/2007-06/FederationMetadata.xml to get the file.
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

Once finished, create two IAM roles by choosing the option SAML and the Identity provider (SAML provider) we made earlier.

We shall call our roles AWS-PROD-ADMIN and AWS-PROD-DEV.

Putting AWS before the name is for a good reason I'll explain later (see the error #3 in the section Bugs and Errors below.

These names look familiar because they are the same as our 2 AD groups created at the beginning (PART 1)

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
IAM wizard for SAML roles
Summary: In layman's terms, we gave authorization for our ADFS (The Identity Provider here) to communicate and be used in AWS. We created two roles that our future users will get their permissions from, which are link with the AD groups on-premise.

2. Configuring AWS in ADFS as a Trusted Relying Party

AWS trust this Identity Provider (ADFS), now we also have to do it the other way, meaning ADFS needs to trust AWS.

  • From the ADFS Console, right-click relying party trust and next
  • Add https://signin.aws.amazon.com/static/saml-metadata.xml like below.
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • Set the display name and click Next.
  • For testing purposes, we will permit all users. We can configure an MFA authentification if needed.
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • After this, breeze through the end.

AWS is now a trusted relying party in ADFS.

3. Configuring the Claim Rules for AWS

The claim rules are elements needed by AWS like (NameId, RoleSessionName, and Roles) that ADFS doesn't provide by default.

  • Right-click on the relying party just created and then Claim Issuance Policy
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

Adding NameId

  • Click Add Rule below the dialog box.
  • Select Transform an Incoming Claim, then Next.
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • Add the following settings
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • Click Finish.

Adding RoleSeesionName

  • Click Add Rule
  • Select Send LDAP Attributes as Claims
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • Add the following settings:

Claim rule name: RoleSessionName

Attribute store: Active Directory

LDAP Attribute: E-Mail-Addresses

Outgoing Claim Type: https://aws.amazon.com/SAML/Attributes/RoleSessionName

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • Click Finish.

Now we only need to add the Role attributes to finish, but I'll explain what will be happening.

The next two claims are custom made. The first one will get all the groups of the authenticated user (when authenticating on the ADFS page), while the second rule will do the transformation into the roles claim by matching the role's names.

Hope it's clear!

Adding Role Attributes

  • Click Add Rule
  • Select Send Claims Using a Custom Rule and click Next.
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • For the name type Get AD Groups and in Custom Rule, enter the following:
c:[Type == "https://schemas.microsoft.com/ws/2008/06/identity/claims/windowsaccountname", Issuer == "AD AUTHORITY"]
 => add(store = "Active Directory", types = ("https://temp/variable"), query = ";tokenGroups;{0}", param = c.Value);
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • Click Finish.
  • Click Add Rule again
  • Repeat the same step but the Claim rule name would be Roles and Custom rule is:
c:[Type == "https://temp/variable", Value =~ "(?i)^AWS-"] => issue(Type = "https://aws.amazon.com/SAML/Attributes/Role", Value = RegExReplace(c.Value, "AWS-", "arn:aws:iam::123456789012:saml-provider/Federation-Demo,arn:aws:iam::123456789012:role/"));
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

Make sure you add the correct ARN for the identity provider (green) which is here:

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

Same for the role's ARN (yellow)

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

4. Testing

  • Go to this address: https://localhost/adfs/ls/IdpInitiatedSignOn.aspx in your domain. If you're getting an error while trying to access this page, you might have the SignOn option deactivate.

Open a Powershell Administrator shell and enter :

Get-AdfsProperties

Check that the option EnableIdpInitiatedSignOnPage is True.

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

To set it to True:

Set-Adfsproperties -enableIdPInitiatedSignonPage $true

Once done, you can restart the service:

Restart-Service adfssrv
  • Select Sign in to one of the following sites (second option):
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
Some pages are in French since I'm French :).
  • You'll be prompt to enter an account (Jean's account created earlier)
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
Source is my domain's name.
  • Select the role you want to sign-in with (If mapped to only a role, you'll skip this page and go directly to AWS).
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2
  • Congratulations!, you can now see your role and the user you are Sign-in with :)
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

5. Bugs and Errors

Error #1: Specified provider doesn't exist.

This error is due to the provider's ARN in the issuance's Claims being faulty.

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

Solution: Check if everything is correctly entered for the Identity Provider we created on AWS at the beginning.

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

Error #2: Error: RoleSessionName is required in AuthnResponse

Error due to the AD user not having some necessary attributes, which are:

  • distinguishedName
  • mail
  • sAMAccountName
  • userPrincipalName

Solution: Checking that each of those attributes is correctly filled.

Some of those options need you to activate the Advanced Features of the Active Directory.

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

Error #3: Not authorized to perform sts:AssumeRoleWithSAML

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

This is a matching name error of the Group's name and the Role's name on AWS you're trying to connect with.

Remember what I've told you in the Configuring AWS section about why we put AWS before the name? This is it!

The Roles we created in AWS are the following, and we're interested in the highlighted role below :

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

I've purposely created an AD's Group without AWS (PROD-TEST) at the beginning to show you.

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

The two names are not matched.

Solution: Make them matched by either changing the name of your AD Group by adding AWS (example: AWS-845125856994-AWS-PROD-TEST) or adding an AWS tag in the Role's issuance claim here:

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 2

I'll be glad to answer any questions or help if needed by contacting me.

]]>
<![CDATA[How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1]]>https://thelearningjourney.co/how-to-federate-your-on-premise-users-to-aws-using-adfs-and-saml-2-0-part-1-2/6428a5324cdf85000161cf29Sun, 03 Jan 2021 00:17:00 GMT

1. Introduction

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1

Most enterprises using the cloud would want to federate their existing users base, meaning creating an SSO (Single Sign-On) environment to authorize with specific rights what a person can do in AWS cloud.

This post will describe how to use enterprise federation, the integration of ADFS and AWS.

Prerequisites:

  • Windows server 2019 ( 2016 should be fine)
  • AWS account (free tier is fine)

2. How it works

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1
ADFS Integration with AWS - ARCHITECTURE

1 - User will connect to their ADFS portal

2 - ADFS will check the user access and authenticate against the AD.

3 - A response is receive as a SAML assertion with group membership information.

4 - The ARNs are dynamically build using ADs group membership information for the IAM roles, while the user attributes (distinguishedName, mail, sAMAccountName) are used for the AWS account IDs.

Finally, ADFS will send a signed assertion to AWS STS.

5 - Temporary credentials are giving by STS AssumeRoleWithSAML.

6 - The user is authenticated and giving access to the AWS management console.

3. Configuring Active Directory

Before configuring ADFS, you'll need to have an active working directory.

  • Create two AD Groups named exactly AWS-accountId-AWS-PROD-ADMIN & AWS-accountId-AWS-PROD-DEV (uppercase and lowercase are important). Your account ID is found on your AWS dashboard.
  • Create a user named Jean with an email address.
  • Add Jean in the two Groups created above.
  • Create another user named ADFSAWS, This is a service account used by the ADFS service.

4. Installing and Configuring ADFS

Before installing our ADFS role, we created a user name Jean whom we added to two groups.

After installing the role, configuring and setting up the environment is easy by keeping the default settings.

Launch the ADFS management page by searching AD FS.

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1
Search on Windows

Click to Configure the WIZARD :

  • Select Create a new Federation (default setting).
  • Select New federation server farm (default setting).
  • In the Specify Service Properties, I'm using a self-signed SSL certificate generated with IIS for demo purposes.
  • ADFSAWS is the service account created earlier.
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1
Specify Service Account
  • Select the first option Create a database on this server
  • A review of our choices.
How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1
Review options

If you encounter this error, it means we have to set up the service account created earlier.

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1
SPN error
setspn -a host/localhost adfsaws
Note: ADFSAWS is the service account created earlier.

If the command succeeds, you should see something like that:

How to federate your on premise users to AWS using ADFS and SAML 2.0 PART 1
Result

Configure AWS, which will be in PART 2 -->CLICK HERE

]]>