Chapter 5: Code Analysis (AI Understanding)#

Welcome back! In Chapter 4: Code Fetching, we successfully gathered all the relevant code files from your chosen repository or local directory. Now, we have a collection of text files containing the raw code. But this is just like having a stack of blueprints without knowing what building they describe or how the different parts fit together.

To create a helpful tutorial, we need to understand the codebase's structure, purpose, and key concepts. This is where Code Analysis, powered by AI Understanding, comes in.

The Problem: Making Sense of Code#

Imagine looking at a large, complex piece of software for the first time. It's thousands (or millions!) of lines of code spread across hundreds of files. It's like trying to understand a whole city just by looking at a map of every single pipe and wire. You need a higher-level view.

You need to identify:

The Main Ideas (Abstractions): What are the core components, modules, classes, or key pieces of logic that make this project work? (Like identifying the major buildings, districts, or power grid in a city).
How They Connect (Relationships): How do these main ideas interact? Which parts use other parts? What's the flow of data or control between them? (Like understanding how traffic flows between districts, how buildings connect to the power grid, or how water gets to each house).

Doing this manually for a large project is difficult and time-consuming, even for experienced developers! For beginners, it's nearly impossible.

The Solution: AI Understanding the Code#

Our project uses a powerful tool – a Large Language Model (LLM), which is a type of AI – to act like a super-fast, super-smart code analyst. This AI reads the raw code content we fetched and processes it to figure out the Abstractions and Relationships.

Think of the AI as an expert who can quickly read through all the blueprints (the code) and tell you: * "Okay, these files seem to be about the 'User Management' system (that's an Abstraction)." * "And these files look like the 'Database Interface' (another Abstraction)." * "It looks like the 'User Management' system frequently 'Uses' the 'Database Interface' to save user data (that's a Relationship)."

This structured understanding is the foundation upon which the entire tutorial is built. It helps us decide what needs explaining and in what order.

How Our Project Analyzes Code#

In our Workflow Engine (Pocket Flow) from Chapter 3, the Code Analysis step is primarily handled by two Nodes: IdentifyAbstractions and AnalyzeRelationships. These nodes work in sequence, each using the results of the previous step and the original fetched code.

Let's break down what happens:

Step 1: Identify Abstractions#

This is where the AI reads the code and tries to find the big-picture components.

Node: IdentifyAbstractions
Input: The dictionary of fetched files (shared["files"]) containing file paths and their content.
Process:
- The prep method takes the list of files from shared. It formats the code content into a large text block that can be sent to the AI. It also creates a simple listing of the files with their index number (which is important for the AI to reference files later).
- The exec method takes this formatted text and listing and sends it to the AI (call_llm function, which we'll cover in Chapter 6) with a specific prompt. The prompt asks the AI to identify the 5-10 most important concepts or components (Abstractions), give them a name and beginner-friendly description (using an analogy), and list which files are most relevant to each abstraction (using the file index numbers from the listing).
- The AI responds, and the exec method then validates this response. The prompt asks for the output in a specific format (like YAML). The code checks if it's valid YAML, if it contains the expected fields (name, description, file_indices), and if the file indices are valid numbers within the range of files provided. If validation fails, it might retry the AI call (as configured in the node).
- The post method takes the validated list of abstractions and stores it in the shared dictionary under the key "abstractions". Each item in this list is a dictionary like {"name": "...", "description": "...", "files": [index1, index2, ...]}.
Output: A list of identified abstractions, each with a name, description, and a list of indices pointing to relevant files. Stored in shared["abstractions"].

sequenceDiagram
    participant Shared as Shared Store
    participant IdentifyNode as IdentifyAbstractions Node
    participant LLM_Util as call_llm Utility
    participant LLM as Large Language Model (AI)

    IdentifyNode->>Shared: Read "files"
    IdentifyNode->>IdentifyNode: prep(shared) <br> Format code & file list
    IdentifyNode->>LLM_Util: exec(...) <br> Call call_llm(prompt)
    LLM_Util->>LLM: Send Prompt (code, file list, instructions)
    LLM-->>LLM_Util: Respond with YAML (abstractions)
    LLM_Util-->>IdentifyNode: Return LLM response
    IdentifyNode->>IdentifyNode: Validate LLM response
    alt If validation fails (or other error)
        IdentifyNode->>LLM_Util: Retry call_llm (if configured)
    end
    IdentifyNode->>Shared: post(shared) <br> Write "abstractions"

Simplified flow showing how the IdentifyAbstractions node prepares data, calls the AI via call_llm, validates the response, and saves the results.

Step 2: Analyze Relationships#

With the main concepts identified, the AI now figures out how they connect.

Node: AnalyzeRelationships
Input: The list of abstractions (shared["abstractions"]) and the original fetched files (shared["files"]).
Process:
- The prep method reads the abstractions (with their names, descriptions, and relevant file indices) and the original files from shared. It creates a new formatted context for the AI. This context includes the list of identified abstractions and snippets of code from the relevant files linked to those abstractions. This helps the AI see how the abstractions are implemented and interact in the actual code.
- The exec method takes this context and sends it to the AI (call_llm) with a prompt. This prompt asks for a high-level summary of the entire project and a list of key relationships between the identified abstractions. For each relationship, it asks for the source abstraction (by index and name), the target abstraction (by index and name), and a brief label describing the interaction (like "Uses" or "Manages").
- The AI responds, and the exec method validates this response. It checks for the correct format (YAML), the presence of a summary and relationships list, and the structure of each relationship entry (valid indices, a label). It also includes a check to make sure every identified abstraction is included in at least one relationship, forcing the AI to provide a complete overview of how everything fits together. Retries are configured here too.
- The post method takes the validated project summary and list of relationships and stores it in the shared dictionary under the key "relationships". The structure is typically like {"summary": "...", "details": [{"from": index1, "to": index2, "label": "..."}, ...]}.
Output: A high-level project summary and a list of relationships between abstractions. Stored in shared["relationships"].

sequenceDiagram
    participant Shared as Shared Store
    participant AnalyzeNode as AnalyzeRelationships Node
    participant LLM_Util as call_llm Utility
    participant LLM as Large Language Model (AI)

    AnalyzeNode->>Shared: Read "abstractions" <br> Read "files"
    AnalyzeNode->>AnalyzeNode: prep(shared) <br> Format abstractions <br> & relevant code snippets
    AnalyzeNode->>LLM_Util: exec(...) <br> Call call_llm(prompt)
    LLM_Util->>LLM: Send Prompt (abstractions, code snippets, instructions)
    LLM-->>LLM_Util: Respond with YAML (summary, relationships)
    LLM_Util-->>AnalyzeNode: Return LLM response
    AnalyzeNode->>AnalyzeNode: Validate LLM response
    alt If validation fails (or other error)
        AnalyzeNode->>LLM_Util: Retry call_llm (if configured)
    end
    AnalyzeNode->>Shared: post(shared) <br> Write "relationships"

Simplified flow showing how the AnalyzeRelationships node uses abstractions and relevant code to generate a summary and relationships.

Using the Analysis Results#

The results stored in shared["abstractions"] and shared["relationships"] are then used by subsequent nodes in the workflow:

The OrderChapters node uses the abstractions and relationships to decide the best sequence for the tutorial chapters (this will be covered in a later chapter).
The WriteChapters node uses the ordered list of abstractions, their descriptions, relevant code snippets, and the project summary/relationships to generate the actual Markdown content for each chapter.
The CombineTutorial node uses the project summary and relationships to create the index page and the relationship diagram.

This is the core idea: turn unstructured code text into structured data about concepts and their connections using AI.

Looking at the Code#

The logic for Code Analysis lives within the IdentifyAbstractions and AnalyzeRelationships classes in the function_app/nodes.py file.

Here's a simplified look at IdentifyAbstractions:

# function_app/nodes.py (Simplified IdentifyAbstractions)
from pocketflow import Node
from utils.call_llm import call_llm # Imports the utility to talk to the AI
import yaml # To parse the AI's YAML output

class IdentifyAbstractions(Node):
    def prep(self, shared):
        files_data = shared["files"] # Get the list of (path, content) tuples

        # --- Code to format files_data into a prompt string (omitted) ---
        # This creates the "Codebase Context" and "List of file indices" shown to the AI
        context = "...formatted file content with indices..."
        file_listing_for_prompt = "...list of indices and paths..."
        # --- End formatting code ---

        return context, file_listing_for_prompt, len(files_data), shared["project_name"], shared.get("language", "english")

    def exec(self, prep_res):
        context, file_listing_for_prompt, file_count, project_name, language = prep_res
        print(f"Identifying abstractions using LLM for project: {project_name}...")

        # --- Code to build the detailed prompt for the AI (omitted) ---
        # Includes instructions on what to identify and the required YAML format
        prompt = f"""
For the project `{project_name}`:
Codebase Context:
{context}
... instructions ...
List of file indices and paths:
{file_listing_for_prompt}
Format the output as a YAML list of dictionaries:
```yaml
... example format ...

...""" # Actual prompt includes language hints if not English

    response = call_llm(prompt) # Send the prompt to the AI!

    # --- Code to validate and parse the LLM's YAML response (omitted) ---
    # Checks format, extracts data, validates file indices
    abstractions = yaml.safe_load(response.strip().split("```yaml")[1].split("```")[0].strip())
    validated_abstractions = [] # Build list of {"name":..., "description":..., "files":[...]}
    # --- End validation/parsing code ---

    print(f"Identified {len(validated_abstractions)} abstractions.")
    return validated_abstractions # This is passed to post

def post(self, shared, prep_res, exec_res):
    shared["abstractions"] = exec_res # Store the validated results in shared
    # Returns "default" implicitly to move to the next node in the flow

This simplified snippet shows how `IdentifyAbstractions` gets file data in `prep`, uses `call_llm` in `exec` to ask the AI, performs validation on the response, and then saves the result in `shared` in `post`. And a simplified look at `AnalyzeRelationships`:python

function_app/nodes.py (Simplified AnalyzeRelationships)#

from pocketflow import Node from utils.call_llm import call_llm # Imports the utility to talk to the AI import yaml # To parse the AI's YAML output

Helper function to get content for specific file indices (defined above the class)#

def get_content_for_indices(files_data, indices): ...#

class AnalyzeRelationships(Node): def prep(self, shared): abstractions = shared["abstractions"] # Get the list of abstractions identified previously files_data = shared["files"] # Get the original file data # --- Code to format context for the AI (omitted) --- # Includes abstraction details and relevant code snippets using get_content_for_indices context = "...formatted abstraction details and relevant code..." abstraction_listing = "...list of indices and names..." # For clear reference in prompt # --- End formatting code --- return context, abstraction_listing, len(abstractions), shared["project_name"], shared.get("language", "english") def exec(self, prep_res): context, abstraction_listing, num_abstractions, project_name, language = prep_res print(f"Analyzing relationships using LLM for project: {project_name}...") # --- Code to build the detailed prompt for the AI (omitted) --- # Includes instructions for summary and relationships, and required YAML format prompt = f""" Based on the following abstractions and relevant code snippets for project {project_name}: List of Abstraction Indices and Names: {abstraction_listing} Context: {context} ... instructions ... Format the output as YAML: yaml summary: | ... relationships: - ... ...""" # Actual prompt includes language hints if not English and validation constraints

    response = call_llm(prompt) # Send the prompt to the AI!

    # --- Code to validate and parse the LLM's YAML response (omitted) ---
    # Checks format, extracts summary and relationships, validates indices and structure
    relationships_data = yaml.safe_load(response.strip().split("```yaml")[1].split("```")[0].strip())
    validated_relationships = [] # Build list of {"from":..., "to":..., "label":...}
    # --- End validation/parsing code ---

    print("Generated project summary and relationship details.")
    return {"summary": relationships_data["summary"], "details": validated_relationships} # Passed to post

def post(self, shared, prep_res, exec_res):
    shared["relationships"] = exec_res # Store the validated results in shared
    # Returns "default" implicitly

`` Similar toIdentifyAbstractions, theAnalyzeRelationshipsnode fetches necessary data fromsharedinprep, calls the AI with a relevant prompt inexec, validates the response, and stores the output insharedinpost. Notice that it uses the output (abstractions`) from the previous analysis step as part of its input context, in addition to the original file data.

These two nodes, powered by the LLM via the call_llm utility, are the core of the code analysis process, transforming raw code into a structured understanding of the codebase.

Key Takeaways#

Code Analysis is necessary to move beyond raw code text and understand the structure and concepts.
Our project uses AI (LLMs) to perform this analysis automatically.
The process is broken down into identifying Abstractions (core components) and analyzing their Relationships.
These steps are implemented as Pocket Flow Nodes (IdentifyAbstractions and AnalyzeRelationships).
These nodes prepare data, send it to the AI via a utility function (call_llm), validate the AI's structured response (like YAML), and store the results in the shared dictionary for subsequent steps.
The output of Code Analysis (Abstractions and Relationships) is the foundation for determining the tutorial structure and generating content.

Now that we've seen what analysis is done and how it fits into the workflow, the next crucial piece is understanding how we actually communicate with the Large Language Model (AI).

Next Chapter: AI Brain (LLM Communication)

_{^{Generated by AI Codebase Knowledge Builder.}} _{^{References: 1(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/function_app/nodes.py), 2(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/nodes.py)}}