Skip to content

Chapter 4: Code Fetching#

Welcome back! In Chapter 3: Workflow Engine (Pocket Flow), we saw how our project uses Pocket Flow to orchestrate the complex steps of tutorial generation, making sure tasks like fetching code, analyzing it, and writing chapters happen in the correct order.

The very first step in that orchestrated process is getting the raw material: the code itself! Before we can analyze anything or ask AI questions about the code, we first need to actually get the code files.

This is the job of Code Fetching.

What is Code Fetching?#

Think of Code Fetching as the project's library assistant. When you want to research a topic, the assistant goes to the library (or online!), finds the specific books you need, brings them back, and maybe even sorts out the irrelevant ones (like novels when you asked for science books).

In our project, Code Fetching is the part of the system that goes to where your code lives – either online in a place like GitHub or right on your local computer – finds the code files, reads their content, and gathers them up for processing.

It's the crucial first step because everything that comes after it (analysis, AI interaction, tutorial writing) depends on having the actual code content to work with.

Why is Code Fetching Important?#

  • Getting the Data: Without fetching, there's no code to analyze. Simple as that!
  • Handling Different Sources: Code can be in many places. Code fetching needs to know how to access online repositories (like GitHub) and local folders.
  • Filtering: Not every file in a project is relevant for a tutorial. We often only care about source code (.py, .js), configuration (.yml), or documentation (.md) files. We also want to ignore large files that might contain data or binaries, or ignore entire folders like test suites or build outputs. Code fetching handles this filtering before passing the code on.

Our Project's Code Fetching#

In our Pocket Flow workflow (from Chapter 3), the FetchRepo node is responsible for code fetching.

When you submit the tutorial request through the Web Interface (Frontend), the initial request sent to the Azure Function (as discussed in Chapter 2: Serverless Deployment (Azure Functions)) contains all the details about where the code is and how to fetch it. This information is stored in the shared dictionary that the Pocket Flow uses.

The FetchRepo node reads these details from the shared dictionary to know what to do.

Here are the key inputs this node needs from the shared dictionary:

Input Parameter Description Example Value
repo_url URL of the GitHub repository (if fetching from GitHub) "https://github.com/owner/repo/tree/main/src"
local_dir Path to a local directory (if fetching from local machine) "/Users/you/myproject"
github_token Your GitHub token (needed for private repos or rate limits) "ghp_abc123..."
include_patterns List of file patterns to include (e.g., *.py, *.md) {"*.py", "*.js", "*.md"}
exclude_patterns List of file patterns to exclude (e.g., tests/*, *.log) {"tests/*", "docs/*", "*.log"}
max_file_size Maximum size of files to fetch (in bytes) 100000 (for ~100KB)

The FetchRepo node takes these parameters and uses specialized helper functions to perform the actual fetching and filtering.

How Code Fetching Works (Under the Hood)#

Let's look at the process inside the FetchRepo node and its helper functions:

  1. Check Source: The FetchRepo node first checks if repo_url or local_dir is provided in the shared data.
  2. Delegate Fetching:
    • If repo_url is present, it calls the crawl_github_files helper function.
    • If local_dir is present, it calls the crawl_local_files helper function.
  3. Helper Does the Work:
    • For GitHub (crawl_github_files): This function uses the GitHub API to navigate the repository's file structure, starting from the specified URL (which can include a branch, commit, or subdirectory). It recursively fetches details about files and directories. For each file:
      • It checks the file size against max_file_size. If too big, it skips.
      • It checks the file path against include_patterns and exclude_patterns. If it doesn't match includes or matches excludes, it skips.
      • If it passes the checks, it downloads the file's content. A GitHub token is used here to authenticate (especially for private repos) and increase API rate limits.
    • For Local (crawl_local_files): This function uses Python's built-in file system functions (os.walk) to traverse the local directory. For each file found:
      • It performs the same size check (max_file_size) and pattern checks (include_patterns, exclude_patterns) using the fnmatch library (which handles patterns like *.py or tests/*).
      • If it passes the checks, it reads the file's content.
  4. Gather Results: Both helper functions collect the paths and contents of all files that pass the filters. They return this as a dictionary, typically like {"files": {"path/to/file1.py": "content1", "another/file2.md": "content2"}}.
  5. Store in Shared: The FetchRepo node takes this result (the dictionary of files) and stores it in the shared dictionary under the key "files".

Here's a simplified sequence of this process within the FetchRepo node:

sequenceDiagram
    participant Shared as Shared Store
    participant FetchRepo as FetchRepo Node
    participant GH_Helper as crawl_github_files
    participant Local_Helper as crawl_local_files
    participant GitHubAPI as GitHub API
    participant FileSystem as Local File System

    FetchRepo->>Shared: Read repo_url, local_dir, patterns, size
    alt If repo_url exists
        FetchRepo->>GH_Helper: Call with url, token, patterns, size
        GH_Helper->>GitHubAPI: List files (recursive)
        GitHubAPI-->>GH_Helper: File list & metadata
        loop For each file
            GH_Helper->>GH_Helper: Apply size & pattern filters
            alt If filters pass
                GH_Helper->>GitHubAPI: Download file content
                GitHubAPI-->>GH_Helper: File content
                GH_Helper->>GH_Helper: Store path: content
            end
        end
        GH_Helper-->>FetchRepo: Return {"files": {...}}
    else If local_dir exists
        FetchRepo->>Local_Helper: Call with dir, patterns, size
        Local_Helper->>FileSystem: Walk directory
        FileSystem-->>Local_Helper: File paths
        loop For each file path
            Local_Helper->>Local_Helper: Apply size & pattern filters
            alt If filters pass
                Local_Helper->>FileSystem: Read file content
                FileSystem-->>Local_Helper: File content
                Local_Helper->>Local_Helper: Store path: content
            end
        end
        Local_Helper-->>FetchRepo: Return {"files": {...}}
    end
    FetchRepo->>Shared: Write {"files": {...}}

This diagram illustrates how the FetchRepo node delegates the work and how the helper functions interact with the source (GitHub or local) while applying the specified filters before returning the gathered code content.

Looking at the Code#

Let's examine simplified snippets from the actual code files:

The FetchRepo node is defined in function_app/nodes.py. It's quite simple because it just calls the external helper functions.

# function_app/nodes.py (Simplified FetchRepo)
from pocketflow import Node
# Import our helper functions
from .utils.crawl_github_files import crawl_github_files
from .utils.crawl_local_files import crawl_local_files

class FetchRepo(Node):
    def prep(self, shared):
        # Read necessary inputs from the shared dictionary
        return {
            "repo_url": shared.get("repo_url"),
            "local_dir": shared.get("local_dir"),
            "token": shared.get("github_token"),
            "include_patterns": shared.get("include_patterns"),
            "exclude_patterns": shared.get("exclude_patterns"),
            "max_file_size": shared.get("max_file_size")
        }

    def exec(self, prep_res):
        # Decide which helper to call based on inputs
        if prep_res["repo_url"]:
            print(f"Crawling GitHub repository: {prep_res['repo_url']}...")
            return crawl_github_files(
                repo_url=prep_res["repo_url"],
                token=prep_res["token"],
                include_patterns=prep_res["include_patterns"],
                exclude_patterns=prep_res["exclude_patterns"],
                max_file_size=prep_res["max_file_size"]
            )
        elif prep_res["local_dir"]:
            print(f"Crawling local directory: {prep_res['local_dir']}...")
            return crawl_local_files(
                directory=prep_res["local_dir"],
                include_patterns=prep_res["include_patterns"],
                exclude_patterns=prep_res["exclude_patterns"],
                max_file_size=prep_res["max_file_size"]
            )
        else:
            raise ValueError("No repository URL or local directory provided.")

    def post(self, shared, prep_res, exec_res):
        # exec_res contains the result from the crawl_* function
        shared["files"] = exec_res.get("files", {}) # Store the dictionary of files
        shared["fetch_stats"] = exec_res.get("stats", {}) # Store any fetching statistics
        print(f"Fetched {len(shared['files'])} files.")
        # The default action is to proceed to the next node in the flow
        # return "default" is implicit

This node's prep method simply collects the parameters needed for fetching from the shared dictionary. The exec method performs the core task: calling the appropriate crawl_* function based on whether a repo_url or local_dir was provided. The post method then takes the result (the fetched files and stats) and saves them back into the shared dictionary so the next node in the flow (IdentifyAbstractions) can access them.

Let's look at a simplified part of the crawl_github_files.py helper:

# function_app/utils/crawl_github_files.py (Simplified snippet)
import requests
import fnmatch
# ... other imports ...

def crawl_github_files(repo_url, token=None, max_file_size=None, include_patterns=None, exclude_patterns=None):
    # ... parse repo_url to get owner, repo, path, ref ...

    headers = {"Accept": "application/vnd.github.v3+json"}
    if token:
        headers["Authorization"] = f"token {token}"

    files = {}
    skipped_files = []

    def should_include_file(file_path: str, file_name: str) -> bool:
        # Logic to check if file_name/file_path matches include/exclude patterns
        # using fnmatch.fnmatch
        # ... returns True or False ...
        pass # Simplified

    def fetch_contents(path):
        url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
        params = {"ref": ref} if ref is not None else {}

        response = requests.get(url, headers=headers, params=params)

        # ... handle rate limits, errors (403, 404) ...

        contents = response.json()
        if not isinstance(contents, list):
            contents = [contents] # Handle single file response

        for item in contents:
            item_path = item["path"]
            if item["type"] == "file":
                # Apply filters
                file_size = item.get("size", 0)
                if max_file_size and file_size > max_file_size:
                    skipped_files.append((item_path, file_size))
                    print(f"Skipping {item_path}: exceeds size limit")
                    continue

                if not should_include_file(item_path, item["name"]):
                    print(f"Skipping {item_path}: patterns mismatch")
                    continue

                # If filters pass, download content
                if "download_url" in item:
                    file_response = requests.get(item["download_url"], headers=headers)
                    if file_response.status_code == 200:
                        files[item_path] = file_response.text # Store content
                        print(f"Downloaded: {item_path}")
                    # ... handle download errors ...
            elif item["type"] == "dir":
                fetch_contents(item_path) # Recurse into subdirectory

    fetch_contents(specific_path or "") # Start the process
    return {"files": files, "stats": {"downloaded_count": len(files), "skipped_count": len(skipped_files)}}

# ... rest of the file ...

This snippet shows how crawl_github_files uses the requests library to interact with the GitHub API. It defines a recursive fetch_contents function to traverse directories. Inside the loop, it checks the item type ("file" or "dir"), applies the size and pattern filters using should_include_file, and if it's a file that passes filters, it downloads its content. Directories trigger recursive calls.

And a simplified part of the crawl_local_files.py helper:

# function_app/utils/crawl_local_files.py (Simplified snippet)
import os
import fnmatch
# ... other imports ...

def crawl_local_files(directory, include_patterns=None, exclude_patterns=None, max_file_size=None):
    if not os.path.isdir(directory):
        raise ValueError(f"Directory does not exist: {directory}")

    files_dict = {}
    # os.walk traverses the directory tree
    for root, _, filenames in os.walk(directory):
        for filename in filenames:
            filepath = os.path.join(root, filename)
            relpath = os.path.relpath(filepath, directory) # Get path relative to start

            # Apply filters similar to the github crawler
            # Check size
            if max_file_size and os.path.getsize(filepath) > max_file_size:
                 print(f"Skipping {relpath}: exceeds size limit")
                 continue

            # Check patterns
            included = False
            if include_patterns:
                 for pattern in include_patterns:
                      if fnmatch.fnmatch(relpath, pattern):
                           included = True
                           break
            else: # No include patterns means include all
                 included = True

            excluded = False
            if exclude_patterns:
                 for pattern in exclude_patterns:
                      if fnmatch.fnmatch(relpath, pattern):
                           excluded = True
                           break

            if not included or excluded:
                print(f"Skipping {relpath}: patterns mismatch")
                continue

            # If filters pass, read content
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    content = f.read()
                files_dict[relpath] = content # Store content
                print(f"Added {relpath}")
            except Exception as e:
                print(f"Warning: Could not read file {filepath}: {e}")

    return {"files": files_dict}

# ... rest of the file ...

This snippet for local crawling uses os.walk to iterate through files. It constructs the relative path (os.path.relpath), applies the same size and pattern filters using os.path.getsize and fnmatch.fnmatch, and if a file passes, it reads its content using open() and stores it.

Both helper functions ensure that only relevant, non-oversized files matching your criteria are collected, and their content is stored in the dictionary that FetchRepo returns.

Conclusion#

Code Fetching, implemented by the FetchRepo node and its helper functions, is the essential first step in our tutorial generation pipeline. It acts like a targeted file collector, intelligently gathering the code content from GitHub or a local directory while applying filters for size and file patterns. This ensures that the subsequent steps of analysis and generation work with a manageable and relevant set of code files.

Now that we have the code, the next challenge is to understand what it does. In the next chapter, we'll explore how the project uses AI to analyze this fetched code.

Next Chapter: Code Analysis (AI Understanding)


Generated by AI Codebase Knowledge Builder. References: 1(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/function_app/function_app.py), 2(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/function_app/utils/crawl_github_files.py), 3(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/function_app/utils/crawl_local_files.py), 4(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/main.py), 5(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/utils/crawl_github_files.py), 6(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/utils/crawl_local_files.py)