Chapter 4: Code Fetching#
Welcome back! In Chapter 3: Workflow Engine (Pocket Flow), we saw how our project uses Pocket Flow to orchestrate the complex steps of tutorial generation, making sure tasks like fetching code, analyzing it, and writing chapters happen in the correct order.
The very first step in that orchestrated process is getting the raw material: the code itself! Before we can analyze anything or ask AI questions about the code, we first need to actually get the code files.
This is the job of Code Fetching.
What is Code Fetching?#
Think of Code Fetching as the project's library assistant. When you want to research a topic, the assistant goes to the library (or online!), finds the specific books you need, brings them back, and maybe even sorts out the irrelevant ones (like novels when you asked for science books).
In our project, Code Fetching is the part of the system that goes to where your code lives β either online in a place like GitHub or right on your local computer β finds the code files, reads their content, and gathers them up for processing.
It's the crucial first step because everything that comes after it (analysis, AI interaction, tutorial writing) depends on having the actual code content to work with.
Why is Code Fetching Important?#
- Getting the Data: Without fetching, there's no code to analyze. Simple as that!
- Handling Different Sources: Code can be in many places. Code fetching needs to know how to access online repositories (like GitHub) and local folders.
- Filtering: Not every file in a project is relevant for a tutorial. We often only care about source code (
.py
,.js
), configuration (.yml
), or documentation (.md
) files. We also want to ignore large files that might contain data or binaries, or ignore entire folders like test suites or build outputs. Code fetching handles this filtering before passing the code on.
Our Project's Code Fetching#
In our Pocket Flow workflow (from Chapter 3), the FetchRepo
node is responsible for code fetching.
When you submit the tutorial request through the Web Interface (Frontend), the initial request sent to the Azure Function (as discussed in Chapter 2: Serverless Deployment (Azure Functions)) contains all the details about where the code is and how to fetch it. This information is stored in the shared
dictionary that the Pocket Flow uses.
The FetchRepo
node reads these details from the shared
dictionary to know what to do.
Here are the key inputs this node needs from the shared
dictionary:
Input Parameter | Description | Example Value |
---|---|---|
repo_url |
URL of the GitHub repository (if fetching from GitHub) | "https://github.com/owner/repo/tree/main/src" |
local_dir |
Path to a local directory (if fetching from local machine) | "/Users/you/myproject" |
github_token |
Your GitHub token (needed for private repos or rate limits) | "ghp_abc123..." |
include_patterns |
List of file patterns to include (e.g., *.py , *.md ) |
{"*.py", "*.js", "*.md"} |
exclude_patterns |
List of file patterns to exclude (e.g., tests/* , *.log ) |
{"tests/*", "docs/*", "*.log"} |
max_file_size |
Maximum size of files to fetch (in bytes) | 100000 (for ~100KB) |
The FetchRepo
node takes these parameters and uses specialized helper functions to perform the actual fetching and filtering.
How Code Fetching Works (Under the Hood)#
Let's look at the process inside the FetchRepo
node and its helper functions:
- Check Source: The
FetchRepo
node first checks ifrepo_url
orlocal_dir
is provided in theshared
data. - Delegate Fetching:
- If
repo_url
is present, it calls thecrawl_github_files
helper function. - If
local_dir
is present, it calls thecrawl_local_files
helper function.
- If
- Helper Does the Work:
- For GitHub (
crawl_github_files
): This function uses the GitHub API to navigate the repository's file structure, starting from the specified URL (which can include a branch, commit, or subdirectory). It recursively fetches details about files and directories. For each file:- It checks the file size against
max_file_size
. If too big, it skips. - It checks the file path against
include_patterns
andexclude_patterns
. If it doesn't match includes or matches excludes, it skips. - If it passes the checks, it downloads the file's content. A GitHub token is used here to authenticate (especially for private repos) and increase API rate limits.
- It checks the file size against
- For Local (
crawl_local_files
): This function uses Python's built-in file system functions (os.walk
) to traverse the local directory. For each file found:- It performs the same size check (
max_file_size
) and pattern checks (include_patterns
,exclude_patterns
) using thefnmatch
library (which handles patterns like*.py
ortests/*
). - If it passes the checks, it reads the file's content.
- It performs the same size check (
- For GitHub (
- Gather Results: Both helper functions collect the paths and contents of all files that pass the filters. They return this as a dictionary, typically like
{"files": {"path/to/file1.py": "content1", "another/file2.md": "content2"}}
. - Store in Shared: The
FetchRepo
node takes this result (the dictionary of files) and stores it in theshared
dictionary under the key"files"
.
Here's a simplified sequence of this process within the FetchRepo
node:
sequenceDiagram
participant Shared as Shared Store
participant FetchRepo as FetchRepo Node
participant GH_Helper as crawl_github_files
participant Local_Helper as crawl_local_files
participant GitHubAPI as GitHub API
participant FileSystem as Local File System
FetchRepo->>Shared: Read repo_url, local_dir, patterns, size
alt If repo_url exists
FetchRepo->>GH_Helper: Call with url, token, patterns, size
GH_Helper->>GitHubAPI: List files (recursive)
GitHubAPI-->>GH_Helper: File list & metadata
loop For each file
GH_Helper->>GH_Helper: Apply size & pattern filters
alt If filters pass
GH_Helper->>GitHubAPI: Download file content
GitHubAPI-->>GH_Helper: File content
GH_Helper->>GH_Helper: Store path: content
end
end
GH_Helper-->>FetchRepo: Return {"files": {...}}
else If local_dir exists
FetchRepo->>Local_Helper: Call with dir, patterns, size
Local_Helper->>FileSystem: Walk directory
FileSystem-->>Local_Helper: File paths
loop For each file path
Local_Helper->>Local_Helper: Apply size & pattern filters
alt If filters pass
Local_Helper->>FileSystem: Read file content
FileSystem-->>Local_Helper: File content
Local_Helper->>Local_Helper: Store path: content
end
end
Local_Helper-->>FetchRepo: Return {"files": {...}}
end
FetchRepo->>Shared: Write {"files": {...}}
This diagram illustrates how the FetchRepo
node delegates the work and how the helper functions interact with the source (GitHub or local) while applying the specified filters before returning the gathered code content.
Looking at the Code#
Let's examine simplified snippets from the actual code files:
The FetchRepo
node is defined in function_app/nodes.py
. It's quite simple because it just calls the external helper functions.
# function_app/nodes.py (Simplified FetchRepo)
from pocketflow import Node
# Import our helper functions
from .utils.crawl_github_files import crawl_github_files
from .utils.crawl_local_files import crawl_local_files
class FetchRepo(Node):
def prep(self, shared):
# Read necessary inputs from the shared dictionary
return {
"repo_url": shared.get("repo_url"),
"local_dir": shared.get("local_dir"),
"token": shared.get("github_token"),
"include_patterns": shared.get("include_patterns"),
"exclude_patterns": shared.get("exclude_patterns"),
"max_file_size": shared.get("max_file_size")
}
def exec(self, prep_res):
# Decide which helper to call based on inputs
if prep_res["repo_url"]:
print(f"Crawling GitHub repository: {prep_res['repo_url']}...")
return crawl_github_files(
repo_url=prep_res["repo_url"],
token=prep_res["token"],
include_patterns=prep_res["include_patterns"],
exclude_patterns=prep_res["exclude_patterns"],
max_file_size=prep_res["max_file_size"]
)
elif prep_res["local_dir"]:
print(f"Crawling local directory: {prep_res['local_dir']}...")
return crawl_local_files(
directory=prep_res["local_dir"],
include_patterns=prep_res["include_patterns"],
exclude_patterns=prep_res["exclude_patterns"],
max_file_size=prep_res["max_file_size"]
)
else:
raise ValueError("No repository URL or local directory provided.")
def post(self, shared, prep_res, exec_res):
# exec_res contains the result from the crawl_* function
shared["files"] = exec_res.get("files", {}) # Store the dictionary of files
shared["fetch_stats"] = exec_res.get("stats", {}) # Store any fetching statistics
print(f"Fetched {len(shared['files'])} files.")
# The default action is to proceed to the next node in the flow
# return "default" is implicit
This node's prep
method simply collects the parameters needed for fetching from the shared
dictionary. The exec
method performs the core task: calling the appropriate crawl_*
function based on whether a repo_url
or local_dir
was provided. The post
method then takes the result (the fetched files and stats) and saves them back into the shared
dictionary so the next node in the flow (IdentifyAbstractions
) can access them.
Let's look at a simplified part of the crawl_github_files.py
helper:
# function_app/utils/crawl_github_files.py (Simplified snippet)
import requests
import fnmatch
# ... other imports ...
def crawl_github_files(repo_url, token=None, max_file_size=None, include_patterns=None, exclude_patterns=None):
# ... parse repo_url to get owner, repo, path, ref ...
headers = {"Accept": "application/vnd.github.v3+json"}
if token:
headers["Authorization"] = f"token {token}"
files = {}
skipped_files = []
def should_include_file(file_path: str, file_name: str) -> bool:
# Logic to check if file_name/file_path matches include/exclude patterns
# using fnmatch.fnmatch
# ... returns True or False ...
pass # Simplified
def fetch_contents(path):
url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
params = {"ref": ref} if ref is not None else {}
response = requests.get(url, headers=headers, params=params)
# ... handle rate limits, errors (403, 404) ...
contents = response.json()
if not isinstance(contents, list):
contents = [contents] # Handle single file response
for item in contents:
item_path = item["path"]
if item["type"] == "file":
# Apply filters
file_size = item.get("size", 0)
if max_file_size and file_size > max_file_size:
skipped_files.append((item_path, file_size))
print(f"Skipping {item_path}: exceeds size limit")
continue
if not should_include_file(item_path, item["name"]):
print(f"Skipping {item_path}: patterns mismatch")
continue
# If filters pass, download content
if "download_url" in item:
file_response = requests.get(item["download_url"], headers=headers)
if file_response.status_code == 200:
files[item_path] = file_response.text # Store content
print(f"Downloaded: {item_path}")
# ... handle download errors ...
elif item["type"] == "dir":
fetch_contents(item_path) # Recurse into subdirectory
fetch_contents(specific_path or "") # Start the process
return {"files": files, "stats": {"downloaded_count": len(files), "skipped_count": len(skipped_files)}}
# ... rest of the file ...
This snippet shows how crawl_github_files
uses the requests
library to interact with the GitHub API. It defines a recursive fetch_contents
function to traverse directories. Inside the loop, it checks the item type ("file" or "dir"), applies the size and pattern filters using should_include_file
, and if it's a file that passes filters, it downloads its content. Directories trigger recursive calls.
And a simplified part of the crawl_local_files.py
helper:
# function_app/utils/crawl_local_files.py (Simplified snippet)
import os
import fnmatch
# ... other imports ...
def crawl_local_files(directory, include_patterns=None, exclude_patterns=None, max_file_size=None):
if not os.path.isdir(directory):
raise ValueError(f"Directory does not exist: {directory}")
files_dict = {}
# os.walk traverses the directory tree
for root, _, filenames in os.walk(directory):
for filename in filenames:
filepath = os.path.join(root, filename)
relpath = os.path.relpath(filepath, directory) # Get path relative to start
# Apply filters similar to the github crawler
# Check size
if max_file_size and os.path.getsize(filepath) > max_file_size:
print(f"Skipping {relpath}: exceeds size limit")
continue
# Check patterns
included = False
if include_patterns:
for pattern in include_patterns:
if fnmatch.fnmatch(relpath, pattern):
included = True
break
else: # No include patterns means include all
included = True
excluded = False
if exclude_patterns:
for pattern in exclude_patterns:
if fnmatch.fnmatch(relpath, pattern):
excluded = True
break
if not included or excluded:
print(f"Skipping {relpath}: patterns mismatch")
continue
# If filters pass, read content
try:
with open(filepath, 'r', encoding='utf-8') as f:
content = f.read()
files_dict[relpath] = content # Store content
print(f"Added {relpath}")
except Exception as e:
print(f"Warning: Could not read file {filepath}: {e}")
return {"files": files_dict}
# ... rest of the file ...
This snippet for local crawling uses os.walk
to iterate through files. It constructs the relative path (os.path.relpath
), applies the same size and pattern filters using os.path.getsize
and fnmatch.fnmatch
, and if a file passes, it reads its content using open()
and stores it.
Both helper functions ensure that only relevant, non-oversized files matching your criteria are collected, and their content is stored in the dictionary that FetchRepo
returns.
Conclusion#
Code Fetching, implemented by the FetchRepo
node and its helper functions, is the essential first step in our tutorial generation pipeline. It acts like a targeted file collector, intelligently gathering the code content from GitHub or a local directory while applying filters for size and file patterns. This ensures that the subsequent steps of analysis and generation work with a manageable and relevant set of code files.
Now that we have the code, the next challenge is to understand what it does. In the next chapter, we'll explore how the project uses AI to analyze this fetched code.
Next Chapter: Code Analysis (AI Understanding)
Generated by AI Codebase Knowledge Builder. References: 1(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/function_app/function_app.py), 2(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/function_app/utils/crawl_github_files.py), 3(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/function_app/utils/crawl_local_files.py), 4(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/main.py), 5(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/utils/crawl_github_files.py), 6(https://github.com/hieuminh65/Tutorial-Codebase-Knowledge/blob/be7f595a38221b3dd7b1585dc226e47c815dec6e/utils/crawl_local_files.py)