DocuGen – Automated Documentation Analysis & Enhancement

Byeongjun Moon, 2024, bj.moon@usc.edu

A machine learning-powered tool that analyzes GitHub repository documentation quality and suggests improvements using state-of-the-art language models.

Features

Analyzes README quality using DistilBERT for semantic understanding
Evaluates code-documentation alignment using CodeBERT
Generates enhanced documentation using OPT-125m (optimized for Apple Silicon)
Provides detailed section-by-section analysis with quality scores
Suggests actionable improvements based on best practices
Web interface for easy interaction

Project Structure

src/
├── models/                 # Core ML models and analyzers
│   ├── code_documentation_analyzer.py  # CodeBERT-based code analysis
│   ├── quality_enhancer.py            # OPT-125m-based enhancement
│   ├── unified_scorer.py              # Combined scoring system
├── trainers/              # Model training scripts
└── main.py               # Application entry point

Installation

Clone the repository:

git clone https://github.com/yourusername/docugen.git
cd docugen

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env with your GitHub token

Usage

Start the web interface:

python src/main.py

Open your browser and navigate to:

http://127.0.0.1:7862

Enter a GitHub repository URL to analyze

Models

DistilBERT & CodeBERT

Used for semantic analysis of documentation content
Evaluates code-documentation alignment
Measures documentation completeness and quality
Analyzes docstrings and code comments

OPT-125m Enhancer

Lightweight model optimized for Apple Silicon
Generates contextual documentation improvements
Memory-efficient operation
Enhanced performance on M1/M2 chips

ReadmeQualityModel

Custom model for quality scoring
Fine-tuned on high-quality documentation examples
Evaluates clarity, completeness, and structure
Provides section-specific quality metrics

Development

Training the Quality Model

Add your GitHub token to .env
Run the data collection script:

python src/models/getting_data.py

Train the README quality model:

python src/trainers/train_readme_model.py

Run the main script to start the web interface:

python src/main.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gradio		.gradio
.venv		.venv
old		old
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocuGen – Automated Documentation Analysis & Enhancement

Features

Project Structure

Installation

Usage

Models

DistilBERT & CodeBERT

OPT-125m Enhancer

ReadmeQualityModel

Development

Training the Quality Model

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

bjmoonn/DocuGen

Folders and files

Latest commit

History

Repository files navigation

DocuGen – Automated Documentation Analysis & Enhancement

Features

Project Structure

Installation

Usage

Models

DistilBERT & CodeBERT

OPT-125m Enhancer

ReadmeQualityModel

Development

Training the Quality Model

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages