Building an Intelligent Attack and Defense Knowledge Base: Vulnerability Management Practices Combining AI Technology

Project Origin

One day, a classmate threw a GitHub repository link for PoCs into the group chat, containing many PoC markdown documents. I saved it immediately. Recently, building knowledge bases with vector databases has become very convenient, so I thought, why not use tools like AnythingLLM to directly construct a knowledge base? AnythingLLM even supports creating documents directly from GitHub repositories.

The effect is as follows: “What are the vulnerabilities of X-OA?” Answer:

**PoC Name:** oa-l329158
**Vulnerability Description:**
This vulnerability exploits the Java Encoder library to perform file hiding and malicious encoding attacks. When a user uploads a specific type of file, the attacker can generate a file with malicious code to trigger a DoS attack.

(Omitted much content)

This indeed achieves some effect, but most of the time, when I look up this document, I hope it gives me the PoC directly. So, I decided to build one myself.

System Design Philosophy

Core Function Design

Intelligent Search: Semantic search based on vector similarity, no longer limited to keyword matching. AI-Assisted PoC Generation: Input vulnerability description, AI helps generate the initial version of PoC code. Having the documentation, PoC generation is a must. Knowledge Graph: Establish relationships between vulnerabilities, technologies, and tools. Multi-Platform Support: Support multiple PoC formats like Nuclei, Pocsuite3, Python, etc.

Technology Selection

Considering development efficiency and maintenance costs, we chose a relatively mature technology stack:

Backend Architecture:

FastAPI: Build REST API, integrate AI model calls. Mainly because Python is convenient.
PostgreSQL + pgvector: Store data and support vector retrieval. Because PostgreSQL stores everything. 😂
Redis: Cache hot data.

Frontend Interface:

Vue3 + Tailwind CSS: Build user interface.
ECharts: Data visualization.
The frontend was entirely generated by AI.

AI Capabilities:

Local LLM: Locally deployed Ollama. Main models used:
- large-bge: Used for vectorization, supports 1024-dimensional vectors, said to have good results.
- gtp-oss: Used to generate vectors based on documents; actually, Qwen’s performance is also quite good.

System Architecture Overview

Overall Design

graph TB subgraph "User Interface" UI1[Web Interface] UI2[PoC Editor] UI3[Search Function] end subgraph "API Layer" API[FastAPI Backend] AUTH[Auth Module] NucleiTemplate[YAML Template] end subgraph "AI Processing Layer" LLM[Large Language Model] EMBED[Vectorization Model] SEARCH[Vector Search] end subgraph "Data Storage" PG[PostgreSQL+pgvector] REDIS[Redis Cache] end UI1 --> API UI2 --> API UI3 --> API API --> AUTH API --> LLM API --> NucleiTemplate API --> EMBED EMBED --> SEARCH SEARCH --> PG NucleiTemplate --> PG API --> REDIS LLM --> PG

1. Vulnerability Management

Support importing vulnerability documents in Markdown format.
Automatically extract key vulnerability information (CVE number, CVSS score, etc.).
Support tag classification and relationship management.
Support various Python scripts, Lua scripts, etc.
Of course, Nuclei scripts are also supported.

2. Intelligent Search

Semantic search based on vector similarity.
Support natural language queries.
Recommend relevant vulnerabilities.
Support keyword search.

3. AI-Assisted PoC Generation

Generate initial PoC code based on vulnerability description.
Support multiple formats: Nuclei YAML, Python script, Pocsuite3, etc.
Code quality assessment and optimization suggestions.

1. Vectorized Semantic Search

Different from traditional keyword search, we use vector similarity for semantic search. For example, when searching for “SQL injection”, relevant content like “database attack” and “parameterized query bypass” can also be found.

2. AI-Assisted PoC Generation

This is the core highlight of the system. After the user inputs the Markdown description of the vulnerability, the AI will:

Analyze the vulnerability type and attack vector.
Retrieve similar historical PoCs.
Generate a suitable PoC code framework.
Provide code optimization suggestions.

3. Multi-Format Support

The system supports generating PoCs in multiple formats:

Nuclei YAML: For batch scanning.
Python Script: For custom verification.
Pocsuite3: Integrated into existing frameworks.
Metasploit Module: For penetration testing.

Technical Implementation Highlights

1. Vector Database Optimization

Using PostgreSQL’s pgvector extension, index optimization is performed based on the text characteristics of the security field. Compared with traditional full-text search, vector search can understand semantic relevance, for example, “buffer overflow” and “stack overflow” will be identified as related concepts.

2. AI Model Integration

Integrated multiple AI models to handle different tasks:

Text Understanding: Use LLMs like GPT to understand vulnerability descriptions.
Code Generation: Based on code generation models like Code Llama.
Vectorization: Use Sentence Transformers for text embedding.

3. Progressive PoC Generation

The process of AI generating PoC is progressive:

First generate basic detection logic.
Then add specific attack payloads.
Finally optimize code structure and exception handling.
Provide code quality assessment and improvement suggestions.

4. Real-time Feedback Learning

The system collects user feedback on AI-generated PoCs to:

Optimize generation quality.
Adjust recommendation algorithms.
Improve search result ranking.

Future Planning

Enhance AI Capabilities

Train specialized security domain models.
Support more programming languages and frameworks.
Improve code generation quality.
Combine some mature tools, such as LangGraph and related tools.

Extend Functions

Integrate more security tools.
Add vulnerability trend analysis.
Support automated testing.

About Search

At first, I planned to use vector search, but vector search also has many problems and cannot solve everything. The core task of vector databases (LanceDB / Chroma / Milvus / Pinecone, etc.) is:

Text/Data → Vectorization → Search using algorithms like cosine similarity.
It does not guarantee structured results (e.g., returning the “Nth PoC”).

If you directly treat PoC code as text chunks for vectorization, doing precise search becomes difficult:

Search might hit code, but context will be incomplete.
It is hard to guarantee that when you search “CVE-2023-xxxx”, it can bring out the PoC corresponding to that vulnerability.

A good solution is Hybrid Search:

Full-Text Search (FTS) → Ensure keywords, IDs, terms, code snippets, etc., can be hit accurately.
Vector Search → Supplement semantic similarity search to help discover relevant content.

Summary

Building this intelligent attack and defense knowledge base has been a very meaningful practice. By introducing AI technology, we not only solved the pain points of traditional knowledge management but also brought new possibilities to security research work.

Although the system is still being perfected, it has already played an important role in practical work. I hope this sharing can provide some reference and inspiration for friends with similar needs. If you are interested in this project, welcome to exchange and discuss. We also plan to open source part of the code at an appropriate time to promote the development of security technology together with the community.

Article first published on Security Tech Blog, please indicate source when reproducing.

Project Origin#

System Design Philosophy#

Core Function Design#

Technology Selection#

System Architecture Overview#

Overall Design#

1. Vectorized Semantic Search#

2. AI-Assisted PoC Generation#

3. Multi-Format Support#

Technical Implementation Highlights#

1. Vector Database Optimization#

2. AI Model Integration#

3. Progressive PoC Generation#

4. Real-time Feedback Learning#

Future Planning#

About Search#

Summary#