diff --git a/crates/cudnn/README.md b/crates/cudnn/README.md
deleted file mode 100644
index 8b55c1da..00000000
--- a/crates/cudnn/README.md
+++ /dev/null
@@ -1,88 +0,0 @@
-# cudnn
-Type safe cuDNN wrapper for the Rust programming language.
-
-## Project status
-The current version of cuDNN targeted by this wrapper is the 8.3.2. You can refer to the official [release notes](https://docs.nvidia.com/deeplearning/cudnn/release-notes/index.html) and to the [support matrix](https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html) by NVIDIA.
-
-The legacy API is somewhat complete and it is usable but the backend API is still to be considered a work in progress and its usage is therefore much discouraged. Both APIs are still being developed so expect bugs and reasonable breaking changes whilst using this crate. 
-
-The project is part of the Rust CUDA ecosystem and is actively maintained by [frjnn](https://github.com/frjnn).
-
-## Primer 
-
-Here follows a list of useful concepts that should be taken as a handbook for the users of the crate. This is not intended to be the full documentation, as each wrapped struct, enum and function has its own docs, but rather a quick sum up of the key points of the API. As a matter of fact, for a deeper view, you should refer both to the docs of each item and to the [official ones](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#overview) by NVIDIA. Furthermore, if you are new to cuDNN we strongly suggest reading the [official developer guide](https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#overview).
-
-### Device buffers
-
-This crate is built around [`cust`](https://docs.rs/cust/latest/cust/index.html) which is the core wrapper for interfacing with the CUDA driver API of our choice.
-
-### cuDNN statuses and Result
-
-All cuDNN library functions return their status. This crate uses [`Result`](https://doc.rust-lang.org/std/result/enum.Result.html) to achieve a leaner, idiomatic and easier to manage API.
-
-### cuDNN handles and RAII
-
-The main entry point of the cuDNN library is the `CudnnContext` struct. This handle is tied to a device and it is explicitly passed to every subsequent library function that operates on GPU data. It manages resources allocations both on the host and the device and takes care of the synchronization of all the the cuDNN primitives. 
-
-The handles, and the other cuDNN structs wrapped by this crate, are implementors of the [`Drop`](https://doc.rust-lang.org/std/ops/trait.Drop.html) trait which implicitly calls their destructors on the cuDNN side when they go out of scope. 
-
-cuDNN contexts can be created as shown in the following snippet:
-
-```rust
-use cudnn::CudnnContext;
-
-let ctx = CudnnContext::new().unwrap();
-```
-
-### cuDNN data types
-
-In order to enforce type safety as much as possible at compile time, we shifted away from the original cuDNN enumerated data types and instead opted to leverage Rust's generics. In practice, this means that specifying the data type of a cuDNN tensor descriptor is done as follows:
-
-```rust
-use cudnn::{CudnnContext, TensorDescriptor};
-
-let ctx = CudnnContext::new().unwrap();
-
-let shape = &[5, 5, 10, 25];
-let strides = &[1250, 250, 25, 1];
-
-// f32 tensor
-let desc = TensorDescriptor::<f32>::new_strides(shape, strides).unwrap();
-```
-
-This API also allows for using Rust own types as cuDNN data types, which we see as a desirable property. 
-
-Safely manipulating cuDNN data types that do not have any such direct match, such as vectorized ones, whilst still performing compile time compatibility checks can be done as follows:
-
-```rust
-use cudnn::{CudnnContext, TensorDescriptor, Vec4};
-
-let ctx = CudnnContext::new().unwrap();
-
-let shape = &[4, 32, 32, 32];
-
-// in cuDNN this is equal to the INT8x4 data type and CUDNN_TENSOR_NCHW_VECT_C format
-let desc = TensorDescriptor::<i8>::new_vectorized::<Vec4>(shape).unwrap();
-```
-
-The previous tensor descriptor can be used together with a `i8` device buffer and cuDNN will see it as being a tensor of `CUDNN_TENSOR_NCHW_VECT_C` format and `CUDNN_DATA_INT8x4` data type.
-
-Currently this crate does not support `f16` and `bf16` data types.
-
-### cuDNN tensor formats
-
-We decided not to check tensor format configurations at compile time, since it is too strong of a requirement. As a consequence, should you mess up, the program will fail at run-time. A proper understanding of the cuDNN API mechanics is thus fundamental to properly use this crate. 
-
-You can refer to this [extract](https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#data-layout-formats) from the cuDNN developer guide to learn more about tensor formats.
-
-We split the original cuDNN tensor format enum, which counts 3 variants, in 2 parts: the `ScalarC` enum and the `TensorFormat::NchwVectC` enum variant. The former stands for "scalar channel" and it encapsulates the `Nchw` and `Nhwc` formats. Scalar channel formats can be both converted to the `TensorFormat` enum with [`.into()`](https://doc.rust-lang.org/std/convert/trait.Into.html).
-
-```rust
-use cudnn::{ScalarC, TensorFormat};
-
-let sc_fmt = ScalarC::Nchw;
-
-let vc_fmt = TensorFormat::NchwVectC;
-
-let sc_to_tf: TensorFormat = sc_fmt.into();
-``` 
diff --git a/crates/cudnn/src/lib.rs b/crates/cudnn/src/lib.rs
index 3939d399..de91aa7f 100644
--- a/crates/cudnn/src/lib.rs
+++ b/crates/cudnn/src/lib.rs
@@ -1,5 +1,134 @@
+//! # cudnn
+//! Type safe cuDNN wrapper for the Rust programming language.
+//!
+//! ## Project status
+//!
+//! The current version of cuDNN targeted by this wrapper is 8.3.2. You can refer to the official
+//! [release notes] and to the [support matrix] by NVIDIA.
+//!
+//! [release notes]: https://docs.nvidia.com/deeplearning/cudnn/release-notes/index.html
+//! [support matrix]: https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html
+//!
+//! The legacy API is somewhat complete and usable but the backend API is still a work in progress
+//! and its usage is discouraged. Both APIs are still being developed so expect bugs and reasonable
+//! breaking changes whilst using this crate.
+//!
+//! The project is part of the Rust CUDA ecosystem and is actively maintained by
+//! [frjnn](https://github.com/frjnn).
+//!
+//! ## Primer
+//!
+//! Here follows a list of useful concepts that should be taken as a handbook for the users of the
+//! crate. This is not intended to be the full documentation, as each wrapped struct, enum and
+//! function has its own docs, but rather a quick sum up of the key points of the API. For a deeper
+//! view, you should refer both to the docs of each item and to the [official ones] by NVIDIA.
+//! Furthermore, if you are new to cuDNN we strongly suggest reading the [official developer
+//! guide].
+//!
+//! [official ones]: https://docs.nvidia.com/deeplearning/cudnn/api/index.html
+//! [official developer guide]: https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#overview
+//!
+//! ### Device buffers
+//!
+//! This crate is built around [`cust`](https://docs.rs/cust/latest/cust/index.html) which is the
+//! core wrapper for interfacing with the CUDA driver API of our choice.
+//!
+//! ### cuDNN statuses and Result
+//!
+//! All cuDNN library functions return their status. This crate uses
+//! [`Result`](https://doc.rust-lang.org/std/result/enum.Result.html) to achieve a leaner,
+//! idiomatic and easier to manage API.
+//!
+//! ### cuDNN handles and RAII
+//!
+//! The main entry point of the cuDNN library is the `CudnnContext` struct. This handle is tied to
+//! a device and it is explicitly passed to every subsequent library function that operates on GPU
+//! data. It manages resources allocations both on the host and the device and takes care of the
+//! synchronization of all the the cuDNN primitives.
+//!
+//! The handles, and the other cuDNN structs wrapped by this crate, are implementors of the
+//! [`Drop`](https://doc.rust-lang.org/std/ops/trait.Drop.html) trait which implicitly calls their
+//! destructors on the cuDNN side when they go out of scope.
+//!
+//! cuDNN contexts can be created as shown in the following snippet:
+//!
+//! ```rust
+//! use cudnn::CudnnContext;
+//!
+//! let ctx = CudnnContext::new().unwrap();
+//! ```
+//!
+//! ### cuDNN data types
+//!
+//! In order to enforce type safety as much as possible at compile time, we shifted away from the
+//! original cuDNN enumerated data types and instead opted to leverage Rust's generics. In
+//! practice, this means that specifying the data type of a cuDNN tensor descriptor is done as
+//! follows:
+//!
+//! ```rust
+//! use cudnn::{CudnnContext, TensorDescriptor};
+//!
+//! let ctx = CudnnContext::new().unwrap();
+//!
+//! let shape = &[5, 5, 10, 25];
+//! let strides = &[1250, 250, 25, 1];
+//!
+//! // f32 tensor
+//! let desc = TensorDescriptor::<f32>::new_strides(shape, strides).unwrap();
+//! ```
+//!
+//! This API also allows for using Rust own types as cuDNN data types, which we see as a desirable
+//! property.
+//!
+//! Safely manipulating cuDNN data types that do not have any such direct match, such as vectorized
+//! ones, whilst still performing compile time compatibility checks can be done as follows:
+//!
+//! ```rust
+//! use cudnn::{CudnnContext, TensorDescriptor, Vec4};
+//!
+//! let ctx = CudnnContext::new().unwrap();
+//!
+//! let shape = &[4, 32, 32, 32];
+//!
+//! // in cuDNN this is equal to the INT8x4 data type and CUDNN_TENSOR_NCHW_VECT_C format
+//! let desc = TensorDescriptor::<i8>::new_vectorized::<Vec4>(shape).unwrap();
+//! ```
+//!
+//! The previous tensor descriptor can be used together with a `i8` device buffer and cuDNN will
+//! see it as being a tensor of `CUDNN_TENSOR_NCHW_VECT_C` format and `CUDNN_DATA_INT8x4` data
+//! type.
+//!
+//! Currently this crate does not support `f16` and `bf16` data types.
+//!
+//! ### cuDNN tensor formats
+//!
+//! We decided not to check tensor format configurations at compile time, because it is too strong
+//! a requirement. As a consequence, should you mess up, the program will fail at run-time. A
+//! proper understanding of the cuDNN API mechanics is thus fundamental to properly use this crate.
+//!
+//! You can refer to this [extract] from the cuDNN developer guide to learn more about tensor
+//! formats.
+//!
+//! [extract]: https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#data-layout-formats
+//!
+//! We split the original cuDNN tensor format enum, which counts 3 variants, in 2 parts: the
+//! `ScalarC` enum and the `TensorFormat::NchwVectC` enum variant. The former stands for "scalar
+//! channel" and it encapsulates the `Nchw` and `Nhwc` formats. Scalar channel formats can be both
+//! converted to the `TensorFormat` enum with
+//! [`.into()`](https://doc.rust-lang.org/std/convert/trait.Into.html).
+//!
+//! ```rust
+//! use cudnn::{ScalarC, TensorFormat};
+//!
+//! let sc_fmt = ScalarC::Nchw;
+//!
+//! let vc_fmt = TensorFormat::NchwVectC;
+//!
+//! let sc_to_tf: TensorFormat = sc_fmt.into();
+//! ```
+
 #![deny(rustdoc::broken_intra_doc_links)]
-#[doc = include_str!("../README.md")]
+
 mod activation;
 mod attention;
 mod backend;
diff --git a/crates/cust/README.md b/crates/cust/README.md
deleted file mode 100644
index 264a2743..00000000
--- a/crates/cust/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Cust 
-
-Featureful, Safe, and Fast CUDA Driver API wrapper for the Rust CUDA Project.
-
-Cust is a fork of rustacuda with a lot of API changes, added functions, etc. Big thanks to everyone who worked on RustaCUDA!
diff --git a/crates/cust/src/error.rs b/crates/cust/src/error.rs
index d5912014..d4423f92 100644
--- a/crates/cust/src/error.rs
+++ b/crates/cust/src/error.rs
@@ -1,6 +1,6 @@
 //! Types for error handling
 //!
-//! # Error handling in CUDA:
+//! # Error handling in CUDA
 //!
 //! cust uses the [`CudaError`](enum.CudaError.html) enum to represent the errors returned by
 //! the CUDA API. It is important to note that nearly every function in CUDA (and therefore
diff --git a/crates/cust/src/function.rs b/crates/cust/src/function.rs
index 67a5fb84..d18c4f21 100644
--- a/crates/cust/src/function.rs
+++ b/crates/cust/src/function.rs
@@ -474,7 +474,7 @@ impl Function<'_> {
 
 /// Launch a kernel function asynchronously.
 ///
-/// # Syntax:
+/// # Syntax
 ///
 /// The format of this macro is designed to resemble the triple-chevron syntax used to launch
 /// kernels in CUDA C. There are two forms available:
diff --git a/crates/cust/src/lib.rs b/crates/cust/src/lib.rs
index 638b523d..5e81edd0 100644
--- a/crates/cust/src/lib.rs
+++ b/crates/cust/src/lib.rs
@@ -6,9 +6,9 @@
 //! provides unsafe functions for retrieving and setting handles to raw CUDA objects.
 //! This allows advanced users to embed libraries that rely on CUDA, such as OptiX.
 //!
-//! # CUDA Terminology:
+//! # CUDA Terminology
 //!
-//! ## Devices and Hosts:
+//! ## Devices and Hosts
 //!
 //! This crate and its documentation uses the terms "device" and "host" frequently, so it's worth
 //! explaining them in more detail. A device refers to a CUDA-capable GPU or similar device and its
@@ -16,7 +16,7 @@
 //! must be transferred from host memory to device memory before the device can use it for
 //! computations, and the results must then be transferred back to host memory.
 //!
-//! ## Contexts, Modules, Streams and Functions:
+//! ## Contexts, Modules, Streams and Functions
 //!
 //! A CUDA context is akin to a process on the host - it contains all of the state for working with
 //! a device, all memory allocations, etc. Each context is associated with a single device.
@@ -30,7 +30,7 @@
 //! stream. Work within a single stream will execute sequentially in the order that it was
 //! submitted, and may interleave with work from other streams.
 //!
-//! ## Grids, Blocks and Threads:
+//! ## Grids, Blocks and Threads
 //!
 //! CUDA devices typically execute kernel functions on many threads in parallel. These threads can
 //! be grouped into thread blocks, which share an area of fast hardware memory known as shared
@@ -44,7 +44,7 @@
 //! hand, if the thread blocks are too small each processor will be under-utilized and the
 //! code will be unable to make effective use of shared memory.
 //!
-//! # Usage:
+//! # Usage
 //!
 //! Before using cust, you must install the CUDA development libraries for your system. Version
 //! 9.0 or newer is required. You must also have a CUDA-capable GPU installed with the appropriate
diff --git a/crates/cust/src/memory/pointer.rs b/crates/cust/src/memory/pointer.rs
index d40beed0..40973fac 100644
--- a/crates/cust/src/memory/pointer.rs
+++ b/crates/cust/src/memory/pointer.rs
@@ -429,7 +429,7 @@ impl<T: DeviceCopy> UnifiedPointer<T> {
 
     /// Returns a null unified pointer.
     ///
-    /// # Examples:
+    /// # Examples
     ///
     /// ```
     /// # let _context = cust::quick_init().unwrap();
diff --git a/crates/cust/src/module.rs b/crates/cust/src/module.rs
index 124a67fd..a26be5ab 100644
--- a/crates/cust/src/module.rs
+++ b/crates/cust/src/module.rs
@@ -338,7 +338,7 @@ impl Module {
 
     /// Get a reference to a global symbol, which can then be copied to/from.
     ///
-    /// # Panics:
+    /// # Panics
     ///
     /// This function panics if the size of the symbol is not the same as the `mem::sizeof<T>()`.
     ///
diff --git a/crates/cust_derive/README.md b/crates/cust_derive/README.md
deleted file mode 100644
index bf2b30f7..00000000
--- a/crates/cust_derive/README.md
+++ /dev/null
@@ -1 +0,0 @@
-Custom derive macro crate for [RustaCUDA](https://github.com/bheisler/RustaCUDA).
\ No newline at end of file
diff --git a/crates/cust_derive/src/lib.rs b/crates/cust_derive/src/lib.rs
index 5a3a4011..06d240a0 100644
--- a/crates/cust_derive/src/lib.rs
+++ b/crates/cust_derive/src/lib.rs
@@ -1,3 +1,5 @@
+//! Custom derive macro crate for cust.
+
 #[macro_use]
 extern crate quote;
 extern crate proc_macro;
diff --git a/crates/gpu_rand/README.md b/crates/gpu_rand/README.md
deleted file mode 100644
index 25f45889..00000000
--- a/crates/gpu_rand/README.md
+++ /dev/null
@@ -1,35 +0,0 @@
-# gpu_rand
-
-gpu_rand is the Rust CUDA Project's equivalent of cuRAND. cuRAND unfortunately does not work with
-the CUDA Driver API, therefore, we reimplement (and extend) some of its algorithms and provide them in this crate.
-
-This crate is meant to be gpu-centric, which means it may special-case certain things to run faster on the GPU by using PTX 
-assembly. However, it is supposed to also work on the CPU, allowing you to reuse the same random states across CPU and GPU.
-
-A lot of the initial code is taken from the [rust-random project](https://github.com/rust-random) and modified to make it able to
-pass to the GPU, as well as cleaning up certain things and updating it to edition 2021.
-
-The random generators currently implemented are:
-
-32-bit:
-- Xoroshiro64**
-- Xoroshiro64*
-- Xoroshiro128+
-- Xoroshiro128++
-- Xoroshiro128**
-
-64-bit:
-- Xoroshiro128+
-- Xoroshiro128++
-- Xoroshiro128**
-- Xoroshiro256+
-- Xoroshiro256++
-- Xoroshiro256**
-- Xoroshiro512+
-- Xoroshiro512++
-- Xoroshiro512**
-
-- SplitMix64
-
-We also provide a default 64-bit generator which should be more than enough for most applications. The default
-currently uses Xoroshiro128** but that is subject to change in the future.
diff --git a/crates/gpu_rand/src/lib.rs b/crates/gpu_rand/src/lib.rs
index d8189cc7..51383463 100644
--- a/crates/gpu_rand/src/lib.rs
+++ b/crates/gpu_rand/src/lib.rs
@@ -1,13 +1,41 @@
-//! gpu_rand is the Rust CUDA Project's equivalent of cuRAND. cuRAND unfortunately does not work with
-//! the CUDA Driver API, therefore, we reimplement (and extend) some of its algorithms and provide them in this crate.
+//! gpu_rand is the Rust CUDA Project's equivalent of cuRAND. cuRAND unfortunately does not work
+//! with the CUDA Driver API, therefore, we reimplement (and extend) some of its algorithms and
+//! provide them in this crate.
 //!
-//! This crate is meant to be gpu-centric, which means it may special-case certain things to run faster on the GPU by using PTX
-//! assembly. However, it is supposed to also work on the CPU, allowing you to reuse the same random states across CPU and GPU.
+//! This crate is meant to be GPU-centric, which means it may special-case certain things to run
+//! faster on the GPU by using PTX assembly. However, it is supposed to also work on the CPU,
+//! allowing you to reuse the same random states across CPU and GPU.
+//!
+//! A lot of the initial code is taken from the [rust-random
+//! project](https://github.com/rust-random) and modified to make it able to pass to the GPU, as
+//! well as cleaning up certain things and updating it to edition 2024.
 //!
-//! A lot of the initial code is taken from the [rust-random project](https://github.com/rust-random) and modified to make it able to
-//! pass to the GPU, as well as cleaning up certain things and updating it to edition 2024.
 //! The following generators are implemented:
 //!
+//! The random generators currently implemented are:
+//!
+//! 32-bit:
+//! - Xoroshiro64**
+//! - Xoroshiro64*
+//! - Xoroshiro128+
+//! - Xoroshiro128++
+//! - Xoroshiro128**
+//!
+//! 64-bit:
+//! - Xoroshiro128+
+//! - Xoroshiro128++
+//! - Xoroshiro128**
+//! - Xoroshiro256+
+//! - Xoroshiro256++
+//! - Xoroshiro256**
+//! - Xoroshiro512+
+//! - Xoroshiro512++
+//! - Xoroshiro512**
+//! - SplitMix64
+//!
+//! We also provide a default 64-bit generator which should be more than enough for most
+//! applications. The default currently uses Xoroshiro128** but that is subject to change in the
+//! future.
 
 #![deny(missing_docs)]
 #![deny(missing_debug_implementations)]